Reddit will block the Internet Archive

125 points by timpera 2 days ago

It's so weird how fragile digital history is. When things first became digital I remember sentiments of "things can now be maintained perfectly forever" but today it feels like that in 30 years we'll have a better record of 1820 than 2020.

anon-3988 a day ago

> It's so weird how fragile digital history is.
Is it? Its very easy to produce (hence there's too many of them) and they are extremely fragile (bit rot, complicated format that no one knows how to parse etc). Seems to be this is inevitable. I personally think Youtube is going to start pruning their database in the next decade.
efilife 2 days ago

I noticed that people just don't archive things they care about. Stuff like YouTube videos, music, blog articles etc. all get lost because all of those who consume them don't think they can be gone any day. It's always "someone will reupload this" but what if they don't? And they often don't. I started a pretty big archival movement in a smaller community on Soundcloud after I got fed up with artists wiping their accounts constantly (I reuploaded many lost songs I archived over the years). After I showed the way, many copycats started showing up and even artists started giving people time to save their stuff before removing it. Maybe we need to raise awareness about how fragile media really is?
- Telaneo a day ago
  
  Even if someone reuploads it, it's usually just another Youtube purge away from being lost again, unless someone goes out of their way to reupload it somewhere else.
  The datahoarders of the world probably have somewhat decent archives of quite a few youtube channels and such, but since it's not publically availible, and reuploading to youtube or IA isn't really viable, it's lost as far as any of the rest of us are concerned.
  Doesn't help that video is really big and cumbersome to archive. Audio's a lot smaller and thus easier to keep around. Text is easiest, but there tends to be a lot of it, and archiving one page at a time is usually not worth it in the same way that a video or song might be, so blocking automation is usually a pretty good bet for anyone who really doesn't want their stuff to be archived.
  - Fade_Dance a day ago
    
    >The datahoarders of the world probably have somewhat decent archives of quite a few youtube channels and such, but since it's not publically availible, and reuploading to youtube or IA isn't really viable, it's lost as far as any of the rest of us are concerned.
    We are talking archives, older archival processes were not much better. At best it would arrive in a public information nexus, but be filed away in a dark filing cabinet, not much different than data hoarding. At least data hoarding can interface with the wider internet incredibly easily should there be a desire.
    Although I would agree that you are right and that there needs to be a better long-term infrastructure for this.
    I think back to the original peer-to-peer applications and almost think that that loose framework would be an improvement - where different people with files with the same digital signature can serve as mirrors for a reference file through a decentralized network.
    
    Telaneo a day ago
    
    You're probably right that in the past, things would just be filed away, but then, even if it's a pain to find, there's hopefully an obvious place to find it. Old newspapers got onto microfilm and then into libraries, or were otherwise preserved to some degree by their publishers.
    Good luck finding someone's archive of a Youtube channel. Sure it probably exists, probably even several copies of it, but unless XxX_PussyDestroyer69_XxX on reddit wrote in a comment that they personally archived that particular channel, you're never going to be able to get in touch with the relevant people to actually get a hold of what you're looking for.
    I wish peer-to-peer video ended up being more popular than it is, if only to have an obvious place to put those archives people otherwise juat hoard foe themselves, but I have little faith that will happen for as long as Youtube remains the only viable video platform out there for your average Joe.
- croes 21 hours ago
  
  Many sites aren't back friendly. The either don't allow downloads or rely so heavily on js and frameworks that the stored files aren't working
  - efilife 21 hours ago
    
    This doesn't matter that much. If someone's interested and not very technical it's a matter of googling "X downloader" and pasting the link you want to archive. Learning to use yt-dlp is not that difficult to a layman as well

duxup 2 days ago

Everyone wants to close down their corner of the internet because they think AI is going to make them a ton of money. We're getting the first part but I'm not sure we're seeing the latter ... anywhere as far as platforms go.

necovek 2 days ago

Well, Reddit is getting a ton of money out of licensing deals for using their data to train AI.
Whether you classify that as "AI-related" or not, I don't know.
- phil21 2 days ago
  
  It's funny/interesting/terrifying to me that developers went from the near-religious mantra of "Garbage In, Garbage Out" when I was learning computers - to now training our supposedly super intelligent AIs off of reddit posts or even worse.
  Basically laundering outright wrong information into something the next generation is now going to believe as scientific truths.
  I often wonder how many people/organizations are seeding places like Reddit with malinformation/beliefs for it to become canical truth in the AI age once it's too late to tell the difference for most people?
  Lord knows I've made trolling-level posts that are only marginally accurate back in the day that are now part of the AI corpus of knowledge. Mix those in with some of my well-researched stuff and you couldn't even really filter it based on "this account is a shitposter" to weight it lower. Nevermind plenty of earnest posts made that were outright wrong simply due to... being wrong in the moment and later learning better.
  - anon7000 2 days ago
    
    Reddit sucks, but it’s also one of the biggest goldmines of human-curated information out there. Alternatives include blogspam, which is worse than useless these days, and forums with limited scope. Figuring out how to sift through dirt to find the nuggets of gold is important for any AI, whether they train on Reddit or not.
    
    Telaneo a day ago
    
    It's always funny to me how Reddit is one of the less garbage sources on random unexpected shit you one day want to find out. If you're really lucky, someone will have written a really good article on some niche topic on their personal website or similar, but that's annoyingly rare. You've also got review sites and such for common consumer products which are usually a decent bet after some filtering and cross checking. But the step down from that in terms of quality really is reddit of all places. You sill need to double check be sane about what you read, but the alternatives, like the ones you've listed, are actually worse than reddit's random internet strangers.
    
    necovek 17 hours ago
    
    I don't think really good blog posts/articles on a personal site are rare. But they are increasingly harder to find, searches only returning SEO spam sites instead.
  - carlhjerpe 2 days ago
    
    I think filtering on upvotes/downvotes, comments/views, sub, user and whatever metrics they have on the content can help AI companies train on somewhat reasonable things. Blend it with Wikipedia, scientific papers, reliable newspapers and you're golden?
    Metadata is what makes gold out of poo, I assume model developers can "train negatively" too if metadata suggests they should.
    
    phantompeace 2 days ago
    
    So now the people who can buy upvotes get to write history?
    
    carlhjerpe a day ago
    
    True, once you start measuring by something it becomes an useless metric so it'd work until people knew if they do it or not, then they will not be able to rely on that data either. Or they improve their bot detection and mitigation and play the cat and mouse game.
- BeFlatXIII 2 days ago
  
  Every company who pays is a chump. They ought to get better at scraping and hacking IOT devices as residential proxies.
jajuuka 2 days ago

It's not entirely self enriching. AI crawlers hit servers hard and everyone has their own crawler. So it's partially covering a business expense. Especially with Reddit being a goldmine of content for training data.
Internet Archive has been terrible as capturing full pages on Reddit for a while. So it's not a real loss. Unfortunately right now these AI companies have full freedom to do whatever they want. Taking paid content, artistic works, and your own posts on social media. So Reddit trying to charge them is a good idea as it's some form of quid pro quo put on AI scraping companies.
- JohnFen 2 days ago
  
  > So Reddit trying to charge them is a good idea as it's some form of quid pro quo put on AI scraping companies.
  Except that what Reddit is really doing is selling content they didn't produce and don't own. I don't think they're walking some kind of high road here like they would be if they were actually fighting against the scraping.
  - jajuuka 2 days ago
    
    I didn't mean to imply they were. As I said, it's not ENTIRELY for self-enriching reasons. As in self-enrichment is a part of the reasons for this effort to combat AI scrapers.
    That being said I can still take some satisfaction in seeing AI scapers get jammed up considering how they face zero consequences right now.

pfcd 2 days ago

It already was blocked, in a way.

See: https://www.reddit.com/r/internetarchive/comments/1gpn54q/is...

> They are not specifically targeting Wayback Machine. Anything other than residential IP's are blocked, to my information. Such as IP's of cloud services like Hetzner, GCP, AWS... The list goes on. (from my comment there)

echelon 2 days ago

I'll happily install an Internet Archive extension to scrape websites, if there is such a thing.
- xnx a day ago
  
  Archive Team had a Reddit project in he past. Hopefully, they'll start up a new one: https://wiki.archiveteam.org/index.php/Projects
  - pabs3 a day ago
    
    IIRC the project stopped because we couldn't get around Reddit's blocking.
    https://wiki.archiveteam.org/index.php/Reddit
- toomuchtodo a day ago
  
  Stay tuned.

mrkramer a day ago

Ironically enough rampant piracy turned out to be the best method for preserving history because that way thousands of people have x or y thing on their hard drive stored and preserved in decentralized fashion. One way centralized archiving is fragile.

throw0101c 2 days ago

Is there a way to allow IA to scrape the site but not allow viewing the results (for "x" weeks/months?)?

A way to balance archiving (in case something happens to Reddit) and exploitation (of copyright/material).

Telaneo a day ago

I doubt that'll be a deal reddit will find palatable, given there's no obvious monetary incentive for them to allow archiving in the first place, but there are some to blocking others from being able to index their content, like funnelling those exact people trying to index them into licencing reddits content instead, and preventing the people who are already doing that from getting ideas.
tempfile 17 hours ago

Why should there be a balance between archiving (a useful social good) and exploitation by a platform of copyrighted material they did not create and do not own?

cinb 2 days ago

Meanwhile, sites that constantly archive Reddit, like https://arctic-shift.photon-reddit.com, remain unscathed.

timpera 2 days ago

Oh, this is pretty cool. How do they get around the anti-scraping protection? It must be pretty hard to consistently scrape so much data.
xnx a day ago

Wow, that's a cool site. You can set the charts back 21 years ("21y"): https://arctic-shift.photon-reddit.com/live-charts/

PaulHoule 2 days ago

What they're really afraid of is that people will read content using LLM inference and make all the ads and nags and "download the app for a crap experience" go away -- and never click on ads accidentally for an occasional ka-ching.

Yeah, the front end for de-enshittification looks a lot like that other archive site,

https://archive.today/

In the summer of 2020 I was driving to Buffalo a lot with my son and getting cheap hotel deals thanks to the pandemic and thinking about missile defense systems and I was sick and tired of the awful shape of the web and dreaming up a system that would "archive" 100% of web pages before I read them. I spent two weeks on a spike prototype and concluded that an "archiver" can never really know if a modern web page is done loading so it at best uses heuristics to make the page load completely and waits a long time -- which makes following a link even slower than waiting for all the ads and trackers to load. I finally got Fiber-to-the-Node at home so downloading all the trash of the annoyances economy became more tolerable, a lot of the ideas I had that the time made it into my RSS reader a few years later.

freedomben 2 days ago

I had (and still have to some extent) the same dream, though I'm ok with the archiving happening after-the-fact. ArchiveBox has worked reasonably well for me
Nathan2055 2 days ago

> What they're really afraid of is that people will read content using LLM inference and make all the ads and nags and "download the app for a crap experience" go away -- and never click on ads accidentally for an occasional ka-ching.
See, I don't think this is right either. Back during the original API protests, several people (including me!) pointed out that if the concern was really that third-party apps weren't contributing back to Reddit (which was a fair point: Apollo never showed ads of any kind, neither Reddit's or their own) then a good solution would be to make using third-party apps require paying for Reddit Premium. Then they wouldn't have to audit all of the apps to ensure they were displaying ads correctly and would be able to collect revenue outside of the inherent limitations of advertising.
Theoretically, this should have been a straight win for Reddit, especially given the incredibly low income that they've apparently been getting from ads anyway (I can't find the report now so the numbers might not be exact, but I remember it being reported that Reddit was pulling in something like ~$0.60 per user per month versus Twitter's slightly better ~$8 per user per month and Meta's frankly mindblowing ~$50 per user per month) but it was immediately dismissed out of hand in favor of their way more complicated proposal that app developers audit their own usage and then pay Reddit back.
My initial thoughts were either that the Reddit API was so broken that they couldn't figure out how to properly implement the rate limits or payment gating needed for the other strategy (even now the API still doesn't have proper rate limits, they just commence legal action anyone they find abusing it rather than figure out how to lock them out; the best they can really do is the sort of basic IP bans they're using here), or the Reddit higher-ups were so frustrated that Apollo had worked out a profitable business model before them that they just wanted to deploy a strategy targeted specifically at punishing them.
But it quickly became clear later that Reddit genuinely wasn't even thinking about third-party apps. They saw dollar signs from the AI boom, and realized that Reddit was one of the largest and most accessibly corpuses of generally-high-quality text on a wide variety of topics, and AI companies were going to need that. Google showing an intense dependency on Reddit during the blackout didn't hurt either (yes, at this point I genuinely believe the blackout actually hurt more than it helped by giving Reddit further leverage to use on Google, hence why they were one of the first to sign a crawler deal afterwards).
So they decided to use any method they could think of to lock down access to the platform while keeping enough people around that the Reddit platform was still mostly decent enough to be usable for AI training and pivoted much of their business to selling data. All of this while claiming, as they're still doing today with the Internet Archive move, that this is somehow a "privacy measure" meant to ensure deleted comments aren't being archived anywhere.
The same thing basically happened with Stack Exchange, except they had much less leverage over their community because the entire site was previously CC licensed and they didn't have any real authority to override that beyond making data access really annoying.
The good news is that it really does seem like "injest everything" big model AI is the least likely to survive at this point. Between ChatGPT scaling things down massively to save on costs with the GPT-5 update and the Chinese models somehow making do with less data and slower chips by just using better engineering techniques, I highly doubt these economics around AI are going to last. The bad news is that, between stuff like this and the GitHub restructuring today, I don't thing Big Tech has any plans on how they're going to continue functioning in an economy that isn't entirely based on AI hype. And that's really concerning.

khelavastr a day ago

Isaac Asimov never dreamed companies would try to legally blockade robots from intellectual property to charge extra. Absurd.

ChrisArchitect 2 days ago

What is the source? Where did Reddit say this? No blog post or release anywhere

mikestew 2 days ago

Well, there is TFA that quotes a Reddit spokesperson. What do you want, stone tablets?
dkiebd 2 days ago

Read the article.