Case Studies Featured

How I webscraped 6 years of Reddit posts in JSON

Kristiyan Dimitrov

May 14, 2024 • 5 min read

Have you ever wondered how many startups actually fail? People love to say that "most startups fail", but how many of them really do?

Well I have, so I decided to scrape as many self-promotion Reddit posts as I could for the past 2 days.

To be more specific, I scraped every post in the last 6 years from r/SaaS and r/SideProject and extracted every single unique domain from all links. From my experience, these are the subreddits where most people self-promote their product.

Turns out scraping Reddit is not that easy

Reddit has an RSS/JSON feed for every subreddit and post. For example, if you want to pull all data about the 10 newest posts from r/pics, you can do something like:

https://www.reddit.com/r/pics/new.json?limit=10

This gives you a large list of post objects. Each one of these has a bunch of metadata including a title, author and selftext which includes the whole text in the reddit post in a Markdown-ish format.

There's also an RSS feed but it contains less data:

https://www.reddit.com/r/pics/new.rss?limit=10

Well, that's great! But of course, you can't just pull every single post like this - the maximum returned posts can be no more than 100. So I dug around and found a few more URL parameters that Reddit uses:

after - return posts after a specific post id
before - return posts before a specific post id

Each post has a name field that goes like t3_[id]. This ID is 8 characters long and can contain letters and numbers. You can actually see this ID in every post's URL:

/r/pics/comments/post_id/[post title slug]/

By using it for the after URL parameter, you can get the 10 newest posts after a specific post:

https://www.reddit.com/r/pics/new.json?limit=10&after=t3_1crggzy

The scraping process

I wrote a little program that would grab the first 100 posts, then use an after parameter equal to the ID of the last post. Repeat this 10 times and you get 1000 posts.

Then, I compiled a little Regex that would extract URLs from the posts and ignore all popular social media links (incl. www variants of them).

private static readonly string[] MATCH_BLACKLIST = [
    @"reddit\.com",
    @"twitter\.com",
    @"x\.com",
    @"tiktok\.com",
    @"instagram\.com",
    @"youtu\.be",
    // Many more common social media websites.
];

// I am not the best at Regex.
private static readonly string NON_SOCIAL_MEDIA =
    $@"(http|https):\/\/(?!{string.Join('|', MATCH_BLACKLIST)}|{@"www\." + string.Join(@"|www\.", MATCH_BLACKLIST)})[0-9A-z\.\-_]+";

My basic regex pattern and social media blacklist

This worked pretty well and was able to extract every valid link from all posts. Then I just made sure they were unique and removed the duplicates.

This worked wonders until it hit the first major roadblock: the data cuts off after 14 days of posts 😔

The array of posts just becomes empty. This also applies to /hot, /top and all other filter categories.

An alternative data provider - Pushshift

While looking for alternative ways of obtaining data, I found Pushshift - a Reddit archive with an API that collected all posts and comments from the top 40,000 subreddits on Reddit.

Back in 2023 when Reddit introduced their new pricing model for their previously free API, they forced Pushshift to only allow pre-approved moderators to use it.

Thanks to the good ol' torrenting community, you can pretty much find all of their data (from 2005 to the end of 2023) on AcademicTorrents (2.64 TB).

Tracking a website's status

The data problem was now solved - extracting all links from all these reddit posts took no more than a few seconds.

But how do I tell if a website is still operating or not? There are a few approaches I thought of:

Pinging the domain

This doesn't really work because in most cases, the IP behind a domain is still connected to the internet.

For example, every site on Cloudflare will respond to a ping because of Cloudflare's DNS proxying.

Sending a GET request

This is too network-intensive and could take hours to complete.

You essentially receive all the data (in most cases), that a normal browser would when visiting a website.

The solution - a HEAD request

The HEAD request is basically the same as GET, but the response only contains the returned headers. The response body is completely empty.

I ran it on my 500 mbps server and it was done in about 2 minutes.

You can accurately determine if the website is operational if the response code is either 200 (OK) or a 301, 302, 307, 308 (Redirect).

If a request times out, has an SSL issue or any other connection issue, the website is considered abandoned.

I also wrote a little algorithm to determine if the redirect goes to a www subdomain or to a different page on the same domain. If it doesn’t, that indicates a reused or expired domain, or a spammy ad website, essentially meaning the website is non-operational.

Results

I compared a bunch of data based on the results.

Up	Abandoned	Redirect	Errors
10080	5423	1495	1169

Rougly ~40% of all websites are abandoned or non-operational.

Domains	Total	Up	Downs
.com	9600	5458	4142
.io	1873	1061	812
.co	1027	491	536
.ai	514	389	125
.net	307	152	155
End with "ify"	99	55	44
Start with "go"	98	47	51
Start with "try"	55	19	36

.io, .co and .ai are the most popular .com alternatives for startups.

Year	Total Links	Total Upvotes	Avg. Upvotes
2018	956	7478	7.82
2019	2055	16720	8.14
2020	3017	30412	10.08
2021	2951	38054	12.9
2022	2307	18133	7.86
2023	5753	26614	4.63

Starting an online business is more common than ever, yet people are upvoting less and less. This trend could possibly be due to the influx of blatantly AI-written posts and bot-generated content.

Resources and Conclusion

You can find all the domains and data on Google Sheets. I've also added some additional information and statistics.

The past 2 days have been fun. I learned many things about how Reddit stores billions of posts efficiently and I even learned how to parse an RSS stream! Somehow, I've never parsed one after about 8 years of programming.