How I webscraped 6 years of Reddit posts in JSON
Table of Contents
Have you ever wondered how many startups actually fail? People love to say that "most startups fail", but how many of them really do?
Well I have, so I decided to scrape as many self-promotion Reddit posts as I could for the past 2 days.
To be more specific, I scraped every post in the last 6 years from r/SaaS and r/SideProject and extracted every single unique domain from all links. From my experience, these are the subreddits where most people self-promote their product.
Turns out scraping Reddit is not that easy
Reddit has an RSS/JSON feed for every subreddit and post. For example, if you want to pull all data about the 10 newest posts from r/pics, you can do something like:
https://www.reddit.com/r/pics/new.json?limit=10
This gives you a large list of post objects. Each one of these has a bunch of metadata including a title
, author
and selftext
which includes the whole text in the reddit post in a Markdown-ish format.
There's also an RSS feed but it contains less data:
https://www.reddit.com/r/pics/new.rss?limit=10
Well, that's great! But of course, you can't just pull every single post like this - the maximum returned posts can be no more than 100. So I dug around and found a few more URL parameters that Reddit uses:
after
- return posts after a specific post idbefore
- return posts before a specific post id
Each post has a name
field that goes like t3_[id]
. This ID is 8 characters long and can contain letters and numbers. You can actually see this ID in every post's URL:
/r/pics/comments/post_id
/[post title slug]
/
By using it for the after
URL parameter, you can get the 10 newest posts after a specific post:
https://www.reddit.com/r/pics/new.json?limit=10&after=t3_1crggzy
The scraping process
I wrote a little program that would grab the first 100 posts, then use an after
parameter equal to the ID of the last post. Repeat this 10 times and you get 1000 posts.
Then, I compiled a little Regex that would extract URLs from the posts and ignore all popular social media links (incl. www
variants of them).
This worked pretty well and was able to extract every valid link from all posts. Then I just made sure they were unique and removed the duplicates.
This worked wonders until it hit the first major roadblock: the data cuts off after 14 days of posts 😔
The array of posts just becomes empty. This also applies to /hot
, /top
and all other filter categories.
An alternative data provider - Pushshift
While looking for alternative ways of obtaining data, I found Pushshift - a Reddit archive with an API that collected all posts and comments from the top 40,000 subreddits on Reddit.
Back in 2023 when Reddit introduced their new pricing model for their previously free API, they forced Pushshift to only allow pre-approved moderators to use it.
Thanks to the good ol' torrenting community, you can pretty much find all of their data (from 2005 to the end of 2023) on AcademicTorrents (2.64 TB).
Tracking a website's status
The data problem was now solved - extracting all links from all these reddit posts took no more than a few seconds.
But how do I tell if a website is still operating or not? There are a few approaches I thought of:
Pinging the domain
This doesn't really work because in most cases, the IP behind a domain is still connected to the internet.
For example, every site on Cloudflare will respond to a ping because of Cloudflare's DNS proxying.
Sending a GET request
This is too network-intensive and could take hours to complete.
You essentially receive all the data (in most cases), that a normal browser would when visiting a website.
The solution - a HEAD request
The HEAD
request is basically the same as GET
, but the response only contains the returned headers. The response body is completely empty.
I ran it on my 500 mbps server and it was done in about 2 minutes.
You can accurately determine if the website is operational if the response code is either 200
(OK) or a 301, 302, 307, 308
(Redirect).
If a request times out, has an SSL issue or any other connection issue, the website is considered abandoned.
I also wrote a little algorithm to determine if the redirect goes to a www
subdomain or to a different page on the same domain. If it doesn’t, that indicates a reused or expired domain, or a spammy ad website, essentially meaning the website is non-operational.
Results
I compared a bunch of data based on the results.
Up | Abandoned | Redirect | Errors |
---|---|---|---|
10080 | 5423 | 1495 | 1169 |
Rougly ~40% of all websites are abandoned or non-operational.
Domains | Total | Up | Downs |
---|---|---|---|
.com | 9600 | 5458 | 4142 |
.io | 1873 | 1061 | 812 |
.co | 1027 | 491 | 536 |
.ai | 514 | 389 | 125 |
.net | 307 | 152 | 155 |
End with "ify" | 99 | 55 | 44 |
Start with "go" | 98 | 47 | 51 |
Start with "try" | 55 | 19 | 36 |
.io
, .co
and .ai
are the most popular .com
alternatives for startups.
Year | Total Links | Total Upvotes | Avg. Upvotes |
---|---|---|---|
2018 | 956 | 7478 | 7.82 |
2019 | 2055 | 16720 | 8.14 |
2020 | 3017 | 30412 | 10.08 |
2021 | 2951 | 38054 | 12.9 |
2022 | 2307 | 18133 | 7.86 |
2023 | 5753 | 26614 | 4.63 |
Starting an online business is more common than ever, yet people are upvoting less and less. This trend could possibly be due to the influx of blatantly AI-written posts and bot-generated content.
Resources and Conclusion
You can find all the domains and data on Google Sheets. I've also added some additional information and statistics.
The past 2 days have been fun. I learned many things about how Reddit stores billions of posts efficiently and I even learned how to parse an RSS stream! Somehow, I've never parsed one after about 8 years of programming.