Scraping Twitter with Ruby

Scraping Twitter with Ruby


Twitter changed their service to require log-in. I took the opportunity to replace my old way of scraping with a new way.

The old way

So - the old way was made up of a few steps: 1. Fetch a timeline served by a local(host) nitter instance as HTML5. 2. Parse the timeline with Beautifulsoup and extract the text/image in each tweet. 3. For each tweet, hash the text as a simple means to see if the tweet is new. 4. If it isn’t, skip it. If it is, convert the tweet to a Misskey Note (and attachment, if there’s an image) 5. If the language isn’t English, push it through Deepl’s API. 6. Push the note to misskey as the user I’m mirroring, with their own API key.

This depended on a few things. Firstly, a working nitter instance. Originally I made use of a pool of instances, but eventually set up my own instance to avoid rate-limiting. Nitter, as of this post, is currently waiting to see where things fall. The workarounds I’ve used to scrape twitter largely come from their issue tracker, so I do expect they’ll find a way.

Secondly, a database to store the tweet text, hash, and translation (if applicable). For this I used sqlite. Really, I could skip hashing and instead go by timestamp on the tweet, but that has issues (I’ll return to this later).

A deepl API key. Most of the tweets I scrape are in English, so I make very few requests and can survive on their free plan. If not for the absurd cost of electricity right now, I’d be pushing this through NLLB [1] on my workstation instead.

[1] https://github.com/facebookresearch/fairseq/tree/nllb

A working misskey instance. I run my own, but given all communication happens over HTTPS with an API key, this could forseeably work on any instance.

This was all written in Python 3 with BeautifulSoup for parsing and Misskey.py for posting.

The new way

Fewer steps: 1. Scrape twitter directly. 2. Parse the timeline with Nokogiri and return a list of Tweet objects. Slightly less user friendly than bs4. 3. Again, hash the text to determine newness. 4. Translate if required. 5. Tweet.to_note or Tweet.to_note_with_attachment 6. Push to Misskey.

Scraping

Scraping directly is hacky. So hacky, I have three differing TwitterScraper.scrape_x methods.

The first, TweetScraper.scrape_synd used the now disabled syndication endpoint. The syndication endpoint is kind enough to contain the timeline of a user as JSON, so once I’d found where that JSON sat in the document and pulled it out with Nokogiri, it wasn’t difficult.

TweetScraper.scrape_login was far more work – logging into twitter, fetching the timelines as an authenticated user, and then pulling the content. It worked, but I really couldn’t be bothered with images, and every now and then I’d get back valid HTML without content. Frankly, a lot of it is guesswork, poking, and prodding. I imagine nitter will look into credentials, as for all it doesn’t help public instances, it would make life easier for people like me who run their own.

TweetScraper.scrape_bearer was a hack, and the one I’m currently depending on. Some bright spark pulled endpoints from the Android application (or maybe iOS?) and cobbled a bash script that fished tweets out of timelines with jq. After taking a closer look at what data was being returned, I found that (maybe unsurprisingly), the JSON structure was a superset of the structure I was pulling with TweetScraper.scrape_synd. Adapting scrape_synd gave me a working drop-in.

Hashing

There are two ways to determine if a tweet is new. You can go by timestamp, or you can go by content. Going by timestamp introduces more work in terms of parsing the timeline - you’re now having to find the timestamp of that tweet, localised to your timezone, normalising it to ISO, and storing the latest stamp either implicitly (if cron runs every 15 minutes, only process stamp > now - 15m) or explictly (in a file). Going by content has the obvious caveats of storing more, and losing tweets with duplicate contents.

That is to say you can do it right, or you can do it fast. Given I’m on my third iteration of this in a weekend, I’m sticking with fast.

Deepl

Someone made a Gem that handles this. Admittedly, I could have just wrote it myself (it is just HTTP requests). But I’m thankful I saved the time and could spend that working out scraping instead.

Pushing to Misskey

Whereas with Python I had Misskey.py, I couldn’t find an appropriate Gem for Misskey in Ruby. I found that slightly odd - Ruby does have market share in Japan. But it is what it is. I was quite happy to find that Misskey’s API documentation is actually served by all instances on a fixed endpoint (very nice), and it didn’t take long to knock up a method to push.

One thing that did bite me here - Ruby isn’t as popular as Python, and as a result, information on the web is generally outdates, wrong, or not helpful. To push attachments to a Misskey drive, it is necessary to send the payload as a multipart. net/http does support this, despite old answers on Stackexchange indicating otherwise. Nevertheless, the intricacies of using net/http for this escaped me, and I ended up rewriting my HTTP methods to use the http Gem throughout.

I’m not happy with the state of Ruby. It isn’t, I think, Ruby’s fault. Ruby seems to do everything right in my books. But there’s a real feeling that a lot of useful code has been left on the vine for too long, and searching rubygems.org to see the code you want hasn’t been updated since 2017 is disheartening. The mRuby side of the fence is perhaps worse, with some modules not even building on current mRuby.

Summary

I like Ruby, I don’t dislike Python. Migrating my shitty script from Python to Ruby wasn’t difficult, and I’ve (at least for now) removed nitter as a step in the chain. What I have now is arguably easier to maintain, as I’ve bothered to codify the structures I expect to use as input/output, and that will probably help as I implement another three ways to scrape twitter.