snac2 and selenium
Table of Contents
Many, many, many moons ago - before Elon Musk bought twitter, I used to scrape my own timeline with Nokogiri and repost everything except the adverts to Pleroma.
Eventually, I migrated away from Pleroma to Misskey. This was down to a botched upgrade, and an unwillingness to wrap my head around the Erlang ecosystem. I chose Misskey over Mastodon as it worked with naught more than a database (postgresql) and a reverse-proxy (nginx).
Problems arose on both sides as time went on - Misskey migrated from npm to pnpm and a newer version of NodeJS (newer than was provided by Debian stable). Twitter was bought, and timelines locked down. For a short period, Nitter continued to work - until Musk shut down the undocumented interfaces it depended on.
Projects like twscrape filled the gap, but eventually, I found even those fell prey to Cloudflare's blocking.
1. Scraping Twitter
Unsurprisingly, the best way to scrape twitter in 2026 appears to be emulating the behaviour of a client through browser automation.
Selenium remains the most well-supported option for puppeteering browsers, and has complete support within Ruby-space. Code-examples are plentiful, and it isn't too much work to jot-down the content and fields required to go from the twitter login page to digging out the text fields in posts on a timeline.
The pipeline beyond that point is as always. Filter the posts to remove gambling adverts (and as seems to be the case these days, political rage-bait) - fetch images, resolve short-links, and insert into sqlite for further processing.
As a practical example - here's an excerpt from my code. Warts and all:
Tweet = Struct.new(:username, :stamp, :text, :hash, :image_url, :image_path, :links) # [...] snip def parse_timeline(username) # pagination @driver.navigate.refresh sleep 15.7 @driver.action.send_keys(:page_down).perform sleep 12.47 @driver.action.send_keys(:page_down).perform sleep 7.21 @driver.action.send_keys(:page_up).perform sleep 10.8 timeline = @driver.find_element(css: '[aria-label^="Timeline"]') all_tweets = timeline.find_elements(css: '[data-testid="tweet"]') # twitter inserts shite into timelines. Filter on the username header: real_tweets = all_tweets.select do |t| a_tag = t.find_element(css: "a[href*='/#{username}']") rescue nil href = a_tag.attribute("href") rescue nil href == "https://x.com/#{username}" end sleep 13.3 tweets = real_tweets.reverse.map do |tweet| # text text = tweet.find_element(css: '[data-testid="tweetText"]').text rescue nil # image URLs - discard all but the first for now. image_url = tweet.find_element(css: '[alt="Image"]').attribute("src").split("&name").first rescue nil image_path = nil image_path = ("/tmp/" + image_url.gsub(/\?format=/, ".").split("/").last).gsub("\0", "") if image_url # links links = tweet.find_elements(tag_name: 'a') rescue nil links = links.map {|link| link.attribute("href") rescue nil} if links # sometimes we hit a link to twitter's terms of service - just drop those. links = links.reject {|url| url.nil? or url.include?("x.com")} if links # resolve redirects, as twitter shortens links if links links = links.map do |link| res = HTTP.get(link) res.status == 301 ? res.headers["Location"] : link end end # and sometimes people only post an image. if text.nil? and image_url text = image_url end # package everything into a nice little struct, with a hash of the text to serve as a unique # identifier (as we don't have visibility of twitter's own ID, nor accurate timestamp). Tweet.new(username, Time.now.to_i, text, Digest::SHA256.hexdigest(text), image_url, image_path, links) end tweets end
I should really break the different content types into their own methods. Pagination too. Oh well.
Images are fetched by loading the image into the virtual browser and taking a screenshot. TS is
a TwitterScraper class that encapsulates the Selenium instance. I'm simply dumping files in /tmp
and checking if they're already there, to avoid repeat requests.
if tweet.image_url if File.exist?(tweet.image_path) puts "#{tweet.hash} => Image #{tweet.image_url} already at #{tweet.image_path}" else puts "#{tweet.hash} => Fetching #{tweet.image_url}" TS.nav_to(tweet.image_url) TS.save_image_get(tweet.image_url, tweet.image_path) if File.exist?(tweet.image_path) puts "#{tweet.hash} => Fetch OK -- file wrote to #{tweet.image_path}" else puts "#{tweet.hash} => Fetch NOT OK -- will continue without image" end end end
For some individuals I follow, I often need to pass the text through DeepL.
This happens on the second leg of process - before it is mirrored to activityPub. They offer a free tier
provided you make an account, and I've yet to hit a limitation with it. src_lang is a property of an account, and
stored alongside their twitter handle in sqlite (with the obvious limitation polylingual accounts are messed up).
Though, specifying source language like this does avoid common issues like DeepL detecting Japanese as Chinese.
final_text = tweet.text response = DeepL.translate(tweet.text, src_lang, 'EN') if response.detected_source_language == 'EN' puts "#{tweet.hash} => Skipping translation as already English" else puts "#{tweet.hash} => Translated from #{src_lang} to EN" final_text = response.text end
2. Replacing Misskey
This leads to the question of which activityPub instance. While it is perfectly possible to run Misskey on GuixSD - it isn't currently packaged (and is unlikely to be packaged soon, given the vast quantity of nodeJS packages it leans upon).
While Misskey has a fantastic responsive web interface, there's no shortage of wonderful free-software clients available for Android and iOS:
- Tusky (which I use)
- SubwayTooter
- Fedilab
- Moshidon
Which leads to the aforementioned snac2. snac2 implements enough of the activityPub spec to work with the above clients, and unlike Misskey - is entirely self-contained. Perhaps most importantly (to me), it's also packaged in Guix!
2.1. Snac2 Setup
I've yet to properly wrap snac2 in a Herd service, but to go from zero-to-web-interface is as simple as running a scant few commands.
To initialize snac's data store and initial configuration:
> snac init /var/snac/data # Network address [127.0.0.1]: # Network port [8001]: # Host name: snac.example.com # URL prefix: # Admin email address (optional): # Done.
And to add your first user:
> snac adduser /var/snac/data patrick # Creating RSA key... # Done. # User password is X # Go to https://snac.example.com/yourname and continue configuring your user there.
And finally, to start the HTTP server:
> snac httpd /var/snac/data
You may refer to the snac2 documentation for guidance on setting up nginx location blocks.
Of course, if you prefer to fully replicate twitter users to properly
compartmentalise their tweets (to provide individual timelines for each
mirrored user) - just add further users with the adduser command.
2.2. API Access
If you prefer to make posts for your users via the API instead of through
the snac note command, you'll need to log in as the individual users and
navigate to https://$SNAC_HOST/oauth/x-snac-get-token.
The tokens last forever, and changing profile pictures/banners/biography is best done in the web interface anyway - so I don't think it's worth automating.