snac2 and selenium

Table of Contents

Many, many, many moons ago - before Elon Musk bought twitter, I used to scrape my own timeline with Nokogiri and repost everything except the adverts to Pleroma.

Eventually, I migrated away from Pleroma to Misskey. This was down to a botched upgrade, and an unwillingness to wrap my head around the Erlang ecosystem. I chose Misskey over Mastodon as it worked with naught more than a database (postgresql) and a reverse-proxy (nginx).

Problems arose on both sides as time went on - Misskey migrated from npm to pnpm and a newer version of NodeJS (newer than was provided by Debian stable). Twitter was bought, and timelines locked down. For a short period, Nitter continued to work - until Musk shut down the undocumented interfaces it depended on.

Projects like twscrape filled the gap, but eventually, I found even those fell prey to Cloudflare's blocking.

1. Scraping Twitter

Unsurprisingly, the best way to scrape twitter in 2026 appears to be emulating the behaviour of a client through browser automation.

Selenium remains the most well-supported option for puppeteering browsers, and has complete support within Ruby-space. Code-examples are plentiful, and it isn't too much work to jot-down the content and fields required to go from the twitter login page to digging out the text fields in posts on a timeline.

The pipeline beyond that point is as always. Filter the posts to remove gambling adverts (and as seems to be the case these days, political rage-bait) - fetch images, resolve short-links, and insert into sqlite for further processing.

As a practical example - here's an excerpt from my code. Warts and all:

Tweet = Struct.new(:username, :stamp, :text, :hash, :image_url, :image_path, :links)
# [...] snip

def parse_timeline(username)
  # pagination
  @driver.navigate.refresh
  sleep 15.7
  @driver.action.send_keys(:page_down).perform
  sleep 12.47
  @driver.action.send_keys(:page_down).perform
  sleep 7.21
  @driver.action.send_keys(:page_up).perform
  sleep 10.8 

  timeline = @driver.find_element(css: '[aria-label^="Timeline"]')
  all_tweets = timeline.find_elements(css: '[data-testid="tweet"]')

  # twitter inserts shite into timelines. Filter on the username header:
  real_tweets = all_tweets.select do |t|
    a_tag = t.find_element(css: "a[href*='/#{username}']") rescue nil
    href = a_tag.attribute("href") rescue nil
    href == "https://x.com/#{username}"
  end

  sleep 13.3
  tweets = real_tweets.reverse.map do |tweet|
    # text
    text = tweet.find_element(css: '[data-testid="tweetText"]').text rescue nil

    # image URLs - discard all but the first for now.
    image_url = tweet.find_element(css: '[alt="Image"]').attribute("src").split("&name").first rescue nil
    image_path = nil
    image_path = ("/tmp/" + image_url.gsub(/\?format=/, ".").split("/").last).gsub("\0", "") if image_url

    # links
    links = tweet.find_elements(tag_name: 'a') rescue nil
    links = links.map {|link| link.attribute("href") rescue nil} if links
    # sometimes we hit a link to twitter's terms of service - just drop those.
    links = links.reject {|url| url.nil? or url.include?("x.com")} if links

    # resolve redirects, as twitter shortens links
    if links
      links = links.map do |link|
        res = HTTP.get(link)
        res.status == 301 ? res.headers["Location"] : link
      end
    end

    # and sometimes people only post an image.
    if text.nil? and image_url
      text = image_url
    end

    # package everything into a nice little struct, with a hash of the text to serve as a unique
    # identifier (as we don't have visibility of twitter's own ID, nor accurate timestamp).
    Tweet.new(username, Time.now.to_i, text, Digest::SHA256.hexdigest(text), image_url, image_path, links)
  end

  tweets
end

I should really break the different content types into their own methods. Pagination too. Oh well.

Images are fetched by loading the image into the virtual browser and taking a screenshot. TS is a TwitterScraper class that encapsulates the Selenium instance. I'm simply dumping files in /tmp and checking if they're already there, to avoid repeat requests.

  if tweet.image_url
  if File.exist?(tweet.image_path)
    puts "#{tweet.hash} => Image #{tweet.image_url} already at #{tweet.image_path}"
  else
    puts "#{tweet.hash} => Fetching #{tweet.image_url}"
    TS.nav_to(tweet.image_url)
    TS.save_image_get(tweet.image_url, tweet.image_path)
    if File.exist?(tweet.image_path)
      puts "#{tweet.hash} => Fetch OK -- file wrote to #{tweet.image_path}"
    else
      puts "#{tweet.hash} => Fetch NOT OK -- will continue without image"
    end
  end
end

For some individuals I follow, I often need to pass the text through DeepL. This happens on the second leg of process - before it is mirrored to activityPub. They offer a free tier provided you make an account, and I've yet to hit a limitation with it. src_lang is a property of an account, and stored alongside their twitter handle in sqlite (with the obvious limitation polylingual accounts are messed up).

Though, specifying source language like this does avoid common issues like DeepL detecting Japanese as Chinese.

final_text = tweet.text
response = DeepL.translate(tweet.text, src_lang, 'EN')
if response.detected_source_language == 'EN'
  puts "#{tweet.hash} => Skipping translation as already English"
else
  puts "#{tweet.hash} => Translated from #{src_lang} to EN"
  final_text = response.text
end

2. Replacing Misskey

This leads to the question of which activityPub instance. While it is perfectly possible to run Misskey on GuixSD - it isn't currently packaged (and is unlikely to be packaged soon, given the vast quantity of nodeJS packages it leans upon).

While Misskey has a fantastic responsive web interface, there's no shortage of wonderful free-software clients available for Android and iOS:

Which leads to the aforementioned snac2. snac2 implements enough of the activityPub spec to work with the above clients, and unlike Misskey - is entirely self-contained. Perhaps most importantly (to me), it's also packaged in Guix!

2.1. Snac2 Setup

I've yet to properly wrap snac2 in a Herd service, but to go from zero-to-web-interface is as simple as running a scant few commands.

To initialize snac's data store and initial configuration:

> snac init /var/snac/data
# Network address [127.0.0.1]:
# Network port [8001]:
# Host name: snac.example.com
# URL prefix:
# Admin email address (optional):
# Done.

And to add your first user:

> snac adduser /var/snac/data patrick
# Creating RSA key...
# Done.

# User password is X

# Go to https://snac.example.com/yourname and continue configuring your user there.

And finally, to start the HTTP server:

> snac httpd /var/snac/data

You may refer to the snac2 documentation for guidance on setting up nginx location blocks.

Of course, if you prefer to fully replicate twitter users to properly compartmentalise their tweets (to provide individual timelines for each mirrored user) - just add further users with the adduser command.

2.2. API Access

If you prefer to make posts for your users via the API instead of through the snac note command, you'll need to log in as the individual users and navigate to https://$SNAC_HOST/oauth/x-snac-get-token.

The tokens last forever, and changing profile pictures/banners/biography is best done in the web interface anyway - so I don't think it's worth automating.

Date: 2026-02-21 Sat 00:00

Author: Patrick

Emacs 30.2 (Org mode 9.7.11)

Validate