Python Nitter/DeepL/Misskey Glue
I don’t really like Python. As a language. Whitespace as a way to delimit blocks of code is not my cup of tea. But I still use it, and one of the ways I’ve used it recently is to hook a few services together.
Nitter
Nitter is a self-hosted interface to twitter that doesn’t require javascript to use. Interestingly, it’s written in Nim, which is a language I know nothing about and would like to try one day. But the main point is that it provides a plain HTML interface to twitter, which is always going to be far less work to scrape than Twitter itself.
DeepL
DeepL is a free, with paid options, translation service. It has a simple API, and a python module that takes all of the work you’d otherwise have to do and wraps it up in straight-forward functions.
Misskey
Misskey is one of a few ActivityPub implementations, of which the servers implementing it make up the “Fediverse”. I have no real interest in federated social media, but it [ActivityPub] stands as a natural interface into which I can push content pulled from twitter by means of scraping nitter.
There are a few different implementations that have larger adoption than Misskey. From my understanding, Misskey is largely the work of an individual programmer in Japan, and it makes a number of incompatible changes that limit the number of native applications that can be used with it. For the moment, ‘MilkTea’ and ‘SubwayTooter’ are the only two clients that support it.
I could run Mastodon, and I have ran Pleroma. But the interface for Misskey is miles ahead of both in my own opinion, and my needs are quite minimal, making that the deciding factor between competing implementations.
nitter-misskey.py
This isn’t the full script, but I’ll at least include the core logic:
Configuration is provided in a TOML file as an argument to the script. The file, at a minimum, contains enough information to authenticate to the correct misskey instance, to know which twitter handle to scrape through which nitter instance, what DeepL API key to use, and where the Sqlite3 database for that handle exists.
# Misskey app config
app_config = config["app"]
miss_api = app_config["url"]
miss_token = app_config["token"]
# Nitter config
# DeepL config
# Sqlite config
# And so on
We start by fetching the Nitter page for the specified handle, doing some checking for return code to make sure we’re fine to go ahead. If we get back a 429, which would be odd for a locally-hosted instance, we back-off.
try:
print(f"[ GET ] Performing GET on {source_url}/{source_timeline}")
body = requests.request("GET", source_url + "/" + source_timeline)
except Exception as e:
print(f"[ GET ] Failed to fetch {source_url}{source_timeline}")
print(e)
sys.exit(1)
while (body.status_code == 429):
if (body.status_code != 200):
Presuming we have a valid body of HTML, we use BeatifulSoup to parse the contents. Ideally, Nitter will eventually provide an API to do this, rather than requiring matching for specific div classes. Some handling is implemented to avoid catching retweets. The text of the tweet is hashed and checked against sqlite to ensure we don’t process tweets we’ve already handled. If an image was attached, we fetch the full image and push it to Misskey’s ‘Drive’, which is a storage area for files for each user. We record the ID Misskey returns so we can use it in the post we’re about to make.
Images were hashed when this worked with Pleroma, but Misskey can handle that for us now.
soup = BeautifulSoup(body.text, 'html.parser')
all_elements = soup.find_all("div", {"class":"timeline-item"})
for timeline_item in reversed(all_elements):
if timeline_item.find("div", {"class":"retweet-header"}):
print(f"[ GET ] Skipping retweet")
continue
else:
post = timeline_item.find("div", {"class":"tweet-content media-body"})
post_body = post.get_text()
post_hash = get_text_hash(post_body)
if len(post_body) > 1 and is_new(post_hash):
print(f"[ NEW ] {post_hash} => {post_body}")
image_obj = timeline_item.find("a", {"class":"still-image"})
if (image_obj):
print(f"[ IMG ] Fetching image attachment")
img_url = source_url + urllib.parse.unquote(image_obj["href"])
image_req = requests.request("GET", img_url)
image_file = image_req.content
print("[ IMG ] Fetched image attachment")
image_result = mk.drive_files_create(io.BytesIO(image_file))
attachment_id = image_result["id"]
print(f'[ IMG ] Pushed image to Misskey: {image_result}')
has_image = True
else:
has_image = False
If translation is enabled for the handle - say we’re trying to mirror the tweets of the Sax player Takeshi Itoh and he posts a lot in Japanese - we pass the tweet through DeepL. I have no interest in untranslated tweets, and it is generally uncommon for the people I follow that speak a foreign language to mix languages when using twitter, so we’re throwing an exit when DeepL isn’t working. Given Misskey seems to have translation as a feature (also using DeepL), it seems likely I’ll bin this rather fragile code.
def translate(orig_text):
print(f"Fetching translation for: {orig_text}")
try:
translator = deepl.Translator(deepl_key)
translation_result = translator.translate_text(orig_text, target_lang="EN-GB")
return(translation_result.text)
except Exception as e:
print("Failed in fetching translation for: {orig_text}")
print(e)
sys.exit(1)
if translate_tweets == True:
print("[ TRANS ] Calling translation API")
final_body = translate(post_body)
else:
print(f"[ TRANS ] Skipping translation: {translate_tweets=}")
final_body = post_body
Finally, we push the (optionally translated) tweet into Misskey as a new note, adding the attachment ID we were given by Misskey if there was an image. The attachment/drive system in Misskey is nice, and could be easily adapted to handle video content - but I don’t really want to store files larger than images, so I’ve yet to bother to abstract the code above further.
print(f"[ NOTE ]: {post_hash} => Posting note\n")
try:
c.execute(f""" INSERT INTO {source_timeline}(orig_text, orig_hash, en_text) VALUES ("{post_body}", "{post_hash}", "{final_body}") """)
if has_image:
api_resp = mk.notes_create(text=final_body, file_ids=[attachment_id])
else:
api_resp = mk.notes_create(text=final_body)
except Exception as e:
print(f"[ NOTE ]: {post_hash} => Failed to POST to Misskey:")
print(e)
This script is then called in the crontab on a 15-minute interval, with an entry using a different TOML file for each handle I want to mirror. As I don’t follow that many people or organisations, creating the account and token on the Misskey side is done manually. Although I wouldn’t be surprised if there was an API endpoint I could make use of in the script itself.