Youtube stream yanking

Archiving youtube streams isn't too difficult, there's a number of different livestream-ripping utilities available for use. The big challenge is figuring out if a channel is live. These are some notes on how I do it with beautifulsoup, some threads, and streamlink. It isn't a guide - more an extension of the comments for my in-use script.

Beautifulsoup

Beautifulsoup is a great little library for working with HTML in python. We start by fetching the /live page for a channel using the requests library's request function. It doesn't need to be requests - anything that will give you the page as a long bit of text will do. We take the text and slap it in a beautifulsoup object specifying the HTML parser should be used. I've removed logging and exception handling in these examples for brevity.

channel_id = "XXYYZZ"
url = f"https://www.youtube.com/channel/channel/{channel_id}/live"
resp = requests.request("GET", url)
soup = BeautifulSoup(resp.text, "html.parser")

Once we have our soup object, we want to search through it for something to indicate the channel is live. As a masochist, my preference is regex. A channel has three states - live, not live, and not live but scheduled. These can be indicated by the presence of certain json keys and values in one of the scripts included in the page. In order:

A live channel will contain the object “videoDetails” with a “videoId”, “title” and “lengthSeconds” keys.

A not-live channel will contain an object with a “simpleText” key, and a value starting “Last streamed live on”.

A not-live channel with a scheduled stream will also contain an object with a “simpleText” key, this time starting with “Scheduled for”.

live_pattern = re.compile(r'"videoDetails":{"videoId":"(?P<id>.*)","title":"(?P<title>.*)","lengthSeconds"')
notlive_pattern = re.compile(r':{"simpleText":"Last streamed live on .*"}}},{"videoSecondaryInfoRenderer')
scheduled_pattern = re.compile(r':{"simpleText":"Scheduled for (?P<scheduled>.*)"}}},{"videoSecondaryInfoRenderer')

Once you have the means to say whether a stream is happening or not, all that remains is yanking it when it's live. Were I less lazy, I'd be using the streamlink Python library instead of spawning a subprocess. But I am lazy. The command I use is provided below for reference.

proc = subprocess.run(f"streamlink --ringbuffer-size 256M --hls-live-restart --hls-segment-threads 10 --hls-segment-timeout 120 --hls-segment-attempts 300 --hls-segment-ignore-names preloading --hls-playlist-reload-attempts 30 --hls-live-edge 5 --hls-timeout 90 --loglevel info -o '{now} {safe_title}.ts' {channel_url} {quality}", shell=True, cwd=output_directory, stdout=subprocess.PIPE, encoding="utf-8")

Footnote: Google's 429

Google will slap you if you get too persistent with your lookups. They'll also (for less clear reasons) slap you down if you try to yank a stream over IPv6. Their most common way of telling you ‘no’ is to return a 429 HTTP response code. The best way to avoid this is to do three things:

  1. Place delays in between checks to see if a stream is up.
  2. Increase that delay when you know a stream is scheduled down the line. It's worth still checking, as the stream may be rescheduled to an earlier slot.
  3. Make a point of only calling streamlink when you know there's a stream.