User:Fæ/Project list/CDC videos

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Guidance for 10 things to do at home to manage COVID-19 infection during the pandemic.
Video giving easy to understand pregnancy tips.
Spanish video encouraging the use of insect repellent to defend from the Zika virus.

This batch upload project populates Category:CDC videos.

The videos are focused on health promotion, prevention and preparedness activities in the United States, though some relate to international programmes.

The project is run by Fæ as an independent volunteer. Questions and comments can be raised at User talk:Fæ. https://petscan.wmflabs.org/?psid=1181470 gives a report of uploads in the last 7 days.

Introduction

[edit]

Centers for Disease Control and Prevention (CDC) is a U.S. federal agency, so all of works created under its projects are automatically public domain. The source Youtube channel CDCStreamingHealth used for the videos is their official channel, linking back to their home website www.cdc.gov. In practice, videos that are displayed at www.cdc.gov are actually Youtube hosted videos which have been embedded within CDC webpages.

This batch upload project for CDC videos was suggested in an email from James Heilman. Video transcoding and uploading has challenges for Wikimedia Commons, creating many barriers so that video remains only a very small part of Commons' collections.

Technical pointers

[edit]

Youtube makes a range of formats available for hosted videos behind the scenes. However the default format, and normally the best format, is mp4 which is not an open source standard for video. As Commons is limited to open source formats, this means that the files have to be reprocessed into an open standard. For videos the most common accepted format is webm using the VP8 or VP9 codec for video and vorbis for audio.

Example formats available for https://www.youtube.com/watch?v=YovSyrTUpxc; a CDC video, uploaded at File:I Am CDC- Linda Schieb.webm

[info] Available formats for YovSyrTUpxc:
format code  extension  resolution note
139          m4a        audio only DASH audio   49k , m4a_dash container, mp4a.40.5@ 48k (22050Hz), 540.46KiB
249          webm       audio only DASH audio   55k , opus @ 50k, 546.69KiB
250          webm       audio only DASH audio   78k , opus @ 70k, 765.37KiB
171          webm       audio only DASH audio  121k , vorbis@128k, 1.09MiB
140          m4a        audio only DASH audio  129k , m4a_dash container, mp4a.40.2@128k (44100Hz), 1.40MiB
251          webm       audio only DASH audio  192k , opus @160k, 1.80MiB
278          webm       256x144    144p   90k , webm container, vp9, 15fps, video only, 760.28KiB
160          mp4        256x144    DASH video  112k , avc1.4d400c, 15fps, video only, 1.14MiB
242          webm       426x240    240p  169k , vp9, 30fps, video only, 1.14MiB
133          mp4        426x240    DASH video  261k , avc1.4d4015, 30fps, video only, 2.61MiB
243          webm       640x360    360p  293k , vp9, 30fps, video only, 2.05MiB
134          mp4        640x360    DASH video  352k , avc1.4d401e, 30fps, video only, 2.43MiB
244          webm       854x480    480p  514k , vp9, 30fps, video only, 3.36MiB
135          mp4        854x480    DASH video  650k , avc1.4d401f, 30fps, video only, 4.90MiB
247          webm       1280x720   720p 1057k , vp9, 30fps, video only, 6.90MiB
136          mp4        1280x720   720p 1197k , avc1.4d401f, 30fps, video only, 9.67MiB
302          webm       1280x720   720p60 1634k , vp9, 60fps, video only, 11.15MiB
298          mp4        1280x720   DASH video 2819k , avc1.4d4020, 60fps, video only, 19.25MiB
17           3gp        176x144    small , mp4v.20.3, mp4a.40.2@ 24k
36           3gp        320x180    small , mp4v.20.3, mp4a.40.2
43           webm       640x360    medium , vp8.0, vorbis@128k
18           mp4        640x360    medium , avc1.42001E, mp4a.40.2@ 96k
22           mp4        1280x720   hd720 , avc1.64001F, mp4a.40.2@192k (best)

In the example above, a Commons compliant video could be created by downloading one of the webm format video-only files and merging in a vorbis format audio. This would be relatively fast, as the video and audio would not need transcoding on a local machine (the 'client' machine). In past projects this has been done, for example as part of the import of Youtube videos for the LGBT free media collective collective. However the range of available formats varies, so this type of video creation has been a manual choice at the time of merging, so the list of webm streams and vorbis streams were presented to the operator and picked out in a terminal screen. The hand-picking is not obvious, as very high resolution video is not suitable for Commons as it will fail to display on the image page.

Due to the numbers of CDC videos, and the shortage of 'technical' volunteer time to run the process, a fully automated approach was preferred. This is much slower as it relies on letting Youtube recommend the default best quality video for download, and then having the local machine transcode from mp4 (h264 codec) with an aac codec for audio to webm with VP8 codec and vorbis audio.

The choice of VP8 is based on transcoding times and the indifference to end quality that using VP9 would make. Further research might shift this viewpoint.

2018 automated pick from the manifest

[edit]

Taking account of upgrades to the youtube-dl module, the much simpler call equivalent is used:

youtube-dl -o <local> <webpage_url> --recode-video webm

This uses the DASH manifest to pick the best possible video and audio options and then recodes them to webm format using the local installation of ffmpeg. Example Hurricane Preparedness.webm

2019 hesitantly moving to the WMF cloud

[edit]

Due to repeated overheating of the old personal kit at home, the code was adjusted to be able to run on WMF labs. Unfortunately there have been problems probably as a result of YouTube blocking harvesting bots, so rather than running as expected on the Grid Engine (so that the system could flexibly manage the assignment of resources), the upload and recoding script works only in a 'live' terminal. This may change if the issue/bug gets resolved. The video files are downloaded to the general '/tmp' area of labs after this was recommended to reduce consumption of NFS IO bandwidth by the ffmpeg recoding (presumably flagged as a problem due to heavy disk usage). Ref Phab:T236446.

The option of using youtube-dl to both download and recode the video (from mp4 or other formats to webm) was discounted as it was far slower than downloading via youtube-dl and then recoding directly with a separate ffmpeg call.

The ffmpeg parameters to add metadata to the file work, but the metadata is not parsed on Commons, this has been raised as a bug at Display standard webm metadata/tag. There was an experiment to use the Python module youtube_dl which is intended to convert youtube-dl to internal functions, however problems with data parsing being apparently different resulted in this getting parked.

After discovering that some of the skipped videos might be down to the Commons filenames (titles) already being in use, despite having different YouTube identities, the automatic creation of unique filenames was added by using (CDC <YouTube ID>) to the titles. For example File:An Introduction to Dispensing.webm was uploaded in 2017 but now exists as a different video with apparently a different encoding even though it is the same actual video recording albeit very slightly larger due to a better screen ratio, and the newer alternate has been uploaded as File:An Introduction to Dispensing (CDC GSISCC7qs6g).webm. In other cases the same title might be being used for different edits of the videos, hence File:CDC Response to Hurricanes.webm (4 min 49s) and File:CDC Response to Hurricanes (CDC 2knfLO0Yk4c).webm (7 min 2 s) were released by the CDC within a few weeks of each other and both are valid separate content for Commons. Unfortunately the YouTube ID is not pretty, but there was no shorter obvious way to create a unique identifier. One possible future issue might be created is if the CDC recode lots of past videos this way, then re-release them on YouTube, thereby Commons having many duplicates of essentially the same videos; but let's cross that bridge when we come to it.

Code

[edit]

The upload script uses Pywikibot-core. However at the time of batch upload started, the standard Pywikibot-core would fail to upload files at or above around 230 mb in size. A patch by Zhuyifei1999 discussed in phab:T129216, adds an experimental version of asynchronous uploads to Pywikibot-core by adapting the site.py module. Unfortunately this makes this project hard to replicate, unless users are technical enough to install (informal) patches.

The guts of handling Youtube videos is done by making calls with youtube-dl, see https://youtube-dl.org/. This is open source code available on Github. The alternative would be to access the Youtube API, where similar ways of reporting on Youtube video metadata could be done, in particular examining playlists under a Youtube channel account. However youtube-dl was a quick fix based on previous code available, so this option saved on our limited (free) volunteer development time. youtube-dl has options for returning results in JSON format.

youtube-dl → JSON

As a code note, probably to myself rather than expecting others to reuse this, after a lot of trial and error, the trick was:

  1. Taking youtube-dl output to a terminal using the "-j" option to get JSON.
  2. The output looks like ["data-wanted\n", null], so take output[0][:-1] to drop the null then chop of the redundant new line.
  3. re.sub('\n',',', output) so that each video (as a JSON array) in the playlist is separated by a comma, not a newline.
  4. data = json.loads('[' + output + ']') to wrap the naked list of arrays into a list so it can be read as correct JSON.

Getting this right was surprisingly time consuming. Though inelegant, if starting from scratch, it's probably better to parse the output as a long string rather than JSON.

Video transcoding is done using a slightly complex call to ffmpeg, https://ffmpeg.org/. This is probably the best known open source multi format converter tool. Transcoding works at approximately 12x "real time" for high quality video, i.e. a 1 hour high quality video will take 12 hours to transcode. This is based on using a 2012 Mac mini with 16 GB of ram and a 2.5 GHz i5 processor. Faster times would be achieved on better machines, or if WMF-Labs were used as the transcoding environment - this was not done as youtube-dl is not available on labs; again the trade-off between getting the project done with limited volunteer time, or investing that time on further research on getting the same thing working in the labs environment has to be considered.

The source code itself is "fairly" basic and was quickly hacked from a previous bit of code, along with odd local dependencies. So this is not a good bit of example code to learn from. The principles however, could be duplicated if someone wanted to create alternative code:

  • It makes calls to youtube-dl to examine the CDC video channel, pulls out all playlists, then loops through each playlist to then generate data for each video to be uploaded.
  • A local copy of the mp4 best version of the Youtube video is created with a youtube-dl call.
  • The youtube-dl information about the video is used to create the Commons image page text and provide a filename. Extra useful data included is the Youtube "categories" and the Youtube "tags".
  • The local copy is transcoded using ffmpeg.
  • Pywikbot (patched) uploads the file to Commons. Local copies are then removed.
  • Uploads use the Youtube Playlist name as a category name. This will be a red-link until created manually.

Checks made:

  1. The intended filename is checked to see whether it exists on Commons, and skips if it does.
  2. The code checks for whether the "youtube ID" (?v=<id> part of website calls) is already on Commons, presuming that the alpha string will be unique. If found the file is skipped, presuming it exists under another filename.
  3. If Commons upload fails twice, the code gives up on the file, presuming there is a fundamental problem with file encoding.

The critical call for the transcoding is:

call(["nice","-n","15", "ffmpeg","-i","youtube-x.mp4","-c:v", "libvpx", "-crf", "4", "-b:v", "2M", "-c:a", "libvorbis", local])

"call" is from the Python subprocess module, creating a new process to do the transcoding. This is running in OSX, so the command "nice" sets the priority of the process, in this case to a low priority (higher 'niceness') so that the local machine is not overloaded for other jobs. The variable local is where the file is getting created based on the original youtube-x.mp4. The other ffmpeg options are based on the experience of creating videos for Commons over the last couple of years.

Alternative code using youtube_dl as a Python module has been created. This works perfectly well, but the playlist name seemed missing from the video keys. There was no performance benefit and using subprocess.call makes it easy to suppress unwanted output.

Upload runs

[edit]
Penguins get colds too.
15 May 2017

A test run created File:2017 NHSN Training - Standardized Antibiotic Administration Ratio.webm and started Category:I Am CDC.

16 May 2017

Full automated run started. The file upload comment, in the file history, points to this project page. One unintended consequence of using a local copy for transcoding was this rapidly broke my Google drive allowance (15GB), as files were being synced with my drive account. Switching off the sync while this project ran fixed the issue, but a better solution for similar projects would be to let the transcoding run on a stand-alone machine as storage, processor time and bandwidth are all seriously affected.

23 May 2017

Run restarted, with improvement to identifying videos from the youtube-dl playlist JSON listing. There seems to be two different ways that the list may be returned, one having a list of 'entries' within an array for the playlist, and the other being a list of videos as a naked set of arrays, causing errors in JSON parsing of the output. Previous uploads skipped these errors, but were missing some videos/sets of videos as a consequence, for example CDC Zika virus videos.

29 July 2017

First run completed, re-run started. It seems that the transcoding takes long enough for many videos to be uploaded while the upload run is happening. Regular refreshes may be helpful.

17 May 2018

First run with new simpler command, leaving post-processing choices to youtube-dl.

17 Oct 2019

After a long gap, converted to run under Ubuntu (previous Mac OS) on a very old laptop. Halted after overnight run showed laptop temperatures of 93C or more as a steady state while video processing. Even when set to lowest possible priority temperatures were over 90C and fan at a constant maximum RPM. Other options being considered.

23 Oct 2019

Converted to run on WMF Labs, however though the Python code works from within a live terminal, it has yet to run successfully as a job on the Grid Engine (which is the way Labs is supposed to be used). Refer to phab:T236446.

29 Nov 2019

Having given up on WMF Labs, had to restore running on Mac mini, even though this is the only task running on it. Due to the on-going issue of YouTube blocking IP addresses with automated draw-down of video information, the programme compensates by waiting for the block to be lifted. As blocks appear to have automated escalation, this may become untenable.

24 May 2020

YouTube has continued actively to block any bot-like downloading of videos. Block triggers appear to include reading playlists, not just video access. Blocks last several days, currently around 2 to 3 days, and significantly hamper the process of sharing this content.

23 Oct 2020 End of project

RIAA declares youtube-dl "illegal" and GitHub removes the code after the RIAA requests they do so. The impression is that YouTube (and therefore Google that own it) are actively against open knowledge projects that use YouTube to share content or archive files from it. It is unlikely that there can be any future partnership or productive collegiate working between YouTube content and Wikimedia projects. Ref zdnet