2.0-Widevin DRM Encoding and Packaging
2.0-Widevin DRM Encoding and Packaging
version 1.1
Contents
Summary 4
Contact Us 4
Introduction to Encoding 4
Elementary Stream 4
Codecs 5
Containers 5
Container formats 6
Transmuxing and Transcoding 6
Video Stream Structure 7
Group of Pictures (GOP) Structure 7
Why are GOPs important? 8
Encoding media 10
Constant Bitrate (CBR) 10
Variable Bitrate (VBR) 11
Aspect Ratio 11
Using encoding profiles 12
Common encoding profiles 12
Best Practices 13
General recommendations 13
H264 Encoding Profiles 14
Example encoding syntax using ffmpeg 14
Ffmpeg parameters 14
HEVC Encoding Profiles 16
Example encoding syntax using ffmpeg 16
Ffmpeg parameters 16
VP9 Encoding Profiles 18
Example encoding syntax using ffmpeg 18
Ffmpeg parameters 18
Content Encryption 20
Encryption Recommendations from Least Secure to Most Secure 20
Playback Security Levels for Chrome and iOS 20
© 2017 Google, Inc. All Rights Reserved. No express or implied warranties are provided for herein. All specifications are subject to
change and any expected future products, features or functionality will be provided on an if and when available basis. Note that the
descriptions of Google’s patents and other intellectual property herein are intended to provide illustrative, non-exhaustive examples
of some of the areas to which the patents and applications are currently believed to pertain, and is not intended for use in a legal
proceeding to interpret or limit the scope or meaning of the patents or their claims, or indicate that a Google patent claim(s) is
materially required to perform or implement any of the listed items.
Contact Us
For any questions, please contact Widevine from our website - www.widevine.com.
Introduction to Encoding
This section is designed to provide a quick layman’s understanding of how media is created,
what is used for media, common definitions and terminology, and supported codecs for DASH
playback.
Elementary Stream
An elementary stream (ES) is essentially the encoding of media that’s perceptible to the user.
Every elementary stream contains a single media data type (audio, video, subtitles, captions).
The sum of several elementary streams allows for cohesive media playback experience on any
given platform.
The content of an elementary stream is dependent on the data format it holds, a codec
(coder-decoder) for video or audio. Elementary streams are broken down into frames and
encoded by codecs. A frame (or media sample) is typically referred to as a still image for video
or a few milliseconds of audio. It contains information to render a specific video or audio scene
at that specific point in time. A collection of frames would complete a video or audio clip, similar
to a flip book.
What this essentially means is audio and video content is stored as frames encoded
(compressed) by a codec. Each codec conforms to its own specifications and every codec
strives to provide the best media quality for the least amount of resources used (processing,
time, efficiency, size).
Codecs
The most common codecs used for video and audio processing in use today are:
Video Audio
VP9 Vorbis
Codecs may be closed-source commercial products while others are open-source and
free-to-use with or without licensing requirements. Typically, codec selection is based on client
platform support. DASH presentations support multiple codecs (i.e. mixing and matching
different codecs within the same content).
Containers
Media containers are the grouping of one or more elementary streams into a data stream (in this
case, a file). A container is also a format specification that describes elements of the data
streams (timing, structure, and media information) that it holds.
The elementary stream metadata that is added to a container include (not a complete list):
● Codec type
● Codec-specific configuration data
● Video height and width
● Video frame rate
● Audio sampling rate
● Audio channels
● Frame timing and ordering information
Here are examples of container formats commonly used for streaming media.
MPEG2-TS (.ts)
● Optimized for transmission over a closed network for broadcast systems.
ISO-BMFF (.mp4)
● Designed as a next-generation container format by Apple and adopted by ISO/IEC.
● There are multiple file extensions allowed for this format, however, we are focused on
MPEG4 (mp4).
● All data within this format is organized into boxes. Each box type represents a different
type of data element contained within. Boxes may also contain other boxes.
● Fragmented MP4 is a variant of ISO-BMFF optimized for streaming.
When creating DASH compliant media, the specification requires only a single elementary
stream be present per container.
Now that we have covered what media files consist of, how are these files created?
Transmuxing is the act moving elementary streams from one container format to another without
manipulating the actual streams. All audio visual data remains unchanged.
For example:
As previously stated, video is represented as a sequence of frames over time, where there are
groupings of frames that are similar in representation.
A GOP is:
● a self-contained decodable sequence of frames.
● normally represented as a short sequence of pictures.
● begins with a key frame, and ends before the next key frame..
If I-frames are so complete, why is there a need for P or B frames? Media consisting of all
I-frames would be very large, rendering it unusable for mass consumption. B and P frames are
a fraction of the size of an I-frame (since they only display the picture differences).
A GOP length or size is dependent on the number of frames per GOP. A longer GOP is more
efficient (a larger grouping of similar pictures), however, provides fewer seek or adaptation
points.
Media is not uniform. For example, a movie will contain many scenes - action sequences,
dialogues, romance and more. The amount of data required to encode a scene depends on
how much the frames change between one and the next (complexity). An action sequence
would require more data to encode correctly compared to a scene of a sleeping baby.
There are many considerations for encoding, however, the primary decision first comes down to
bitrate.
CBR refers to content that is encoded at a specific bitrate, uniformly across its entirety.
For a high-complexity scene, a CBR configuration will force the codec to compromise on scene
quality if the required data to render the scene exceeds the CBR limit. The scene is too
complex to store in the amount of information allowed, resulting in graininess, artifacts, skipping.
VBR allows the codec to use fewer bits when it is not needed, saving them for more complex
scenes.
The basic principle is to set a target bitrate and allow for some level of variance over the length
of the media, to achieve the overall target bitrate.
What happens when the bitrate exceeds the target for extended periods of time? The end-user
device will be unable to display the frames correctly since it’s expecting X bitrate but receiving
X+Y bitrate instead. This can be shown as stuttering and artifacts (pixelation).
To work around these limitations, a video buffer verifier (VBV) is specified. The VBV manages
VBR variance. It specifies a maximum bitrate over a rolling buffer.
In our tests, for the majority of current devices, the VBV should be set to twice the target bitrate.
Aspect Ratio
Aside from bitrate, there are other parameters to control the display or viewing of pictures -
namely, the aspect ratio to preserve the original picture, preventing visual distortion.
There are 3 distinctions of aspect ratio - Sample Aspect Ratio, Pixel Aspect Ratio and Display
Aspect Ratio.
The display aspect ratio is most understood and referred to when discussing aspect ratios. It is
a reflection of what we see in media. References to terms like 4:3 or 16:9 refer to the display
aspect ratio. A common method is to define the display aspect ratio (DAR) and let your encoder
software calculate the SAR and PAR accordingly.
Encoding profiles refer to a set of parameters that generates media with specific properties. For
example, an encoding profile for SD vs HD content would not be identical as there would be, at
minimum, a change in the display resolution.
These profiles vary per codec. Every codec defines its own specification and features - for
example: H264 uses Baseline, Main and High profiles. Even more importantly, encoding profile
support varies from device to device. Therefore, it is paramount to ensure that a device is a
capable of playback for any given encoding profile.
Codec Profiles
AAC AAC-LC
AAC-HE (SBR - spectral band replication)
AAC-HEv2 (SBR and PS - parametric stereo)
VP9 0 (8-bit)
1 (8-bit)
2 (10-bit ) (12-bit)
3 (10-bit) (12-bit)
The section outlines a series of recommendations and best practices from encoding to
encryption and enabling playback on various client platforms. These recommendations serve
as guidance on the design and implementation of EME, CENC, DASH support from Widevine
for both server and client pieces. To ensure maximum playback compatibility across all client
platforms, the lowest common denominator for encoding profiles should be used.
The table below illustrates the most common audio and video codec combinations for various
video resolutions:
General recommendations
● Use variable bitrate (VBR) tracks with a reasonable video buffer verifier (VBV) value.
○ VBV = 1.5 - 2x target bitrate.
● All files must have closed GOPs and identical IDR frame structure.
● IDR frame separation – approx 3 seconds (lowest value consistent with good image
quality).
● Key frames must be at the same exact intervals across all track types.
● DASH REQUIREMENT - 'moov' atom should immediately follow the 'ftyp' atom.
● ISO-BMFF Chunks should contain no more than 1 second worth of sample data.
Ffmpeg parameters
Parameter Description
-vf The options used here will resize the video to the desired resolution while
maintaining the original aspect ratio
-movflags +faststart Optimizes the output MP4 file format for streaming
Ffmpeg parameters
Parameter Description
-vf The options used here will resize the video to the desired resolution while
maintaining the original aspect ratio
-movflags +faststart Optimizes the output MP4 file format for streaming
ffmpeg -i <source_file> \
-vf "scale=1280:trunc(ow/a/2)*2" \
-c:v libvpx-vp9 -keyint_min 72 -g 72 -profile:v 0 \
-threads 4 -tile-columns 6 -frame-parallel 1 \
-speed 1 -auto-alt-ref 1 -lag-in-frames 25 \
-an -minrate 4M -maxrate 4M -bufsize 4M -b:v 4M \
-f webm -dash 1 <output_file>
Ffmpeg parameters
Parameter Description
-vf The options used here will resize the video to the desired resolution while
maintaining the original aspect ratio
Encryption best practices can be distilled into a simple statement: separate keys should be used
across different types of content (audio, video, resolution).
Separate content key for audio tracks Separate content key for each video resolution
group (SD, HD, UHD)
Separate content key for audio tracks Separate content key for each video track
The recommended security level setting for VIDEO tracks will be to specify
SW_SECURE_DECODE. The only supported security level setting for AUDIO tracks is
SW_SECURE_CRYPTO. Security level settings are specified by your license proxy
implementation on a per track basis. The table below provides the recommended security level
settings per Chrome platform.
Widevine provides a reference open-source CENC packaging solution - S haka Packager. The
GiHub page provides documentation on how to build and use the packager. We highly
recommend joining the GitHub user group to keep up with the latest information and to ask
questions.
The following sections covers how to install and use the Shaka Packager.
Install dependencies
The Shaka Packager supports a variety of OSes, this section of the document will focus on
using Ubuntu Linux (14.04 or higher). For the sake of simplicity, all commands are executed as
user=root .
To ensure you have the basic build environment, run the following commands as user=root :
$ git clone
https://chromium.googlesource.com/chromium/tools/depot_tools.git
$ export PATH=$PATH:`pwd`/depot_tools
$ mkdir shaka-packager
$ cd shaka-packager
$ gclient config https://www.github.com/google/shaka-packager.git
--name=src
$ gclient sync
$ ll src/
6. Build using ninja to create the reference binaries for use. This takes a while, grab a cup
of coffee.
$ ninja -C src/out/Release
$ cd src/out/Release && ls
For ease of use, you can add the Release directory to your PATH or copy the 2 binaries to a
directory in your PATH:
$ export PATH=$PATH:$HOME/shaka-packager/src/out/Release
$ packager --help
$ mpd_generator --help
Use the --dump_strea m_info to analyze your audio and video streams
Example
$ packager input=~/llama_h264_main_480p_1000.mp4 --dump_stream_info
[0921/220412:INFO:demuxer.cc(58)] Initialize Demuxer for file
'llama_h264_main_480p_1000.mp4'.
File "llama_h264_main_480p_1000.mp4":
Found 1 stream(s).
Stream [0] type: Video
codec_string: avc1.4d401f
time_scale: 12288
duration: 1843712 (150.0 seconds)
is_encrypted: false
codec: H264
width: 858
height: 482
pixel_aspect_ratio: 3856:3861
trick_play_rate: 0
nalu_length_size: 4
Encrypt content
Shaka Packager allows for single and multi-file encryption. It’s optimal to package all your
content in a single command - encrypt all tracks and generate the MPD.
To encrypt using the Widevine Cloud License Service, you will need
a. A Content ID - a value that identifies the media that is being packaged.
Provider = widevine_test
IV = d58ce954203b7c9a9a9d467f59839249
KEY = 1ae8ccd0e7985cc0b6203a55855a1034afc252980e970ca90e5202689f947ab9
Sample MPD
<cenc:pssh>AAAAaHBzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAAEgIARIQJPZ2phZEUvO+FnPYDjU
XwBIQW2j5Y6rUXreKuHqFkSvFSxIQgV87O8FlVBWIa+a4N/dmPxoFY3dpcDEiBRI0VmeIMgA=</cen
c:pssh>
</ContentProtection>
<BaseURL>enc_tears_audio.mp4</BaseURL>
<SegmentBase indexRange="1001-1920" timescale="44100">
<Initialization range="0-1000"/>
</SegmentBase>
</Representation>
</AdaptationSet>
<AdaptationSet id="1" contentType="video" width="720" height="300"
frameRate="12288/512" par="12:5">
<Representation id="1" bandwidth="686572" codecs="avc1.42c01e"
mimeType="video/mp4" sar="1:1">
<ContentProtection value="cenc"
schemeIdUri="urn:mpeg:dash:mp4protection:2011"
cenc:default_KID="24f676a6-1644-52f3-be16-73d80e3517c0"/>
<ContentProtection
schemeIdUri="urn:uuid:edef8ba9-79d6-4ace-a3c8-27dcd51d21ed">
<cenc:pssh>AAAAaHBzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAAEgIARIQJPZ2phZEUvO+FnPYDjU
XwBIQW2j5Y6rUXreKuHqFkSvFSxIQgV87O8FlVBWIa+a4N/dmPxoFY3dpcDEiBRI0VmeIMgA=</cen
c:pssh>
</ContentProtection>
<BaseURL>enc_tears_video.mp4</BaseURL>
<SegmentBase indexRange="1127-1902" timescale="12288">
Content Playback