Video Codecs – H264 , H265 , AV1

Article discusses the popularly adopted current standards for video codecs( compression / decompression) namely MPEG2, H264, H265 and AV1


MPEG-2 (a.k.a. H.222/H.262 as defined by the ITU)
generic coding of moving pictures and associated audio information
combination of lossy video compression and lossy audio data compression methods, which permit storage and transmission of movies using currently available storage media and transmission bandwidth.

better than MPEG 1

evolved out of the shortcomings of MPEG-1 such as audio compression system limited to two channels (stereo) , No standardized support for interlaced video with poor compression , Only one standardized “profile” (Constrained Parameters Bitstream), which was unsuited for higher resolution video.


  • over-the-air digital television broadcasting and in the DVD-Video standard.
  • TV stations, TV receivers, DVD players, and other equipment
  • MOD and TOD – recording formats for use in consumer digital file-based camcorders.
  • XDCAM – professional file-based video recording format.
  • DVB – Application-specific restrictions on MPEG-2 video in the DVB standard:


Advanced Video Coding (AVC), or H.264 or aka MPEG-4 AVC or ITU-T H.264 / MPEG-4 Part 10 ‘Advanced Video Coding’ (AVC)
introduced in 2004

Better than MPEG2

40-50% bit rate reduction compared to MPEG-2

Support Up to 4K (4,096×2,304) and 59.94 fps
21 profiles ; 17 levels

Compression Model

Video compression relies on predicting motion between frames. It works by comparing different parts of a video frame to find the ones that are redundant within the subsequent frames ie not changed such as background sections in video. These areas are replaced with a short information, referencing the original pixels(intraframe motion prediction) using mathematical function and direction of motion

Hybrid spatial-temporal prediction model
Flexible partition of Macro Block(MB), sub MB for motion estimation
Intra Prediction (extrapolate already decoded neighbouring pixels for prediction)
Introduced multi-view extension
9 directional modes for intra prediction
Macro Blocks structure with maximum size of 16×16
Entropy coding is CABAC(Context-adaptive binary arithmetic coding) and CAVLC(Context-adaptive variable-length coding )


  • most deployed video compression standard
  • Delivers high definition video images over direct-broadcast satellite-based television services,
  • Digital storage media and Blu-Ray disc formats,
  • Terrestrial, Cable, Satellite and Internet Protocol television (IPTV)
  • Security and surveillance systems and DVB
  • Mobile video, media players, video chat


High Efficiency Video Coding (HEVC), or H.265 or MPEG-H HEVC
video compression standard designed to substantially improve coding efficiency
stream high-quality videos in congested network environments or bandwidth constrained mobile networks
Jan 2013
product of collaboration between the ITU Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).

better than H264

overcome shortage of bandwidth, spectrum, storage
bandwidth savings of approx. 45% over H.264 encoded content

resolutions up to 8192×4320, including 8K UHD
Supports up to 300 fps
3 approved profiles, draft for additional 5 ; 13 levels
Whereas macroblocks can span 4×4 to 16×16 block sizes, CTUs can process as many as 64×64 blocks, giving it the ability to compress information more efficiently.

multiview encoding – stereoscopic video coding standard for video compression that allows for the efficient encoding of video sequences captured simultaneously from multiple camera angles in a single video stream. It also packs a large amount of inter-view statistical dependencies.

Compression Model

Enhanced Hybrid spatial-temporal prediction model
CTU ( coding tree units) supporting larger block structure (64×64) with more variable sub partition structures

Motion Estimation – Intra prediction with more nodes, asymmetric partitions in Inter Prediction)
Individual rectangular regions that divide the image are independent

Paralleling processing computing – decoding process can be split up across multiple parallel process threads, taking advantage multi-core processors.

Wavefront Parallel Processing (WPP)- sort of decision tree that grants a more productive and effectual compression.
33 directional nodes – DC intra prediction , planar prediction. , Adaptive Motion Vector Prediction
Entropy coding is only CABAC


  • cater to growing HD content for multi platform delivery
  • differentiated and premium 4K content

reduced bitrate enables broadcasters and OTT vendors to bundle more channels / content on existing delivery mediums
also provide greater video quality experience at same bitrate

Using ffmpeg for H265 encoding

I took a h264 file (640×480) , duration 30 seconds of size 39,08,744 bytes (3.9 MB on disk) and converted using ffnpeg

After conversion it was a HEVC (Parameter Sets in Bitstream) , MPEG-4 movie – 621 KB only !!! without any loss of clarity.

> ffmpeg -i pivideo3.mp4 -c:v libx265 -crf 28 -c:a aac -b:a 128k output.mp4                                              ffmpeg version 4.1.4 Copyright (c) 2000-2019 the FFmpeg developers   built with Apple LLVM version 10.0.1 (clang-1001.0.46.4)   configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1.4_2 --enable-shared --enable-pthreads --enable-version3 --enable-avresample --cc=clang --host-cflags='-I/Library/Java/JavaVirtualMachines/adoptopenjdk-12.0.1.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/adoptopenjdk-12.0.1.jdk/Contents/Home/include/darwin' --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libmp3lame --enable-libopus --enable-librubberband --enable-libsnappy --enable-libtesseract --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-videotoolbox --disable-libjack --disable-indev=jack --enable-libaom --enable-libsoxr   libavutil      56. 22.100 / 56. 22.100   libavcodec     58. 35.100 / 58. 35.100   libavformat    58. 20.100 / 58. 20.100   libavdevice    58.  5.100 / 58.  5.100   libavfilter     7. 40.101 /  7. 40.101   libavresample   4.  0.  0 /  4.  0.  0   libswscale      5.  3.100 /  5.  3.100   libswresample   3.  3.100 /  3.  3.100   libpostproc    55.  3.100 / 55.  3.100 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'pivideo3.mp4':   Metadata:     major_brand     : isom     minor_version   : 1     compatible_brands: isomavc1     creation_time   : 2019-06-23T04:58:13.000000Z   Duration: 00:00:29.84, start: 0.000000, bitrate: 1047 kb/s     Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 640x480, 1046 kb/s, 25 fps, 25 tbr, 25k tbn, 50k tbc (default)     Metadata:       creation_time   : 2019-06-23T04:58:13.000000Z       handler_name    : h264@GPAC0.5.2-DEV-revVersion: 0.5.2-426-gc5ad4e4+dfsg5-3+deb9u1 Codec AVOption b (set bitrate (in bits/s)) specified for output file #0 (output.mp4) has not been used for any stream. The most likely reason is either wrong type (e.g. a video option with no video streams) or that it is a private option of some encoder which was not actually used for any stream. Stream mapping:   Stream #0:0 -> #0:0 (h264 (native) -> hevc (libx265)) Press [q] to stop, [?] for help x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9 x265 [info]: build info [Mac OS X][clang 10.0.1][64 bit] 8bit+10bit+12bit x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 x265 [info]: Main profile, Level-3 (Main tier) x265 [info]: Thread pool created using 4 threads x265 [info]: Slices                              : 1 x265 [info]: frame threads / pool features       : 2 / wpp(8 rows) x265 [warning]: Source height < 720p; disabling lookahead-slices x265 [info]: Coding QT: max CU size, min CU size : 64 / 8 x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 3 x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00 x265 [info]: Lookahead / bframes / badapt        : 20 / 4 / 2 x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 0 x265 [info]: References / ref-limit  cu / depth  : 3 / off / on x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1 x265 [info]: Rate Control / qCompress            : CRF-28.0 / 0.60 x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra x265 [info]: tools: strong-intra-smoothing deblock sao Output #0, mp4, to 'output.mp4':   Metadata:     major_brand     : isom     minor_version   : 1     compatible_brands: isomavc1     encoder         : Lavf58.20.100     Stream #0:0(und): Video: hevc (libx265) (hev1 / 0x31766568), yuv420p, 640x480, q=2-31, 25 fps, 12800 tbn, 25 tbc (default)     Metadata:       creation_time   : 2019-06-23T04:58:13.000000Z       handler_name    : h264@GPAC0.5.2-DEV-revVersion: 0.5.2-426-gc5ad4e4+dfsg5-3+deb9u1       encoder         : Lavc58.35.100 libx265 frame=  746 fps= 64 q=-0.0 Lsize=     606kB time=00:00:29.72 bitrate= 167.2kbits/s speed=2.56x     video:594kB audio:0kB subtitle:0kB other streams:0kB global headers:2kB muxing overhead: 2.018159% x265 [info]: frame I:      3, Avg QP:27.18  kb/s: 1884.53  x265 [info]: frame P:    179, Avg QP:27.32  kb/s: 523.32   x265 [info]: frame B:    564, Avg QP:35.17  kb/s: 38.69    x265 [info]: Weighted P-Frames: Y:5.6% UV:5.0% x265 [info]: consecutive B-frames: 1.6% 3.8% 9.3% 53.3% 31.9%  encoded 746 frames in 11.60s (64.31 fps), 162.40 kb/s, Avg QP:33.25

if you get error like

Unknown encoder 'libx265'

then reinstall ffmpeg with h265 support


Realtime High quality video encoder
product of product of the Alliance for Open Media (AOM)
Contained by Matroska , WebM , ISOBMFF , RTP (WebRTC)

better than H265

AV1 is royalty free and overcomes the patent complexities around H265/HVEC


  • Video transmission over internet , voip , multi conference
  • Virtual / Augmented reality
  • self driving cars streaming
  • intended for use in HTML5 web video and WebRTC together with the Opus audio format

Audio and Acoustic Signal Processing

Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions and Audio Signal Processing focuses on the computational methods for intentionally altering auditory signals or sounds, in order to achieve a particular goal.

Application of audio Signal processing in general

  • storage
  • data compression
  • music information retrieval
  • speech processing ( emotion recognition/sentiment analysis , NLP)
  • localization
  • acoustic detection
  • Transmission / Broadcasting – enhance their fidelity or optimize for bandwidth or latency.
  • noise cancellation
  • acoustic fingerprinting
  • sound recognition ( speaker Identification , biometric speech verification , voice commands )
  • synthesis – electronic generation of audio signals. Speech synthesisers can generate human like speech.
  • enhancement (e.g. equalization, filtering, level compression, echo and reverb removal or addition, etc.)

Effects for audio streams processing

  • delay or echo
    To simulate reverberation effect, one or several delayed signals are added to the original signal. To be perceived as echo, the delay has to be of order 35 milliseconds or above.
    Implemented using tape delays or bucket-brigade devices.
  • flanger
    delayed signal is added to the original signal with a continuously variable delay (usually smaller than 10 ms).
    signal would fall out-of-phase with its partner, producing a phasing comb filter effect and then speed up until it was back in phase with the master
  • phaser
    signal is split, a portion is filtered with a variable all-pass filter to produce a phase-shift, and then the unfiltered and filtered signals are mixed to produce a comb filter.
  • chorus
    delayed version of the signal is added to the original signal. above 5 ms to be audible. Often, the delayed signals will be slightly pitch shifted to more realistically convey the effect of multiple voices.
  • equalization
    frequency response is adjusted using audio filter(s) to produce desired spectral characteristics. Frequency ranges can be emphasized or attenuated using low-pass, high-pass, band-pass or band-stop filters.
    overdrive effects such as the use of a fuzz box can be used to produce distorted sounds, such as for imitating robotic voices or to simulate distorted radiotelephone traffic
  • pitch shift
    shifts a signal up or down in pitch. For example, a signal may be shifted an octave up or down. This is usually applied to the entire signal, and not to each note separately. Blending the original signal with shifted duplicate(s) can create harmonies from one voice.
  • time stretching
    changing the speed of an audio signal without affecting its pitch.
  • resonators
    emphasize harmonic frequency content on specified frequencies. These may be created from parametric EQs or from delay-based comb-filters.
  • modulation
    change the frequency or amplitude of a carrier signal in relation to a predefined signal.
  • compression
    reduction of the dynamic range of a sound to avoid unintentional fluctuation in the dynamics. Level compression is not to be confused with audio data compression, where the amount of data is reduced without affecting the amplitude of the sound it represents.
  • 3D audio effects
    place sounds outside the stereo basis
  • reverse echo
    swelling effect created by reversing an audio signal and recording echo and/or delay while the signal runs in reverse.
  • wave field synthesis
    spatial audio rendering technique for the creation of virtual acoustic environments

ASP application in Telephony and mobile phones, by ITU (International Telegraph Union)

  • Acoustic echo control
    aims to eliminate the acoustic feedback, which is particularly problematic in the speakerphone use-case during bidirectional voice
  • Noise control
    microphone doesn’t only pick up the desired speech signal, but often also unwanted background noise. Noise control tries to minimize those unwanted signals . Multi-microphone AASP, has enabled the suppression of directional interferers.
  • Gain control
    how loud a speech signal should be when leaving a telephony transmitter as well as when it is being played back at the receiver. Implemented either statically during the handset design stage or automatically/adaptively during operation in real-time.
  • Linear filtering
    ITU defines an acceptable timbre range for optimum speech intelligibility. AASP in the form of linear filtering can help the handset manufacturer to meet these requirements.
  • Speech coding: from analog POTS based call to G.711 narrowband (approximately 300 Hz to 3.4 kHz) speech coder is a big leap in terms of call capacity. other speech coders with varying tradeoffs between compression ratio, speech quality, and computational complexity have been also made available. AASP provides higher quality wideband speech (approximately 150 Hz to 7 kHz).

ASP applications in music playback

AASP is used to provide audio post-processing and audio decoding capabilities for mobile media consumption needs, such as listening to music, watching videos, and gaming

  • Post-processing
    techniques as equalization and filtering allow the user to adjust the timbre of the audio such as bass boost and parametric equalization. Other techniques like adding reverberation, pitch shift, time stretching etc
  • Audio (de)coding: audio contianers like mp3 and AAC define how music is distributed, stored, and consumed also in Online music streaming services

ASP for virtual assitants

Virtual Assistance include a variety of servies from Apple’s Siri, Microsoft’s Cortana , Google’s Now , Alexa etc. ASP is used in

  • Speech enhancement
    multi-microphone speech pickup using beamforming and noise suppression to isolate the desired speech prior to forwarding it to the speech recognition engine.
  • Speech recognition (speech-to-text): this draws ideas from multiple disciplinary fields including linguistics, computer science, and AASP. Ongoing work in acoustic modeling is a major contribution to recognition accuracy improvement in speech recognition by AASP.
  • Speech synthesis (text-to-speech): this technology has come a very long way from its very robotic sounding introduction in the 1930s to making synthesized speech sound more and more natural.

Other areas of ASP

  • Virtual reality (VR) like VR headset / gaming simulators use three-dimensional soundfield acquisition and representation like Ambisonics (also known as B-format).

Ref :
wikipedia –

JavaScript Session Establishment Protocol (JSEP) in WebRTC handshake

This article is aimed at explaning the intricacies and detailed offer answer flow in webrtc handshake and JSEP . You can read the following artciles on WebRTC as prereq before reading through this one

WebRTC API – Peerconnection , getUserMedia , Datachannel , DataStaats

JSEP (JavaScript Session Establishment Protocol)

JSEP (JavaScript Session Establishment Protocol) is used during signalling via w3c RTCPeerConnectionAPI interface to set up a multimedia session. The multimedia session description specfies the crtical components of setting up a session between local and remote such as transport ports , protcol , profiles . It also handles the intercation with ICE state machine

Offer/Answer Excahange Flow

prereq : Setup Client side for the caller
PeerConnectionFactory to generate PeerConnections
PeerConnection for every connection to remote peer
MediaStream audio and video from client device

  1. Side initiating the session creates a offer by CreateOffer() API
aPromise = myPeerConnection.createOffer([options]);

options is type of RTC Offer Options

  • iceRestart
  • offerToReceiveAudio ( legacy)
  • offerToReceiveVideo ( legacy)
  • voiceActivityDetection

2. The application then stores the offer in local config as setLocalDescriptionAPI()

 myPeerConnection.createOffer().then(function(offer) {
    return myPeerConnection.setLocalDescription(offer);

3. Offer is sent to remote side using its choice of signalling ( SIP , WS , HTTP, XMPP .. )

4. Remote party stores it use setRemoteDescription() API

.then(function () {
  return createMyStream();

4. Remote part generates an answer using createAnswer() API

aPromise = RTCPeerConnection.createAnswer([options]);

5. Remote party stores the answer in its local config using setLocalDescription() API

6. Answer is transferred to Initiator side using choice of signalling ( SIP , WS , HTTP, XMPP .. ) again

7. Initiating side stores it use setRemoteDescription() API

Interfaces of webrtc and tracks to stream addition

Process to perform webrtc handshake

Webrtc call setup and incoming call callflow between remote peer , peerconnection actory , peerconnection and application

setup a call
receive a call

Signalling state Transitions on PeerConnection

As the caller initiates a new RTCPeerConnection() , the RTCSignalingState state is “stable” as remote and local descriptions are empty

As the caller initiates call and calls createOffer() , he now has offer SDP and procced to store offer locally with setLocalDescription(offer) the RTCSignalingState state is “have-local-offer” . After than caller send the offer to callee over signalling channel

Simillarily as the calle recives the offer , it starts with RTCSignalingState stable and then proceeds to store the Remote’s offer using setRemoteDescription(offer) , its state is now “have-remote-offer”

The callee generates a provsional answer and for caller and stores it locally , state transitiosn to “have-local-pranswer” . the pranswer SDP is send to caller over signalling channel again .

Caller stores the callee’s pr answer SDP and state updates to “have-remote-pranswer”

Once there is no offer/answer exchange in progress the state again changes to ” stable “.

State schanges to “closed” if RTCpeerConnection is closed

img :

Detailed Offer / Answer SDP

Local Offer created by side initiating the session / Caller

The first offfer called initial offer can have dummy date for contact line such as to prevent leaking a local Ip address

c=IN IP4
a=rtcp:9 IN IP4

“o=” line contains <username> <sess-id> <sess-version> <nettype> <addrtype> <unicast-address>

o=- 4445251981417004127 2 IN IP4 127.0.0.

shows username – and 4445251981417004127 as session id. Same username “-” is specified in “s=” line

“t=” line shows <start time> <stop time>

t=0 0

Full session Block example

type: offer, sdp: v=0
o=- 4445251981417004127 2 IN IP4
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2

Media Section : An m= section is generated for each RtpTransceiver that has been added to the PeerConnection. For the initial offer since no ports are available yet , dummy port 9 can be sadded. However if it is bundle only then port value is set to 0. Later the port value will be set to the port value of default ICE candidate.

DTLS filed “UDP/TLS/RTP/SAVPF” is followed by the list of codecs in order of priority.

“c=” line in msection too must be filled with dummy values if IP as no candidates are available yet .




“a=ice-ufrag” , “a=ice-pwd” , “a=fingerprint” , “a=setup” , “a=tls-id”

Media Stream Identification attribute “a-mid:”

For each media format on the m= line, “a=rtpmap” for “rtx” with the clock rate of codec and “a=fmtp” to reference the payload type of the primary codec.  “a=rtcp-fb” specified RTCP feedback

a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1

Audio Block exmaple

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4
a=rtcp:9 IN IP4
a=fingerprint:sha-256 1D:C8:1F:18:D2:AB:B7:68:CC:DC:A8:8D:6B:1D:70:11:06:E9:19:D2:22:CE:A5:F3:BE:82:00:ED:99:58:20:4A
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid
a=extmap:5 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:6 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=msid:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2 7525d75c-ffe7-4038-8b71-653d249e63bb
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:106 CN/32000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:112 telephone-event/32000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:3968544080 cname:da0nYe1oYR8AvVNp
a=ssrc:3968544080 msid:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2 7525d75c-ffe7-4038-8b71-653d249e63bb
a=ssrc:3968544080 mslabel:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2
a=ssrc:3968544080 label:7525d75c-ffe7-4038-8b71-653d249e63bb

// remove video section for simplicity

Data Block is created if data channle has been created with m= section for data.

“a=sctp-port” line referencing the SCTP port number set to 5000

 “a=max-message-size”  set to 262144 here

Data Block example

m=application 9 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4
a=fingerprint:sha-256 1D:C8:1F:18:D2:AB:B7:68:CC:DC:A8:8D:6B:1D:70:11:06:E9:19:D2:22:CE:A5:F3:BE:82:00:ED:99:58:20:4A

Subsequent Offers

When createOffer is called a second (or later) time, or is called after a local description has already been installed, the processig is different due to gathered ICE candidates . However the <session-version> is not changed .

Additionally m section is updated if RtpTransceiver is added or removed

Each “m=” and c=” line MUST be filled in with the port, relevant RTP profile, and address of the default candidate for the m= section

If the m= section is not bundled into another m= section, update the “a=rtcp” with port and address of RTCP camdidate and add “a=camdidate” with  “a=end-of-candidates” 

Local Answer created by side receiving the session/ Callee

When createAnswer is called for the first time after a remote description has been provided, the result is known as the initial answer. 

 Each offered m= section will have an associated RtpTransceiver

Remote Destination / Callee can reject the m section by setting port in m line to 0 . It can reject msection if neither of the offered media format are supported , RtpTransceiver is stoopped etc.

For the initial offer the dummy port value of 9 is set as no ICE candudate is avaible yet . Simillarly  “c=” line must contain the “dummy” value “IN IP4” too.

The <proto> field MUST be set to exactly match the <proto> field for the corresponding m= line in the offer.

type: answer, sdp: v=0
o=- 5730481682283561642 3 IN IP4
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS KGmQ9mTmvTaWlHTQ0B0YP36QIxOYNeB3i2nT

Audio section

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4
a=rtcp:9 IN IP4
a=fingerprint:sha-256 B9:9C:8A:A9:E9:09:0C:FB:52:2A:D3:18:7B:A9:D4:EC:B3:00:77:72:27:51:EC:5F:82:BE:11:7F:C7:CF:43:43
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid
a=extmap:5 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:6 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=msid:KGmQ9mTmvTaWlHTQ0B0YP36QIxOYNeB3i2nT e817fe0f-1cc0-4901-9fd9-e810289cc85d
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:106 CN/32000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:112 telephone-event/32000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:3260997313 cname:FxLUKuXrLQe0r1rn

Video section removed for simplicity

Data stream

m=application 9 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4
a=fingerprint:sha-256 B9:9C:8A:A9:E9:09:0C:FB:52:2A:D3:18:7B:A9:D4:EC:B3:00:77:72:27:51:EC:5F:82:BE:11:7F:C7:CF:43:43

Subsequent Answers

 Port value would normally be set to the port of the default ICE candidate for this m= section. For the exmaple above

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126

will be changes with relevant port adress such as

type: offer, sdp: v=0
o=- 6407282338169184323 3 IN IP4
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS bSrCUCFybGovIy0FUhPTZAr9ToRmx8I09nEj
m=audio 55375 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4
a=rtcp:9 IN IP4
a=candidate:2880323124 1 udp 2122260223 55375 typ host generation 0 network-id 1 network-c

Simillarly m video and data line will also get ports

m=video 53877 UDP/TLS/RTP/SAVPF 96 97 98 99 100 101 102 122 127 121 125 107 108 109 124 120 123 119 114 115 116
c=IN IP4
a=rtcp:9 IN IP4
a=candidate:2880323124 1 udp 2122260223 53877 typ host generation 0 network-id 1 network-cost 10
m=application 57991 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4
a=candidate:2880323124 1 udp 2122260223 57991 typ host generation 0 network-id 1 network-cost 10

If the answer contains any “a=ice-options” attributes where “trickle” is listed as an attribute, update the PeerConnection canTrickle property to be true. 

Modifying Offer/answer SDP

SDP returned from createOffer or createAnswer MUST NOT be changed before passing it to setLocalDescription.
After calling setLocalDescription with an offer or answer, the application MAY modify the SDP to reduce its capabilities before sending it to the far side

Assume we have a MCU at location and want the video stream to relay via a Media Server.

SDP Parsing

SDP is used for session parsing and contians sequence of line with key value pairs. SDP is read, line-by-line, and converted to a data structure that contains the deserialized information.

JSEP SDP bears a lot of simillarity to SIP SDP explained here : SIP and SDP Messages Explained

Session-Level Parsing

Line “v=” , “o=”,”b=” and “a=” are processed . The “i=”, “u=”, “e=”, “p=”, “t=”, “r=”, “z=”, and “k=” lines are not used by this specification; they MUST be checked for syntax but their values are not used. Line “c=” is checked for syntax and ICE mismatch detection

“a= ” attribute could be : “a=group” , “s=”ice-lite” , “a=ice-pwd”, “a=ice-options” , “a=fingerprint”, “a=setup” , a=tls-id”, “a=identity” , “a=extmap”

Media Section Parsing

Line “m=” for media , proto , port , fmt in RTP

Attributes “a=” can be

“a=rtpmap” or “a=fmtp”

map from an RTP payload type number to a media encoding name that identifies the payload format.

a=rtpmap:<payload type> <encoding name>/<clock rate> [/<encoding parameters>]
m=audio 49230 RTP/AVP 96 97 98
a=rtpmap:96 L8/8000
a=rtpmap:97 L16/8000
a=rtpmap:98 L16/11025/2

“a=ptime” , “a=maxptime”

dierction as  “a=sendrecv” , a=recvonly , a=sendonly , a=inactive

Muxing as “a=rtcp-mux” ,


RTCP attributes “a=rtcp” , “a=rtcp-rsize”

Line “c=” is checked .

Line “b=” for bandiwtdh , bwtype

Attribites for “a=” could be “a=ice-ufrag”, “a=”ice-pwd”, “a=ice-options” , “a=candidate”, “a=remote-candidate” , a=end-of-candidates” and “a=fingerprint”

Semantics Verification

Interactive Connectivity Establishment (ICE) for NAT traversal

Protocols using offer/answer are difficult to operate through Network Address Translators (NATs) since flow of media packets require IP addresses and ports of media sources and sinks within their messages. Also realtime media emphasises on reduced latency and decreased packet loss .

An extension to the offer/answer model, and works by including a multiplicity of IP addresses and ports in SDP offers and answers, which are then tested for connectivity by peer-to-peer connectivity checks.
Checks done by STUN and TURN
also allows for address selection for multi-homed and dual-stack hosts

ICE allows the agents to discover enough information about their topologies to potentially find one or more paths by which they can communicate. Then it systematically tries all possible pairs (in a carefully sorted order) until it finds one or more that work.

ICE Gathering

Caller and callee performs checks to finalize the protocol and routing needed to establish a peer connection . Number of candudates are proposed till they mutually agree upon one . Peerconnection then uses that candiadte detaisl to initiate the connection .

While Applying a Local Description at the media engine level if m= section is new, WebRTC media stacks begins gathering candidates for it.

RTCPeerconnection specified canTrickleIceCandidates . ICE trickling is the process of continuing to send candidates after the initial offer or answer has already been sent to the other peer.

ICE TransportRole is responsible for Choosing a candidate pair

ICE layer sets one peer as controlling and other as controlled agent . The controling agent makes the final decision as to which candidate pair to choose.

Final selected canduadte in SDP

a=group:BUNDLE 0 1 2
a=msid-semantic: WMS 9Cv3eIelHVuhxrGfxSvUsfokNu4eb4R9PYw2

m=audio 59937 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 x.x.x.x
a=rtcp:9 IN IP4
a=candidate:2880323124 1 udp 2122260223 x.x.x.x 59937 typ host generation 0 network-id 1 network-cost 10
a=candidate:3844981444 1 tcp 1518280447 x.x.x.x 9 typ host tcptype active generation 0 network-id 1 network-cost 10

An agent identifies all CANDIDATE whic is a transport address. Types:

  • HOST CANDIDATE – directly from a local interface which could be Wifi, Virtual Private Network (VPN) or Mobile IP (MIP)
    if an agent is multihomed ( private and public networks) , it obtains a candidate from each IP address and includes all candidates in its offer.
  • STUN or TURN to obtain additional candidates. Types
    • translated addresses on the public side of a NAT (SERVER REFLEXIVE CANDIDATES)
    • addresses on TURN servers (RELAYED CANDIDATES)

Mapping Server Reflexive address

Agent sends the TURN Allocate request from IP address and port X:x,
NAT will create a binding X1′:x1′, mapping this server reflexive candidate to the host candidate X:x ( BASE).
Outgoing packets sent from the host candidate will be translated by the NAT to the server reflexive candidate.
Incoming packets sent to the server reflexive candidate will be translated by the NAT to the host candidate and forwarded to the agent.

Allocate Request and response fom TURN – Informing the agent of this relayed candidate

Only STUN based Binding

agent sends a STUN Binding request to its STUN server which will get server reflexive candidate and send back Binding response.

STUN Binding request for connectivity checks on CANDIDATE PAIRS

The candidates are carried in attributes in the SDP offer . The remote peer also follows this process and gather and send lits own sorted list of candidates. Hence CANDIDATE PAIRS from both sides are formed.

PEER REFLEXIVE CANDIDATES – connectivity checks can produce aditional candidates espceialy around symmetric NAT

Since the same address is used for STUN. and media ( RTP/RTCP) Demultiplexing based on packet contents helps to identify which one is which.

Checks : ICE checks are performed in a specific sequence, so that high-priority candidate pairs are checked first.

TRIGGERED CHECKS – accelerates the process of finding a valid candidate
ORDINARY CHECKS – agent works through ordered prioritised check list by sending a STUN request for the next candidate pair on the list periodically.

Checks ensure maintaining frozen candidates and pairs with some foundation for media stream

Each candidate pair in the check list has a foundation and a state. States for candidates pairs

1.Waiting: A check has not been performed for this pair, and can be performed as soon as it is the highest-priority Waiting pair onthe check list.

2. In-Progress: A check has been sent for this pair, but the transaction is in progress.

3. Succeeded: A check for this pair was already done and produced a successful result.

4. Failed: A check for this pair was already done and failed, either never producing any response or producing an unrecoverable failure response.

5. Frozen: A check for this pair hasn’t been performed, and it can’t yet be performed until some other check succeeds, allowing this pair to unfreeze and move into the Waiting state.

Example of ICE gather state

icegatheringstatechange – gathering

icecandidate (host)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:1511920713 1 udp 2122260223 58122 typ host generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (srflx)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:4081163164 1 udp 1686052607 37542 typ srflx raddr rport 58122 generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (host)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:345893049 1 tcp 1518280447 9 typ host tcptype active generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (relay)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:2130406062 1 udp 41886207 27190 typ relay raddr rport 37542 generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (relay)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:3052096874 1 udp 25108479 28049 typ relay raddr rport 37543 generation 0 ufrag vzpn network-id 1 network-cost 10

icegatheringstatechange – complete

Examaple Candidate Checking

iceconnectionstatechange : checking

setRemoteDescription L type: answer, sdp: v=0

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 110 112 113 126
c=IN IP4
a=rtcp:9 IN IP4

m=video 9 UDP/TLS/RTP/SAVPF 98 100 96 97 99 101 102 122 127 121 125 107 108 109 124 120 123 119 114 115 116
c=IN IP4
a=rtcp:9 IN IP4

addIceCandidate (host)
sdpMid: , sdpMLineIndex: 0, candidate: candidate:1511920713 1 udp 2122260223 56060 typ host generation 0 ufrag ydvf network-id 1 network-cost 10

iceconnectionstatechange : connected

Candidate Nomination for Media Path

selecting low-latency media paths can use various techniques such as actual round-trip time (RTT) measurement
controlling agent gets to nominate which candidate pairs will get used for media amongst the ones that are valid. Ways
regular nomination and aggressive nomination


To read More on WebRTC Communication as a platform

WebRTC Media Stack

WebRTC service’s

References : WebRTC 1.0: Real-time Communication Between Browsers – W3C Editor’s Draft 31 August 2019
RFC 5245 Inter

Websockets as VOIP signal transport medium

Web resources are usually build on request/response paradigm such as HTTP , SIP messages . This means that server responds only when a client requests it to. This made web intercations very slow and unsuited for VOIP signalling
Long Poll involved repeated polling checks to load new server resources by itself instead of client made explicit request
AJAX and multipart XHR tried to patch the problem by selective reloading however they still required that client perform the mapping for an incomig reply to map to correct request.
However due to overhead latency involved with HTTP transaction and its working mode to open new TCP connetion for every request and reponse and add HTTP headers, none of them were suited to realtime operations

Websocket is the current (2017) most idelistic solution to perform realtime sigalling suited to VOIP requirnments due to its nature os establish a socket .

Websocket Protocol

Enables two-way communication between a client running untrusted code in a controlled environment to a remote host that has opted-in to communications from that code.

protocol consists of an opening handshake followed by basic message framing, layered over TCP.
handshake is interpreted by HTTP servers as an Upgrade request.

Secure websocket example :

Request URL: wss://
Request Method: GET
Status Code: 101 Switching Protocols

Response Headers
Connection: Upgrade
Sec-WebSocket-Accept: UVhTdFOWfywGyQTKDRZyGuhkfls=
Sec-WebSocket-Extensions: permessage-deflate
Upgrade: websocket

Request Headers
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cache-Control: no-cache
Connection: Upgrade
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: 06FNaHge8GLGVuPFxV2fAQ==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36

Query String parameters
transport: websocket
sid: hh3Dib_aBWgqyO1IAAEL

Working with websockets

A new websocket can be opned with ws or wss and it can have sub protocols like in example .

var wsconnection = new WebSocket('wss://', ['soap', 'xmpp']);

It can be attached with event handlers

wsconnection.onopen = function () {
wsconnection.onerror = function (error) {
  console.log('WebSocket Error ' + error);
wsconnection.onmessage = function (e) {
  console.log('message received : ' +;

Send Data on websocket

message string


Blob or ArrayBuffer object to send binary data
Ex : Sending canvas ImageData as ArrayBuffer

var img = canvas_context.getImageData(0, 0, 400, 320);
var binary = new Uint8Array(;
for (var i = 0; i <; i++) {
  binary[i] =[i];

Ex : sending file as Blob

var file = document.querySelector('input[type="file"]').files[0];

Closing the connection

if (socket.readyState === WebSocket.OPEN) {

Registry for Close codes for WS
1000 Normal Closure [IESG_HYBI] [RFC6455]
1001 Going Away [IESG_HYBI] [RFC6455]
1002 Protocol error [IESG_HYBI] [RFC6455]
1003 Unsupported Data [IESG_HYBI] [RFC6455]
1004 Reserved [IESG_HYBI] [RFC6455]
1005 No Status Rcvd [IESG_HYBI] [RFC6455]
1006 Abnormal Closure [IESG_HYBI] [RFC6455]
1007 Invalid frame payload data [IESG_HYBI] [RFC6455]
1008 Policy Violation [IESG_HYBI] [RFC6455]
1009 Message Too Big [IESG_HYBI] [RFC6455]
1010 Mandatory Ext. [IESG_HYBI] [RFC6455]
1011 Internal Error [IESG_HYBI] [RFC6455][RFC Errata 3227]
1012 Service Restart [Alexey_Melnikov] []
1013 Try Again Later [Alexey_Melnikov] []
1014 The server was acting as a gateway or proxy and received an invalid response from the upstream server. This is similar to 502 HTTP Status Code. [Alexey_Melnikov] []
1015 TLS handshake [IESG_HYBI] [RFC6455]
1016-3999 Unassigned
4000-4999 Reserved for Private Use [RFC6455]

WebSocket Subprotocol Name Registry

  • MBWS
  • soap soap
  • wamp WAMP (“The WebSocket Application Messaging Protocol”)
  • v10.stomp Name: STOMP 1.0 specification
  • v11.stomp Name: STOMP 1.1 specification
  • v12.stomp Name: STOMP 1.2 specification
  • ocpp1.2 OCPP 1.2 open charge alliance
  • ocpp1.5 OCPP 1.5 open charge alliance
  • ocpp1.6 OCPP 1.6 open charge alliance
  • ocpp2.0 OCPP 2.0 open charge alliance
  • ocpp2.0.1 OCPP 2.0.1
  • rfb RFB [RFC6143]
  • sip WebSocket Transport for SIP (Session Initiation Protocol) [RFC7118]
  • OMA RESTful Network API for Notification Channel
  • wpcp Web Process Control Protocol (WPCP)
  • amqp Advanced Message Queuing Protocol (AMQP) 1.0+
  • mqtt mqtt [MQTT Version 5.0]
  • jsflow jsFlow pubsub/queue protocol
  • rwpcp Reverse Web Process Control Protocol (RWPCP)
  • xmpp WebSocket Transport for the Extensible Messaging and Presence Protocol (XMPP) [RFC7395]
  • ship SHIP – Smart Home IP SHIP (Smart Home IP) is a an IP based approach to plug and play home automation and smart energy / energy efficiency, which can easily be extended to additional domains such as Ambient Assisted Living (AAL). SHIP can be used solely on the customer premises or can be integrated into a cloud based solution.
  • mielecloudconnect Miele Cloud Connect Protocol This protocol is used to securely connect household or professional appliances to an internet service portal via a public communication network in order to enable remote services.
  • Push Channel Protocol
  • msrp WebSocket Transport for MSRP (Message Session Relay Protocol) [RFC7977]
  • TLCP (Text Lightstreamer Client Protocol)
  • bfcp WebSocket Transport for BFCP (Binary Floor Control Protocol)
  • Softvelum Low Delay Protocol SLDP is a low latency live streaming protocol for delivering media from servers to MSE-based browsers and WebSocket-enabled applications.
  • opcua+uacp OPC UA Connection Protocol
  • opcua+uajson OPC UA JSON Encoding
  • v1.swindon-lattice+json Swindon Web Server Protocol (JSON encoding)
  • v1.usp USP (Broadband Forum User Services Platform)
  • mles-websocket mles-websocket
  • coap Constrained Application Protocol (CoAP) [RFC8323]
  • TLCP (Text Lightstreamer Client Protocol)
  • sqlnet This protocol is used for communication between Oracle database client and database server, and its usage as subprotocol of websocket is primarly geared towards cloud deployments. sqlnet supports bi-directional data transfer and is full duplex in nature.
  • oneM2M.R2.0.json oneM2M R2.0 JSON
  • oneM2M.R2.0.xml oneM2M R2.0 XML
  • oneM2M.R2.0.cbor oneM2M R2.0 CBOR
  • transit Transit
  • MPEG-DASH-ServerPush-23009-6-2017
  • MPEG-MMT-23008-1-2018
  • Softvelum WebSocket signaling protocol WebRTC live streaming requires WebSocket-based signaling protocol for every specific implementation. Softvelum products will use this subprotocol for signaling

websocket libraries

C++: libwebsockets
Java: Jetty
Node.JS: ws
PHP: Ratchet, phpws

Ref :
RFC 6455 – The websocket protocol
Websocket Protocol Registeries :
IANA websocket -