Video Codecs – H264 , H265 , AV1

Article discusses the popularly adopted current standards for video codecs( compression / decompression) namely MPEG2, H264, H265 and AV1


Compression algorithms differ from media containers since they involves compressing the information in raw stream to reduce the size for streaming applications while media files are containers which are just used for playback from a set location.

Examples of Codecs: H.261, H.263, VC-1, MPEG-1, MPEG-2, MPEG-4, AVS1, AVS2, AVS3, VP8, VP9, AV1, AVC/H.264, HEVC/H.265, VVC/H.266, EVC, LCEVC

Examples of containers inlude :MPEG-1 System Stream, MPEG-2 Program Stream, MPEG-2 Transport Stream, MP4, MOV, MKV, WebM, AVI, FLV, IVF, MXF, HEIC so on.

MPEG 2

MPEG-2 (a.k.a. H.222/H.262 as defined by the ITU)
Generic coding of moving pictures and associated audio information
Combination of lossy video compression and lossy audio data compression methods, which permit storage and transmission of movies using currently available storage media and transmission bandwidth.

MPEG2 is better than MPEG 1

Evolved out of the shortcomings of MPEG-1 such as audio compression system limited to two channels (stereo), No standardized support for interlaced video with poor compression , Only one standardized “profile” (Constrained Parameters Bitstream), which was unsuited for higher resolution video.

Application

  • over-the-air digital television broadcasting and in the DVD-Video standard.
  • TV stations, TV receivers, DVD players, and other equipment
  • MOD and TOD – recording formats for use in consumer digital file-based camcorders.
  • XDCAM – professional file-based video recording format.
  • DVB – Application-specific restrictions on MPEG-2 video in the DVB standard

MPEG -4

Video coding standards :-
MPEG-4 Part 2 Visual (ISO/IEC 14496-2) released in 1999 as MPEG-4 video codec
MPEG-4 Part 10 Advanced Video Coding (ISO/IEC 14496-10) released in 2003 as AVC/H.264 video codec;
MPEG-4 Part 14 (ISO/IEC 14496-14) MP4 file format is a media container. rather than a Codec ( compression algorithm).

H264

Introduced in 2004 as Advanced Video Coding (AVC)/H.264 or MPEG-4 AVC or ITU-T H.264/MPEG-4 Part 10 ‘Advanced Video Coding’ (AVC). It is a widely supported vendor agnostic solution.

MPEG-4 Part 10 AVC/H.264 is better than MPEG2

  • 40-50% bit rate reduction compared to MPEG-2
  • Resolution support 4K (4,096×2,304) and 59.94 fps
  • 21 profiles ; 17 levels

Compression Model

Video compression relies on predicting motion between frames. It works by comparing different parts of a video frame to find the ones that are redundant within the subsequent frames ie not changed such as background sections in video. These areas are replaced with a short information, referencing the original pixels(intraframe motion prediction) using mathematical function and direction of motion

Hybrid spatial-temporal prediction model
Flexible partition of Macro Block(MB), sub MB for motion estimation
Intra Prediction (extrapolate already decoded neighbouring pixels for prediction)
Introduced multi-view extension
9 directional modes for intra prediction
Macro Blocks structure with maximum size of 16×16
Entropy coding is CABAC(Context-adaptive binary arithmetic coding) and CAVLC(Context-adaptive variable-length coding )

Applications of H264

  • most deployed video compression standard
  • Delivers high definition video images over direct-broadcast satellite-based television services,
  • Digital storage media and Blu-Ray disc formats,
  • Terrestrial, Cable, Satellite and Internet Protocol television (IPTV)
  • Security and surveillance systems and DVB
  • Mobile video, media players, video chat

H265

High Efficiency Video Coding (HEVC), or H.265 or MPEG-H HEVC. Streams high-quality videos in congested network environments or bandwidth constrained mobile networks.

  • (+) 2 times the video compression with the same video quality as H264.
  • (-) higher processing power required

Introduced in Jan 2013 as product of collaboration between the ITU Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).

H265 is better than H264

Overcome shortage of bandwidth, spectrum, storage and performs bandwidth savings of approx. 45% over H.264 encoded content

  • Resolutions up to 8192×4320, including 8K UHD
  • Supports up to 300 fps
  • 3 approved profiles, draft for additional 5 ; 13 levels

Whereas macroblocks can span 4×4 to 16×16 block sizes, CTUs can process as many as 64×64 blocks, giving it the ability to compress information more efficiently.

Multiview encoding – stereoscopic video coding standard for video compression that allows for the efficient encoding of video sequences captured simultaneously from multiple camera angles in a single video stream. It also packs a large amount of inter-view statistical dependencies.

Compression Model

  1. Enhanced Hybrid spatial-temporal prediction model
  2. CTU ( coding tree units) supporting larger block structure (64×64) with more variable sub partition structures
  3. Motion Estimation – Intra prediction with more nodes, asymmetric partitions in Inter Prediction). Individual rectangular regions that divide the image are independent
  4. Paralleling processing computing – decoding process can be split up across multiple parallel process threads, taking advantage multi-core processors.
  5. Wavefront Parallel Processing (WPP)- sort of decision tree that grants a more productive and effectual compression.
  6. 33 directional nodes – DC intra prediction , planar prediction. , Adaptive Motion Vector Prediction
  7. Entropy coding is only CABAC

Applications of H265

  • cater to growing HD content for multi platform delivery
  • differentiated and premium 4K content

Reduced bitrate enables broadcasters and OTT vendors to bundle more channels / content on existing delivery mediums
also provide greater video quality experience at same bitrate

Using ffmpeg for H265 encoding

I took a h264 file (640×480) , duration 30 seconds of size 39,08,744 bytes (3.9 MB on disk) and converted using ffnpeg

After conversion it was a HEVC (Parameter Sets in Bitstream) , MPEG-4 movie – 621 KB only !!! without any loss of clarity.

> ffmpeg -i pivideo3.mp4 -c:v libx265 -crf 28 -c:a aac -b:a 128k output.mp4  
                                            
ffmpeg version 4.1.4 Copyright (c) 2000-2019 the FFmpeg developers   
built with Apple LLVM version 10.0.1 (clang-1001.0.46.4)   
configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1.4_2 --enable-shared --enable-pthreads --enable-version3 --enable-avresample --cc=clang --host-cflags='-I/Library/Java/JavaVirtualMachines/adoptopenjdk-12.0.1.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/adoptopenjdk-12.0.1.jdk/Contents/Home/include/darwin' --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libmp3lame --enable-libopus --enable-librubberband --enable-libsnappy --enable-libtesseract --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-videotoolbox --disable-libjack --disable-indev=jack --enable-libaom --enable-libsoxr   
libavutil      56. 22.100 / 56. 22.100   
libavcodec     58. 35.100 / 58. 35.100   
libavformat    58. 20.100 / 58. 20.100   
libavdevice    58.  5.100 / 58.  5.100   
libavfilter     7. 40.101 /  7. 40.101   
libavresample   4.  0.  0 /  4.  0.  0   
libswscale      5.  3.100 /  5.  3.100   
libswresample   3.  3.100 /  3.  3.100   
libpostproc    55.  3.100 / 55.  3.100 
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'pivideo3.mp4':   
...

If you get error like

Unknown encoder 'libx265'

then reinstall ffmpeg with h265 support

HEVC bitstream is an ordered sequence of the syntax elements. Each syntax element is placed into a logical packet called a NAL (network abstraction layer) Unit. There are 64 different NAL Unit types. They can be grouped into 10 classes:

  1. VPS – Video parameter set
  2. SPS – Sequence parameter set
  3. PPS – Picture parameter set
  4. Slice (different types)
  5. AUD – Access unit delimiter signals the start of video frame
  6. EOS – End of sequence
  7. EOB – End of bitstream
  8. FD – Filler data for bitrate smoothening
  9. SEI – Supplemental enhancement information such as picture timing, color space information, etc.
  10. Reserved and unspecified

AV1

Realtime High quality video encoder , product of the Alliance for Open Media (AOM). Contained by Matroska , WebM , ISOBMFF , RTP (WebRTC).

Av1 is better than H265

  • (+) AV1 is royalty free and overcomes the patent complexities around H265/HVEC

Applications

  • Video transmission over internet , voip , multi conference
  • Virtual / Augmented reality
  • self driving cars streaming
  • intended for use in HTML5 web video and WebRTC together with the Opus audio format

SVC ( scalable video encoding)

SVC standardises the encoding of a high-quality video bitstream that also contains one or more subset bitstreams. Asubset bitsteam can represent a lower spatial resolution (smaller screen), lower temporal resolution (lower frame rate), or lower quality video signal ( dropped packet) compared to the base stream it is derieved from.

  • Temporal (frame rate) scalability is enabled through structuring motion compensation dependencie so that complete pictures (i.e. their associated packets) can be dropped from the bitstream.
  • Spatial (picture size) scalability is enabled with video coded at multiple spatial resolutions
  • SNR/Quality/Fidelity scalability: video is coded at a single spatial resolution but at different qualities. The data and decoded samples of lower qualities can be used to predict data or samples of higher qualities in order to reduce the bit rate to code the higher qualities.
  • Combined scalability: a combination of the 3 scalability modalities described above.

Not all codecs sypport all modes. While the Av1 and VP9 support majority of modes defined in the table, VP8 only supports temporal scalability (e.g. “L1T2”, “L1T3”);H.264/SVC supports both temporal and spatial scalability but only permits transport of simulcast on distinct SSRCs.

Vp8

{
      "clockRate": 90000,
      "mimeType": "video/VP8",
      "scalabilityModes": [
        "L1T1",
        "L1T2",
        "L1T3"
      ]
    },
 const stream = await navigator.mediaDevices.getUserMedia(constraints);
    selfView.srcObject = stream;
    pc.addTransceiver(stream.getAudioTracks()[0], {direction: 'sendonly'});
    pc.addTransceiver(stream.getVideoTracks()[0], {
      direction: 'sendonly',
      sendEncodings: [
        {rid: 'q', scaleResolutionDownBy: 4.0, scalabilityMode: 'L1T3'}
        {rid: 'h', scaleResolutionDownBy: 2.0, scalabilityMode: 'L1T3'},
        {rid: 'f', scalabilityMode: 'L1T3'},
      ]    
    });


References :

Audio and Acoustic Signal Processing

Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions and Audio Signal Processing focuses on the computational methods for intentionally altering auditory signals or sounds, in order to achieve a particular goal.

Application of audio Signal processing in general

  • storage
  • data compression
  • music information retrieval
  • speech processing ( emotion recognition/sentiment analysis , NLP)
  • localization
  • acoustic detection
  • Transmission / Broadcasting – enhance their fidelity or optimize for bandwidth or latency.
  • noise cancellation
  • acoustic fingerprinting
  • sound recognition ( speaker Identification , biometric speech verification , voice commands )
  • synthesis – electronic generation of audio signals. Speech synthesisers can generate human like speech.
  • enhancement (e.g. equalization, filtering, level compression, echo and reverb removal or addition, etc.)

Effects for audio streams processing

  • delay or echo
    To simulate reverberation effect, one or several delayed signals are added to the original signal. To be perceived as echo, the delay has to be of order 35 milliseconds or above.
    Implemented using tape delays or bucket-brigade devices.
  • flanger
    delayed signal is added to the original signal with a continuously variable delay (usually smaller than 10 ms).
    signal would fall out-of-phase with its partner, producing a phasing comb filter effect and then speed up until it was back in phase with the master
  • phaser
    signal is split, a portion is filtered with a variable all-pass filter to produce a phase-shift, and then the unfiltered and filtered signals are mixed to produce a comb filter.
  • chorus
    delayed version of the signal is added to the original signal. above 5 ms to be audible. Often, the delayed signals will be slightly pitch shifted to more realistically convey the effect of multiple voices.
  • equalization
    frequency response is adjusted using audio filter(s) to produce desired spectral characteristics. Frequency ranges can be emphasized or attenuated using low-pass, high-pass, band-pass or band-stop filters.
    overdrive effects such as the use of a fuzz box can be used to produce distorted sounds, such as for imitating robotic voices or to simulate distorted radiotelephone traffic
  • pitch shift
    shifts a signal up or down in pitch. For example, a signal may be shifted an octave up or down. This is usually applied to the entire signal, and not to each note separately. Blending the original signal with shifted duplicate(s) can create harmonies from one voice.
  • time stretching
    changing the speed of an audio signal without affecting its pitch.
  • resonators
    emphasize harmonic frequency content on specified frequencies. These may be created from parametric EQs or from delay-based comb-filters.
  • modulation
    change the frequency or amplitude of a carrier signal in relation to a predefined signal.
  • compression
    reduction of the dynamic range of a sound to avoid unintentional fluctuation in the dynamics. Level compression is not to be confused with audio data compression, where the amount of data is reduced without affecting the amplitude of the sound it represents.
  • 3D audio effects
    place sounds outside the stereo basis
  • reverse echo
    swelling effect created by reversing an audio signal and recording echo and/or delay while the signal runs in reverse.
  • wave field synthesis
    spatial audio rendering technique for the creation of virtual acoustic environments

ASP application in Telephony and mobile phones, by ITU (International Telegraph Union)

  • Acoustic echo control
    aims to eliminate the acoustic feedback, which is particularly problematic in the speakerphone use-case during bidirectional voice
  • Noise control
    microphone doesn’t only pick up the desired speech signal, but often also unwanted background noise. Noise control tries to minimize those unwanted signals . Multi-microphone AASP, has enabled the suppression of directional interferers.
  • Gain control
    how loud a speech signal should be when leaving a telephony transmitter as well as when it is being played back at the receiver. Implemented either statically during the handset design stage or automatically/adaptively during operation in real-time.
  • Linear filtering
    ITU defines an acceptable timbre range for optimum speech intelligibility. AASP in the form of linear filtering can help the handset manufacturer to meet these requirements.
  • Speech coding: from analog POTS based call to G.711 narrowband (approximately 300 Hz to 3.4 kHz) speech coder is a big leap in terms of call capacity. other speech coders with varying tradeoffs between compression ratio, speech quality, and computational complexity have been also made available. AASP provides higher quality wideband speech (approximately 150 Hz to 7 kHz).

ASP applications in music playback

AASP is used to provide audio post-processing and audio decoding capabilities for mobile media consumption needs, such as listening to music, watching videos, and gaming

  • Post-processing
    techniques as equalization and filtering allow the user to adjust the timbre of the audio such as bass boost and parametric equalization. Other techniques like adding reverberation, pitch shift, time stretching etc
  • Audio (de)coding: audio contianers like mp3 and AAC define how music is distributed, stored, and consumed also in Online music streaming services

ASP for virtual assitants

Virtual Assistance include a variety of servies from Apple’s Siri, Microsoft’s Cortana , Google’s Now , Alexa etc. ASP is used in

  • Speech enhancement
    multi-microphone speech pickup using beamforming and noise suppression to isolate the desired speech prior to forwarding it to the speech recognition engine.
  • Speech recognition (speech-to-text): this draws ideas from multiple disciplinary fields including linguistics, computer science, and AASP. Ongoing work in acoustic modeling is a major contribution to recognition accuracy improvement in speech recognition by AASP.
  • Speech synthesis (text-to-speech): this technology has come a very long way from its very robotic sounding introduction in the 1930s to making synthesized speech sound more and more natural.

Other areas of ASP

  • Virtual reality (VR) like VR headset / gaming simulators use three-dimensional soundfield acquisition and representation like Ambisonics (also known as B-format).

Ref :
wikipedia – https://en.wikipedia.org/wiki/Audio_signal_processing
IEEE – https://signalprocessingsociety.org/publications-resources/blog/audio-and-acoustic-signal-processing%E2%80%99s-major-impact-smartphones

JavaScript Session Establishment Protocol (JSEP) in WebRTC handshake


This article is aimed at explaining the intricacies and detailed offer answer flow in webrtc handshake and JSEP. You can read the following articles on WebRTC as a prereq before reading through this one. WebRTC has API s namely – Peerconnection , getUserMedia , Datachannel and getStats.

JSEP (JavaScript Session Establishment Protocol)

JSEP is used during signalling via w3c’s recommended RTCPeerConnectionAPI interface to set up a multimedia session. The multimedia session description specifies the critical components of setting up a session between local and remote such as transport ports, protocol, profiles. It also handles the interaction with the ICE state machine.

Offer/Answer Excahange Flow

prereq : Setup Client side for the caller
PeerConnectionFactory to generate PeerConnections
PeerConnection for every connection to remote peer
MediaStream audio and video from client device

  1. Side initiating the session creates a offer by CreateOffer() API
aPromise = myPeerConnection.createOffer([options]);

options is type of RTC Offer Options

  • iceRestart
  • offerToReceiveAudio ( legacy)
  • offerToReceiveVideo ( legacy)
  • voiceActivityDetection

2. The application then stores the offer in local config as setLocalDescriptionAPI()

 myPeerConnection.createOffer().then(function(offer) {
    return myPeerConnection.setLocalDescription(offer);
})

3. Offer is sent to remote side using its choice of signalling ( SIP , WS , HTTP, XMPP .. )

4. Remote party stores it use setRemoteDescription() API

myPeerConnection.setRemoteDescription(sdp)
.then(function () {
  return createMyStream();
})

4. Remote part generates an answer using createAnswer() API

aPromise = RTCPeerConnection.createAnswer([options]);

5. Remote party stores the answer in its local config using setLocalDescription() API

6. Answer is transferred to Initiator side using choice of signalling ( SIP , WS , HTTP, XMPP .. ) again

7. Initiating side stores it use setRemoteDescription() API

Interfaces of webrtc and tracks to stream addition

Perform webRTC handshake

Webrtc call setup and incoming call callflow between remote peer , peerconnection factory , peerconnection and application

Outgoing Call: setting up a call with remote by sending an offer. Wait for the remote’s answer to process it to create the session.

JSEP setup a call

Incoming Call : Receive remote’s offer and process to reply with an answer.

JSEP flow to receive a call

Signalling state Transitions on PeerConnection

As the caller initiates a new RTCPeerConnection() , the RTCSignalingState state is “stable” as remote and local descriptions are empty

As the caller initiates call and calls createOffer() , he now has offer SDP and procced to store offer locally with setLocalDescription(offer) the RTCSignalingState state is “have-local-offer” . After than caller send the offer to callee over signalling channel

Simillarily as the calle recives the offer, it starts with RTCSignalingState stable and then proceeds to store the Remote’s offer using setRemoteDescription(offer), its state is now “have-remote-offer”

The callee generates a provsional answer and for caller and stores it locally , state transitiosn to “have-local-pranswer“. The pranswer SDP is send to caller over signalling channel again .

Caller stores the callee’s pr answer SDP and state updates to “have-remote-pranswer”

img : https://w3c.github.io/webrtc-pc

Once there is no offer/answer exchange in progress the state again changes to ” stable “.

State schanges to “closed” if RTCpeerConnection is closed

Detailed Offer / Answer SDP

Local Offer created by side initiating the session / Caller

The first offfer called initial offer can have dummy date for contact line such as 0.0.0.0 to prevent leaking a local Ip address

c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0

“o=” line contains <username> <sess-id> <sess-version> <nettype> <addrtype> <unicast-address>

o=- 4445251981417004127 2 IN IP4 127.0.0.

shows username – and 4445251981417004127 as session id. Same username “-” is specified in “s=” line

“t=” line shows <start time> <stop time>

t=0 0

Full session Block example

type: offer, sdp: v=0
o=- 4445251981417004127 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2

Media Section : An m= section is generated for each RtpTransceiver that has been added to the PeerConnection. For the initial offer since no ports are available yet , dummy port 9 can be sadded. However if it is bundle only then port value is set to 0. Later the port value will be set to the port value of default ICE candidate.

DTLS filed “UDP/TLS/RTP/SAVPF” is followed by the list of codecs in order of priority.

“c=” line in msection too must be filled with dummy values if IP 0.0.0.0 as no candidates are available yet .

ICE

a=ice-options:trickle

Transport

“a=ice-ufrag” , “a=ice-pwd” , “a=fingerprint” , “a=setup” , “a=tls-id”

Media Stream Identification attribute “a-mid:”

For each media format on the m= line, “a=rtpmap” for “rtx” with the clock rate of codec and “a=fmtp” to reference the payload type of the primary codec.  “a=rtcp-fb” specified RTCP feedback

a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1

Audio Block exmaple

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:JDMg
a=ice-pwd:6OARDQ8U/orhtXZbfN+ars37
a=ice-options:trickle
a=fingerprint:sha-256 1D:C8:1F:18:D2:AB:B7:68:CC:DC:A8:8D:6B:1D:70:11:06:E9:19:D2:22:CE:A5:F3:BE:82:00:ED:99:58:20:4A
a=setup:actpass
a=mid:0
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid
a=extmap:5 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:6 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=sendrecv
a=msid:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2 7525d75c-ffe7-4038-8b71-653d249e63bb
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:106 CN/32000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:112 telephone-event/32000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:3968544080 cname:da0nYe1oYR8AvVNp
a=ssrc:3968544080 msid:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2 7525d75c-ffe7-4038-8b71-653d249e63bb
a=ssrc:3968544080 mslabel:DYVK4IA4kA8LvnIYWjXhRzMgSGicnwVutWE2
a=ssrc:3968544080 label:7525d75c-ffe7-4038-8b71-653d249e63bb

// remove video section for simplicity

Data Block is created if data channle has been created with m= section for data.

“a=sctp-port” line referencing the SCTP port number set to 5000

 “a=max-message-size”  set to 262144 here

Data Block example

m=application 9 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4 0.0.0.0
a=ice-ufrag:JDMg
a=ice-pwd:6OARDQ8U/orhtXZbfN+ars37
a=ice-options:trickle
a=fingerprint:sha-256 1D:C8:1F:18:D2:AB:B7:68:CC:DC:A8:8D:6B:1D:70:11:06:E9:19:D2:22:CE:A5:F3:BE:82:00:ED:99:58:20:4A
a=setup:actpass
a=mid:2
a=sctp-port:5000
a=max-message-size:262144

Subsequent Offers

When createOffer is called a second (or later) time, or is called after a local description has already been installed, the processig is different due to gathered ICE candidates . However the <session-version> is not changed .

Additionally m section is updated if RtpTransceiver is added or removed

Each “m=” and c=” line MUST be filled in with the port, relevant RTP profile, and address of the default candidate for the m= section

If the m= section is not bundled into another m= section, update the “a=rtcp” with port and address of RTCP camdidate and add “a=camdidate” with  “a=end-of-candidates” 

Local Answer created by side receiving the session/ Callee

When createAnswer is called for the first time after a remote description has been provided, the result is known as the initial answer. 

Each offered m= section will have an associated RtpTransceiver

Remote Destination / Callee can reject the m section by setting port in m line to 0 . It can reject msection if neither of the offered media format are supported , RtpTransceiver is stoopped etc.

For the initial offer the dummy port value of 9 is set as no ICE candudate is avaible yet. Simillarly  “c=” line must contain the “dummy” value “IN IP4 0.0.0.0” too.

The <proto> field MUST be set to exactly match the <proto> field for the corresponding m= line in the offer.

type: answer, sdp: v=0
o=- 5730481682283561642 3 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS KGmQ9mTmvTaWlHTQ0B0YP36QIxOYNeB3i2nT

Audio section

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:MgKS
a=ice-pwd:X3oTkKO/v7GVgd/CDC3e9B7c
a=ice-options:trickle
a=fingerprint:sha-256 B9:9C:8A:A9:E9:09:0C:FB:52:2A:D3:18:7B:A9:D4:EC:B3:00:77:72:27:51:EC:5F:82:BE:11:7F:C7:CF:43:43
a=setup:active
a=mid:0
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid
a=extmap:5 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:6 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=sendrecv
a=msid:KGmQ9mTmvTaWlHTQ0B0YP36QIxOYNeB3i2nT e817fe0f-1cc0-4901-9fd9-e810289cc85d
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:106 CN/32000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:112 telephone-event/32000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:3260997313 cname:FxLUKuXrLQe0r1rn

Video section removed for simplicity

Data stream

m=application 9 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4 0.0.0.0
b=AS:30
a=ice-ufrag:MgKS
a=ice-pwd:X3oTkKO/v7GVgd/CDC3e9B7c
a=ice-options:trickle
a=fingerprint:sha-256 B9:9C:8A:A9:E9:09:0C:FB:52:2A:D3:18:7B:A9:D4:EC:B3:00:77:72:27:51:EC:5F:82:BE:11:7F:C7:CF:43:43
a=setup:active
a=mid:2
a=sctp-port:5000
a=max-message-size:262144

Subsequent Answers

 Port value would normally be set to the port of the default ICE candidate for this m= section. For the exmaple above

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126

will be changes with relevant port adress such as

type: offer, sdp: v=0
o=- 6407282338169184323 3 IN IP4 54.190.54.190
s=-
t=0 0
a=group:BUNDLE 0 1 2
a=msid-semantic: WMS bSrCUCFybGovIy0FUhPTZAr9ToRmx8I09nEj
m=audio 55375 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 54.190.54.190
a=rtcp:9 IN IP4 0.0.0.0
a=candidate:2880323124 1 udp 2122260223 54.190.54.190 55375 typ host generation 0 network-id 1 network-c

Simillarly m video and data line will also get ports

m=video 53877 UDP/TLS/RTP/SAVPF 96 97 98 99 100 101 102 122 127 121 125 107 108 109 124 120 123 119 114 115 116
c=IN IP4 54.190.54.190
a=rtcp:9 IN IP4 0.0.0.0
a=candidate:2880323124 1 udp 2122260223 54.190.54.190 53877 typ host generation 0 network-id 1 network-cost 10
..
m=application 57991 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4 54.190.54.190
a=candidate:2880323124 1 udp 2122260223 54.190.54.190 57991 typ host generation 0 network-id 1 network-cost 10

If the answer contains any “a=ice-options” attributes where “trickle” is listed as an attribute, update the PeerConnection canTrickle property to be true. 

Modifying Offer/answer SDP

SDP returned from createOffer or createAnswer MUST NOT be changed before passing it to setLocalDescription. After calling setLocalDescription with an offer or answer, the application MAY modify the SDP to reduce its capabilities before sending it to the far side.

Assume we have a MCU at location and want the video stream to relay via a Media Server.

SDP Parsing

SDP is used for session parsing and contians sequence of line with key value pairs. SDP is read, line-by-line, and converted to a data structure that contains the deserialized information.

JSEP SDP bears a lot of simillarity to SIP SDP explained here : SIP and SDP Messages Explained

Session-Level Parsing

  • Line “v=” , “o=”,”b=” and “a=” are processed . The “i=”, “u=”, “e=”, “p=”, “t=”, “r=”, “z=”, and “k=” lines are not used by this specification; they MUST be checked for syntax but their values are not used. Line “c=” is checked for syntax and ICE mismatch detection
  • “a= ” attribute could be : “a=group” , “s=”ice-lite” , “a=ice-pwd”, “a=ice-options” , “a=fingerprint”, “a=setup” , a=tls-id”, “a=identity” , “a=extmap”

Media Section Parsing

Line “m=” for media , proto , port , fmt in RTP

Attributes “a=” can be :

  • “a=rtpmap” or “a=fmtp” : map from an RTP payload type number to a media encoding name that identifies the payload format.
a=rtpmap:<payload type> <encoding name>/<clock rate> [/<encoding parameters>]
m=audio 49230 RTP/AVP 96 97 98
a=rtpmap:96 L8/8000
a=rtpmap:97 L16/8000
a=rtpmap:98 L16/11025/2
  • Packetization parameters as “a=ptime” , “a=maxptime” which define the length of each RTP packet.
  • Direction as  “a=sendrecv” , a=recvonly , a=sendonly , a=inactive
  • Muxing as “a=rtcp-mux” , “a=rtcp-mux-only”
  • RTCP attributes “a=rtcp” , “a=rtcp-rsize”
  • Line “c=” is checked.
  • Line “b=” for bandiwtdh , bwtype
  • Attribites for “a=” could be “a=ice-ufrag”, “a=”ice-pwd”, “a=ice-options” , “a=candidate”, “a=remote-candidate” , a=end-of-candidates” and “a=fingerprint”

Interactive Connectivity Establishment (ICE) for NAT traversal

Protocols using offer/answer are difficult to operate through Network Address Translators (NATs) since flow of media packets require IP addresses and ports of media sources and sinks within their messages. Also realtime media emphasises on reduced latency and decreased packet loss .

An extension to the offer/answer model, and works by including a multiplicity of IP addresses and ports in SDP offers and answers, which are then tested for connectivity by peer-to-peer connectivity checks.
Checks done by STUN and TURN, also allows for address selection for multi-homed and dual-stack hosts

ICE allows the agents to discover enough information about their topologies to potentially find one or more paths by which they can communicate. Then it systematically tries all possible pairs (in a carefully sorted order) until it finds one or more that work.

ICE Gathering

Caller and callee performs checks to finalize the protocol and routing needed to establish a peer connection . Number of candudates are proposed till they mutually agree upon one . Peerconnection then uses that candiadte detaisl to initiate the connection .

While Applying a Local Description at the media engine level if m= section is new, WebRTC media stacks begins gathering candidates for it.

RTCPeerconnection specified canTrickleIceCandidates. ICE trickling is the process of continuing to send candidates after the initial offer or answer has already been sent to the other peer.

ICE TransportRole is responsible for Choosing a candidate pair.

ICE layer sets one peer as controlling and other as controlled agent. The controling agent makes the final decision as to which candidate pair to choose.

Final selected canduadte in SDP

a=group:BUNDLE 0 1 2
a=msid-semantic: WMS 9Cv3eIelHVuhxrGfxSvUsfokNu4eb4R9PYw2

m=audio 59937 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 x.x.x.x
a=rtcp:9 IN IP4 0.0.0.0
a=candidate:2880323124 1 udp 2122260223 x.x.x.x 59937 typ host generation 0 network-id 1 network-cost 10
a=candidate:3844981444 1 tcp 1518280447 x.x.x.x 9 typ host tcptype active generation 0 network-id 1 network-cost 10

An agent identifies all CANDIDATE whic is a transport address. Types:

  • HOST CANDIDATE – directly from a local interface which could be Wifi, Virtual Private Network (VPN) or Mobile IP (MIP)
    if an agent is multihomed ( private and public networks) , it obtains a candidate from each IP address and includes all candidates in its offer.
  • STUN or TURN to obtain additional candidates. Types
    • translated addresses on the public side of a NAT (SERVER REFLEXIVE CANDIDATES)
    • addresses on TURN servers (RELAYED CANDIDATES)

Mapping Server Reflexive address

Steps for mappling Server Reflexive Address

  1. Agent sends the TURN Allocate request from IP address and port X:x,
  2. NAT will create a binding X1′:x1′, mapping this server reflexive candidate to the host candidate X:x ( BASE).
  3. Outgoing packets sent from the host candidate will be translated by the NAT to the server reflexive candidate.
  4. Incoming packets sent to the server reflexive candidate will be translated by the NAT to the host candidate and forwarded to the agent.

Allocate Request and response fom TURN – Informing the agent of this relayed candidate

Only STUN based Binding

agent sends a STUN Binding request to its STUN server which will get server reflexive candidate and send back Binding response.

STUN Binding request for connectivity checks on CANDIDATE PAIRS

The candidates are carried in attributes in the SDP offer . The remote peer also follows this process and gather and send lits own sorted list of candidates. Hence CANDIDATE PAIRS from both sides are formed.

PEER REFLEXIVE CANDIDATES – connectivity checks can produce aditional candidates espceialy around symmetric NAT

Since the same address is used for STUN. and media ( RTP/RTCP) Demultiplexing based on packet contents helps to identify which one is which.

Checks : ICE checks are performed in a specific sequence, so that high-priority candidate pairs are checked first.

  • TRIGGERED CHECKS – accelerates the process of finding a valid candidate
  • ORDINARY CHECKS – agent works through ordered prioritised check list by sending a STUN request for the next candidate pair on the list periodically.

Checks ensure maintaining frozen candidates and pairs with some foundation for media stream. Each candidate pair in the check list has a foundation and a state. States for candidates pairs

1.Waiting: A check has not been performed for this pair, and can be performed as soon as it is the highest-priority Waiting pair onthe check list.

2. In-Progress: A check has been sent for this pair, but the transaction is in progress.

3. Succeeded: A check for this pair was already done and produced a successful result.

4. Failed: A check for this pair was already done and failed, either never producing any response or producing an unrecoverable failure response.

5. Frozen: A check for this pair hasn’t been performed, and it can’t yet be performed until some other check succeeds, allowing this pair to unfreeze and move into the Waiting state.

ICE gather state

icegatheringstatechange – gathering

icecandidate (host)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:1511920713 1 udp 2122260223 192.168.0.2 58122 typ host generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (srflx)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:4081163164 1 udp 1686052607 106.51.26.168 37542 typ srflx raddr 192.168.0.2 rport 58122 generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (host)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:345893049 1 tcp 1518280447 192.168.0.2 9 typ host tcptype active generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (relay)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:2130406062 1 udp 41886207 74.125.39.44 27190 typ relay raddr 106.51.26.168 rport 37542 generation 0 ufrag vzpn network-id 1 network-cost 10

icecandidate (relay)
sdpMid: 0, sdpMLineIndex: 0, candidate: candidate:3052096874 1 udp 25108479 172.217.163.158 28049 typ relay raddr 106.51.26.168 rport 37543 generation 0 ufrag vzpn network-id 1 network-cost 10

icegatheringstatechange – complete

Candidate Checking

iceconnectionstatechange : checking

setRemoteDescription L type: answer, sdp: v=0

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 110 112 113 126
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:ydvf
a=ice-pwd:mb4ousBoT6B0l//ljjD/9Z/M
a=ice-options:trickle


m=video 9 UDP/TLS/RTP/SAVPF 98 100 96 97 99 101 102 122 127 121 125 107 108 109 124 120 123 119 114 115 116
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:ydvf
a=ice-pwd:mb4ousBoT6B0l//ljjD/9Z/M
a=ice-options:trickle

addIceCandidate (host)
sdpMid: , sdpMLineIndex: 0, candidate: candidate:1511920713 1 udp 2122260223 192.168.0.2 56060 typ host generation 0 ufrag ydvf network-id 1 network-cost 10

iceconnectionstatechange : connected

Candidate Nomination for Media Path

Selecting low-latency media paths can use various techniques such as actual round-trip time (RTT) measurement. Controlling agent gets to nominate which candidate pairs will get used for media amongst the ones that are valid. There are 2 ways : regular nomination and aggressive nomination.

ReadMore :

WebRTC Media Stack

WebRTC service’s

References :

  • [1] WebRTC 1.0: Real-time Communication Between Browsers – W3C Editor’s Draft 31 August 2019 http://w3c.github.io/webrtc-pc/
  • [2] RFC 5245 Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols

Websockets as VOIP signal transport medium

Web resources are usually build on request/response paradigm such as HTTP , SIP messages . This means that server responds only when a client requests it to. This made web intercations very slow and unsuited for VOIP signalling
Long Poll involved repeated polling checks to load new server resources by itself instead of client made explicit request
AJAX and multipart XHR tried to patch the problem by selective reloading however they still required that client perform the mapping for an incomig reply to map to correct request.
However due to overhead latency involved with HTTP transaction and its working mode to open new TCP connetion for every request and reponse and add HTTP headers, none of them were suited to realtime operations

Websocket is the current (2017) most idelistic solution to perform realtime sigalling suited to VOIP requirnments due to its nature os establish a socket .

Websocket Protocol

Enables two-way communication between a client running untrusted code in a controlled environment to a remote host that has opted-in to communications from that code.

protocol consists of an opening handshake followed by basic message framing, layered over TCP.
handshake is interpreted by HTTP servers as an Upgrade request.

Secure websocket example :

Request URL: wss://site.com:8084/socket.io/?transport=websocket&sid=hh3Dib_aBWgqyO1IAAEL
Request Method: GET
Status Code: 101 Switching Protocols

Response Headers
Connection: Upgrade
Sec-WebSocket-Accept: UVhTdFOWfywGyQTKDRZyGuhkfls=
Sec-WebSocket-Extensions: permessage-deflate
Upgrade: websocket

Request Headers
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cache-Control: no-cache
Connection: Upgrade
Host: site.com:8085
Origin: https://site.com:8084
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: 06FNaHge8GLGVuPFxV2fAQ==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36

Query String parameters
transport: websocket
sid: hh3Dib_aBWgqyO1IAAEL

Working with websockets

A new websocket can be opned with ws or wss and it can have sub protocols like in example .

var wsconnection = new WebSocket('wss://voipsever.com', ['soap', 'xmpp']);

It can be attached with event handlers

wsconnection.onopen = function () {
  ...
};
wsconnection.onerror = function (error) {
  console.log('WebSocket Error ' + error);
};
wsconnection.onmessage = function (e) {
  console.log('message received : ' + e.data);
};

Send Data on websocket

message string

wsconnection.send('Hi);

Blob or ArrayBuffer object to send binary data
Ex : Sending canvas ImageData as ArrayBuffer

var img = canvas_context.getImageData(0, 0, 400, 320);
var binary = new Uint8Array(img.data.length);
for (var i = 0; i < img.data.length; i++) {
  binary[i] = img.data[i];
}
wsconnection.send(binary.buffer);

Ex : sending file as Blob

var file = document.querySelector('input[type="file"]').files[0];
wsconnection.send(file);

Closing the connection

if (socket.readyState === WebSocket.OPEN) {
    socket.close();
}

Registry for Close codes for WS
1000 Normal Closure [IESG_HYBI] [RFC6455]
1001 Going Away [IESG_HYBI] [RFC6455]
1002 Protocol error [IESG_HYBI] [RFC6455]
1003 Unsupported Data [IESG_HYBI] [RFC6455]
1004 Reserved [IESG_HYBI] [RFC6455]
1005 No Status Rcvd [IESG_HYBI] [RFC6455]
1006 Abnormal Closure [IESG_HYBI] [RFC6455]
1007 Invalid frame payload data [IESG_HYBI] [RFC6455]
1008 Policy Violation [IESG_HYBI] [RFC6455]
1009 Message Too Big [IESG_HYBI] [RFC6455]
1010 Mandatory Ext. [IESG_HYBI] [RFC6455]
1011 Internal Error [IESG_HYBI] [RFC6455][RFC Errata 3227]
1012 Service Restart [Alexey_Melnikov] [http://www.ietf.org/mail-archive/web/hybi/current/msg09670.html]
1013 Try Again Later [Alexey_Melnikov] [http://www.ietf.org/mail-archive/web/hybi/current/msg09670.html]
1014 The server was acting as a gateway or proxy and received an invalid response from the upstream server. This is similar to 502 HTTP Status Code. [Alexey_Melnikov] [https://www.ietf.org/mail-archive/web/hybi/current/msg10748.html]
1015 TLS handshake [IESG_HYBI] [RFC6455]
1016-3999 Unassigned
4000-4999 Reserved for Private Use [RFC6455]

WebSocket Subprotocol Name Registry

  • MBWS.huawei.com MBWS
  • MBLWS.huawei.com MBLWS
  • soap soap
  • wamp WAMP (“The WebSocket Application Messaging Protocol”)
  • v10.stomp Name: STOMP 1.0 specification
  • v11.stomp Name: STOMP 1.1 specification
  • v12.stomp Name: STOMP 1.2 specification
  • ocpp1.2 OCPP 1.2 open charge alliance
  • ocpp1.5 OCPP 1.5 open charge alliance
  • ocpp1.6 OCPP 1.6 open charge alliance
  • ocpp2.0 OCPP 2.0 open charge alliance
  • ocpp2.0.1 OCPP 2.0.1
  • rfb RFB [RFC6143]
  • sip WebSocket Transport for SIP (Session Initiation Protocol) [RFC7118]
  • notificationchannel-netapi-rest.openmobilealliance.org OMA RESTful Network API for Notification Channel
  • wpcp Web Process Control Protocol (WPCP)
  • amqp Advanced Message Queuing Protocol (AMQP) 1.0+
  • mqtt mqtt [MQTT Version 5.0]
  • jsflow jsFlow pubsub/queue protocol
  • rwpcp Reverse Web Process Control Protocol (RWPCP)
  • xmpp WebSocket Transport for the Extensible Messaging and Presence Protocol (XMPP) [RFC7395]
  • ship SHIP – Smart Home IP SHIP (Smart Home IP) is a an IP based approach to plug and play home automation and smart energy / energy efficiency, which can easily be extended to additional domains such as Ambient Assisted Living (AAL). SHIP can be used solely on the customer premises or can be integrated into a cloud based solution.
  • mielecloudconnect Miele Cloud Connect Protocol This protocol is used to securely connect household or professional appliances to an internet service portal via a public communication network in order to enable remote services.
  • v10.pcp.sap.com Push Channel Protocol
  • msrp WebSocket Transport for MSRP (Message Session Relay Protocol) [RFC7977]
  • v1.saltyrtc.org
  • TLCP-2.0.0.lightstreamer.com TLCP (Text Lightstreamer Client Protocol)
  • bfcp WebSocket Transport for BFCP (Binary Floor Control Protocol)
  • sldp.softvelum.com Softvelum Low Delay Protocol SLDP is a low latency live streaming protocol for delivering media from servers to MSE-based browsers and WebSocket-enabled applications.
  • opcua+uacp OPC UA Connection Protocol
  • opcua+uajson OPC UA JSON Encoding
  • v1.swindon-lattice+json Swindon Web Server Protocol (JSON encoding)
  • v1.usp USP (Broadband Forum User Services Platform)
  • mles-websocket mles-websocket
  • coap Constrained Application Protocol (CoAP) [RFC8323]
  • TLCP-2.1.0.lightstreamer.com TLCP (Text Lightstreamer Client Protocol)
  • sqlnet.oracle.com sqlnet This protocol is used for communication between Oracle database client and database server, and its usage as subprotocol of websocket is primarly geared towards cloud deployments. sqlnet supports bi-directional data transfer and is full duplex in nature.
  • oneM2M.R2.0.json oneM2M R2.0 JSON
  • oneM2M.R2.0.xml oneM2M R2.0 XML
  • oneM2M.R2.0.cbor oneM2M R2.0 CBOR
  • transit Transit
  • 2016.serverpush.dash.mpeg.org MPEG-DASH-ServerPush-23009-6-2017
  • 2018.mmt.mpeg.org MPEG-MMT-23008-1-2018
  • CLUE CLUE
  • webrtc.softvelum.com Softvelum WebSocket signaling protocol WebRTC live streaming requires WebSocket-based signaling protocol for every specific implementation. Softvelum products will use this subprotocol for signaling

websocket libraries

C++: libwebsockets
Erlang: Shirasu.ws
Java: Jetty
Node.JS: ws
Ruby:
em-websocket
EventMachine
Faye
Python:
Tornado,
pywebsocket
PHP: Ratchet, phpws
Javascript:
Socket.io
ws
WebSocket-Node
GoLang:
Gorilla
C#:
Fleck

Ref :
RFC 6455 – The websocket protocol
Websocket Protocol Registeries : http://www.iana.org/assignments/websocket/websocket.xml
https://www.html5rocks.com/en/tutorials/websockets/basics/
IANA websocket -https://www.iana.org/assignments/websocket/websocket.xhtml