SIP conferencing and Media Bridges

SIP is the most popular signalling protocol in VOIP ecosystem. It is most suited to a caller-callee scenario, yet however supporting scalable conferences on VOIP is a market demand. It is desired that SIP must for multimedia stream but also provide conference control for building communication and collaboration apps for new and customisable solutions.

SIP Recap

Apart from VoIP, it is used in other multimedia technologies like online games, video conferencing, instant messaging and other services. SIP is an IETF-defined signalling protocol for controlling communication sessions over IP. It is an application layer protocol, which runs on TCP, UDP and SCTP. SIP is based on the Web protocol Hypertext Transfer Protocol (HTTP) and is a request/response protocol.

SIPv1 :SIPv1 was text-based. It used Session Description Protocol (SDP) to describe the sessions and UDP as a transport protocol. SIPv1 only handled session establishment and did not handle Mid-conference controls.

SIPv2 :Simple Conference Invitation Protocol( SCIP) utilized Transmission Control Protocol (TCP) as the transport protocol to manage conferences. Ur was based on HTTP and used e-mail addresses as identifiers for users. SIP v2 was also text-based but based on HTTP and could use both UDP and TCP as transport protocols. The combination of SIPv1 and SCIP resulted in Session Initiation Protocol. 

SIP is used to distribute session descriptions among potential participants. Once the session description is distributed, SIP can be used to negotiate and modify the parameters of the session and terminate the session. Role of SIP in conference involves

  • initiating confs
  • inviting participants
  • enabling them to join conf
  • leave conf
  • terminate conf
  • expel participants
  • configure media flow
  • control activities in conf

Mesh vs star topology

Mesh has p2p streaming thus has maximum data privacy and low cost for service provider because tehre arnt any media stream to take care of. Infact it just comes out of the box with WebRTC peerconnections. But ofcourse you cant scale a p2p mesh based archietcture . Although the communication provider is now indifferent to the media stream traffic , the call quality of session is entirely dependent of the end clients processing and their bandwidths which in my experince cannot accomodate more than 20-25 particpants in a call even above average bandwidth of 30-40 Mbps uplink , downlink both.
On the other hand in a star topolgy the participants only need to communicate with the media server , irrrespective of the network conditions of the receivers .

Centralised (star) structure

In a Centralised ( star) signalling model , all communication flows via a centralised control point

Applications of star topology couold be MCU and SFU.

Centralised Media / MCU

Centralised Media / MCU

Multipoint Control Unit (MCU) uses mixer found in video conferencing bridges.

  • (+) proven interworking with legacy systems
  • (+) singel point to manage transcoding
  • (+) energy efficient mode of operation , keeping cleint side stream management low
  • (+) single point for DTMF inband/signalling processing
  • (-) CPU and resource intensive on server side
  • (-) adds latency for traversal via media server
  • (-) self managed scaling , heavy tarffic and resource maantaince
  • (-) possible security vulnerability as server decryptes media packets

Centralised Media via SFU

SFU + simulcast

Single Forwarding Unit ( SFU) is a neew topology where the centralzed media server only forwards or proxies the streams without mixing.

  • (+) scales for low latency video streaming
  • (+) less CPU consumption of server
  • (+) can control output stream for each peers based on their network capabilities
  • (-) still susceptab;e to security vulnearability at the focal point.

Decentralised structure

In a decentralised ( mesh) signalling structure, participants can communicate p2p

Decentralised media, Multi unicast streaming

Decentralised media, Multicast streaming

Mesh based communication

Limitations of WebRTC mesh Architecture

WebRTC is intrinsically a p2p system and as more participants join the session , the network begins to resemble a mesh. Audio and textual data being the lighter option from heavy video media streams can still adjust to the difficult conditions without much noticible lag. However video streams take a hit when peers are on difficult bandwidth and use differnt qualities of video sources.

Lets assume 3 different clients communication on WebRTC mesh session

  1. WebRTC browser on high resolution system ( desktop , laptop , kiosk) – this client will likely have high quality stream and would like to consume high quality as well
  2. Mobile browser of native WebRTC client – this will have aberage quality stream and may fluctuate owing to telecom network handover or instability in moving beween locations
  3. Embedded system like Raspberry pi with camera module – since this is an embedded system likley part of IoT survillance system , it will try to restrict the outgoing usuage and incoming stream consumption to minimal.

Some issue with WebRTC mesh conference include

  • Unmatched quality of stream for idnividual p2p streams in mesh make it difficult to have a homogenous session quality.
  • Often video packet go out of sync with audio packets leading to delay or freezing due to packet loss.
  • Pixelating video when resolution of incoming video does not match the viewers display mode eg : low quality 320×280 pixel video viewed on desktop monitor with 1080×720 resolution.
  • Different source encoders at peers WebRTC client behave different . eg : webrtc stream from an embedded system like Rpi will be different from that of a WebRTC browser like Safari or mozilla or a mobile browser like chrome on Android.

Although with auto adjustments in WebRTC’s media stack , combinations of bitrate and resolution are manipulated in realtime based on feedback packets to adjust the qualities of your video streaming to bandwidth constraints of your own and the peer, there exist many difficulties to have large number of partcipants ( in order of few tens to hundreds) to join the mesh session. Even with an excellent connection and great scale of bandwidth of 5G networks it is just not feasible to host even upto 100 users on a common mesh based video system.

Unicast, BroastCast and Multicast media distribution

one-to-one transmissionone-to-all within a range

Its types are
– limited broadcast and
– Direct broadcast

servers direct single streams towards any listener who wants to connect. the stream is replicated many times accross the network.
usage : RTC over the networks between two specific endpointsusage : conference streamingusage : IPTV that distributes to hundres or thousands of viewers

Inspite of both being a star topology, SFU/Selective Forwarding Unit is different from MCU as in contrast to MCU it does not do any heavy duty processing on media streams , it only fetches the stream and routes them to other peers .

On the other hand MCU ( Multipoint Control Unit ) media servers need a lot of computational strength to perform many operations on RTP stream such as mixing , multiplexing, filytering echo /noise etc.

Scalable Video Coding (SVC) for large groups

while simulcast streams multiple versions of the same stream with differenet qualities like resolutions where the SFU can pick the appropriate one for the destination. SFU can also forward different framerates to differnrt detsinations absed on their bandwidth. Some of the Conference Bridge types :-

1. Bridge

Centralised entity to book conf , start conf , leave conf . Therefore single point of failure potentially .

  • To create conf : conf created on a bridge URL , bridge registers on SIP Server, participants join the conf on the bridge using INVITES
  • To stop conf : either participant can Leave with BYE or conf can terminate by sending BYE to all

2. Endpoints as Mixer

Endpoints handle stream , decentralised media , therefore adhoc suited

(-) mixer UAs cannot leave untill conf finishes

3. Mesh

complex and more processing power on each UA required

  • (+) no single point of failure
  • (-) high network congestion and endpoint processing
  • (-) endpoints have to handle NATIng

Large scale multiparticipant WebRTC sessions

A MCU ( Media control Unit) which acts as a bridge between all particpants is a tradiotionally used system to host large conferences. However a MCU limits or lowers the bandwidth usuage by packing the streams together .A SFU ( Single Forwarding Unit ) on the other hand simply forwards the stream.

This setup is usualy designed with heavy bandwidth and upload rates in mind and are more scalable and resilient to bad quality stream than p2p type mesh setups. As these media gateways servers scale to accomodate more simultanous real time users , their bandwidth consumption is heavy and expensive( some thing to be kept in mind while buying instances fro cloud providers like azure or AWS).

Some of the many options to make SFU (single forwarding unit setup) for WebRTC mediastreams are listed below :-


Opensource (Apache 2.0) WebRTC gateways that has buildin integration to OpenCV.

Pipeline Architecture Design Pattern

Features in KMS ( Kurento Media Server) include Augmentation, face reciognition, filetrs, Object tracking and even virtual fencing.

Other features like mixing , transcoding, recording as well as client APIs make it suitable for integration into rich multimedia applications.

  • (+) It can function as both MCU and SFU.
  • (+) Added Media processing and transformations – Augmented Relaity , Blending , Mixing , Analyzing ..
  • (+) ML freindly + openCV filter ( samples provided )
  • (+) pipeline used with computer vision

Nightly build, good docuemntion and developer gtraction make this a good choice. Latest version at the time of writing this article is Kurento 6.15.0 release on november 2020.


Opensource (MIT) WebRTC Comm platform by Lynckia.

Simple and starightforward to build from source . Latest release is v8 on sep 2019.

Erizo, its WebRTC core, is by default is SFU but also can be switched to MCU for more features like output streaming, transcoding.

It is written in c++ and uses nodejs API to communicate with server.

Supports modules like recording which can be added.


Opensource (Aapache 2.0)Video conferencing called Jitsi Video Bridge ( jvb).

JITSI Components

  • Jitsi VideoBridge – SFU
  • Jicofo – “Focus” component that initiates Jingle on behalf of JVB
  • Jigasi – SIP to XMPP signaling and media gateway
  • Jirecon – Component that allows for recording of RTP streams
  • Jibri – New live recording component

Other client side components and SDK

  • lib-jitsi-meet/Strophe.js – Javascript (Browser/Node.js)
  • XMPPFramework/MeetRTC_iOS – iOS
  • Smack – Java/Android
Jits conferencing ( SFU)
  • (+) Supports high capacity SFU. Provides tools ( jibri) for recording and/or streaming.
  • (+) Has Android and iOS SDKs.
  • (-) Low sip support ( more on XMPP) Orignally uses XMPP signalling but can communicate with SIP platfroms using a gateway which is part of Jitsi project .

It is best used as a binary package on debina / ubuntu instead of self Maven compile. The most recent release is 2.0.5390 release on 12 Jan 2021.


Opensource ( ISC) SFU conferecing server for both WebRTc and plain non secured RTP.

Producer consumer archietecture design pattern.

  • (+) It is signalling agnostic
  • (+) nodejs module in server ( media handling in cpp)
  • (+) Provides JS and c++ client libraries
  • (+) audio/video consumers or consumers can be Gstreamer and FFMPEG scripts

Relatively new with less documentation however simpleistic and minimilistic deisgn make it easy to grasp and run.


WebRTc gateway is also opensource ( GNU GPL v3)

Build on C. It does have ability to switch between SFU and MCU and provides pligins on top like recording.

By default uses a Websocket based protocol , HTTP/JSON and XMPP but can communicate with SIP platforms too using plugin.

Asterisk SFU

MCU based pure SIP signalling and media server ( GNU GPL v2 ) from Sangoma Technologies.

Powerful server core to many OTT / VOIP providers and call centre platfroms.

  • (+) Can be modified to any role using combination of hundres of modules.
  • (-) Project does not provide client SDK.


live streaming with SDK for native (ios , android) and html5. cutom server side application.

  • (+) supports Ip camera , drone , RTSP, RTMP , hardware encoders ( many client instances )
  • (+) failover to HLS and flash

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.