With the sudden onset of Covid-19 and building trend of working-from-home , the demand for building scalable conferncing solution and virtual meeting room has skyrocketed . Here is my advice if you are building a auto- scalable conferencing solution
This article is about media server setup to provide mid to high scale conferencing solution over SIP to various endpoints including SIP softphones , PBXs , Carrier/PSTN and WebRTC.
Point to Point
Endpoints communicating over unicast
RTP and RTCP tarffic is private between sender and reciver even if the endpoints contains multiple SSRC’s in RTP session
- Facilitates private communication between the parties
- Only limitaion to number of stream between the partcipants are the physical limiations such as bandwidth, num of available ports
Point to Point via Middlebox
Same as above but with a middle-box involved
mostly used interoperability for non-interoperable endpoints such as transcoding the codecs or transport convertion
does not use an SSRC of its own and keeps the SSRC for an RTP stream across the translation.
Subtypes of Multibox :
Roles like NAT traversal by pinning the media path to a public address domain relay or TURN server
Middleboxes for auditing or privacy control of particpant’s IP
Other SBC ( Session Border Gateways) like charecteristics are also part of this topology setup
interconnecting networs like mutlicast to unicast
media repacktization to allow other media to connect tgo the session like non RTP protocols
modified the media inside of RTP streams commonly known as transcoding
can do uptp full encoding / decoding of RTP streams
in many cases it can also act of behalf of non RTP supported endpoints , receivinga nd repsosnidng to feedback reports ad performing FEC ( forward error corrected )
Back-To-Back RTP Session
Mostly like middlebox like trnslator but establishes separte legs RTP session with the endpoints , bridging the two sessions.
Takes complete repsososibility of forwarding teh correct RTP payload and maianting the realtion between the SSRC and CNAMEs
- B2BUA / media bridge take responsibility tpo relay and manages congestion
- B2BUA can be subjected to mim attack or have a backdoor to eavesdrop on conversations
Point to Point using Multicast
Any-Source Multicast (ASM)
traffic from any particpant sent to the multicat group address reaches all other partcipants
Source-Specific Multicast (SSM)
Selective Sender stream to the multicast group which streams it to the recibers
Point to Multipoint using Mesh
many unicast RTP streams making a mesh
Point to Multipoint + Translator
some more varients of this topoplogy are Point to Multi point with Mixer
Media Mixing Mixer
receives RTP streams from several endpoints and selects the stream(s) to be included in a media-domain mix. The selection can be through
static configuration or by dynamic, content-dependent means such as voice activation. The mixer then creates a single outgoing RTP stream from this mix.
Media Swicthing Mixer
RTP mixer based on media switching avoids the media decoding and encoding operations in the mixer, as it conceptually forwards the encoded media stream.
The Mixer can reduce bitrtae or switch between sources like active speaker.
SFU ( selective Forwarding Unit)
middlebox can select which of the potential sources ( SSRC) transmitting media will be sent to each of the endpoints. This gtramsission is set up is independant RTP Session.
extensively used in videoconferencing topologies with scalable video coding as well as simulcasting.
On a high level , one can safely assume that for no of peers between 3-6 mesh archietctures make sense however any number above it require centralized media archietcture .
Among teh centralized media archietctures , SFU makes sense for atmomst 6-15 people in a confernece however is teh number of participants exceed that it may need to switch to MCU mode. However there is another architecture which works on Hybrid mode
Point to Multipoint Using Video-Switching MCUs
much like MCU but MCU can switch the bitrate and resilution stream based on active speaker , host or ppt presenter , floor control like charecteristics
This setup can embed the charecteristics of trabnslator , sleector and can ecen do congesyion congrol based on RTCP
To handle a multipoint confernece scenario it acts as a transaltor forwarding the selected RTP stream under its own SSRC, with the appropriate CSRC values and modify the RTCP RRs it forwards between the domains
Before getting into indepth discussion of all possible types of Media Archietctures in VoIP system , lets learn about TCP vs UDP
TCP is a reliable connection oriented protocol which sends REQ and receives ACK to establish connection between cmmunicating parties . It sequeentiallys ends packets which can be resent inidvidually when the receiver reciognizes out of order packets . It is thus used for session creation due to its errorx correction and congestion control features .
Once a session is established it automatically shifts to RTP over UDP . UDP even though not as reliable , not guarrenting non-duplication and delivery error correction is used due to its tunneling methodds where packets of other protcols are encapsulated isnide of UDP packet. However to provide ened to end security other methods for Auth and encryption
Audio PCAP storage and Privacy constraints for Media Servers
A Call session produces various traces for offtime monitoring and analysis which can include
CDR ( Call Detail Records ) – to , from numbers , ring time , answer time , duration etc
Signalling PCAPS – collected usually from SIP application server containing the SIP requests, SDP and responses. It shows the call flow sequences for example, who sent the INVITE and who send the BYE or CANCEL. How many times the call was updated or paused/resumed etc .
Media Stats – jitter , buffer , RTT , MOS for all legs and avg values
Audio PCAPS – this is the recording of the RTP stream and RTCP packets between the parties and requires explicit consent from the customer or user . The VoIP companies complying with GDPR cannot record Audio stream for calls and preserve for any purpose like audit , call quality debugging or an inspection by themselves.
Throwing more light on Audio PCAPS storage, assuming the user provides explicit permission to do so , here is the approach for carrying out the recording and storage operations.
Firther more , strict accesscontrol , encryption and annonymisation of the media packets is necessary to obfuscate details of the call session.
- RFC 7667 RTP Topologies – https://tools.ietf.org/html/rfc7667
To learn about the difference between Media Server tologies
- centralized vs decentralised,
- SFU vs MCU ,
- multicast vs unicast ,
To read more about buildinga scalable VoIP Server Side architecture and
- Clustering the Servers with common cache for High availiability and prompt failure recovery
- Multitier archietcture ie seprartion between Data/session and Application Server /Engine layer
- Micro service based architecture ie diff between proxies like Load balancer, SBC, Backend services , OSS/BSS etc
- Containerization and Autoscalling