Realtime Messaging Services Design

Functional Requirnments

  1. one to one / group chat
  2. support for multimedia – text / images / video / loccation
  3. Read receipt / Message status
  4. Last seen
  5. Push notifications

Non Functional Requirnments

  1. No latency / No lag
  2. HA ( high availibilty ) + Fault tolerent
  3. scalablity ( 2 billion users , 1.6 Monthly ative users )
  4. traffic 64 billion msgs / day
  5. Administrative requirnments – GDPR so on

Design Expectations

  1. Partition tolerance to handle a large amount of data using clusters.
  2. To create trust, reliability and consistency are critical as miscommunication will drain user confidence in the application.
  3. Resilient to recover from failures.
  4. Security and Privacy : End to end encryption on SSL
  5. Analytics and monitoring

The User Application system could have user profile , messaging service and alerts / notifications .

A transistent data store handles unsent messages before expiry. The transient message are temporarily stored and once send to user., deleted .

The frontend tech for the various mobile and desktop agents could be

  • Android: Java / React Native
  • iOS: Swift / React Native
  • Web client: JavaScript/HTML/CSS/ with web frameworks such as Angular or React JS
  • Mac Desktop app: Swift/Objective-C
  • PC Desktop app: Electron , C/Java

Message Format

from :  alice x.x.x.x
to : bob y.y.y.y
metadata : 
    timestamp : 12 dec 2017 3:09:13:6678
    type : text 
msgPayload : "Hi How are you " 

“Message Read ” Format

from :  bob y.y.y.y
to : alice y.y.y.y
metadata : 
    timestamp : 12 dec 2017 3:09:13:9070
    type : seen
msgPayload : 

Primary Keys

  • User
    • UserId
    • Username
    • UserprofilePic
  • Groups
    • GId
    • UserId1, UserId2…
  • Messages
    • ToUserId
    • FromUserId
    • Ts
    • MediaUrl
  • Sessions
    • UserId
    • MsgServerId
  • LastSeen
    • UserId
    • Ts



  • SendMessage(senderUId, receiverUID, msg)
  • GetMessages( UserId , count , timerange)

Accout API

  • registerUser ( APIKey , phoneNumber, UserId)
  • loginUser ( APIKey , phoneNumber , UserId , OTP)
  • validateAcc

Group API

  • createGrp (APIkey , groupInfo) return GrpId
  • addUserToGrp ( APIKey , UserId, GrpId)
  • removeUserFromGrp (UserId, GrpId)
  • createAdmin( APIKey , UserId , GrpId)

Other API should provide

  • authentication
  • monitoring
  • Load balancing
  • caching
  • request hsaping
  • static responses

Messaging protocol

HTTPHTTP Long Poll / short Poll by Client Websocket
slow as server closes connection after each req client polls the server requesting new informationpersistnet connecion
Server Push
unsuitable to for realtime msgingunsuitable to for realtime msgingsuitable

High Level architecture (HLL)

The overall high level architecture consists of interfacing websocket handler layer isolated from core messaging service and msg handler.

Session Management

Dedicated / Private Chat Sessions : SessionId = <UserId1 + UserId2>

Group / shared Chat Session : SessionId < prefix + randomId >

SessionMessages schema can have its primary key : <SessionId + timestamp>

Fan Out Message / Send to All

Routing Service -> Messaing Group -> Push Notification

Push Notification

  • APNS – apple Push Notification Service used for iphone
  • GCM ( Google Cloud Mesageing ) / FCM ( FireBase Cloud Messaging )
  • WNS ( Windows Notofcaton Service )

Mobile Agent talks to its PNS with its device ID to get a pus notification token

The push notifcation token will be then used by Messaging platform to send a push notification to recipint .

Handling Load

External Load Balancers for Websockets Handlers and User agents

High load shared by multuple Message servers and PAI gateways behind Internal Load balancers in Dmz zone ( demilitirized zone).

Distributed Datastore : API gateways to distribute requests accross servers using consistent hashing

Sharded by GroupId as Primary Index and UserId as seconday Index

Distributed cache : write through Mechanism : Redis Clusters

Stream and Log Analysis : Kafka + Hadoop


Assume 1 billion users active per month and 40 million at peak

server required = message count per second * latency / server limit for concurrent messages per second

servers required = 40 million * 20 ms / 100,000 = 8 servers


  1. If receipient of Message is offline / unavaible the message delivery is tried indefinately
    • Solved using transiset message satore to hold undelivered messages untill user is able to take the message or message is expired .
    • Transisnet DB can be FIFO
  2. Server Failure
    • Replication of Messaging Server for ongoing sessions in 2f+1
    • client to automatically be handed over to new server when exsting server crashes


  1. Replication of Transient Storage and CDN for media ( images / Videos )
  2. To fetch new messages – use the last msg as pointer to read the message that was last read and fetch all message with greater sequnece
  3. Random Authentication and Challenge
  4. Proactive server restore and key refresh to prevent brazentine attacks
  5. Integration with SMS gateway

Rich Communication suite

I I have discussed more Addon Features for IM in terms of RCS ( Rich Communication Suite )

RCS ( Rich Communication Suite )

RCS ( Rich Communication Suite ) For the past few weeks, I’ve been trying to find the answer to this one. After much information gathering, I understood that majority of communication platform provider’s mostly OTT such as iMessage from apple, RBM(RCS Business Messaging) already supports these features. And it is partially a term coined by Google…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.