Realtime Messaging Services Design

Functional Requirnments

  • one to one / group chat
  • support for multimedia – text / images / video / loccation
  • Read receipt / Message status
  • Last seen
  • Push notifications

Non Functional Requirnments

  • No latency / lag
  • HA ( high availibilty ) + Fault tolerent
  • scalablity ( 2 billion users , 1.6 Monthly ative users )
  • traffic 64 billion msgs / day
  • Administrative req – GDPR so on

Design Expectations

  • Partition tolerance to handle a large amount of data using clusters.
  • To create trust, reliability and consistency are critical as miscommunication will drain user confidence in the application.
  • Resilient to recover from failures.
  • Security and Privacy : End to end encryption on SSL
  • Analytics and monitoring

I I have discussed more value Addon Features for msgs in termed at RCS ( Rich Communication Suite ) here :

The User Application system could have user profile , messaging service and alerts / notifications .

A transistent data store handles unsent messages before expiry. The transient message are temporarily stored and once send to user., deleted .

The frontend tech for the various mobile and desktop agents could be

  • Android: Java / React Native
  • iOS: Swift / React Native
  • Web client: JavaScript/HTML/CSS/ with web frameworks such as Angular or React JS
  • Mac Desktop app: Swift/Objective-C
  • PC Desktop app: Electron , C/Java

Message Format

from :  alice x.x.x.x
to : bob y.y.y.y
metadata : 
    timestamp : 12 dec 2017 3:09:13:6678
    type : text 
msgPayload : "Hi How are you " 

“Message Read ” Format

from :  bob y.y.y.y
to : alice y.y.y.y
metadata : 
    timestamp : 12 dec 2017 3:09:13:9070
    type : seen
msgPayload : 

Primary Keys

  • User
    • UserId
    • Username
    • UserprofilePic
  • Groups
    • GId
    • UserId1, UserId2…
  • Messages
    • ToUserId
    • FromUserId
    • Ts
    • MediaUrl
  • Sessions
    • UserId
    • MsgServerId
  • LastSeen
    • UserId
    • Ts

API

Msg API

  • SendMessage(senderUId, receiverUID, msg)
  • GetMessages( UserId , count , timerange)

Accout API

  • registerUser ( APIKey , phoneNumber, UserId)
  • loginUser ( APIKey , phoneNumber , UserId , OTP)
  • validateAcc

Group API

  • createGrp (APIkey , groupInfo) return GrpId
  • addUserToGrp ( APIKey , UserId, GrpId)
  • removeUserFromGrp (UserId, GrpId)
  • createAdmin( APIKey , UserId , GrpId)

Other API should provide

  • authentication
  • monitoring
  • Load balancing
  • caching
  • request hsaping
  • static responses

Messaging protocol

HTTPHTTP Long Poll / short Poll by Client Websocket
slow as server closes connection after each req client polls the server requesting new informationpersistnet connecion
p2p
Server Push
unsuitable to for realtime msgingunsuitable to for realtime msgingsuitable

The overall high level architecture

Session Management

Dedicated / Private Chat Sessions : SessionId = <UserId1 + UserId2>

Group / shared Chat Session : SessionId < prefix + randomId >

SessionMessages schema can have its primary key : <SessionId + timestamp>

Fan Out Message / Send to All

Routing Service -> Messaing Group -> Push Notification

Push Notification

  • APNS – apple Push Notification Service used for iphone
  • GCM ( Google Cloud Mesageing ) / FCM ( FireBase Cloud Messaging )
  • WNS ( Windows Notofcaton Service )

Mobile Agent talks to its PNS with its device ID to get a pus notification token

The push notifcation token will be then used by Messaging platform to send a push notification to recipint .

Handling Load

External Load Balancers for Websockets Handlers and User agents

High load shared by multuple Message servers and PAI gateways behind Internal Load balancers in Dmz zone ( demilitirized zone).

Distributed Datastore : API gateways to distribute requests accross servers using consistent hashing

Sharded by GroupId as Primary Index and UserId as seconday Index

Distributed cache : write through Mechanism : Redis Clusters

Stream and Log Analysis : Kafka + Hadoop

Scalability

Assume 1 billion users active per month and 40 million at peak

server required = message count per second * latency / server limit for concurrent messages per second

servers required = 40 million * 20 ms / 100,000 = 8 servers

BottleNecks

  1. If receipient of Message is offline / unavaible the message delivery is tried indefinately
    • Solved using transiset message satore to hold undelivered messages untill user is able to take the message or message is expired .
    • Transisnet DB can be FIFO
  2. Server Failure
    • Replication of Messaging Server for ongoing sessions in 2f+1
    • client to automatically be handed over to new server when exsting server crashes

Optimizations

  1. Replication of Transient Storage and CDN for media ( images / Videos )
  2. To fetch new messages – use the last msg as pointer to read the message that was last read and fetch all message with greater sequnece
  3. Random Authentication and Challenge
  4. Proactive server restore and key refresh to prevent brazentine attacks
  5. Integration with SMS gateway

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.