- Functional Requirnments
- Non functional Requirnments
- Design Expectations
- Message Format
- Primary Keys
- API
- Messaging protocol
- high-level-architecture
- Session Management
- Session Management
- Handling Load
- Scalability
- BottleNecks
- Optimizations
- Rich Communication suite
Functional Requirnments
- one to one / group chat
- support for multimedia – text / images / video / loccation
- Read receipt / Message status
- Last seen
- Push notifications
Non Functional Requirnments
- No latency / No lag
- HA ( high availibilty ) + Fault tolerent
- scalablity ( 2 billion users , 1.6 Monthly ative users )
- traffic 64 billion msgs / day
- Administrative requirnments – GDPR so on
Design Expectations
- Partition tolerance to handle a large amount of data using clusters.
- To create trust, reliability and consistency are critical as miscommunication will drain user confidence in the application.
- Resilient to recover from failures.
- Security and Privacy : End to end encryption on SSL
- Analytics and monitoring
The User Application system could have user profile , messaging service and alerts / notifications .
A transistent data store handles unsent messages before expiry. The transient message are temporarily stored and once send to user., deleted .
The frontend tech for the various mobile and desktop agents could be
- Android: Java / React Native
- iOS: Swift / React Native
- Web client: JavaScript/HTML/CSS/ with web frameworks such as Angular or React JS
- Mac Desktop app: Swift/Objective-C
- PC Desktop app: Electron , C/Java

Message Format
from : alice x.x.x.x to : bob y.y.y.y metadata : timestamp : 12 dec 2017 3:09:13:6678 type : text msgPayload : "Hi How are you "
“Message Read ” Format
from : bob y.y.y.y to : alice y.y.y.y metadata : timestamp : 12 dec 2017 3:09:13:9070 type : seen msgPayload :
Primary Keys
- User
- UserId
- Username
- UserprofilePic
- Groups
- GId
- UserId1, UserId2…
- Messages
- ToUserId
- FromUserId
- Ts
- MediaUrl
- Sessions
- UserId
- MsgServerId
- LastSeen
- UserId
- Ts
API
Msg API
- SendMessage(senderUId, receiverUID, msg)
- GetMessages( UserId , count , timerange)
Accout API
- registerUser ( APIKey , phoneNumber, UserId)
- loginUser ( APIKey , phoneNumber , UserId , OTP)
- validateAcc
Group API
- createGrp (APIkey , groupInfo) return GrpId
- addUserToGrp ( APIKey , UserId, GrpId)
- removeUserFromGrp (UserId, GrpId)
- createAdmin( APIKey , UserId , GrpId)
Other API should provide
- authentication
- monitoring
- Load balancing
- caching
- request hsaping
- static responses
Messaging protocol
HTTP | HTTP Long Poll / short Poll by Client | Websocket |
slow as server closes connection after each req | client polls the server requesting new information | persistnet connecion p2p Server Push |
unsuitable to for realtime msging | unsuitable to for realtime msging | suitable |
High Level architecture (HLL)
The overall high level architecture consists of interfacing websocket handler layer isolated from core messaging service and msg handler.

Session Management
Dedicated / Private Chat Sessions : SessionId = <UserId1 + UserId2>
Group / shared Chat Session : SessionId < prefix + randomId >
SessionMessages schema can have its primary key : <SessionId + timestamp>
Fan Out Message / Send to All
Routing Service -> Messaing Group -> Push Notification
Push Notification
- APNS – apple Push Notification Service used for iphone
- GCM ( Google Cloud Mesageing ) / FCM ( FireBase Cloud Messaging )
- WNS ( Windows Notofcaton Service )
Mobile Agent talks to its PNS with its device ID to get a pus notification token
The push notifcation token will be then used by Messaging platform to send a push notification to recipint .
Handling Load
External Load Balancers for Websockets Handlers and User agents
High load shared by multuple Message servers and PAI gateways behind Internal Load balancers in Dmz zone ( demilitirized zone).
Distributed Datastore : API gateways to distribute requests accross servers using consistent hashing
Sharded by GroupId as Primary Index and UserId as seconday Index
Distributed cache : write through Mechanism : Redis Clusters
Stream and Log Analysis : Kafka + Hadoop
Scalability
Assume 1 billion users active per month and 40 million at peak
server required = message count per second * latency / server limit for concurrent messages per second
servers required = 40 million * 20 ms / 100,000 = 8 servers
BottleNecks
- If receipient of Message is offline / unavaible the message delivery is tried indefinately
- Solved using transiset message satore to hold undelivered messages untill user is able to take the message or message is expired .
- Transisnet DB can be FIFO
- Server Failure
- Replication of Messaging Server for ongoing sessions in 2f+1
- client to automatically be handed over to new server when exsting server crashes
Optimizations
- Replication of Transient Storage and CDN for media ( images / Videos )
- To fetch new messages – use the last msg as pointer to read the message that was last read and fetch all message with greater sequnece
- Random Authentication and Challenge
- Proactive server restore and key refresh to prevent brazentine attacks
- Integration with SMS gateway
Rich Communication suite
I I have discussed more Addon Features for IM in terms of RCS ( Rich Communication Suite )
RCS ( Rich Communication Suite )
RCS ( Rich Communication Suite ) For the past few weeks, I’ve been trying to find the answer to this one. After much information gathering, I understood that majority of communication platform provider’s mostly OTT such as iMessage from apple, RBM(RCS Business Messaging) already supports these features. And it is partially a term coined by Google…