System Design Interview: Real-Time Chat App (2026)

The “design a real-time chat application” question remains one of the most frequently asked system design prompts at companies like Meta, Google, Slack, Discord, and Microsoft in 2026. It tests nearly every muscle a senior engineer needs: API design, data modeling, networking, consistency trade-offs, and horizontal scaling. In this guide, we’ll walk through a complete, interview-ready solution you can adapt on the spot.

Step 1: Clarify Requirements Before You Draw Anything

Strong candidates spend the first 5 minutes scoping. Resist the urge to sketch boxes immediately. Instead, nail down functional and non-functional requirements.

Functional Requirements

Assume a WhatsApp-scale product: one-to-one messaging, group chats up to 500 members, online/last-seen presence, read receipts, media attachments, and push notifications when the recipient is offline. Message history must be durable and searchable.

Non-Functional Requirements

Target 500 million daily active users sending an average of 40 messages each — roughly 230,000 messages per second at peak. End-to-end latency should stay under 200 ms in-region. The system must be highly available (99.99%), and messages must never be lost once acknowledged.

Step 2: Back-of-the-Envelope Estimates

Interviewers love numeric reasoning. At 20 billion messages per day with an average payload of 200 bytes, you’re storing roughly 4 TB of raw text daily, or about 1.5 PB per year before replication. Media attachments dominate storage: assuming 10% of messages include a 500 KB image, that’s an additional 1 PB per day flowing into object storage. Plan capacity accordingly.

Step 3: High-Level Architecture

The backbone consists of seven components working together:

Client SDKs on iOS, Android, and Web that maintain a persistent WebSocket connection.
Load balancers (Layer 4 for WebSockets, Layer 7 for REST) fronting the edge.
Connection / Gateway servers that terminate WebSockets and map user IDs to sessions.
Messaging service that validates, fans out, and persists messages.
Storage layer: Cassandra or ScyllaDB for messages, Redis for presence, S3 for media.
Kafka as the durable event bus between services.
Notification service that integrates with APNs/FCM for offline delivery.

Step 4: The Message Delivery Flow

When Alice sends a message to Bob, her client publishes a SEND frame over her open WebSocket to the nearest gateway. The gateway stamps the message with a Snowflake-style 64-bit ID (timestamp + shard + sequence), writes it to Kafka, and returns an ACK to Alice. A downstream consumer persists the message to Cassandra using conversation_id as the partition key and message_id as the clustering key — O(1) appends and cheap reverse-chronological reads.

The same consumer looks up Bob’s active gateway in Redis. If Bob is online, the message is pushed over his WebSocket; if not, the notification service is invoked. Using Kafka as the spine decouples write latency from fan-out and gives you a replayable log for recovery.

Step 5: Scaling WebSocket Connections

A single modern gateway node (16 vCPU, 32 GB RAM) can comfortably hold 500,000–1,000,000 idle WebSocket connections. For 100 million concurrent users, you need roughly 100–200 gateway nodes per region. Sticky routing via consistent hashing on user_id ensures that a reconnecting client lands on the same shard, simplifying presence tracking.

Step 6: Data Modeling Deep-Dive

The messages table in Cassandra typically uses (conversation_id) as the partition key and message_id as the clustering key in DESC order. This layout keeps a conversation’s messages co-located on disk, enabling fast pagination. A secondary user_conversations table indexes the inbox view per user. For full-text search, stream writes into Elasticsearch via a Kafka sink connector.

Step 7: Trade-offs Interviewers Probe

Expect follow-up questions on the CAP theorem. Chat systems usually prioritize availability and partition tolerance: a temporarily stale read receipt is acceptable, but a dropped message is not. Discuss how you’d use quorum writes (LOCAL_QUORUM) to balance durability and latency, and idempotency keys to tolerate client retries without duplicating messages. For ordering messages in a multi-region group chat, a hybrid logical clock (HLC) assigned at the gateway gives you monotonic ordering with minimal coordination.

Step 8: Reliability and Failure Modes

Walk through three failure scenarios: a gateway crash mid-send (Kafka retains the message, a background job retries fan-out), a regional outage (active-active replication across two regions with conflict-free merge), and a thundering-herd reconnection storm after a deploy (exponential backoff with jitter baked into the client SDK).

Step 9: Practicing Out Loud Is the Real Unlock

Reading a walkthrough is only half the battle — you need to rehearse explaining these trade-offs under time pressure. Modern tools like Niraswa AI can listen to mock interviews and surface personalized hints in real time, which is especially useful for practicing system design narratives where pacing and structure matter as much as content.

Step 10: A 10-Minute Interview Script

If you only have ten minutes to present, structure your answer as: 1 minute on requirements, 1 minute on estimates, 3 minutes on high-level architecture, 2 minutes on data model, 2 minutes on scaling bottlenecks, and 1 minute on failure modes. Drive the whiteboard; don’t let the interviewer drive you.

Final Thoughts

A real-time chat system is deceptively deep. The candidates who stand out in 2026 are the ones who anchor every design choice to a concrete requirement and can defend their trade-offs with numbers. Master this one question and you’ll have frameworks that transfer to feed, ride-sharing, and collaborative editor problems.

Ready to sharpen your system design skills further? Block 30 minutes tonight, grab a whiteboard, and design this system end-to-end out loud. Do it three times this week and the next interview will feel familiar.