System Design

Scalability & Architecture

Scalability & Architecture Scalability Patterns Horizontal scaling — add more servers; requires stateless app (store sessions in Redis, not in-process) Load bal…

Scalability & Architecture

Scalability Patterns

Horizontal scaling — add more servers; requires stateless app (store sessions in Redis, not in-process)
Load balancing — distribute traffic (round-robin, least-connections, IP hash for sticky sessions)
Caching — CDN (static), Redis/Memcached (app), browser cache, HTTP cache headers
Database scaling — read replicas (read from replica, write to primary), sharding (horizontal partitioning), vertical scaling
Async processing — offload slow tasks (email, image processing) to queues (SQS, RabbitMQ, Bull)
CDN — serve static assets and cacheable content from edge locations near users

CAP Theorem

Distributed systems can only guarantee 2 of 3: Consistency (every read gets the most recent write), Availability (every request gets a non-error response), Partition Tolerance (system works despite network partitions). Since network partitions happen, real systems choose CP (PostgreSQL, MongoDB) or AP (DynamoDB, Cassandra in eventual consistency mode).

Common Components

Client → DNS → CDN (static/cached) → Load Balancer → App Servers → Cache → Database
                                                    ↓
                                             Message Queue → Workers

Key components:
┌──────────────────────────────────────────────────────────┐
│ DNS          — Route 53, Cloudflare (geo-routing, failover)│
│ CDN          — CloudFront, Fastly (static assets, caching) │
│ Load Balancer— ALB, Nginx (health checks, SSL termination) │
│ App Servers  — stateless, auto-scaling group               │
│ Cache Layer  — Redis (sessions, hot data, rate limiting)   │
│ Primary DB   — PostgreSQL, MySQL (writes)                  │
│ Read Replica — for read-heavy queries                      │
│ Object Store — S3 (uploads, large files, backups)          │
│ Message Queue— SQS, RabbitMQ (async jobs, decoupling)      │
│ Workers      — process jobs (emails, image resize, reports)│
│ Search       — Elasticsearch (full-text, faceted search)   │
│ Monitoring   — Datadog, CloudWatch (metrics, alerts)       │
│ Log Aggregator— Elasticsearch/Loki + Grafana               │
└──────────────────────────────────────────────────────────┘

Database Selection Guide

PostgreSQL/MySQL — structured data, complex queries, ACID, strong consistency
MongoDB — flexible schema, documents with nested data, rapid iteration
Redis — caching, sessions, rate limiting, leaderboards, pub/sub
Elasticsearch — full-text search, log analytics, faceted filtering
Cassandra/DynamoDB — massive scale write-heavy, time-series, multi-region
S3/Object Store — files, images, videos, backups, data lake

System Design

Reliability & Interview Framework

Reliability & Interview Framework Reliability Patterns Circuit Breaker — stop calling a failing service after N failures; half-open state to probe recovery Retr…

Reliability & Interview Framework

Reliability Patterns

Circuit Breaker — stop calling a failing service after N failures; half-open state to probe recovery
Retry with backoff — retry transient failures with exponential backoff + jitter
Timeout — always set timeouts on outbound calls; fail fast, don't let threads pile up
Bulkhead — isolate resources per service so one slow service doesn't exhaust all thread pools
Idempotency — make retries safe by ensuring duplicate requests produce same result
Health checks — /health endpoint for load balancer; /ready for readiness
Graceful shutdown — finish in-flight requests before shutting down; drain connections

System Design Interview — RESHADED Framework

1. Requirements clarification (5 min)
   - Functional: what does the system DO?
   - Non-functional: scale, latency, availability, consistency
   - Out of scope: what are we NOT building?
   - "How many users? DAU? Writes/reads per second?"

2. Estimation (3 min)
   - DAU × avg requests = RPS
   - Storage: object size × writes/day × retention

3. System Interface (2 min)
   - Key API endpoints / data model

4. High-Level Design (10 min)
   - Draw the major components: client, API gateway, services, DB, cache, queue
   - Data flow for the main use cases

5. Detailed Design (15 min)
   - Deep-dive the interesting/hard parts
   - Database schema, sharding strategy, cache invalidation

6. Bottlenecks & Trade-offs (5 min)
   - Single points of failure
   - What breaks at 10x scale?
   - Cost vs performance trade-offs

Common numbers to know:
- Read from memory: ~100ns
- Read from SSD: ~100µs (1000× slower)
- Read from network: ~10ms
- 1 million requests/day = ~12 RPS
- 1 billion requests/day = ~12,000 RPS
- Average web request: ~1KB
- 1 million users × 1KB = ~1GB/user data
- 99.9% availability = 8.7 hours downtime/year
- 99.99% availability = 52 minutes downtime/year

Caching Strategy Reference

Cache-aside (lazy loading) — app checks cache, on miss fetches from DB and writes to cache. Simple, but initial miss is slow
Write-through — write to cache AND DB synchronously. Cache always up-to-date, but slower writes
Write-behind (write-back) — write to cache first, async write to DB. Fast writes, risk of data loss
Read-through — cache handles DB fetch on miss. Simpler app code, cache library manages it

System Design notes for developers