Question 1

How does Amazon deploy code in production without logging users out?

Accepted Answer

Sessions are NOT kept in server memory — they live in an external store like Redis or DynamoDB. Servers become completely stateless, so killing one mid-deploy doesn't affect users. Load balancer routes the next request to a fresh server, which fetches the session from Redis. Rolling deployment proceeds one server at a time so traffic never stops. Replication across 3+ Redis nodes guarantees zero session loss.

💡 Real-life example

🏪 Stateless servers + external session store + rolling deployment = zero-downtime at Amazon scale. The session 'follows the user', not the server.

Question 2

Your auto-scaling works perfectly — why do users still face downtime during traffic spikes?

Accepted Answer

Auto-scaling reacts AFTER traffic arrives. New EC2 boot, Kubernetes scheduling, container startup, health checks and load balancer registration all add minutes of delay. Existing servers get overloaded in that gap, triggering a dangerous cycle: slow responses → more retries → more traffic → bigger overload. Senior engineers combine four levers: predictive scaling (scale BEFORE expected spikes), warm standby (keep extra capacity hot), queue buffering (absorb spikes safely), and aggressive caching (so the DB isn't the bottleneck).

💡 Real-life example

🎟️ Concert ticket releases: pure reactive scaling = the site crashes for 10 minutes. Predictive scaling + warm pool + Redis cache = zero downtime.

Question 3

How would you design an API that survives 1 million requests per second?

Accepted Answer

Layered architecture: Load balancer distributes traffic across 200+ servers → horizontal scaling adds instances elastically → Redis caches hot data so we never hit the DB for it → CDN serves static assets from the edge → Kafka/SQS offloads emails, analytics and heavy processing async → DB is sharded for parallel writes → rate limiting blocks abusive clients → payloads are lean to cut bandwidth and latency.

💡 Real-life example

🚀 1M RPS recipe: Load Balancer + Horizontal Scale + Cache + CDN + Async Queues + Sharding + Rate Limiting + Lean Payloads. No single layer carries the load.

Question 4

Indexes make queries faster — so why not index every column?

Accepted Answer

Every index has a write cost. On INSERT/UPDATE/DELETE the database must update all related indexes — extra disk writes, page splits, lock contention. Also: storage cost, query optimizer confusion (may pick the wrong index), and indexes on low-cardinality columns (boolean) help almost nothing. Index strategically: columns used in WHERE filters, JOIN conditions, and ORDER BY sorts. Look at actual query plans, not guesses.

💡 Real-life example

📚 An index on `users.email` (high cardinality, used in WHERE) → big win. An index on `users.is_active` (boolean, two values) → planner usually ignores it.

Question 5

Your database has 500 million rows. Queries take 30 seconds. No code change allowed. What do you do?

Accepted Answer

Start with partitioning on the same machine: horizontal splits rows by date or region so queries hit only one partition; vertical splits heavy columns into separate tables; hash partitioning by hash(user_id) % N gives even distribution with no hotspots. If that's not enough, move to sharding across machines: each server holds one shard, queries hit one shard only. Sharding adds real complexity — cross-shard queries are expensive, resharding is painful, distributed transactions are hard. Always partition first; shard only when single-server capacity is exhausted.

💡 Real-life example

📅 A 500M-row 'orders' table partitioned by year means a 'last month' query scans 40M rows instead of 500M — 12x faster, zero code change.

Question 6

What is the N+1 query problem and how do you fix it?

Accepted Answer

N+1 happens when one query fetches N parent records and then triggers N more queries — one per parent — to load related data. Common with ORM lazy loading inside loops. Example: 1 query for 100 users + 100 queries for their orders = 101 queries instead of 1. Fix it with JOIN FETCH / EntityGraph (eager loading), batch fetching using `WHERE id IN (...)`, or aggregating at the DB level with GROUP BY. Always log generated SQL in development.

💡 Real-life example

🛒 An admin orders dashboard goes from 101 queries (1.2s) to 1 query (40ms) by replacing the for-loop with a single JOIN.

Question 7

Your production database is down. Users are active. What do you do in the next 5 minutes?

Accepted Answer

Minute 1: confirm the outage from monitoring dashboards, not just user reports — Twitter and screenshots lie. Minute 2: reduce impact — maintenance page or serve cached responses where possible. Minute 3: failover — switch to a read replica or standby. Minute 4: identify root cause — recent deployment? disk full? deadlock? cloud-provider incident? Minute 5: communicate clearly — DevOps, stakeholders, public status page.

💡 Real-life example

🚨 Confirm → Contain → Failover → Debug → Communicate. In production incidents, calm, clear communication is leadership.

Question 8

JWT tokens are never stored in the database — how does the server know they're valid?

Accepted Answer

A JWT is self-contained: `header.payload.signature`. The server takes the header + payload, re-runs HMAC with its SECRET_KEY, and compares with the signature on the token. Match = valid, no DB call, instant. The payload carries userId, roles, and an `exp` (expiry) claim that the server checks. Trade-off: you can't easily revoke a token before it expires. Fix in production: short expiry (15 min) + refresh token. For logout: a Redis blacklist of invalidated token IDs.

💡 Real-life example

🎫 Like a concert ticket with a hologram. The venue doesn't call HQ to verify each one — they look at the hologram. JWT signature is the hologram.

Question 9

You need authentication for your app — JWT or OAuth?

Accepted Answer

JWT is a token FORMAT — a signed container for identity data, verifiable without a DB lookup. OAuth is an authorization PROTOCOL — it lets one app access resources from another on a user's behalf. They are not alternatives; JWT is often used as the access token inside an OAuth flow. Use plain JWT when you control both client and server. Use OAuth when users log in via Google/GitHub/Apple, or when your app needs to call another service's API on the user's behalf.

💡 Real-life example

🔑 JWT alone = your house key. OAuth = the doorman of a hotel that hands you a JWT keycard. Different layers of the same problem.

Question 10

How do you store passwords securely in 2025?

Accepted Answer

Use bcrypt or Argon2 — both are intentionally slow, which defeats brute-force on stolen hashes. Add a unique salt per user so two users with the same password produce different hashes (rainbow tables become useless). On login: re-compute `hash(password + salt)` and compare with the stored hash — never decrypt; you literally can't. Layer in rate-limiting per IP + account lockout after N failed attempts.

💡 Real-life example

🛡️ A leak of 1M bcrypt password hashes is barely useful — even with a high-end GPU, brute-forcing one user's password takes years. Plain MD5 hashes? Cracked in seconds.

Question 11

If caching makes things faster, why not cache everything?

Accepted Answer

Caches help when the data is read-heavy and tolerates some staleness. They hurt or add bugs when data changes constantly (cache invalidation becomes a full-time job), when results are unique per request (every cache entry is a single-use one), or when storage cost outweighs the latency win. Phil Karlton said it: there are only two hard things in CS — naming things, cache invalidation, and off-by-one errors. Cache the hot 20% that drives 80% of reads. Everything else: just hit the DB.

💡 Real-life example

🌶️ A homepage product list is perfect to cache (read 10,000×, changes hourly). A user's bank balance is not (read rarely, changes every transaction).

Question 12

Explain cache eviction policies — LRU, LFU, FIFO and when to use each.

Accepted Answer

LRU (Least Recently Used): evicts the entry that hasn't been accessed for longest. Best general default — most workloads have recency locality. LFU (Least Frequently Used): evicts the entry with fewest accesses ever. Good when popular items stay popular for a long time. FIFO (First In, First Out): evicts oldest insertion. Simplest, but ignores access patterns — rarely the right pick. Redis uses approximated-LRU and LFU. Memcached uses LRU.

💡 Real-life example

🎬 A movie streaming service: LFU works well for global favorites (Inception is always near the top). LRU works well for a single user's 'continue watching' list.

Question 13

What's the difference between PUT and PATCH?

Accepted Answer

PUT replaces the entire resource. If you PUT `{name: "Sam"}` on a user with name + email + phone, the email and phone get wiped (or set to defaults). PATCH applies a partial update — `{name: "Sam"}` only changes the name. Both should be idempotent (calling twice has the same effect as once). Use PUT when you have the complete resource. Use PATCH when you only want to change specific fields.

💡 Real-life example

📝 PUT = re-print the whole page from scratch. PATCH = whiteout the typo on line 4. Both end the same place; one is much cheaper.

Question 14

Design rate limiting that works across multiple servers.

Accepted Answer

In-memory counters per server break the moment you have more than one — a user can hit each instance freely. Move the counter to a shared store like Redis: `INCR key` with a TTL gives you fixed-window limiting in O(1). Better: token bucket or sliding window log in Redis Lua scripts for atomic operations. Add a circuit-breaker layer at the edge (CDN, API gateway) for malicious-spike protection. Return clear 429 Too Many Requests with a `Retry-After` header.

💡 Real-life example

🪣 Token bucket: each user gets 100 tokens, refills 10/sec, each request consumes 1. Smooth bursts up to 100, sustained limit 10 req/sec — the algorithm of choice at Stripe and GitHub.

Question 15

What is an API Gateway and why do microservices need one?

Accepted Answer

An API gateway sits between clients and your microservices. It handles cross-cutting concerns ONCE so each service doesn't have to: authentication (validate JWT), rate limiting, request/response transformation, API versioning, service routing, observability (logging, tracing), and circuit breaking. Without a gateway, every microservice re-implements these or clients call dozens of internal URLs directly. Popular gateways: Kong, AWS API Gateway, Apigee, custom Nginx.

💡 Real-life example

🏛️ Like a hotel concierge: guests don't go knocking on the kitchen, laundry and security doors individually — they call one number and the concierge routes the request.

Question 16

How do you prevent duplicate orders if the user clicks 'Pay' twice (idempotency)?

Accepted Answer

The client generates an `idempotency_key` (UUID) and sends it as a header on the payment request. The server stores `(idempotency_key → result)` in Redis or the DB. On a duplicate request, the server returns the cached result instead of re-executing the transaction. Stripe, Square and every modern payment API use this exact pattern.

💡 Real-life example

💳 The user double-clicks 'Pay $99'. Both requests carry the same idempotency key. Server processes the first, caches the result; the second request gets the cached response — one charge, two happy clicks.

Question 17

What's the difference between concurrency, parallelism and async?

Accepted Answer

Concurrency = multiple tasks IN PROGRESS at once (they may take turns). Parallelism = multiple tasks LITERALLY EXECUTING at once (requires multiple CPU cores). Async = a programming model where a task yields control while waiting for I/O so something else can run on the same thread. You can have concurrency without parallelism (single-threaded async like Node.js). You can have parallelism only on multi-core hardware. Async is a TECHNIQUE that enables concurrency cheaply.

💡 Real-life example

🍳 Concurrency: one chef alternating between two pans. Parallelism: two chefs, two pans. Async: the chef sets a timer on the oven and starts chopping veggies while it bakes.

Question 18

One microservice goes down. The whole system goes down. What's the fix?

Accepted Answer

That's a cascading failure caused by synchronous coupling. Fix with three layers: (1) Circuit breakers — when a downstream service fails, the breaker opens and we serve a cached/default response instead of waiting. (2) Bulkheads — isolate thread pools per dependency so one slow service can't exhaust all threads. (3) Message queues — turn synchronous calls into async events where possible (publish an event, downstream consumes when ready). The system degrades gracefully instead of going dark.

💡 Real-life example

🚢 Like ship bulkheads: one compartment floods, the others stay sealed. Without them, one leak sinks the whole ship.

Question 19

Kafka vs RabbitMQ — when do you pick which?

Accepted Answer

Kafka: distributed event log. High throughput (millions of events/sec), retains messages on disk for days, multiple consumers can replay the same stream. Use for analytics pipelines, event sourcing, real-time data fans-out, audit logs. RabbitMQ: traditional message broker. Strong routing (exchanges, topics), per-message acknowledgements, lower throughput but lower latency per message. Use for task queues, RPC-style messaging, workflows with retries. Rough rule: Kafka for streams of events; RabbitMQ for tasks.

💡 Real-life example

📜 Kafka = a newspaper printing press emitting an immutable log everyone reads. RabbitMQ = a postal sorter delivering specific letters to specific mailboxes.

Question 20

What is the CAP theorem? How do you apply it when designing a system?

Accepted Answer

CAP says a distributed system can guarantee at most TWO of three: Consistency (every node sees the same data), Availability (every request gets a response), Partition tolerance (system keeps working despite network failures). Since network partitions WILL happen, P is mandatory — so you really choose between C and A during a partition. Banking apps choose CP (rather wrong-than-stale → never), social media chooses AP (a stale profile pic is fine), DNS chooses AP. CAP is not a label on a database — it describes BEHAVIOR under partition.

💡 Real-life example

🏦 Bank ATMs (CP): partition? Stop accepting withdrawals. 📱 Instagram likes (AP): partition? Count locally, reconcile later. Same theorem, opposite priorities.

Question 21

What is consistent hashing and why do distributed systems use it?

Accepted Answer

Naive sharding (`hash(key) % N`) breaks when N changes — adding one node remaps almost every key, causing massive rebalancing and cache misses. Consistent hashing places servers and keys on a circular hash ring; each key goes to the nearest server clockwise. Adding a server only moves the keys between it and its predecessor (O(K/N) instead of O(K)). Virtual nodes (each server appears at multiple ring positions) improve balance. Used by Memcached, Redis Cluster, Cassandra, DynamoDB and every modern CDN.

💡 Real-life example

🍰 Imagine 100 cakes split between 4 chefs by hash. Adding a 5th chef with naive sharding rotates almost all cakes. With consistent hashing only ~20 cakes move.

Question 22

SQL vs NoSQL — when do you choose what?

Accepted Answer

SQL (PostgreSQL, MySQL): strong schema, ACID transactions, JOINs across tables, mature tooling. Best for: financial data, relational domains, transactional systems where consistency matters. NoSQL umbrella: key-value (Redis, DynamoDB), document (MongoDB), column (Cassandra, HBase), graph (Neo4j). Best for: massive scale, denormalized schemas, flexible documents, write-heavy workloads. Modern reality: most systems mix both — Postgres for the source of truth, Redis for caching, Elasticsearch for search, etc.

💡 Real-life example

🏦 Stripe stores money in Postgres (ACID, JOINs, audit). Twitter timeline lives in Redis (fast reads, denormalized). Same company, different stores, same engineering team.

Question 23

Explain data replication strategies — synchronous vs asynchronous vs semi-sync.

Accepted Answer

Synchronous: primary waits for ALL replicas to confirm before returning success. Strong consistency, but a slow replica blocks every write. Asynchronous: primary writes locally and returns immediately; replicas catch up in the background. Fast, but a primary crash can lose un-replicated data. Semi-synchronous: primary waits for AT LEAST ONE replica. Best of both — durable for the most common single-node failure, fast in the common path. Most large systems (MySQL replication, PostgreSQL streaming) default to semi-sync.

💡 Real-life example

📚 Banking ledgers: synchronous (no data loss tolerated). Social media feed: asynchronous (a few-second lag is invisible). Most production DBs: semi-sync.

Question 24

Design a URL shortener like bit.ly that handles billions of URLs.

Accepted Answer

Core: generate a short unique key, store the mapping, redirect on lookup. Use Base62 encoding (a-z, A-Z, 0-9) of an auto-increment ID — 6 chars = 56 billion URLs. Store in NoSQL (Cassandra or DynamoDB) sharded by short key. The system is hugely read-heavy (~100:1 read/write ratio), so put a Redis cache in front with TTL — 80% of clicks hit 20% of URLs. CDN-cache the redirect responses at the edge. Async-log every click to Kafka for analytics. Extras: custom aliases (check uniqueness), expiry, abuse detection.

💡 Real-life example

🔗 bit.ly serves billions of redirects/day on this exact recipe: Base62 keys + sharded NoSQL + Redis cache + CDN + Kafka analytics. The hard part is operational, not algorithmic.

Question 25

Design a notification system that sends millions of push, email and SMS messages reliably.

Accepted Answer

Architecture: producer services emit notification EVENTS to Kafka (`{userId, type, payload}`). A dispatcher consumes the stream and decides which channels apply based on the user's preferences (push? email? SMS?). Channel-specific workers (push-worker, email-worker, sms-worker) consume their own topics and call external providers (APNS/FCM, SendGrid, Twilio). Retries with exponential backoff. Dead-letter queue for permanent failures. Rate-limit per user to avoid spam. Idempotency key per notification to dedupe re-sends.

💡 Real-life example

📲 Uber sending you 'Driver arrived': one event → fans out to push notification + SMS fallback. If push fails (phone offline) the SMS still arrives because they're separate consumers.

Question 26

Design a typeahead (autocomplete) system like Google Search.

Accepted Answer

Backend stores top-K most popular completions for each prefix in a TRIE indexed in Redis. Each request sends the current query; the server walks the trie in O(prefix_length) and returns the top suggestions. To rank: combine global popularity (logs from search) with personalization (user's recent searches). Update the trie incrementally from a Kafka stream of queries (batch every 5 min). Pre-warm Redis with popular prefixes. Latency budget: <100ms p99 — anything slower feels broken.

💡 Real-life example

🔍 Type 'rest' → Google suggests 'restaurants near me' in 50ms. That's a trie + Redis + a real-time popularity update job, not a database query.

Question 27

What is the Circuit Breaker pattern and when should you use it?

Accepted Answer

A circuit breaker protects calls to a flaky downstream service. It tracks failures and, when they cross a threshold, OPENS — short-circuiting future calls and returning a cached/default response immediately. After a cool-down it goes HALF-OPEN, tries one request, and either closes (recovery) or re-opens (still broken). This stops cascading failures: instead of every request hanging on a dead service for 30 seconds, they fail in 5ms. Built into Netflix Hystrix, resilience4j, .NET Polly, AWS App Mesh.

💡 Real-life example

⚡ Like a real circuit breaker: when wiring shorts, the breaker pops immediately instead of letting the whole house burn down. Same idea for distributed services.

Question 28

How does Google return search results in under 200ms?

Accepted Answer

It's a layered system, not one database. The query is parsed and corrected against a spell-checker model. An inverted index (term → list of URLs) shards across thousands of machines; each shard returns its top candidates in parallel. Results are scored by hundreds of signals (PageRank, freshness, personalization). Edge caches hold popular query results — 'weather' returns from a CDN in 20ms. The entire pipeline runs in parallel and the slowest shard's result is dropped (tail latency control).

💡 Real-life example

⚡ A search for 'weather london' touches dozens of services in parallel and answers in 80ms. The 'database' you imagine doesn't exist — it's an inverted index across thousands of machines.

Question 29

What's the difference between stateful and stateless services?

Accepted Answer

Stateless: each request contains all the info needed to handle it; servers don't remember anything between calls. Easy to scale horizontally — add servers freely behind a load balancer. Stateful: the server keeps session/data in memory. Harder to scale because subsequent requests from the same user must reach the same server (sticky sessions) and the server can't be killed without losing state. Modern best practice: keep services stateless and push state to dedicated stores (Redis, DB, Kafka).

💡 Real-life example

📡 REST APIs are stateless (every request carries auth). A real-time game server tracking your in-game position is stateful — kill it and players disconnect.

Question 30

What is Event Sourcing and when should you use it?

Accepted Answer

Instead of storing the CURRENT state of an entity, store every EVENT that changed it: 'OrderPlaced', 'ItemAdded', 'OrderCancelled'. The current state is computed by replaying the events. Benefits: complete audit log (regulatory dream), easy temporal queries ('what did the cart look like 30 min ago?'), natural fit for event-driven systems, can rebuild state from scratch after a bug. Costs: more storage, harder queries (need projections), schema evolution is tricky. Common pair with CQRS — write events, read from materialized projections.

💡 Real-life example

🏦 A bank ledger is event-sourced by definition: never erase a deposit; add a corrective entry. 'Current balance' is just `sum(events)`. Stripe, Slack, and most fintech use this pattern.

Top 30 System Design Interview Questions (Senior Engineer Level)

📈 Scalability & Deployment (4)

💾 Databases & Storage (5)

🔐 Auth & Security (3)

⚡ Caching (2)

🔌 API Design (3)

🧵 Concurrency & Consistency (2)

📨 Messaging & Queues (2)

🌐 Distributed Systems (4)

🧩 Design X (Classic) (3)

📊 Observability & Reliability (1)

🌍 Real-World Walkthroughs (1)

📚 Want to go deeper?