Alex XU Notes
Alex XU Notes
Here's a concise summary of Chapter 1: Zero to Million Users from System Design Interview – An
Insider's Guide by Alex Xu:
This chapter introduces the idea that system design evolves as the number of users grows. What works for
1,000 users may not work for a million. The chapter walks through five stages of system growth:
• All components (web server, database, business logic) are hosted on one machine.
• Simple to build and deploy.
• Bottlenecks: limited scalability and availability.
• Introduce replication (read replicas), caching (Redis, Memcached), and connection pooling.
• Improve performance and reduce database load.
• Focus on optimizing slow queries and indexing.
Design should evolve gradually with growth. Premature optimization can overcomplicate the system. Focus
on solving the right problems at the right scale.
Sure! Here’s a summary of Chapter 2: Reliability from System Design Interview – An Insider’s Guide by Alex
Xu
Chapter 2: Reliability
Reliability refers to the system’s ability to work correctly and consistently over time. It’s crucial for user
trust and business continuity.
1. What is Reliability?
2. Types of Failures
4. Tradeoffs
Key Takeaway:
Design with failure in mind. Build systems assuming something will go wrong, and ensure graceful
degradation or recovery when it does.
Here’s a summary of Chapter 3: A Framework for System Design Interviews from System Design Interview
– An Insider’s Guide by Alex Xu:
This chapter presents a step-by-step framework to tackle system design interviews in a structured,
confident, and comprehensive way.
🔶 Why a Framework?
Key Takeaway:
Use a clear framework to handle design interviews like a pro — starting from understanding the problem to
proposing a scalable, maintainable solution.
Certainly! Here’s a clear and concise summary of Chapter 2: Back-of-the-Envelope Estimation from
System Design Interview – An Insider's Guide by Alex Xu:
Metric Description
Users Daily Active Users (DAU), Monthly Active Users (MAU)
QPS (Queries Per Second) How much traffic your system handles per second
Storage Needs How much data your system stores over time
Bandwidth How much data is transferred between components
Read/Write Ratios Determines database and cache optimization needs
1. Define a scenario
✅ Key Takeaways
Great! Here's a Back-of-the-Envelope Estimation Cheat Sheet and Worksheet Template you can use to
practice and structure your answers during system design interviews:
Assume:
• 1 photo upload per user per day
• 1 photo = 2 MB
• 10 views per photo (read-heavy app)
🔹 4. Estimate Storage
Assume:
• 1M photo views/hour
• Requests per second = total requests / seconds in a day
→ ~1,100 QPS
🔹 6. Estimate Bandwidth
Assume:
🔹 7. Read/Write Ratio
🧠 Final Thoughts
🔹 A. Twitter (X)
🔹 B. YouTube
👉 Reply with A or B, and we’ll go step-by-step from defining the problem to calculating QPS, storage, and
bandwidth.
Great choice! Let’s walk through a Back-of-the-Envelope Estimation for a system like Twitter (X) using the
framework.
• Tweeting
• Reading tweets
• User timelines
• 5 tweets posted
• 50 tweets read
So per day:
🔹 4. Estimate Storage
Assume:
• Each tweet = 280 characters ≈ 0.5 KB
• Each tweet is stored with metadata = ~1 KB total
🔹 6. Estimate Bandwidth
🔹 7. Read/Write Ratio
• Reads = 1B
• Writes = 100M
• Read:Write = 10:1
🔁 Implication: Must optimize for reads → use caching (Redis, Memcached), fan-out strategies, timeline
pre-computation.
Excellent! Let's now walk through a Back-of-the-Envelope Estimation for a YouTube-like platform—a
large-scale, video-streaming service.
So daily:
• Videos uploaded = 200M × 0.1 = 20 million
• Videos watched = 200M × 10 = 2 billion
🔹 4. Estimate Storage
Assume:
📦 Note: YouTube likely stores multiple resolutions (e.g. 480p, 720p, 1080p) → actual usage could be 3× to
5× more.
🔹 6. Estimate Bandwidth
Assume:
🔹 7. Read/Write Ratio
• Reads = 2B
• Writes = 20M
• Read:Write = 100:1
Absolutely! Practicing more Back-of-the-Envelope Estimation examples will sharpen your system design
skills. Here are 5 practical examples across different domains, each designed to test different aspects of
system thinking:
🧮 Back-of-the-Envelope Estimation Examples
• MAU: 1 billion
• DAU: 100 million
• Active requests/hour: 10M users requesting route every 30 mins
• QPS: ~140K for route calculation
• Data size: Each map tile = ~20–100 KB
• Key Systems:
o Real-time traffic ingestion
o Fast geo queries
o Location-based caching
🔧 Next Steps
Assume:
• Messages sent: 50
• Photos sent: 5
• Voice/video notes sent: 1
• Each message = ~1 KB
• Each media = ~1 MB (average)
Totals:
Messages:
Media:
🔐 Only sender & recipient have access due to end-to-end encryption — metadata like timestamps, delivery
receipts might still be stored centrally.
🔹 6. Estimate Bandwidth
Text:
Media:
Most bandwidth comes from media; needs aggressive compression and CDN delivery (even for encrypted
media, you can cache encrypted blobs).
🔹 7. Read-Write Ratio
• Writes: 7.5B
• Reads: ~15B (delivery, read receipts, sync across devices)
• Read:Write ≈ 2:1
🔒 Security Architecture
Sure! Here’s a detailed summary of Chapter: Design a URL Shortener from System Design Interview – An
Insider’s Guide by Alex Xu.
This chapter walks you through the complete system design of a URL shortener service, where users submit
a long URL and get a shortened version like short.ly/xyz123.
✅ 1. Functional Requirements
🧩 3. High-Level Design
1. Shorten URL:
2. Redirect URL:
10. Trade-Offs
✅ Key Takeaways
Certainly! Here's a clear and structured explanation of the chapter “Design a Unique ID Generator in
Distributed Systems” from System Design Interview – An Insider’s Guide by Alex Xu.
🔐 Chapter: Design a Unique ID Generator in Distributed Systems
In distributed systems, generating globally unique, ordered, and scalable IDs is essential for things like:
This chapter explores how to design such a service efficiently and safely.
✅ 1. Requirements
📌 Functional
🛡
Non-functional
🔢 3. ID Formats
Option A: UUID
• Randomly generated
• 128-bit, e.g. 550e8400-e29b-41d4-a716-446655440000
• ❌ Not sortable, relatively large
Option B: Database Auto-Increment
✅ Option C: Snowflake-style ID
Key Takeaways
Certainly! Here’s a detailed explanation of the chapter “Design a Web Crawler” from System Design
Interview – An Insider’s Guide by Alex Xu. This is one of the more complex system design problems and tests
your understanding of scalability, distribution, concurrency, and data processing.
A web crawler (or spider) is a bot that browses the web in a methodical, automated manner to:
🧠 Used by: Search engines (e.g., Google), SEO tools, data mining applications
✅ 1. Functional Requirements
2. Non-Functional Requirements
🔁 4. Workflow
✅ Options:
✅ Use:
📁 8. Storage Strategy
Store:
• Raw HTML
• Parsed content (e.g., title, meta tags, text)
• Metadata (timestamp, status, size, latency)
✅ Options:
🧠 🧮
• Pages/day = 1B / 30 = ~33.3M
• Pages/second ≈ 33.3M / 86,400 ≈ 385 QPS
• Assuming 100 KB/page → 1B × 100 KB = 100 TB total
🔋 With 1,000 worker nodes, each must do ~1 page every 2.5 seconds
Certainly! Here’s a detailed explanation of the chapter “Design Notification System (Notification as a
Service)” from System Design Interview – An Insider’s Guide by Alex Xu.
This chapter explores how to build a generic, scalable, multi-channel notification service, like those used
by apps to send alerts via email, SMS, push notifications, or in-app messages.
🧾 1. Problem Statement
✅ 2. Functional Requirements
🛡 3. Non-Functional Requirements
🔁 5. Notification Flow
✅ Tools: Redis (with sorted sets), Quartz Scheduler, or delay queues in Kafka
Track:
Absolutely! Here's a detailed explanation of the chapter “Design a News Feed System” from System
Design Interview – An Insider’s Guide by Alex Xu.
This chapter focuses on how to build a scalable, low-latency, and personalized feed — the kind you see in
social media apps like Facebook, Instagram, or Twitter.
A News Feed system aggregates content (posts, updates, videos, etc.) from many sources (friends, pages,
followed users) and shows a personalized list of the most relevant or recent content to the user.
Examples:
• Facebook timeline
• Instagram feed
• Twitter home timeline
• LinkedIn activity feed
✅ 2. Functional Requirements
3. Non-Functional Requirements
✅ Pros:
❌ Cons:
• High latency
• Heavy DB load on every feed request
• Harder to scale
✅ Pros:
❌ Cons:
➡ Reduces push overhead while keeping feed fast for most users
🔁 6. Feed Generation Logic
Pull Model
• Collaborative filtering
• Behavioral signals
• Prior engagement
8. Back-of-the-Envelope Estimation
Assume:
• 500M DAU
• Each user follows 500 users
• Each followee posts 2 times/day
Assume:
• Infinite scrolling
• Filters: only photos, videos, friends
• Stories (temporary content)
• Ads injected into feed
• Real-time feed updates with WebSocket or polling
✅ Key Takeaways
Absolutely! Here's a detailed explanation of the chapter “Design a Chat System” from System Design
Interview – An Insider’s Guide by Alex Xu.
A chat system is fundamental to messaging apps like WhatsApp, Messenger, Slack, Telegram, etc. It must
handle real-time communication, message reliability, and scalability for millions (or billions) of users.
✅ 1. Functional Requirements
2. Non-Functional Requirements
🔗 4. Communication Protocol
📬 8. Delivery Guarantees
🧮 9. Back-of-the-Envelope Estimation
Assume:
• 200M DAU
• Avg. 50 messages/user/day → 10B messages/day
• = ~115,000 messages/sec (average)
Storage:
Active connections:
• Write-heavy system
• Use NoSQL DB like Cassandra or DynamoDB for fast writes
• Store recent messages in Redis or Memcached for quick access
• Archive old messages to blob store (e.g., S3)
• If recipient is offline:
o Store message
o Trigger push notification via FCM / APNs
• Retry sending on reconnect
🔐 13. Security
✅ Key Takeaways
This chapter focuses on designing a responsive and scalable autocomplete system like the one you see in
Google, Amazon, YouTube, or LinkedIn, where suggestions are shown as you type.
🧾 1. Problem Statement
Design a search autocomplete service that returns top relevant suggestions as the user types, usually
within milliseconds.
Examples:
✅ 2. Functional Requirements
📊 8. Real-Time Updates
Assume:
Estimated storage:
Absolutely! Here's a detailed explanation of the chapter “Design YouTube” from System Design Interview –
An Insider’s Guide by Alex Xu.
This is one of the most complex and multi-component system design problems because it involves:
✅ 1. Functional Requirements
🛡 2. Non-Functional Requirements
📤 Upload Flow
⚙ Encoding Flow
4. Encoding service:
a. Converts video into multiple formats/resolutions (1080p, 720p, 480p, etc.)
b. Stores in object storage
c. Generates thumbnail and preview
▶ Playback Flow
Assume:
QPS:
🔎 Search
🤖
Recommendation
✅ Key Takeaways
Absolutely! Here's a detailed explanation of the chapter “Design Google Drive (or Dropbox)” from System
Design Interview – An Insider’s Guide by Alex Xu.
Designing Google Drive involves building a cloud storage service that lets users upload, store, share, and
collaborate on files and folders, while maintaining reliability, synchronization, scalability, and security.
✅ 1. Functional Requirements
🛡 2. Non-Functional Requirements
➡ Folder structure can be modeled using a relational DB or document store (e.g., MongoDB)
Large files:
🔁 6. Sync Service
To detect changes:
Permissions cascade:
• If you share a folder, child files are shared too (unless overridden)
Support for:
📜 8. Versioning System
Version DB stores:
Users can:
Optional:
Assume:
• 1B users
• Avg 10 files/user/day → 10B new files/day
• Each file = 1MB → 10 PB/day
• 4MB chunk size → ~3 chunks/file → 30B chunk writes/day
• 10–50K writes/sec
• Millions of sync messages/sec
✅ Key Takeaways
• Certainly! Here's a detailed explanation of the chapter “Design a Key-Value Store” from System
Design Interview – An Insider’s Guide by Alex Xu.
• A Key-Value Store is a core component in distributed systems, serving as the foundation for
databases like Redis, Amazon DynamoDB, RocksDB, and LevelDB. It enables quick reads and
writes with a simple interface.
• ✅ 1. Functional Requirements
• GET(key) → value
• PUT(key, value) → insert or update
• DELETE(key)
• Support large data size and millions of requests per second
• Support persistence, replication, and fault tolerance
•
• 🛡 2. Non-Functional Requirements
• Search in MemTable
• If not found, check Immutable MemTable
• Then search in SSTables (on-disk)
o May use a Bloom filter to skip irrelevant files
o Use index block in SSTable for efficient lookup
•
• 🔨 9. Compaction
• Assume:
• 10 million keys
• Key = 20B, Value = 500B → ~520B/entry
• Total storage = 10M × 520B ≈ 5 GB
• If 1,000 writes/sec:
• WAL must handle 500 KB/sec write throughput
• For scalability:
• Partition by key hash across 10 nodes
• Each node handles 1M keys (~500 MB/node)
• ✅ Key Takeaways
• A modern key-value store uses LSM Tree for efficient writes and compaction
• Combines memory (MemTable) and disk (SSTables) for performance and durability
• Read optimization via Bloom filters and indexes
• Replication, sharding, and compaction ensure performance and reliability at scale
•
• Would you like:
• A diagram of the architecture?
• A sample Java code of LSM Tree or in-memory KV store?
• A comparison with Redis or DynamoDB?
Certainly! Here's a detailed explanation of the chapter “Design Consistent Hashing” from System Design
Interview – An Insider’s Guide by Alex Xu.
Consistent Hashing is a fundamental concept in distributed systems, especially in systems where data
needs to be partitioned across multiple servers/nodes—such as caches (like Memcached), databases
(like DynamoDB), file storage systems, etc.
Keys and servers are both placed on a hash ring (0 to 2³² or 0 to 1).
• Hash every node/server to a point on the ring
• Hash every key to a point on the ring
• A key is assigned to the first node clockwise from its hash value
Key Properties:
Example:
• Servers: A → 20, B → 60, C → 120 (hashed positions)If C is removed, X now goes to B (next clockwise)
🔁 5. Adding/Removing Nodes
• Adding a node → only keys between the previous server and new node move
• Removing a node → only its keys need to be reassigned to next clockwise node
🧮 8. Back-of-the-Envelope Estimation
Assume:
• 100M keys
• 10 servers