Common System-Design Problems & Solutions with Code Examples
Below, we expand on the common system-design problems & solutions by including code examples that illustrate how one might implement or configure aspects of these solutions. Each example is simplified for clarity and demonstration purposes.
1. Session Management in a Multi-Server Environment
Problem Recap
- Users log in; session data is stored locally.
- Multiple servers cause session inconsistency.
Solution Approaches with Code Examples
Central Session Store using Redis (Python with Flask and Flask-Session)
Token-Based Auth (JWT) Example (Python with Flask)
2. Database Write Scalability
Solution Approach: Sharding with Pseudocode in Python
3. Distributed Cache Invalidation
Solution Approach: Pub/Sub for Invalidation using Redis (Python)
4. Microservices & Event Consistency
Solution Approach: Event-Driven Architecture with Kafka (Python Producer Example)
5. Real-Time Notifications (Outside of WebSockets)
Solution Approach: Server-Sent Events (SSE) with Python Flask
6. Task Scheduling and Distribution
Solution Approach: Using Celery for Task Distribution (Python)
7. Concurrency Control and Data Races
Solution Approach: Optimistic Locking with SQL (PostgreSQL Example in Python)
8. Distributed Logging & Monitoring
Solution Approach: Centralized Logging with Logstash (ELK Stack)
logstash.conf excerpt:
9. CDN & Edge Caching for Latency Reduction
Solution Approach: CDN Configuration Example in HTML
Note: The CDN setup itself is handled via provider configuration, not code.
10. Failures and Fault Tolerance
Solution Approach: Circuit Breaker Pattern (Python Pseudocode)
Note: These examples aim to illustrate the essence of each approach. In production systems, implementations require more robust error handling, configuration options, and integration testing.
11. API Rate Limiting and Throttling
Problem
- Public APIs or internal microservices can be bombarded by excessive requests (malicious or accidental).
- Without controls, this can degrade service performance or even take down critical components.
Solution Approaches
- Token Bucket/Leaky Bucket Algorithms: Keep track of request tokens that refill over time; once tokens are depleted, throttle new requests.
- Centralized API Gateway: A gateway layer that enforces rate limits for all upstream services (e.g., Kong, AWS API Gateway).
- Distributed Rate Limiting: Use Redis or another shared store to coordinate rate limits across multiple servers.
12. Data Migration and Schema Evolution
Problem
- Production databases need schema or data changes without downtime.
- Large migrations can lock tables, cause major performance hits, or break old code expecting a different schema.
Solution Approaches
- Blue-Green or Rolling Deployments: Spin up new versions of the service or DB while the old version is still running, then switch traffic.
- Backward-Compatible Changes (Expand-Contract Pattern): Deploy code that can handle both old and new schema before actually switching the schema.
- Online Schema Migration Tools (e.g., gh-ost, pt-online-schema-change): Perform table changes in a way that minimizes locking and downtime.
13. Strong Consistency vs. Eventual Consistency
Problem
- Some systems demand immediate, strict consistency (e.g., banking transactions). Others can tolerate a slight delay (e.g., social media feeds).
- Choosing the wrong model can cripple performance or the user experience.
Solution Approaches
- Strong Consistency: Requires more coordination (two-phase commit, distributed transactions). Slower but real-time correctness.
- Eventual Consistency: Systems like Cassandra or DynamoDB replicate asynchronously; faster writes, but short-term stale reads may occur.
- Hybrid Models: Some system parts require strong consistency (e.g., user authentication), while others can be eventually consistent (e.g., analytics).
14. Feature Toggles and Canary Releases
Problem
- Rolling out a new feature to 100% of users at once is risky. If something is wrong, there’s a huge blast radius.
- Hard to roll back or mitigate if the deployment is large and complex.
Solution Approaches
- Feature Flags: Wrap features in toggles that can be turned on/off without a redeploy. Useful for controlled rollouts or A/B testing.
- Canary Deployments: Expose the new version to a small percentage of users first. If stable, gradually increase.
- Blue-Green Deployment: Maintain two production environments (blue and green). Switch traffic from one to the other post-verification.
15. Infrastructure as Code and CI/CD Pipelines
Problem
- Manually configuring servers leads to inconsistencies and “snowflake” servers.
- Slow or error-prone deployments hinder agility and reliability.
Solution Approaches
- Infrastructure as Code (IaC): Define servers and networking in files (Terraform, CloudFormation), use version control.
- Automated CI/CD: Pipelines (Jenkins, GitLab CI, GitHub Actions) automate builds, tests, and deployments.
- Immutable Infrastructure: Spin up new servers or containers for each deploy rather than patching existing ones.
16. Global Latency and Data Locality
Problem
- Serving a global user base from one region leads to high latency for distant clients.
- Potential data-sovereignty or compliance issues if user data can’t leave a certain country/region.
Solution Approaches
- Multiple Regions / Multi-Cloud: Deploy services closer to where users are.
- Geo-Replication of Databases: Copies of data in each region, often eventually consistent.
- Edge Compute: Place serverless or “edge” functions near the user’s location for minimal latency.
17. Large File / Media Handling
Problem
- Storing and serving large files (videos, images) from the main app servers is expensive and can crush bandwidth.
- Slow requests block or degrade core service performance.
Solution Approaches
- Object Storage (AWS S3, Google Cloud Storage): Keep large assets separate from the main app.
- CDN: Cache and deliver static/media files from edge nodes, reducing load on your core.
- Chunked Uploads: For very large files, use multipart uploads with resume capabilities.
18. Handling Spiky or Seasonal Traffic
Problem
- Traffic surges (sales, product launches) overwhelm your infrastructure.
- Over-provisioning 24/7 is costly and wasteful.
Solution Approaches
- Autoscaling: Automatically add/remove instances based on metrics (CPU, RAM, custom).
- Queueing & Throttling: Smooth out incoming request spikes by placing them in a queue.
- Circuit Breakers & Graceful Degradation: Temporarily turn off non-critical features if load is too high.
19. Search and Indexing at Scale
Problem
- Full-text search or complex queries on large data sets is slow with naive DB usage.
- Simple indexes may not handle advanced text analysis or near-real-time updates.
Solution Approaches
- Dedicated Search Engines: Elasticsearch, Solr, or OpenSearch for high-performance text search.
- Distributed Indexes: Shard the index across multiple nodes for parallel search.
- Pipelines: Use something like Kafka + Logstash to feed data to search engine indices in near real-time.
20. Security in Distributed Systems
Problem
- Multiple microservices and endpoints = more attack surface.
- Inconsistent or weak security practices can expose vulnerabilities.
Solution Approaches
- Central Auth/SSO: Identity Providers (OIDC, SAML) unify authentication and authorization.
- mTLS (Mutual TLS): Ensure encrypted and authenticated service-to-service communication.
- Zero-Trust Networking: Require authentication/authorization on every request, even inside the same network.
How to Use These in System Design
- Each bullet is a common challenge encountered when building scalable, fault-tolerant systems.
- You often need multiple patterns together (e.g., microservices plus centralized logging, circuit breakers, caching).
- Refer to these whenever you face a new bottleneck or design puzzle; you’ll see many of these patterns repeating in different forms.
21. Service Discovery
Problem
- In a microservices environment with dynamic scaling, services come and go, making it hard to keep track of current IPs/ports.
- Hardcoding service addresses is brittle; manual updates lead to downtime or stale configurations.
Solution Approaches
- Service Registry (e.g., Consul, Eureka, Zookeeper): Services register themselves and discover others via a central directory.
- DNS-Based Discovery: Dynamically update DNS records when services scale up/down.
- Sidecar Pattern: A sidecar container handles discovery on behalf of the main service, integrating with a service mesh or registry.
22. Service Mesh
Problem
- Microservices often need consistent networking (encryption, retries, load balancing, observability).
- Implementing these capabilities in each service duplicates logic and complicates development.
Solution Approaches
- Dedicated Data Plane (e.g., Envoy sidecars in Istio, Linkerd): Each service instance has a proxy that handles networking concerns uniformly.
- Control Plane: Central service for managing configurations, certificates, routing rules.
- Gradual Adoption: Start small with basic features (e.g., mTLS) and grow to traffic shaping and canary deployments.
23. Gossip Protocols
Problem
- Need to share state (cluster membership, health checks) among nodes in a distributed system.
- A centralized store might be a single point of failure; large-scale systems need a more decentralized approach.
Solution Approaches
- SWIM-Style Gossip: Periodic, random node-to-node communication to exchange membership info (e.g., Serf, HashiCorp Consul under the hood).
- Anti-Entropy Protocols: Nodes gradually converge on a consistent state, tolerating partial failures and network partitions.
- Push/Pull Gossip: Combination of sending updates (push) and requesting them (pull) ensures eventual consistency without a central point.
24. Secret Management
Problem
- Credentials, API keys, and other secrets must be stored securely, not hardcoded or committed to version control.
- Rotating secrets without downtime is often challenging if they’re embedded across multiple services.
Solution Approaches
- Vaults (e.g., HashiCorp Vault): Central secure storage and dynamic secret generation (short-lived tokens).
- Kubernetes Secrets: Encrypted at rest; can be combined with external KMS for better security.
- Parameter Stores (AWS SSM Parameter Store, Azure Key Vault): Centralize configurations and secrets, enforce access policies.
25. Offline-First Applications
Problem
- Mobile or remote clients may lose connectivity. If the app fully depends on a live server, it becomes unusable offline.
- Sync conflicts when connectivity is restored can cause inconsistent data if not managed properly.
Solution Approaches
- Local Storage & Caching: Save user data locally (IndexedDB, SQLite) so the app can function offline.
- Conflict Resolution: Use versioning or last-write-wins; for more complex scenarios, implement merges or user prompts for conflicts.
- Background Sync: Upload changes and fetch updates once the device is back online, ideally in a seamless, queue-based approach.
26. Data Warehouse vs. Data Lake
Problem
- Large volumes of data from various sources need to be stored, processed, and analyzed. Traditional relational DBs become either too expensive or too rigid.
- There’s a tradeoff between structured data (easy queries, predefined schema) and raw/unstructured data (flexibility).
Solution Approaches
- Data Warehouse (e.g., Snowflake, BigQuery, Redshift): Structured schema, optimized for fast analytics; best for well-defined use cases.
- Data Lake (e.g., Hadoop, S3-based lakes): Store raw data cheaply; schema-on-read approach. Flexible for data science exploration and machine learning.
- Hybrid “Lakehouse”: Combines cheap storage of a lake with warehousing capabilities (e.g., Databricks Delta Lake).
27. Streaming Data Pipelines
Problem
- High-volume event streams need near real-time processing (e.g., analytics, fraud detection, log aggregation).
- Batch processing (e.g., nightly jobs) can be too slow for real-time insights.
Solution Approaches
- Message Brokers (Kafka, Pulsar): Durable, high-throughput event pipelines; consumers can process data on the fly.
- Stream Processing Frameworks (Spark Streaming, Flink): Windowing, aggregations, and stateful computations at scale.
- Lambda Architecture: Combine a real-time stream layer (fast but approximate) with a batch layer (slower but exact).
28. Multi-Region Failover
Problem
- A single region or data center is a single point of failure. Outages, natural disasters, or network splits can take the entire system offline.
- Maintaining consistent data across regions is complex, especially for writes.
Solution Approaches
- Active-Passive: One region is primary; a secondary region stands by to take over if the primary fails (DNS or failover orchestration switch).
- Active-Active: Multiple regions serve traffic simultaneously, requiring advanced replication and conflict resolution.
- Latency-Based Routing: Some users automatically connect to their closest region; if that region fails, requests route to another.
29. Monitoring vs. Observability
Problem
- Monitoring alone (CPU, RAM, basic metrics) may not provide enough insight into complex distributed issues.
- Tracing, structured logs, and more detailed metrics are needed to fully understand system behavior.
Solution Approaches
- Monitoring: Use metrics (Prometheus, Datadog) and alert on thresholds.
- Observability: Implement tracing (Jaeger, Zipkin), structured logging (ELK stack), and correlation IDs. Helps pinpoint the root cause faster.
- Unified Dashboards: Aggregate logs, metrics, and traces in a single place (e.g., Grafana, Splunk) for a comprehensive view.
30. Autoscaling Container Workloads
Problem
- Demand fluctuates unpredictably. Manually scaling containers (e.g., in Kubernetes) isn’t practical.
- Over-provisioning leads to wasted resources; under-provisioning degrades performance.
Solution Approaches
- Kubernetes HPA (Horizontal Pod Autoscaler): Scales pods based on CPU/Memory or custom metrics.
- Cluster Autoscaler: Adds/removes worker nodes in response to overall cluster demand.
- Event-Driven Autoscaling: Use external triggers (e.g., queue length, requests per second) to auto-scale more precisely.
How to Use These (21–30) in System Design
- Each point represents a frequent challenge or scenario in distributed systems.
- Consider them an extension to the earlier 1–20 list, covering advanced topics like service discovery, service mesh, and observability.
- Real-world design often involves multiple patterns working in tandem (e.g., a multi-region, microservices app with advanced autoscaling and robust secret management).
31. Modular Monolith vs. Microservices
Problem
- Large monolithic codebase becomes unwieldy, but a full microservices approach is complex.
- Teams want to break it down in a structured way without overcomplicating deployments.
Solution Approaches
- Modular Monolith: Keep a single deployment artifact but split features into well-defined modules with clear boundaries.
- Migration Path: Gradually extract modules into separate services as needed, only when they truly require independent scaling or separate lifecycles.
- Domain-Driven Design: Identify bounded contexts to delineate modules or potential microservices.
32. Event Sourcing and CQRS in Practice
Problem
- Traditional “CRUD-based” data updates lose the history of how state changed over time. Difficult to audit or reconstruct past states.
- Complex business logic or workflows require a precise record of events.
Solution Approaches
- Event Sourcing: Store all changes as a sequence of events. State is derived by replaying events in order. Perfect for auditing and time-travel debugging.
- CQRS: Split read and write models; commands update event logs, while separate queries read from a denormalized view.
- Incremental Adoption: Start with critical domains (e.g., financial transactions) before going all-in across the system.
33. Idempotent Endpoints and Retry Logic
Problem
- Network calls can fail or time out, causing clients to retry. Non-idempotent endpoints can create duplicate side effects (e.g., double-charging a user).
- Hard to guarantee correctness if the same request arrives multiple times.
Solution Approaches
- Idempotent Endpoints: The same operation can be called multiple times without harmful duplication (e.g., use unique request IDs).
- Server-Side De-Duplication: Maintain a short-lived cache of request IDs to detect repeats.
- At-Least-Once vs. Exactly-Once Semantics: Clarify which guarantee is needed for each endpoint; design accordingly.
34. Handling Complex Data Validation
Problem
- Large, nested data structures with intricate validation rules can lead to brittle, scattered logic.
- Code quickly becomes messy with if/else checks duplicated in multiple places.
Solution Approaches
- Validation Libraries: Use frameworks (e.g., pydantic in Python) to define schemas and constraints.
- Layered Validation: Validate at multiple levels (API boundary, domain layer) to catch errors early and ensure invariants.
- Configurable Rules: Externalize complex validation rules (e.g., JSON-based config or DSL) so you can change them without redeploying code.
35. Concurrency Patterns for CPU-Bound vs. I/O-Bound Tasks
Problem
- Mixing CPU-bound computations with I/O-bound tasks in a single application can cause performance bottlenecks.
- Using asynchronous frameworks (like asyncio) alone is insufficient for heavy CPU tasks that block the event loop.
Solution Approaches
- Async I/O: Great for tasks waiting on the network or disk, freeing up the loop to handle other requests.
- Thread or Process Pools: For CPU-intensive segments, offload to workers (threads/processes) to avoid blocking the main event loop.
- Micro-batching: Combine multiple small CPU tasks and run them in a single batch for efficiency on modern CPUs.
36. Plugin Architecture and Extensibility
Problem
- Core application needs custom or user-specific features without bloating the codebase.
- Hard-coded logic leads to massive merges or code forks whenever a new feature is introduced.
Solution Approaches
- Plugin/Extension System: Separate “core” from “plugins” that can be dynamically loaded at runtime.
- Well-Defined Interfaces: Expose hooks or events in the core app that plugins can implement or listen to.
- Package Repositories: Publish plugins as separate packages (PyPI, npm, etc.) for versioned, modular distribution.
37. Large In-Memory Collections & Streaming
Problem
- Processing huge collections (millions of items) in memory can lead to memory exhaustion.
- Single-pass batch operations stall the system while processing, causing timeouts or OOM (out of memory) issues.
Solution Approaches
- Streaming / Iterators: Process data chunks incrementally, freeing memory as you go.
- Lazy Evaluation: Only compute data when needed; libraries like Python’s itertools
or Rx (Reactive Extensions) can help.
- MapReduce Style: For extremely large data sets, distribute across multiple nodes, each handling a partition of the data.
38. Partial Updates & Patch Endpoints
Problem
- Updating large objects by sending the entire payload each time wastes bandwidth and can overwrite concurrent changes.
- Need fine-grained updates (e.g., just updating one field) with concurrency control.
Solution Approaches
- PATCH Method: Use JSON Patch or JSON Merge Patch to only send changed fields.
- Optimistic Concurrency Control: Use version or ETag headers to detect conflicting updates.
- Granular APIs: Provide endpoints for specific sub-resources or fields if partial updates are frequent.
39. Splitting Read vs. Write in Monolithic Systems
Problem
- Large monolith handles both read-heavy and write-heavy workloads, leading to performance bottlenecks and complex code paths.
- Hard to optimize each path separately if they’re all intertwined.
Solution Approaches
- Logical Separation: Within the same codebase, create distinct modules/services for reading vs. writing data.
- Replicated Read Models: Keep a separate read-optimized store (e.g., denormalized) updated from the main write store.
- Gradual Extraction: If performance demands grow, you can move read functionality into a separate microservice while the write side remains in the monolith.
40. Multi-Threading Debugging & Observability
Problem
- Parallel code can have race conditions or deadlocks that are extremely hard to reproduce.
- Traditional debugging tools provide limited insight into concurrency issues.
Solution Approaches
- Structured Logging: Tag logs with thread IDs, correlation IDs, or transaction IDs.
- Concurrency Debug Tools: Tools like Python’s faulthandler
, or specialized concurrency checkers (Intel Inspector, thread sanitizers) for C/C++.
- Traces & Profilers: Use APM solutions (e.g., Jaeger, Zipkin in a multi-threaded environment) to visualize parallel execution flow.
Programming-Focused System Design Takeaways
- These problems (31–40) highlight application-level challenges: concurrency patterns, code organization, data validation, partial updates, etc.
- They can be combined with earlier solutions (1–30) which addressed broader system-level or DevOps topics.
- A real system typically mixes infrastructure concerns (scaling, networking) with software architecture best practices (modularity, concurrency, data handling).
41. Multi-Tenant Code Architecture
Problem
- Supporting multiple tenants (organizations or customers) in one codebase can lead to tangled code: different feature sets, data separation, and custom logic.
- Hard to isolate data per tenant without introducing a lot of “if tenant == X” checks.
Solution Approaches
- Database Schema Per Tenant: Each tenant has its own schema or database; straightforward isolation but potentially more overhead.
- Single Schema, Tenant ID Column: Simpler deployment, but must be vigilant about row-level filtering and permissions.
- Configuration Layers: Define hooks or overrides for tenant-specific logic, possibly a plugin system. Keep core code “tenant-agnostic.”
42. API Versioning Strategies in Code
Problem
- Public or internal APIs evolve, but existing clients may break if the API changes.
- Maintaining backward compatibility becomes a headache, especially if code for old versions is mixed with new.
Solution Approaches
- URI Versioning (e.g., /v1/resource
, /v2/resource
): Easiest to implement, but can lead to code duplication.
- Header or Content Negotiation: Clients request a version via headers; code can route requests accordingly.
- Semantic Versioning & Deprecation Policy: Communicate clearly when breaking changes are introduced and how long old versions will be supported.
43. Managing Domain Invariants & Domain-Driven Design
Problem
- Complex business logic often has rules (invariants) that must never be violated (e.g., “account balance can never go negative without overdraft protection”).
- Spreading these rules across controllers, services, and models leads to inconsistency.
Solution Approaches
- Aggregate Roots: Enforce invariants within the domain object that “owns” the data.
- Value Objects: Encapsulate logic in small, immutable types (e.g., Money
, DateRange
) that validate themselves.
- Domain Services: Centralize logic that spans multiple aggregates, ensuring the rules remain consistent.
44. Integration Testing with External Services
Problem
- Writing automated tests that rely on external APIs (payment gateways, messaging services) is brittle and slow.
- Hard to test error conditions or unusual edge cases without hooking into a live environment.
Solution Approaches
- Mocking & Stubs: Replace external calls with mock implementations or local test doubles.
- Contract Testing: Use frameworks (e.g., Pact) to ensure your service and the external service adhere to an agreed-upon contract.
- Sandbox Environments: Some vendors offer sandbox APIs that mimic production behavior but allow test data and error simulations.
45. Performance Tuning & Profiling Large-Scale Code
Problem
- As traffic grows, certain code paths become slow or CPU-heavy. Identifying bottlenecks by guesswork is inefficient.
- Memory leaks or high GC overhead degrade performance but are hard to track down.
Solution Approaches
- Profilers (e.g., cProfile, PyInstrument in Python): Attach them to running code, get detailed call stacks and timings.
- Sampling vs. Instrumentation: Sampling profilers periodically capture stack traces (lighter); instrumentation adds overhead but provides more precise metrics.
- Continuous Profiling: Tools that run in production with minimal overhead (e.g., Pyroscope, Datadog Profiler) help catch real-world bottlenecks.
46. Error Handling Patterns (Fail-Fast vs. Graceful Degradation)
Problem
- A codebase that swallows exceptions or returns ambiguous errors makes debugging painful.
- Overly defensive code can mask real issues; a single failure might cascade if not handled consistently.
Solution Approaches
- Fail Fast: As soon as a violation or impossible state is detected, throw an error. Don’t continue with corrupted data.
- Global Exception Handlers: At framework boundaries, log the error and return an appropriate response or fallback.
- Graceful Degradation: For non-critical features, provide a partial result or fallback path instead of failing completely.
47. Circuit Breaker Pattern in Application Code
Problem
- Calling an unstable external service repeatedly can slow down the entire app if it’s timing out or erroring.
- Without a mechanism to “trip,” your code keeps hammering the failing resource, exacerbating the problem.
Solution Approaches
- Circuit Breaker Libraries (e.g., Polly for .NET, resilience libraries in Python/Java): They track recent errors. If failures exceed a threshold, “open” the circuit.
- Fallback Logic: Provide alternate code paths when the breaker is open (cached response, default data).
- Close Cycle: After a cooldown period, attempt a few requests to see if the service is healthy again.
48. Using Typed Schemas Across Microservices
Problem
- Different microservices exchanging data with JSON can drift in structure over time, causing runtime errors or missing fields.
- Hard to keep data models consistent as code changes across repos.
Solution Approaches
- Shared Schema Definitions (e.g., protobuf for gRPC, JSON schemas in a common repo): Each service compiles or generates code from the same schema.
- Schema Registry: A central place to manage schema versions (e.g., Confluent Schema Registry for Avro/Kafka).
- Typed Clients: Generate client libraries from the schema so that changes are caught at compile time instead of runtime (where possible).
49. Designing for Offline or Scheduled Tasks
Problem
- Some operations (e.g., generating large reports, sending out newsletters) can’t be completed instantly in a request/response cycle.
- Synchronous endpoints that try to handle these tasks risk timeouts and poor user experience.
Solution Approaches
- Async Queue: Clients submit a request that enqueues a job; a worker processes it asynchronously.
- Polling / Callbacks: The user can check progress or the server can notify when finished.
- Cron Jobs: For recurring tasks, schedule them with cron-like services (e.g., Celery beat in Python).
50. Code Readability & Maintainability
Problem
- Large codebases quickly become messy if developers don’t follow consistent style or best practices.
- Technical debt accumulates, leading to slow feature development and high onboarding friction.
Solution Approaches
- Coding Standards & Linters: Enforce a shared style (PEP 8, ESLint, etc.) and consistent patterns (naming, structure).
- Modular Design: Break code into self-contained modules or packages with clear responsibilities.
- Refactoring Discipline: Continuously refactor old code to keep it aligned with current architectural standards; use code reviews to maintain quality.
```