System Design Basics: Scalability, Reliability, and Distributed Principles

14 min read, Thu, 12 Jun 2025

Centralized systems primarily struggled with adapting to increasing performance demands due to inherent limitations in CPU processing power and memory. Although technological advancements continuously enhance performance, the ratio between required compute power and existing capacity remains significantly low. For instance, highly available systems such as LinkedIn, which handle billions of transactions daily, cannot achieve such massive-scale transaction processing with centralized architectures that only permit vertical scaling.

It requires splitting the processing and data logic so that different CPUs can handle them independently, allowing most processing to occur in parallel; this concept is referred to as distributed systems. While distributed systems offer numerous features, they also introduce unique constraints. For example, if data is split across multiple locations for independent processing, how is the consistency of state guaranteed across these independent units? What happens if an independent unit fails? Would it lead to a complete system failure? Furthermore, what is the most efficient and rapid method for these units (also known as nodes) to communicate with each other?

Some key characterstics of distributed systems include Scalability, Reliability, Availability, Efficiency and Manageability.

Scalable

Scalability refers to a system’s ability to grow in response to increased demand. A distributed system, specifically, should be able to expand as usage increases. For example, if a web application experiences higher traffic, leading to greater compute, storage, and bandwidth demands, it should be possible to either add more nodes or increase the capacity of existing nodes.

When the capacity of an existing node is increased—for instance, by upgrading its CPU, memory, or storage—this is known as vertical scaling.

Conversely, adding more nodes instead of increasing the capacity of existing ones to handle additional traffic load is termed horizontal scaling.

Most cloud computing platforms, such as AWS, GCP, and Azure, offer automatic application scaling. This means that as traffic increases, more nodes can be added, and when traffic decreases, nodes can be terminated.

While vertical scaling is straightforward to achieve by simply increasing CPU or memory size, it has limitations:

In contrast, horizontal scaling offers greater flexibility. You can add any number of nodes without taking the existing ones offline.

An example of vertical scaling is MySQL, while an example of horizontal scaling is Cassandra/MongoDB.

Reliability

Reliability refers to the probability that a system will perform its intended function without failure for a specified period under given conditions. In simpler terms, it’s about whether the system consistently does what it’s supposed to do correctly. A reliable system minimizes errors, data corruption, and unexpected behavior. It’s often expressed as Mean Time Between Failures (MTBF).

Example: A financial transaction system is highly reliable if it consistently processes all transactions accurately, without losing data or miscalculating amounts, even in the face of minor component issues. If a debit card transaction consistently fails to deduct the correct amount from an account due to a software bug, the system is unreliable.

Techniques for enhancing reliability:

Redundancy:

Having multiple copies of data or services so that if one fails, another can take over. For example, storing data across multiple database servers or running multiple instances of a microservice.

Fault Tolerance:

Designing the system to continue operating even if some components fail.

Robust Error Handling:

Implementing mechanisms to detect, report, and recover from errors gracefully, rather than crashing or producing incorrect results.

Availability

Availability is the proportion of time a system is operational and accessible to its users. It’s typically expressed as a percentage (e.g., “four nines” of availability means 99.99% uptime). High availability ensures that users can access the service whenever they need it, despite individual component failures, maintenance, or other disruptions.

Example: An e-commerce website with high availability remains accessible for customers to browse products and make purchases almost continuously. Even if one server goes down, traffic is automatically rerouted to other healthy servers, preventing an outage for users. Conversely, a system that frequently experiences downtime for updates or due to single points of failure has low availability.

Techniques for enhancing availability:

Efficiency

Efficiency in distributed systems refers to how effectively the system utilizes its resources (CPU, memory, network bandwidth, storage) to deliver performance. It’s about getting the most work done with the least amount of resources. Key metrics include throughput (amount of work done per unit of time) and latency (time taken for a request to be processed).

Example: A video streaming service that delivers high-quality video to millions of users with minimal buffering and low operational costs due to optimized data compression, efficient content delivery networks (CDNs), and intelligent resource allocation is highly efficient. If the service experiences frequent buffering for users or consumes excessive bandwidth, it’s inefficient.

Techniques for enhancing efficiency:

Manageability

Manageability refers to the ease with which a distributed system can be operated, monitored, maintained, and updated. It encompasses the simplicity of deploying, configuring, troubleshooting, scaling, and evolving the system. A highly manageable system reduces operational complexity and the need for manual intervention.

Example: A microservices architecture with automated deployment pipelines (CI/CD), centralized logging and monitoring, and self-healing capabilities is highly manageable. Developers can quickly deploy new features, identify and resolve issues, and scale services without extensive manual effort. Conversely, a system that requires manual configuration across many servers, has fragmented logs, and lacks automated recovery mechanisms is poorly manageable.

Techniques for enhancing manageability:

Relationship Between Reliability, Availability, Efficiency, and Manageability

While Reliability, Availability, Efficiency, and Manageability are distinct qualities in system design, they are highly interconnected and often influence each other. Optimizing one can frequently improve others, but sometimes, trade-offs must be considered.

Reliability and Availability: Closely Tied, but Distinct

Reliability and Availability are often confused but represent different aspects of system uptime and correctness.

Availability and Efficiency: Balancing Access and Resource Use

Manageability and All Three: The Operational Multiplier

Manageability plays a crucial role in achieving and maintaining high levels of reliability, availability, and efficiency.

Interdependencies and Synergies

In essence, these four qualities form a critical quartet in distributed system design, each playing a vital role and influencing the others in a complex, yet interdependent, relationship. Optimizing for all four simultaneously is the hallmark of a well-engineered distributed system.

Conclusion

This article provides a foundational understanding of system design, emphasizing the transition from centralized to distributed architectures driven by performance demands. It thoroughly explains vertical and horizontal scaling with practical examples. A core focus is placed on the essential qualities of distributed systems: Reliability, Availability, Efficiency, and Manageability. The piece details various fault tolerance patterns (like Circuit Breaker, Retry, Bulkhead) and robust error handling methods (including Idempotency, Dead-Letter Queues, and Checksums), crucial for building resilient systems. Finally, it explores the intricate relationships between these qualities, highlighting how they collectively contribute to robust and high-performing distributed architectures.