Distributed System Fundamentals
What is a Distributed System?
A system that uses multiple machines to handle the same problems that a single machine would process
Multi-point storage, multi-point computing
Why Do We Need Distributed System?
To solve the growing demand for storage resources and computational resources, distributing system pressure
Three Major Problems Distributed System Must Address
- Storage
- Transmission (Communication)
- Computation
Goals of Distributed System
Able to handle N-fold growth in problems by purchasing N times more hardware
Characteristics of Distributed System
Scalability
Can be divided into three dimensions:
- Scale-out scaling: Increasing the number of nodes should linearly speed up the entire system; increasing data volume should not increase latency
- Geographic scaling: Using geographically distributed data centers to reduce user response time, properly handling cross-datacenter response latency
- Administrative scaling: Adding more nodes should not increase the management burden of the entire system
Performance
- Tasks can be processed quickly (low latency)
- High throughput (task processing rate)
- Consumes fewer computational resources
These factors require constant trade-offs, where low latency is arguably the most important factor, often related to physical limitations that are difficult to solve with money
Latency: refers to the period from when an event occurs until its impact or visibility is realized
A system where data never changes should not have latency issues
The ideal minimum latency depends on the physical distance information must travel
Availability (Fault Tolerance)
One advantage of Distributed System over single machines is that single machines cannot tolerate faults - they either succeed or fail
Distributed System can accommodate a group of unreliable components while building a reliable system at the top level
Systems without redundancy have availability equal to their components; systems with redundant design can tolerate partial failures and therefore have high availability
Availability = Uptime / (Uptime + Downtime)
Fault tolerance: The ability to continue working in predefined ways when errors occur
In short, we cannot tolerate errors that we did not anticipate
Limitations of Distributed System
Physically:
- Number of nodes
- Distance between nodes
If the number of nodes increases: it reduces system availability, increases management overhead, and node-to-node communication costs will degrade performance, thereby increasing wait time (related to physical distance)
Performance and availability can be defined by guarantees provided to external systems (reliability). These guarantees are agreements between different services, such as: How fast can data written at one point be read at other points? Can written data be guaranteed to be persisted? Within what timeframe can a computation request return results? What impact will component failure have on the entire system?
Errors are incorrect behavior, exceptions are unknown behavior
System Abstraction and Modeling
Good abstractions make systems easier to understand
Systems that hide details externally are easier to understand, but systems that expose more details can achieve better optimization
An ideal system has both clear expression and meets business needs
Design Essence: Partition and Replicate
For a dataset, it can be divided into multiple smaller portions to be distributed across multiple nodes for parallel processing (partitioning), or it can be replicated or cached to different nodes to reduce the distance between clients and servers and improve fault tolerance.
Partitioning
Dividing data into different and independent subsets addresses growing data volumes, improves processing performance and availability. The basis for segmentation relates to the primary way data is accessed, and must address some issues with independent subsets (such as inefficient access between different subsets)
Replication
Enables new computational resources to be applied to new copies of data, reducing latency while enhancing availability
Replication can enhance scalability, performance, and reliability
Replication can avoid single points of failure, accelerate computation, cache I/O to improve throughput
Problems brought by replication:
- Cluster data synchronization (maintaining data consistency)
Consistency model selection is crucial: Strong consistency models allow your programs to be programmed as if dealing with a single piece of data, while weak consistency models offer lower latency and higher availability
References:
Views