Distributed System Fundamentals

April 6, 2018·

CS Theory

·3 min read

Tecker Yu

AI Native Cloud Engineer × Part-time Investor

What is a Distributed System?

A system that uses multiple machines to handle the same problems that a single machine would process

Multi-point storage, multi-point computing

Why Do We Need Distributed System?

To solve the growing demand for storage resources and computational resources, distributing system pressure

Three Major Problems Distributed System Must Address

Storage
Transmission (Communication)
Computation

Goals of Distributed System

Able to handle N-fold growth in problems by purchasing N times more hardware

Characteristics of Distributed System

Scalability

Can be divided into three dimensions:

Scale-out scaling: Increasing the number of nodes should linearly speed up the entire system; increasing data volume should not increase latency
Geographic scaling: Using geographically distributed data centers to reduce user response time, properly handling cross-datacenter response latency
Administrative scaling: Adding more nodes should not increase the management burden of the entire system

Performance

Tasks can be processed quickly (low latency)
High throughput (task processing rate)
Consumes fewer computational resources

These factors require constant trade-offs, where low latency is arguably the most important factor, often related to physical limitations that are difficult to solve with money

Latency: refers to the period from when an event occurs until its impact or visibility is realized

A system where data never changes should not have latency issues

The ideal minimum latency depends on the physical distance information must travel

Availability (Fault Tolerance)

One advantage of Distributed System over single machines is that single machines cannot tolerate faults - they either succeed or fail

Distributed System can accommodate a group of unreliable components while building a reliable system at the top level

Systems without redundancy have availability equal to their components; systems with redundant design can tolerate partial failures and therefore have high availability

Availability = Uptime / (Uptime + Downtime)

Fault tolerance: The ability to continue working in predefined ways when errors occur

In short, we cannot tolerate errors that we did not anticipate

Limitations of Distributed System

Physically:

Number of nodes
Distance between nodes

If the number of nodes increases: it reduces system availability, increases management overhead, and node-to-node communication costs will degrade performance, thereby increasing wait time (related to physical distance)

Performance and availability can be defined by guarantees provided to external systems (reliability). These guarantees are agreements between different services, such as: How fast can data written at one point be read at other points? Can written data be guaranteed to be persisted? Within what timeframe can a computation request return results? What impact will component failure have on the entire system?

Errors are incorrect behavior, exceptions are unknown behavior

System Abstraction and Modeling

Good abstractions make systems easier to understand

Systems that hide details externally are easier to understand, but systems that expose more details can achieve better optimization

An ideal system has both clear expression and meets business needs

Design Essence: Partition and Replicate

For a dataset, it can be divided into multiple smaller portions to be distributed across multiple nodes for parallel processing (partitioning), or it can be replicated or cached to different nodes to reduce the distance between clients and servers and improve fault tolerance.

Partitioning

Dividing data into different and independent subsets addresses growing data volumes, improves processing performance and availability. The basis for segmentation relates to the primary way data is accessed, and must address some issues with independent subsets (such as inefficient access between different subsets)

Replication

Enables new computational resources to be applied to new copies of data, reducing latency while enhancing availability

Replication can enhance scalability, performance, and reliability

Replication can avoid single points of failure, accelerate computation, cache I/O to improve throughput

Problems brought by replication:

Cluster data synchronization (maintaining data consistency)

Consistency model selection is crucial: Strong consistency models allow your programs to be programmed as if dealing with a single piece of data, while weak consistency models offer lower latency and higher availability

References:

http://book.mixu.net/distsys/

Views

Zero-Swap Sorting Google Interview Question BUG Caused by Unbuffered Channels