Distributed System Illustrated on codedump notes

The Charm of Distributed Systems

Thu, 25 Jun 2026 00:00:00 +0000

This is a story about “how to build reliable systems out of unreliable components.”

When you open your phone on a rainy night to order food delivery, the system behind it is undergoing complex collaboration: your request passes through a load balancer and gets routed to an instance of the order service; the order data is persisted to the master node of the database and synchronized to two replica nodes across datacenters in milliseconds; at the same time, the inventory service in another cluster is deducting stock, and the payment system is calling a third-party gateway. If any step fails, a series of compensation logics will automatically trigger. The whole process spans dozens of servers, unreliable network links, and hardware that might strike at any time—but in most cases, your food still arrives on time.

Chapter 1: Overview of Distributed Systems

Thu, 25 Jun 2026 00:00:00 +0000

In the evolution of modern software engineering, the transition from single-node applications to distributed architectures represents a critical watershed. This leap is not merely a matter of hardware stacking or code migration, but a profound transformation involving shifts in mindset, design philosophy, and even a renewed understanding of the laws of physics.

As the opening chapter of this book, it will lead readers out of the comfort zone of single-node systems and confront the real challenges of distributed environments. We will first clarify the core definition of distributed systems and analyze their essential differences and advantages over centralized systems. Next, we will focus on the unavoidable technical problems in this field—from unreliable network communication to fragmented global clocks, and the uncertainty brought by partial failures. Finally, this chapter will elaborate on the mindset shifts architects must undergo during this transition: moving from pursuing absolute certainty to seeking a balanced trade-off between consistency and availability. Understanding these fundamental concepts and paradigms is a prerequisite for mastering the complex algorithms and consistency models in subsequent chapters, and it is also the cornerstone of building highly reliable and scalable systems.

Chapter 2: Models of Distributed Systems

Thu, 25 Jun 2026 00:00:00 +0000

In the previous chapter, we compared single-node systems with distributed systems. A single-node system communicates through shared memory, has a globally unique clock, and exhibits deterministic behavior when errors occur, making programming on such systems relatively straightforward. In contrast, a distributed system consists of multiple nodes that communicate via messages, which makes distributed systems significantly more complex. Therefore, before designing a distributed system, we need a theoretical framework to describe how the system operates. Designers must clarify the assumptions and conditions of the runtime environment at design time, as different conditions entail varying implementation challenges. Models of distributed systems are the theoretical frameworks used to describe and analyze the behavior, properties, and design of distributed systems. They provide an abstraction for studying communication, computation, failures, and synchronization in distributed systems, helping designers and researchers understand system behavior and solve practical problems in complex environments.

Chapter 3: Time and Order in Distributed Systems

Thu, 25 Jun 2026 00:00:00 +0000

In distributed systems, multiple nodes work together. Client requests are sent to different nodes for processing, and these requests become events on individual nodes. As we will see, the system’s state is formed by executing these events one by one in a specific order. Therefore, the order of events is particularly important. Different nodes may see different sequences of events, which can lead to different states on these nodes.

From this, it is evident that when processing multiple events, the order in which they occur affects the state of nodes. Thus, how to measure the order of events becomes a core problem in distributed systems. A natural idea is to sort events according to their physical time. In the following sections, we will learn about the principles of physical clocks and, unfortunately, discover that in a distributed system, comparing physical time across multiple nodes to determine the order of events is not precise and can sometimes even lead to errors. Once the order of events is disrupted, the state of the entire distributed cluster becomes corrupted.

Chapter 4: Replication

Fri, 26 Jun 2026 00:00:00 +0000

In distributed systems, data replication is one of the core design strategies. Its primary purpose is to improve system reliability, availability, performance, and fault tolerance by redundantly storing identical copies of data:

High Availability: If only a single node provides service, a single-node failure will render the service unavailable. By replicating data across multiple nodes, even if one node fails, other nodes can continue to provide service.
Fault Tolerance and Disaster Recovery: Hardware failures, network partitions, or data center disasters can lead to permanent data loss. Multi-copy storage (e.g., cross-rack or cross-region replication) ensures that data is recoverable.
Reduced Latency: The physical distance between users and data centers causes access latency (e.g., when accessing cross-border services). Replicating data to geographically distributed nodes allows users to access the nearest replica.
Read Performance Optimization: A single node can become a bottleneck for read requests. Multiple replicas can distribute the read load to improve read performance.

Based on whether a primary node is involved in replication, replication is divided into primary-backup replication and leaderless replication. Primary-backup replication means there is a central node in the system responsible for coordinating write operations and replicating them to other copies. Correspondingly, leaderless replication is a decentralized architecture.

Chapter 5: Distributed Consensus Algorithms

Fri, 26 Jun 2026 00:00:00 +0000

In the preceding chapters, we discussed how replication can prevent data loss caused by single points of failure, how partitioning can handle massive amounts of data, and how distributed transactions can ensure the atomicity of cross-node operations.

However, when we try to combine these mechanisms into a truly production-grade system that runs 24/7 with automatic fault tolerance, we discover an obvious gap: how can surviving nodes reach agreement on the system state when some nodes have failed?

Chapter 6: Partitioning

Fri, 26 Jun 2026 00:00:00 +0000

So far, in the systems we have discussed, we have assumed that every machine stores all data. In the earlier chapter on replication, when the primary node receives a write request from a client, it saves the entire dataset both locally and on other replica nodes. This storage approach has the following problems:

Scalability: If data replication is done in a primary-backup manner, all the work falls on the primary node. Under heavy load, the primary node quickly becomes a bottleneck.
Single-node bottleneck: A single node, due to the physical limits of its hardware (disk, memory, CPU, etc.), will always hit an upper bound on processing capacity.
Failure isolation: If a node hosting a specific portion of the data fails, that data becomes unavailable, reducing the overall availability of the system. For example, if data from different cities were stored on different nodes, a service outage in one region would not affect data in other regions.

When scaling a system, engineers typically choose between two primary strategies, as illustrated below:

Chapter 7: Transactions

Fri, 26 Jun 2026 00:00:00 +0000

So far, we have introduced replication and partitioning. Replication (including the consensus algorithms) improves system fault tolerance, while partitioning improves system scalability; these two techniques address the physical problems of data. In addition, data access in distributed systems often faces “logical problems”, which are solved by the transaction technology introduced in this chapter:

Replication: The main goal is high availability and data redundancy. By storing identical copies of data on different nodes, when one node fails, the system can continue to provide service from other replicas. It answers the question: “Will my data be lost or inaccessible because one machine goes down?”
Partitioning (Sharding): The main goal is scalability. When a single node’s storage or computing capacity cannot handle all the data and requests, we horizontally split the data across multiple nodes. It answers the question: “How does my system handle ever-growing data volumes and access pressure?”
Transaction: The main goal is the correctness of data operations. It packages a series of operations into an indivisible logical unit, ensuring that these operations either all succeed or all fail, and that they do not interfere with each other when executed concurrently. It answers the question: “How can I ensure that a series of related operations maintain data correctness under any circumstances (concurrency, failures)?”

Why are replication and partitioning alone not enough? Let us look at a few typical scenarios: