Rockraft: A Strongly-Consistent KV Storage Framework Based on OpenRaft and RocksDB

Motivation #

The Redis protocol has become the de facto standard for key-value storage protocols. Beyond the official Redis implementation, we have seen various compatible alternatives:

  • Valkey: The Linux Foundation’s official fork of Redis 7.2, licensed under BSD. It is the community’s true open-source alternative after Redis switched to SSPL, fully compatible with the Redis protocol and persistence mechanisms.
  • Dragonfly: A modern multi-threaded in-memory database pursuing extreme performance, offering up to 25x the throughput of Redis with lower tail latency, but licensed under BSL (transitioning to Apache 2.0 after 4 years).
  • KeyDB: A multi-threaded Redis fork maintained by Snapchat, adding Active Replication and Flash storage extensions on top of 100% Redis API compatibility, though updates have slowed down.
  • Kvrocks: An Apache top-level project, a distributed KV storage based on RocksDB that persists data to disk, supporting dozens of terabytes at 1/5–1/10 the cost of in-memory solutions, suitable for large-capacity, low-cost scenarios.

However, none of the above implementations have gone further on strong consistency in distributed systems. In this dimension, they all adopt the eventual consistency of the native Redis implementation.

Initially, I created the coredb project to leverage the Raft consensus algorithm to build a Redis-protocol-compatible service with strong consistency. I know some people will wonder: we use Redis-like systems as caches, and in such projects we usually choose AP in the CAP theorem, prioritizing availability over consistency.

But returning to the conclusion at the beginning of this article: the Redis protocol has become the de facto standard for key-value storage protocols. Under this premise, its ecosystem value should not be limited to traditional in-memory caching. By introducing strongly consistent persistent storage, we can give it new vitality and application scenarios—just as the HTTP protocol evolved from an early web transfer protocol into a ubiquitous communication cornerstone.

At first, the project I built was only coredb, a Raft + RocksDB-based service that is both strongly consistent and Redis-protocol-compatible. That is, you can still access it with a Redis client, but it satisfies strong consistency: as long as a write returns success, it means it has been persisted on more than half of the cluster nodes.

During development, I realized that the “Raft + RocksDB” architecture combination has extremely high generic value. Considering that many developers may also need such a reliable underlying foundation to build their own strongly consistent storage systems, I decoupled this core logic and extracted it into a foundational framework project called Rockraft.

Design and Implementation #

Rockraft is developed in Rust, my current favorite systems programming language: type-safe and memory-safe. These two characteristics are my favorite features of Rust, making programming much more reassuring. Currently, the common Raft implementations in the Rust ecosystem are the following two:

  1. raft-rs

The most mature and production-proven Raft implementation in the Rust ecosystem, adopted by nearly a thousand production environments. It was ported from etcd’s Go implementation but completely rewritten in Rust, ensuring thread safety and memory safety.

Architectural characteristics:

  • Core consensus module: only provides the pure consensus algorithm core (Raft state machine), without log storage, network transport, or state machine implementation.
  • Highly customizable: you need to implement the Storage trait (log storage) and the RaftMessage transport layer yourself, offering extremely high flexibility.
  • Multi-Raft support: TiKV implements a Multi-Raft architecture based on this, supporting massive Region sharding.

Feature completeness:

  • ✅ Leader election, log replication, membership changes (Joint Consensus)
  • ✅ PreVote mechanism to avoid network partition interference
  • ✅ Leader Lease read optimization
  • ✅ Snapshot transmission
  • ✅ CheckQuorum mechanism

Production users of raft-rs include: TiKV (distributed transactional KV database).

Note: This library has entered maintenance mode, new feature development has slowed, and new projects are advised to consider OpenRaft.

  1. OpenRaft 🚀 Modern Asynchronous Architecture

Design philosophy: fully asynchronous event-driven, no reliance on periodic ticks, message batching optimized for high throughput.

Core highlights:

  • Event-driven architecture: based on Raft events rather than polling, avoiding idle spinning and greatly improving resource utilization.
  • Unified API: a single Raft type, extending storage and network layers through three traits: RaftLogStorage, RaftStateMachine, and RaftNetwork.
  • Comprehensive membership changes: adopts the more general Joint Consensus, supporting arbitrary membership changes (adding/removing multiple nodes in a single step), rather than single-step changes.
  • Built-in observability: integrated tracing logs and distributed tracing, supporting compile-time log level adjustment.
  • Manual control: supports manually triggering election (trigger_elect), snapshot (trigger_snapshot), and log purging (purge_log), facilitating operations.

Feature characteristics:

  • ✅ Linearizable reads (ensure_linearizable)
  • ✅ Learner (Non-voter) role support
  • ✅ Dynamic heartbeat/election toggle control
  • ⛔️ Does not support single-step membership changes (a design trade-off favoring the safer Joint Consensus)

Its production users: Databend (cloud-native data warehouse), CnosDB (time-series database), RobustMQ (cloud-native message queue).

Rockraft ultimately chose OpenRaft as the underlying Raft algorithm library, not only because of its more modern asynchronous architecture and strong customizability, but also because I am one of the contributors to this project :).

The architecture of Rockraft is shown in the figure below:

Rockraft
Rockraft Architecture

The four core modules of the architecture are:

  • RPC / Communication Layer: Responsible for handling read and write requests from upper-layer clients, as well as inter-node cluster communication (such as Leader sending heartbeats, replicating logs, and Followers participating in elections). RaftNetwork is the abstract interface defined by OpenRaft, and Rockraft implements the underlying network transport logic at this layer.

  • Consensus Engine Layer (OpenRaft): This is the brain of the node (Raft Core), completely based on the OpenRaft library. It is responsible for maintaining all state transitions of the Raft state machine (Leader, Follower, Candidate, Learner), handling election timeouts, and calculating the Commit Index of logs. It itself is “stateless” and decoupled from storage.

  • Storage Abstract Adaptation Layer: This is the core glue layer of the Rockraft project. OpenRaft defines two core traits: RaftLogStorage and RaftStateMachine. Rockraft implements these two traits, telling the Raft engine: “When you need to store logs, call my methods; when you need to apply data to the business state machine, call my methods too.”

  • Physical Storage Layer (RocksDB): The final destination where data lands. In pursuit of ultimate performance, Rockraft usually shares a single RocksDB instance at the bottom layer, but uses Column Family (CF) technology for physical/logical isolation:

    • CF: Raft Logs: specifically stores Raft logs arranged in increasing Index order, leveraging RocksDB’s efficient sequential write characteristics.
    • CF: Hard State: stores Raft’s hard state (such as the current Term and who the vote was cast for), ensuring that split-brain does not occur after a node restart.
    • CF: State Machine: stores the actual business data that has been Committed. Only after a log reaches majority consensus will the business operation in the log be replayed here (such as Put/Delete Key).

How to Use #

Using Rockraft to implement a strongly consistent service is also very simple. You only need to specify an address and port in the service configuration for inter-node cluster communication using the Raft protocol to reach agreement on write behavior. For example, Rockraft comes with an HTTP service example, which receives user requests via HTTP protocol interfaces and uses Rockraft internally to synchronize data across the cluster.

Method Endpoint Functionality Consistency Requirement
GET /get?key={key} Query the value of a specified key Linearizable read
POST /set Set a key-value pair Leader only
POST /delete Delete a specified key Leader only
POST /batch_write Atomic batch write (multi-operation transaction) Leader only
POST /txn Conditional transaction execution (CAS support) Leader only
POST /getset Atomically get the old value and set the new value Leader only
GET /prefix?prefix={prefix} Prefix scan query Local read
GET /members Get cluster member list Local read
POST /join Add a node to the cluster Leader only (auto-forwarded)
POST /leave Remove a node from the cluster Leader only (auto-forwarded)
GET /health Health check (returns Leader status) Local read
GET /metrics Cluster metrics (term, log index, etc.) Local read

The following is the flow of writing data through the HTTP interface:

┌─────┐     ┌─────────┐     ┌──────────┐     ┌─────────────┐     ┌──────────────┐
│Client│────►│  Axum   │────►│ RaftNode │────►│   OpenRaft  │────►│RocksLogStorage│
└─────┘     │ Handler │     │ write()  │     │             │     │  Append Log   │
            └─────────┘     └──────────┘     └──────┬──────┘     └──────────────┘
                                                      │
                                                      │ Raft Consensus
                                                      │ (Replicate to Quorum)
                                                      ▼
                                               ┌──────────────┐
                                               │RocksStateMachine
                                               │  Apply Log   │
                                               │              │
                                               │┌────────────┴┐
                                               ││   RocksDB   │
                                               ││  Put Key/Val│
                                               │└─────────────┘
                                               └──────────────┘

The configuration file of this service:

node_id = 1
http_addr = "127.0.0.1:8001"

[raft]
address = "127.0.0.1:7001"
advertise_host = "localhost"
join = ["localhost:7002", "localhost:7003"]

Here:

  • 127.0.0.1:8001: the address and port for receiving HTTP client requests.
  • 127.0.0.1:7001: the address and port for inter-node Raft protocol communication within the cluster.
  • join: a list of addresses, all of which are raft.address addresses, indicating that after startup, it will immediately send join-cluster requests to these addresses. An empty join means the system starts in single-node mode.

The public APIs provided by RaftNode include:

Method Description
write(entry) Write a log entry (replicated via Raft)
batch_write(req) Atomically batch write multiple KVs
read(req) Read a KV (from the Leader state machine)
txn(req) Execute a conditional transaction
getset(key, value) Atomically get the old value and set the new value; this interface is designed to implement Redis’s GETSET command
scan_prefix(req) Scan KVs by prefix
add_node(req) Add a node to the cluster
remove_node(req) Remove a node from the cluster
get_members(req) Get the cluster member list
shutdown() Gracefully shut down the node

Currently, the conditional transactions supported by Rockraft include 8 comparison operations:

enum TxnOp {
    Exists,           // key exists
    NotExists,        // key does not exist
    Equal(Vec<u8>),   // value equals
    NotEqual(Vec<u8>),// value not equal
    Greater(Vec<u8>), // value greater than (lexicographical order)
    Less(Vec<u8>),    // value less than
    GreaterEqual(Vec<u8>),
    LessEqual(Vec<u8>),
}

In the example directory, there is a complete example of using Rockraft to implement a strongly consistent HTTP protocol key-value service.

Conclusion #

Rockraft is still continuously evolving. The current optimization directions include:

  • Currently, Raft logs are stored in a special Column Family of the underlying RocksDB. It may be worth referencing the implementation of the Raft Engine project to optimize this part of storage performance.
  • OpenRaft currently does not natively support multi-Raft groups. This part may also be implemented in Rockraft.
  • Use tools such as Chaos Mesh to add chaos testing.

At the same time, you can also pay attention to the coredb project based on Rockraft. As I said at the beginning of the article: the Redis protocol is already the de facto universal key-value storage protocol, and it is entirely possible to do more things beyond pure in-memory caching based on this protocol. coredb is one such exploration toward the CP direction.