Distributed Systems Series — Part 5.3: Scalability & Performance
The Write Scalability Problem
Post 5.1 established that data scalability — the ability to handle growing data volume — is a distinct problem from load scalability. Post 3.8 established that write throughput in leader-based replication is bounded by the leader’s capacity — reads scale with replicas, writes do not. Adding read replicas increases read capacity proportionally. Adding read replicas does nothing for write throughput.
Partitioning — also called sharding — is the primary solution to both problems simultaneously. It divides data across multiple nodes so that each node owns a subset of the key space and serves reads and writes for only that subset. Ten shards means ten independent leaders, each accepting writes for their portion of the data. Write throughput scales with shard count. Data volume scales with shard count. Neither is bounded by the capacity of a single node.
Partitioning is also the most complex scalability mechanism in distributed systems. The partitioning strategy determines query efficiency, load distribution, rebalancing cost, and the class of application bugs that will appear in production. Choosing the wrong strategy for a workload’s access patterns produces systems that scale in the dimension that was not the bottleneck while creating new bottlenecks in the dimension that matters.
This post covers the three core partitioning strategies — range partitioning, hash partitioning, and consistent hashing — with their specific trade-offs, the hot partition problem that derails most initial sharding implementations, virtual nodes as the production solution to load imbalance, rebalancing strategies and their operational cost, cross-partition queries and why they break the scalability promise, and production examples from Cassandra, DynamoDB, CockroachDB, and MongoDB.
What Partitioning Solves and What It Does Not
Partitioning solves the single-node bottleneck for both writes and storage. With N shards, a system can sustain approximately N times the write throughput of a single-node system and store approximately N times the data volume, assuming roughly equal distribution of load across shards.
Partitioning does not solve every scalability problem. It makes some problems harder. Cross-partition queries — queries that must aggregate data from multiple shards — are more expensive after partitioning than before. A query that reads all records matching a condition must scatter across all shards if the condition does not align with the partition key, collect results from each, and merge them into a final result set. This scatter-gather pattern is slow, resource-intensive, and produces unpredictable latency (determined by the slowest shard). An application designed before partitioning with query patterns that cross partition boundaries frequently will be slower and more complex after partitioning, not faster.
The first design decision when introducing partitioning is therefore not which partitioning strategy to use — it is which partition key to use. The partition key must align with the application’s most frequent access patterns so that the majority of queries touch one shard rather than many. A user-centric application that queries primarily by user ID should partition by user ID. A time-series application that queries primarily by time range should consider range partitioning on timestamp. A multi-tenant SaaS application that queries primarily within a tenant should partition by tenant ID.
This alignment is a design constraint that affects the application’s data model, API design, and query patterns. It is significantly easier to choose the right partition key before building the application than to change it after the application is in production with established data and query patterns.
Range Partitioning
Range partitioning assigns contiguous ranges of the key space to each shard. Keys that sort between A and G go to Shard 1. Keys between H and N go to Shard 2. Keys between O and Z go to Shard 3. The partition boundaries — called split points — define where one shard’s range ends and the next begins.
The primary advantage of range partitioning is query efficiency for range scans and ordered iteration. A query for all records with keys between “customer_1000” and “customer_2000” touches exactly the shards whose range overlaps that interval — typically one or two shards. The scatter-gather cost is bounded by the range width, not by the total number of shards. This makes range partitioning the correct choice for time-series data (queries for events in a time window), geographic data (queries for records in a geographic region), and any workload where the most common queries are range queries rather than point lookups.
The primary disadvantage is hot partitions caused by data access skew. If keys are not uniformly distributed — which they almost never are in real-world data — some ranges receive dramatically more traffic than others. A user database partitioned alphabetically by username where most users have names starting with common letters produces a hot shard for those letters. A time-series database partitioned by timestamp where recent timestamps receive all new writes produces a hot shard at the end of the range (the “write hot end” problem). The hot shard becomes a bottleneck regardless of how many total shards exist, because all the work is concentrated in one place.
HBase, Google’s Bigtable, and TiKV all use range partitioning. They manage the hot partition problem through automatic split detection — when a shard grows too large or receives too much traffic, the system splits it into two smaller shards and migrates one half to another node. This is effective but introduces operational complexity: splits must be detected, coordinated, and executed without interrupting ongoing traffic, and the receiving node must have sufficient capacity to absorb the migrated half.
Hash Partitioning
Hash partitioning applies a hash function to the key and uses the hash value to determine the shard assignment: shard = hash(key) mod N where N is the number of shards. Because hash functions produce uniformly distributed outputs for diverse inputs, hash partitioning distributes load evenly across shards regardless of the distribution of the keys themselves. A workload where 90% of traffic is for keys starting with “A” will not produce a hot shard under hash partitioning — the hash function distributes those keys across all shards uniformly.
The primary advantage is load distribution. Hash partitioning is the correct choice for workloads dominated by point lookups (retrieve the record for key X) where the key distribution is skewed. Most web application workloads — user records, session data, product catalogs — have this characteristic. DynamoDB’s default partitioning uses hash partitioning on the primary key for this reason.
Hash partitioning has two significant disadvantages. First, range scans are impossible. A query for all records with keys between “customer_1000” and “customer_2000” must scatter to all shards, because there is no relationship between key lexicographic order and shard assignment. Adjacent keys (in sorted order) are on random shards. This makes hash partitioning inappropriate for any workload that requires ordered iteration or range queries.
Second, adding or removing shards requires migrating most of the data. With N shards, all keys are assigned by hash(key) mod N. When a shard is added (N becomes N+1), almost all keys change assignment: hash(key) mod N produces a different result from hash(key) mod (N+1) for most keys. In practice, approximately N/(N+1) of all data must move to new shards when one shard is added — at N=10 shards, adding one shard requires migrating approximately 90% of the data. This makes resharding operationally expensive and risky, which is why consistent hashing was developed as an alternative.
Consistent Hashing: Minimising Rebalancing Cost
Consistent hashing solves the resharding cost problem of simple hash partitioning while preserving even load distribution. The key insight: instead of mapping keys to shards directly, map both keys and nodes to positions on a conceptual ring (a circular key space from 0 to 2^32). Each node is responsible for the keys between its position and its predecessor’s position on the ring.
When a node is added to the ring, it is assigned a position on the ring and takes ownership of the keys between its position and its predecessor’s position. Only the keys in that arc need to move — approximately 1/N of all keys, where N is the number of nodes before the addition. When a node is removed, its keys are transferred to its successor on the ring. Again, only 1/N keys move — the keys that were on the removed node. Every other node is unaffected.
This property — only 1/N keys move when a node is added or removed — is the core advantage of consistent hashing over simple hash partitioning. At scale with hundreds of nodes, adding one node moves approximately 1% of the data rather than 99%.
Simple consistent hashing has a load imbalance problem: if nodes are randomly placed on the ring, the arcs between nodes are unequal in size, meaning some nodes own more key space and receive more traffic than others. This is solved by virtual nodes (vnodes).
Virtual Nodes: Production-Grade Load Balance
Virtual nodes (vnodes) assign each physical node multiple positions on the consistent hashing ring rather than one. Cassandra’s default configuration assigns 256 vnodes per physical node. With 10 physical nodes, there are 2,560 positions on the ring, each a small arc of the key space. The key space is divided much more finely, and each physical node aggregates the load from 256 small arcs rather than one large arc.
The load balancing improvement is significant. With one position per node, the load on each node depends on how the random position assignments distribute across the ring — potentially very uneven. With 256 positions per node, the statistical averaging across 256 independent positions produces near-perfect load balance. The law of large numbers ensures that 256 arcs per node aggregates to approximately equal total arc length per node.
Vnodes also make node addition and removal operationally cleaner. When a physical node is added, it receives 256 new positions on the ring. Each new position takes a small slice of the key space from the node that previously owned that position on the ring. The data migration is distributed across all existing nodes — each existing node donates a small portion of its data. No single node is the source of all the migrated data. This prevents the network and disk I/O spike that occurs in simpler resharding strategies where all migrated data comes from one or a few sources.
Cassandra uses vnodes as its default sharding mechanism. DynamoDB uses a variant of consistent hashing with internal rebalancing logic. Redis Cluster divides the key space into 16,384 hash slots and assigns slots to nodes, which is functionally equivalent to consistent hashing with 16,384 virtual nodes.
The Hot Partition Problem
The hot partition problem is the most common production failure mode in sharded systems. It occurs when one partition receives dramatically more traffic than others — becoming a bottleneck regardless of how many total partitions exist. The system cannot scale past the hot partition’s capacity ceiling, even though most partitions are lightly loaded.
Hot partitions arise from several sources. Sequential key generation — auto-increment IDs, timestamp-based keys, monotonically increasing sequence numbers — concentrates all new writes on the most recent partition. Every new write goes to the same shard (the one owning the current sequential range), while older shards receive only reads. Hash partitioning prevents this pattern, but range partitioning on sequential keys produces it reliably.
Viral content in social applications produces hot partitions on content-keyed systems. A viral post, trending product, or breaking news event concentrates millions of reads on the single partition that holds the popular content. The partition handling that content becomes a bottleneck while all other partitions serve normal traffic.
Business-level skew produces hot partitions in multi-tenant systems. A partitioning strategy that assigns all of one large customer’s data to one partition creates a hot partition proportional to that customer’s traffic. If one customer generates 40% of total traffic, their partition receives 40% of total load regardless of how many other partitions exist.
Detection requires monitoring per-partition metrics — request rate, CPU utilisation, and latency per shard — not just aggregate metrics. An aggregate dashboard showing 30% average CPU across 10 shards may be hiding one shard at 95% CPU while the others run at 18%. Per-shard metrics are the only way to detect hot partitions before they produce user-visible degradation.
Mitigation strategies depend on the hot partition’s cause. For sequential write hot spots, add randomness to the key — prepend a random prefix (0-9) to distribute writes across 10 partitions rather than concentrating on one. For viral content hot spots, implement application-level caching that absorbs read traffic before it reaches the partition. For large tenant hot spots, isolate the large tenant on dedicated partitions separate from the multi-tenant pool — the same regional isolation pattern that Post 4.6 describes for availability.
Cross-Partition Queries: The Scalability Tax
Partitioning improves write scalability and data scalability for queries that touch one shard. It imposes a significant cost on queries that must touch multiple shards — the scatter-gather pattern.
A scatter-gather query fans out to all relevant shards, waits for each to return results, merges the results into a unified response, and returns the merged result to the client. The cost has three components. Latency is determined by the slowest responding shard — the tail latency amplification from Post 5.2 applies directly. As the number of shards grows, the probability of hitting at least one slow shard grows proportionally. Throughput is limited by the coordinator that aggregates results — it must process responses from all shards, which makes it a bottleneck under high query load. Cost scales linearly with shard count — a query that scans 100 shards consumes 100× the resources of a query that scans one shard.
The practical consequence: applications built on partitioned systems must be designed to avoid scatter-gather for latency-sensitive or high-frequency queries. The partition key must align with the most common query patterns. Queries that cannot align with the partition key should either be served from secondary indexes (which may themselves be partitioned), from pre-computed aggregates, or from a separate analytics system that is designed for cross-partition queries and accepts higher latency.
Cassandra makes this trade-off explicit in its data model design: queries must specify the partition key, and Cassandra routes the query directly to the owning node without scatter-gather. Queries that do not specify a partition key are rejected or full-table-scanned — Cassandra forces the application to design for partition-aligned access patterns. DynamoDB similarly requires the partition key for all queries and restricts range scans to within a single partition.
Rebalancing: The Operational Cost of Resharding
Rebalancing — redistributing data across shards when nodes are added or removed — is the most operationally complex aspect of partitioned systems. It must move data from existing nodes to new nodes (or from failed nodes to survivors) while continuing to serve live traffic, without losing data or producing inconsistent reads.
The rebalancing strategies available depend on the partitioning scheme. For consistent hashing with vnodes, rebalancing is incremental — new nodes take ownership of vnodes one at a time, with each vnode migration transferring a small slice of data. This distributes the migration I/O across time and across source nodes, preventing the network and disk saturation that bulk migration produces.
For fixed-partition schemes like Redis Cluster’s 16,384 slots, rebalancing migrates entire hash slots from source nodes to destination nodes. Redis Cluster’s MIGRATE command moves one key at a time atomically, allowing the cluster to continue serving traffic during the migration. The source slot remains readable until migration is complete, preventing stale reads during the transition.
CockroachDB uses range-based partitioning with automatic split and merge. When a range grows beyond a configured size (default 512 MB), CockroachDB automatically splits it at the midpoint and migrates one half to another node. When ranges are small and lightly loaded, CockroachDB merges adjacent ranges to reduce overhead. This automatic rebalancing is continuous — CockroachDB permanently monitors range sizes and load and rebalances without operator intervention.
The rebalancing rate must be throttled to prevent migration I/O from saturating the network and degrading foreground traffic. This throttling creates a trade-off: faster rebalancing restores the target replication factor and load distribution sooner after a failure or scaling event, but at higher impact to foreground traffic. Most production systems implement configurable rebalancing rate limits — conservative defaults that prioritise foreground traffic, with the ability to increase the rate during maintenance windows when foreground traffic impact is acceptable.
Production Partitioning in Named Systems
Cassandra uses consistent hashing with vnodes (default 256 per node) and a configurable partitioner. The Murmur3Partitioner (default) hashes keys using Murmur3 and distributes them uniformly. The ByteOrderedPartitioner uses range partitioning on raw bytes — which enables range scans but produces hot partitions on sequential keys, which is why Cassandra documentation recommends against it except for specific use cases. Cassandra’s data model requires the partition key in every query, enforcing partition-aligned access patterns at the schema design level.
DynamoDB uses consistent hashing internally, but the implementation details are not publicly documented. From the application perspective, DynamoDB partitions on the primary key (partition key, optionally combined with a sort key within the partition). DynamoDB automatically splits and merges partitions based on throughput and storage — the application does not manage partitions directly. DynamoDB’s on-demand capacity mode automates partition management entirely; provisioned capacity mode requires the application to provision throughput per table, which DynamoDB distributes across partitions.
CockroachDB uses range-based partitioning on the primary key with automatic splitting and merging. Data is physically stored in RocksDB on each node, and each Raft group manages one range. CockroachDB’s rebalancer continuously monitors range sizes and load, automatically migrating ranges between nodes to maintain balance. This makes CockroachDB operationally simpler than manually managed partitioning but requires that the application’s primary key design produces even range distribution — sequential primary keys produce write hot spots on the most recent range.
MongoDB implements sharding through a shard key chosen by the application. MongoDB supports both range sharding (on the shard key value) and hashed sharding (on the hash of the shard key). The mongos query router receives all client requests and routes them to the appropriate shard based on the shard key. Queries that include the shard key are routed to one shard. Queries that do not include the shard key are broadcast to all shards — the scatter-gather cost is explicit in MongoDB’s architecture.
Key Takeaways
- Partitioning is the primary mechanism for scaling both write throughput and data volume beyond a single node — it creates N independent write leaders, each serving a subset of the key space, so that write capacity scales with shard count
- The partition key must align with the most frequent access patterns — queries that cross partition boundaries require scatter-gather, which does not scale and produces unpredictable latency; choose the partition key before building the application, not after
- Range partitioning enables efficient range scans and ordered iteration but produces hot partitions on skewed or sequential data — appropriate for time-series and geographic data with range query workloads
- Hash partitioning distributes load evenly and eliminates hot partitions but makes range scans impossible and requires migrating approximately N/(N+1) of all data when a shard is added — appropriate for point-lookup-dominated workloads
- Consistent hashing solves hash partitioning’s resharding cost — adding or removing a node moves only 1/N of the data, affecting only the immediate neighbours on the ring — and virtual nodes solve consistent hashing’s load imbalance problem by assigning each physical node multiple ring positions
- Hot partitions are the most common production sharding failure — caused by sequential keys, viral content, or large tenant skew — and require per-shard monitoring to detect before they produce user-visible degradation
- Rebalancing must be throttled to prevent migration I/O from saturating network and disk during foreground traffic — the rebalancing rate is a configuration trade-off between restoration speed and foreground traffic impact
Frequently Asked Questions (FAQ)
What is the difference between partitioning and sharding?
Partitioning and sharding are the same concept — dividing data across multiple nodes so each node owns a subset of the key space. Partitioning is the general term used in academic literature and database theory. Sharding is the term commonly used in production engineering conversations, particularly in the context of horizontally scaled databases. Both refer to the same mechanism: a partition key determines which node owns each record, and queries are routed to the node (or nodes) that own the relevant data.
What is consistent hashing and why is it better than simple hash partitioning?
Consistent hashing maps both keys and nodes to positions on a ring (a circular key space). Each node owns the keys between its position and its predecessor’s position. When a node is added, it takes ownership of the keys immediately preceding it on the ring — approximately 1/N of all keys move, where N is the number of nodes. When a node is removed, its keys transfer to its successor — again, only 1/N keys move. Simple hash partitioning (hash(key) mod N) requires migrating approximately N/(N+1) of all keys when one shard is added, because the modulo assignment changes for almost every key. Consistent hashing makes resharding operationally practical at scale.
What are virtual nodes and why does Cassandra use them?
Virtual nodes (vnodes) assign each physical node multiple positions on the consistent hashing ring rather than one. Cassandra’s default assigns 256 vnodes per physical node. With 256 positions, each physical node owns 256 small arcs of the key space. Statistical averaging across 256 independent positions produces near-perfect load balance regardless of how physical nodes are positioned on the ring. Vnodes also make node addition and removal cleaner: a new node takes small slices from all existing nodes rather than one large slice from one neighbouring node, distributing the migration I/O across all nodes simultaneously.
What is a hot partition and how do I detect and fix it?
A hot partition occurs when one shard receives dramatically more traffic than others — becoming a bottleneck regardless of how many total shards exist. Common causes are sequential key generation (all new writes go to the same shard), viral content (all reads for a popular record go to one shard), and large tenant skew (one customer’s traffic dominates one shard). Detection requires per-shard metrics — request rate, CPU, and latency per shard, not just aggregate averages. Mitigation depends on the cause: add randomness to sequential keys (prefix with a random value to spread writes), cache viral content at the application layer, or isolate large tenants on dedicated partitions.
When should I use range partitioning vs hash partitioning?
Use range partitioning when your workload requires range scans or ordered iteration — time-series data queried by time window, geographic data queried by region, lexicographically ordered data queried by prefix. Accept that you must design the partition key to avoid sequential hot spots (add randomness to timestamp-based keys). Use hash partitioning when your workload is dominated by point lookups (retrieve the record for exactly this key) and you do not need range scans. Accept that adding shards requires careful resharding planning, and use consistent hashing rather than simple modulo partitioning. When you need both range scans and good distribution, use a composite partition key that hashes a high-cardinality prefix and ranges on the remainder.
How do I avoid scatter-gather queries in a partitioned system?
Design the partition key to match your most frequent query patterns — every query that specifies the full partition key is routed to one shard with no scatter-gather. For queries that cannot include the partition key, three alternatives exist: secondary indexes (which may be global or local, each with different consistency and performance characteristics), pre-computed aggregates (materialised views computed at write time that can be queried directly by their own partition key), and accepting scatter-gather for low-frequency analytical queries that can tolerate higher latency. The architectural principle is to pay the scatter-gather cost at write time (pre-computation) rather than at read time, because reads are typically more frequent and more latency-sensitive than writes.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 5 — Scalability & Performance
- 5.1 — What Scalability Really Means in Distributed Systems
- 5.2 — Latency and Tail Latency at Scale
- 5.3 — Partitioning and Sharding in Distributed Systems
- 5.4 — Load Balancing Strategies in Distributed Systems
- 5.5 — Caching Trade-offs in Distributed Systems
Previous: ← 5.2 — Latency and Tail Latency at Scale
Next: 5.4 — Load Balancing Strategies in Distributed Systems →
Related posts from earlier in the series:
- 3.8 — Performance Trade-offs in Replicated Systems — write scalability bounded by the leader, Multi-Raft as the partitioned consensus solution
- 3.2 — Replication Models — how each shard replicates internally using leader-based or quorum-based replication
- 3.5 — Quorums and Voting — the W+R>N mechanism that each partition uses for consistency
- 5.2 — Latency and Tail Latency at Scale — scatter-gather amplifies tail latency across shard count
- 4.2 — Fault Tolerance vs High Availability — RTO and RPO implications of partition failure and rebalancing