Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total storage capacity of the system.
See more on the basics of sharding here. Similarly, by distributing the data across multiple machines, a sharded database can handle more requests than a single machine can. Sharding is a form of scaling known as horizontal scaling or scale-out , as additional nodes are brought on to share the load. Horizontal scaling allows for near-limitless scalability to handle big data and intense workloads.
In contrast, vertical scaling refers to increasing the power of a single machine or single server through a more powerful CPU, increased RAM, or increased storage capacity. Database sharding, as with any distributed architecture, does not come for free.
There is overhead and complexity in setting up shards, maintaining the data on each shard, and properly routing requests across those shards. Before you begin sharding, consider if one of the following alternative solutions will work for you. By simply upgrading your machine, you can scale vertically without the complexity of sharding.
Adding RAM, upgrading your computer CPU , or increasing the storage available to your database are simple solutions that do not require you to change the design of either your database architecture or your application. Depending on your use case, it may make more sense to simply shift a subset of the burden onto other providers or even a separate database.
For example, blob or file storage can be moved directly to a cloud provider such as Amazon S3. Analytics or full-text search can be handled by specialized services or a data warehouse.
Offloading this particular functionality can make more sense than trying to shard your entire database. If your data workload is primarily read-focused, replication increases availability and read performance while avoiding some of the complexity of database sharding. By simply spinning up additional copies of the database, read performance can be increased either through load balancing or through geo-located query routing.
However, replication introduces complexity on write-focused workloads, as each write must be copied to every replicated node.
Sharding does come with several drawbacks, namely overhead in query result compilation, complexity of administration, and increased infrastructure costs. In order to shard a database, we must answer several fundamental questions. The answers will determine your implementation. First, how will the data be distributed across shards? This is the fundamental question behind any sharded database. The answer to this question will have effects on both performance and maintenance.
Second, what types of queries will be routed across shards? If the workload is primarily read operations, replicating data will be highly effective at increasing performance, and you may not need sharding at all. In contrast, a mixed read-write workload or even a primarily write-based workload will require a different architecture.
Finally, how will these shards be maintained? Once you have sharded a database, over time, data will need to be redistributed among the various shards, and new shards may need to be created. Depending on the distribution of data, this can be an expensive process and should be considered ahead of time. Ranged sharding , or dynamic sharding , takes a field on the record as an input and, based on a predefined range, allocates that record to the appropriate shard.
Ranged sharding requires there to be a lookup table or service available for all queries or writes. For example, consider a set of data with IDs that range from A simple lookup table might look like the following:.
The field on which the range is based is also known as the shard key. Naturally, the choice of shard key, as well as the ranges, are critical in making range-based sharding effective. A poor choice of shard key will lead to unbalanced shards, which leads to decreased performance. An effective shard key will allow for queries to be targeted to a minimum number of shards.
In our example above, if we query for all records with IDs , then only shards A and B will need to be queried. Two key attributes of an effective shard key are high cardinality and well-distributed frequency.
Cardinality refers to the number of possible values of that key. The main appeal of this strategy is that it can be used to evenly distribute data so as to prevent hotspots. Range based sharding involves sharding data based on ranges of a given value. Every shard holds a different set of data but they all have an identical schema as one another, as well as the original database.
The application code just reads which range the data falls into and writes it to the corresponding shard. Looking at the example diagram, even if each shard holds an equal amount of data the odds are that specific products will receive more attention than others. Their respective shards will, in turn, receive a disproportionate number of reads. To implement directory based sharding , one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data.
In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:.
Here, the Delivery Zone column is defined as a shard key. Data from the shard key is written to the lookup table along with whatever shard each respective row should be written to. The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on.
Whether or not one should implement a sharded database architecture is almost always a matter of debate. Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:. Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:.
Bear in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you. Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application.
Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others. By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.
What is Sharding? Where would you like to share this to? Twitter Reddit Hacker News Facebook. Share link Tutorial share link. Sign Up. DigitalOcean home. Community Control Panel. Hacktoberfest Contribute to Open Source. Introduction Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic.
Benefits of Sharding The main appeal of sharding a database is that it can help to facilitate horizontal scaling , also known as scaling out. Drawbacks of Sharding While sharding a database can make scaling easier and improve performance, it can also impose certain limitations. Altogether, the process looks like this: To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column.
Range Based Sharding Range based sharding involves sharding data based on ranges of a given value. Directory Based Sharding To implement directory based sharding , one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. The following diagram shows a simplistic example of directory based sharding: Here, the Delivery Zone column is defined as a shard key.
Without sufficient optimization, database joins across multiple servers could highly inefficient and difficult to perform. Sharding has been around for a long time, and over the years different sharding architectures and implementations have been used to build large scale systems. In this section, we will go over the three most common ones.
The hash value is then used to determine in which shard the data should reside. With a uniform hashing algorithm such as ketama, the hash function can evenly distribute data across servers, reducing the risk of hotspots. With this approach, data with close shard keys are unlikely to be placed on the same shard. This architecture is thus great for targeted data operations. Range sharding divides data based on ranges of the data value aka the keyspace. Shard keys with nearby values are more likely to fall into the same range and onto the same shards.
Each shard essentially preserves the same schema from the original database. Range sharding allows for efficient queries that reads target data within a contiguous range or range queries. However, range sharding needs the user to apriori choose the shard keys, and poorly chosen shard keys could result in database hotspots. A good rule-of-thumb is to pick shard keys that have large cardinality, low recurring frequency, and that do not increase, or decrease, monotonically.
Without proper shard key selections, data could be unevenly distributed across shards, and specific data could be queried more compared to the others, creating potential system bottlenecks in the shards that get a heavier workload. The ideal solution to uneven shard sizes is to perform automatic shard splitting and merging. If the shard becomes too big or hosts a frequently accessed row, then breaking the shard into multiple shards and then rebalancing them across all the available nodes leads to better performance.
Similarly, the opposite process can be undertaken when there are too many small shards. In geo-based aka location-aware sharding, data is first partitioned according to a user-specified column that maps range shards to specific regions and the nodes in those regions.
Inside a given region, data is then sharded using either hash or range sharding. YugabyteDB is an auto-sharded, ultra-resilient, high-performance, geo-distributed SQL database built with inspiration from Google Spanner. It currently supports hash and range sharding. Geo-partitioning is an active work-in-progress feature. Each data shard is called a tablet, and it resides on a corresponding tablet server.
For hash sharding, tables are allocated a hash space between 0x to 0xFFFF the 2-byte range , accommodating as many as 64K tablets in very large data sets or cluster sizes. Consider a table with 16 tablets as shown in Figure 4.
We take the overall hash space [0x to 0xFFFF , and divide it into 16 segments — one for each tablet. The operation is served by collecting data from the appropriate tablets. Figure 3.
0コメント