When working with loads and loads of data, we need to scale out the database. By scaling out the database, it means to work on the storage as well as throughput.
The container that we create in the CosmosDB is partitioned then to manage its growth because as and when the data will be added to the database, it will be added accordingly and partitions will take place. CosmosDB will then keep on creating as many partitions as needed to manage the data that is growing.
We can think of partitions as glasses filled with water, as and when one glass gets full, water is then filled in the next glass and so on. So basically, one glass (i.e. partition) acts as one container in itself. However, on a larger scale, a container is a collection of partitions.
But the problem that then appears is that when CosmosDB keeps on saving the data in the partitions, how does it look for that particular data in the whole container? Does it check every partition for that required data?
So that’s when the role of partition keys comes into picture. Defining a partition key at the beginning of creation a collection is the most important thing to do. Choosing the right partition key can be dependent on the best property to choose. Once you choose the property which will be the partition key, it then creates a hashed value. This hash is used to figure out as to which partition the data should be stored.
Each partition then contains a fixed storage capacity. Also, CosmosDB has a range-algorithm which stores multiple partition keys within a single partition. All the data with the same partition keys lives in the same partition.
Another important thing is that, when looking for data, you
should search for the query by specifying the partition key as well. This makes
the task a whole lot easier as CosmosDB would not need to search the whole
database, all the partitions to looks for that data.
The property that we choose as the partition key must be a broad and common one among all the partitions in the database. This makes it easier to search across the data. A partition key must be the one with multiple distinct values as that will make it easier to locate the data widely.
Usually, the best choice for a partition key would be a uniquely defined ID that stores the data for a particular item or person and creates a specific partition for that. This also makes it easier to look for data later by running the query with the particular ID getting the data quickly and easily.
Choosing the right property for the partition key is very important as this will also create bottlenecks for dealing with large data. For example, if we choose the “Month created” as the partition key, all the data will be directed to the same partition which is definitely not what we want it to do.
Similarly, we do not want something as Land size to be chosen as the partition key because in that case, one land can be much bigger than the other one and that will result in loads of data being directed to one partition.
These problems can be handled by creating a separate container for that particular partition. This new container can then be categorized by a specific partition key and be used much more efficiently.
This means that we create a specific container for Land 7, as example here. And this can further be categorized using a specific partition key.
To summarize, horizontal partitioning helps to manage the growth of the container and then partitions keep on taking place when the data is being added to the database. We also saw that what are the hot partitions and how can those be avoided.