CockroachDB a NewSQL Answer to Scaling SQL Databases
In this article we discuss a new database technology called CockroachDB and how it implements what database enthusiasts are calling NewSQL. We’ll discuss what NewSQL is and why it was needed and CockroachDB’s approach to its implementation.
What is NewSQL
New SQL is a new take on SQL databases whose intent is to provide a database solution for companies with huge datasets who know that their entire dataset will not fit on a single machine. There have been NoSQL answers such as MongoDB that solve this scalability issue but that at the same time lost some of the great things that SQL inherently provides such as joins and data consistency.
In SQL there is not much to say about data distribution because all data is stored on a single server. Of course you can and should replicate/backup your database but this is not handled by database technology itself, instead a developer must manually create backups.
NoSQL database technologies like MongoDB dealt beautifully with data distrubtion using sharding. Now NewSQL technologies like CockroachDB and Google Spanner have taken what we’ve learned from these NoSQL solution and applied them to SQL.
The main questions to consider in data distribution are how you divide up the data and how do you locate it later. Hashing and Order-Preserving are the two most commonly used approaches.
Let’s discuss the two approaches from a high-level.
Let’s assume you are storing key value pairs. In hashing, a hash function is run on the key to determine which machine the data is stored on. A simple example of a has function would use the modulus operator on the key with the number of database servers you have and that way your data could be split evenly among the servers. The advantages of this methodology is that it’s simple to determine where to store the data and when searching for a piece of data by the key, it isequally as simple to determine which machine it is on; just use the hash function again. The downside of this approach from a SQL point of view is that performing range scans is complex and inefficient. This approach is still used in certain technologies though like DynamoDB. This approach though is not ideal for NewSQL databases because NewSQL heavily relies on ordered indexes for its search efficiency.
The second approach, which most NewSQL technologies like CockroachDB take, is the Order-Preserving methodology which as you can imagine, unlike Hashing, preserves order. In this approach you take your dataset and split it up into pieces that are similar and size and distribute those pieces across your machines. With this approach you preserve order which results in efficient range scans. The downside is that you don’t have a simple function like hashing, that let’s you know which machine the data is stored on. Instead you use a range index which keeps track of which ranges of data reside on each machine.
CockroachDB and most NewSQL technologies alike encourage replications of each node and using rebalancing so that you can add and remove nodes from the system and it will keep the number of replications.
The main question you may be asking yourself if you’re unfamiliar with CockroachDB and NewSQL systems is how does it keep the copies of data in sync? In a system where datasets are being copied and copied, data is bound to get out of sync and conflict. What if one set of data says one thing and a supposed copy of that data says another? Which one is right? There needs to be safeguards in place to handle when these conflicts occur.
There are two common patterns with keeping SQL databases in sync that we’ll discuss here:
The first one is ‘cold backups’ and is that you have a primary database and are making copies on a regular schedule but these backups aren’t expected to have the most current set of data, instead they are more for disaster recover, so that you can always revert back to a recent backup.
The second form is ‘primary secondary’ replication which is when you are writing first to a primary database, and then when it’s done it passes that same write along to the secondary database to perform the same write. That way the primary and secondary are for the most part kept in sync and hold identical copies of the dataset. This can be done synchronously or asynchronously.
Distributed Consensus Protocol
NewSQL technologies typically use a distributed a consensus protocol which consists of using an odd number of machines in a cluster and use a majority rules or consensus to decide what data is correct and resolve conflicts. CockroachDB uses an implementation of a DCP ( Distributed Consensus Protol )called Raft. In DCP’s there is a leader node and followers. Let’s discuss this in an example to solidify our understanding. Let’s imagine a new Post is made within our application, first it goes to the Leader node which stores it in its database, then it passes the write to the followers. Let’s say we have four followers for a total of five nodes. Once that write has been executed by two of the followers, we have a majority because now we have three nodes with the stored data and that write is considered to be persisted and you can acknowledge to the client of the succcess.
The advantage here is that if one node goes down, it allows that store to be committed successfully without every node being onboard.
Cockroach DB and NewSQL provide ACID transactions. If you’re unfamiliar with the acronym: Atomic – the transaction completely succeeds or completely fails, nothing in-between. Consistent – a transaction is guaranteed to stop in an unfinished state. Isolated – transactions that occur at the same time don’t cause issues. Durable – transactions that were stored, stay stored.
The way by which ACID is achieved in a distributed system is different than traditional SQL databases but the result is the same. It’s also noteworthy to mention that many NoSQL systems can’t offer ACID transactions like SQL can. This is one sacrifice that was made for the sake of scalability.
CockroachDB uses MVCC with conflict detection to make this work.
CockroachDB is one of a few NewSQL database technology that’s focused on creating a distributed system for SQL databases. If you’re looking for a solution to creating a distributed system you might look at CockroachDB and NewSQL as an option especially if you’re comfortable with traditional SQL database technologies.
If you have any questions or feedback please don’t hesitate to reach out to us.
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.Get Started