One of the most significant differences between the new generation of non-relational (AKA NoSQL) databases and the traditional RDBMS is the way in which consistency of data is handled. In a traditional RDBMS, all users see a consistent view of the data. Once a user commits a transaction, all subsequent queries will report that transaction and certainly no-one will see partial results of a transaction.
RDBMS transactions are typically described as “ACID” transactions. That is, they are:
- Atomic: The transaction is indivisible – either all the statements in the transaction are applied to the database, or none are.
- Consistent: The database remains in a consistent state before and after transaction execution.
- Isolated: While multiple transactions can be executed by one or more users simultaneously, one transaction should not see the effects of other concurrent transactions.
- Durable: Once a transaction is saved to the database (an action referred to in database programming circles as a commit), its changes are expected to persist.
As databases become distributed across multiple hosts, maintaining ACID consistency becomes increasingly difficult. In a transaction that spans multiple independent databases, complex two-phase commit protocols must be employed. In the case of a truly clustered distributed database even more complex protocols are required, since the state of data in memory and the state of data in various transaction logs and data files must be maintained in a consistent state (cache fusion in Oracle RAC for instance).
CAP Theorem: You can’t have it all
In 2000, Eric Brewer outlined the CAP (AKA Brewer’s) Theorem. Simplistically, CAP theorem says that in a distributed database system, you can only have at most two of the following three characteristics:
- Consistency: All nodes in the cluster see exactly the same data at any point in time
- Availability: Failure of a node does not render the database inoperative
- Partition tolerance: Nodes can still function when communication with other groups of nodes is lost
Interpretation and implementations of CAP theorem vary, but most of the NoSQL database system architectures favour partition tolerance and availability over strong consistency.
A compromise between eventual consistency and weak (no guarantees) consistency is Eventual Consistency.
The core of the eventual consistency concept is that although the database may have some inconsistencies at a point in time, it will eventually become consistent should all updates cease. That is, inconsistencies are transitory: eventually all nodes will receive the latest consistent updates.
BASE – Basically Available Soft-state Eventually consistent is an acronym used to contrast this approach with the RDBMS ACID transactions described above.
Not all implementations of eventually consistent are equal. In particular, an eventually consistent database may also elect to provide the following:
- Causal consistency: This involves a signal being sent from between application session indicating that a change has occurred. From that point on the receiving session will always see the updated value.
- Read your own writes: In this mode of consistency, a session that performs a change to the database will immediately see that change, even if other sessions experience a delay.
- Monotonic consistency: In this mode, A session will never see data revert to an earlier point in time. Once we read a value, we will never see an earlier value.
The NRW notation
NRW notation describes at a high level how a distributed database will trade off consistency, read performance and write performance. NRW stands for:
- N: the number of copies of each data item that the database will maintain.
- R: the number of copies that the application will access when reading the data item
- W: the number of copies of the data item that must be written before the write can complete.
When N=W then the database will always write every copy before returning control to the client – this is more or less what traditional databases do when implementing synchronous replication. If you are more concerned about write performance than read performance, then you can set W=1, R=N. Then each read must access all copies to determine which is correct, but each write only has to touch a single copy of the data.
Most NoSQL databases use N>W>1: more than one write must complete, but not all nodes need to be updated immediately. You can increase the level of consistency in roughly three stages:
- If R=1, then the database will accept whatever value it reads first. This might be out of date if not all updates have propagated through the system.
- If R>1 then the database will more than one value and pick either the most recent (or “correct”) value.
- If W+R>N, then a read will always retrieve the latest value, although it may be mixed in with “older” values. In other words, the number of copies you write and the number of copies you read is high enough to guarantee that you’ll always have at least one copy of the latest version in your read set. This is sometimes referred to as quorum assembly.
|W=N R=1||Read optimized strong consistency|
|W=1 R=N||Write optimized strong consistency|
|W+R<=N||Weak eventual consistency. A read might not see latest update|
|W+R>N||Strong consistency through quorum assembly. A read will see at least one copy of the most recent update|
NoSQL databases generally try hard to be as consistent as possible, even when configured for weaker consistency. For instance, the read repair algorithm is often implemented to improve consistency when R=1. Although the application does not wait for all the copies of a data item to be read, the database will read all known copies in the background after responding to the initial request. If the application asks for the data item again, it will therefore see the latest version.
NoSQL databases can seem simplistic in some respects, but there’s a lot of really clever algorithms going on behind the scenes. For example, the vector clock algorithm can be used to ensure that updates are processed in order (monotonic consistency).
With vector clocks, each node participating in the cluster maintains an change number (or event count) similar to the System Change Number used in some RDBMSs. The “vector” is a list including the current node’s change number as well as the change numbers that have been received from other nodes. When an update is transmitted, the vector is included with the update and the receiving node compares that vector with other vectors that have been received to determine if updates are being received out of sequence. Out of sequence updates can be held until the preceding updates appear.
I found vector clocks hard to understand until I read the description in Operating Systems: Concurrent and Distributed Software Design by Jean Bacon and Tim Harris (Addison-Wesley).
A lot of the eeventually consistent concepts were best articulated by Amazon in Verner Vogels’ Eventually Consistent paper and in Amazon’s paper on the Dynamo eeventually consistent key-value store. Dynamo implements most of the ideas above, and is the inspiration for several well known NoSQL datastores including Voldemort and – together with Google’s BigTable specification – Cassandra.