Database System Errors in Light of CAP Theorem and Eventual Consistency
What is CAP Theorem: We can see that there is a new interest in CAP theorem of late, especially for the database management systems which span across multiple sites. CAP theorem covers the three most fundamental desirable properties of database management applications. Let us have a quick overview of those.
- C of CAP: Consistency. The consistency facilitates multi-site transactions to have a much familiar, all-or-nothing kind of semantics, which is ideally supported by the commercial database management systems. When the replicas get supported, one may want those to always be in consistent states.
- A of CAP: Availability. The goal of availability is to support a database system to be always up. In other terms, when there is a failure, the system must keep on running by switching itself over to a replica if needed. Tandem Computers popularized this feature about twenty years ago and are now being used widely.
- P of CAP: Partition-tolerance. This is the concept of allowing the processing to go on in both the subgroups if there is any network failure, which may split the processing nodes into two groups that cannot communicate with each other.
Need to give up C, A, or P
CAP theorem is considered as a negative result, which says that you cannot achieve all three goals simultaneously in the presence of errors. You must consider your objectives and need to pick any one objective among CAP to give up.
In NoSQL, the CAP theorem is used as the justification to give up C, i.e., consistency. Since most of the NoSQL systems tend to disallow the transactions that may cross the node boundaries, consistency applies only to the replicas. So, the CAP theorem is ideally used to justify giving up the consistency replicas by replacing it with the concept of eventual consistency. With this alternate objective, the guarantee is that all the replicas may get converged into the same state ultimately, i.e., while network connectivity gets re-established and enough time has gone for the cleaning of replicas. The justification for dropping consistency is that both A and P can be protected. what is CAP Theorem
Database Errors in light of CAP
This article’s objective is to assert that the above approach is suspect, and there are more dimensions to consider in the case of recovery from errors. For this, we may assume a hardware model that consists of a collection of storage and local processing nodes, which are in a cluster form of a LAN network. These clusters, in turn, are connected using a WAN network. Considering this structure, let us discuss the various cause of errors in the DB. It may not be a complete list, though. For further clarification in terms of database setup, you may consult reliable service providers like RemoteDBA.com.
The application errors
In this case, suppose the application has performed multiple incorrect updates. Usually, this is not identified immediately or even for many hours. As a solution, the DB must have to be backed up to the point before the error transaction or transactions have happened, and all subsequent activities to be redone. what is CAP Theorem
- DBMS errors (repeatable)
The database management system crashed at a given processing node. Executing the same kind of transaction on a processing node with a replica may cause backups to crash. All these errors are known as Bohr bugs.
- Unrepeatable database management system errors
Suppose if the database is crashed. However, a replica of the same seems to be fine. This situation is due to a few weird corner cases which deal with asynchronous operations. These are known as Heisenbugs.
- OS errors
Here, the OS is crashed at a given node and finally generates a blue screen of death. This is also a frequent case in distributed systems.
- Hardware failure in the local clusters
These errors may include disk failure or memory failures, which generally end up in a panic stop by the whole DBMS operating system. These errors may also sometimes appear as the Heisenbugs.
- Network partition in the local clusters
In this case, the local area network fails, and the nodes may no longer communicate. This also will cause a dead stop in network operations.
- A natural disaster
The local cluster may be fully wiped off or destroyed due to a flood, earthquake, storm, etc. In this case, the cluster does not exist anymore.
Network failure in the WAN which connects the clusters
In this case, the Wide Area Network fails, and the clusters may no longer communicate with each other. There is a complete stop still of network communications.
The first two errors we discussed above may cause problems with the high availability scheme. In such scenarios, keeping it going is challenging as availability is almost impossible to achieve. Replica consistency is also meaningless, and the current database management state is also simply wrong. Error #7 may be recoverable if only local transactions are only committed after assuring whether another cluster receives the WAN transaction. There are a few application builders considering this type of latency. So, eventual consistency is not guaranteed, but the transaction may be fully lost if a disaster hits the local cluster before a transaction is forwarded to somewhere else. In other terms, application designers choose to suffer data loss when a unusual happening like a disaster occurs, and the performance penalty to avoid it is huge. So, errors #1, #2, and #7, as we discussed here, can be taken as classic examples where the CAP theorem is not applied. The databases must be prepared to ensure the recovery of data in such cases.
Consider the other cases of local cluster failures as like in error #3, #4, #5, #6, etc. Most of these cause a single node failure as a degenerate case of a network partition is survived by various algorithms. So, here it is better to think of dropping P instead of compromising C. In the LAN environment, it is ideal for keeping CA up rather than the combination of AP. The OLTP systems like NimbusDB and VoltDB etc., tend to the same.
In summary, you may not drop off C that quickly as there are many real-time error scenarios in which CAP is not applied, and referring to it may be a wrong trade off in such situations.
A Side-by-Side Analysis of Pros and Cons of Top Databases