In my previous blogs I have talked at length about 2-Phase-Commit transaction protocol for in memory caches, and how in-memory caches can handle failures a lot more efficiently than disk-based databases. In this blog I want to cover yet another very important optimization that can be utilized for in-memory caches, specifically for cases where data is partitioned across the network.
In-memory caches, and specifically in-memory data grids, such as GridGain or Oracle Coherence, often employ a technique called data partitioning, where every key in the key set is assigned to a partition, and every partition is assigned to a specific cluster member. Assigning keys to a partition is usually easy and is done similarly to how hash maps work:
key.hashCode() % N, where N is a total number of partitions.
Figure 1: Cache Partition Distribution
For example, if we have Employee objects and Company objects, then we can ensure that all employees working for the same company will be mapped to the same partition by providing a custom affinity-key for Employees, which in this case will be the "companyId".
Custom affinity mapping helps us ensure that all objects within a single transaction are mapped to the same cluster node, thus collocating the whole transaction logic on that node and minimizing the number of nodes involved in a transaction.The 1-Phase-Commit optimization is possible only when we have 1 primary and 1 backup copy. Such deployments are most common for distributed caches. If we add 2 backup copies, then we have to resort to 2-Phase-Commit, however, adding a 2nd backup copy is generally considered wasteful from memory standpoint and is rarely done. Diagram below illustrates the 1-Phase optimization:
Figure 2: 1-Phase-Commit for Collocated Partition Transaction
Now let's analyze failure scenarios. In this case, failures of client nodes or backup nodes are not very interesting, as they do not affect the primary copies. Failures of primary nodes are a bit trickier, however they are still safe. If the primary node crashes before it sends the commit message, then the backup transaction never starts, so there are no side effects. If the primary node fails after or during the commit acknowledgement is received from the backup node, then the backup transaction is committed, and data consistency is again not violated.
The hardest part is that even though the data remains consistent in case of any cluster failures, how does the client node know whether the transaction was committed or not if it failed to get the final acknowledgement, i.e. if the primary node failed before it was able to send the acknowledgement to the client? In this case the "recovery protocol" is initiated, which was described in detail in my previous blog. Essentially, a message is sent to the backup node asking whether the transaction was committed or not. Since the backup node keeps a backlog of completed transaction IDs in memory, it can always check the backlog. If the backlog does not have the given transaction ID, the backup node will add it in the "rolled back" state and will reply to the client with rollback acknowledgement. If afterwards the backup node actually does receive the commit request for the same transaction ID from the now failed primary node, it can verify in the backlog that it was rolled back and safely ignore it.
Conclusion
By ensuring that all objects participating in a transaction are mapped to the same logical partition, we can remove the whole "prepare" phase from the distributed commit protocol, thus converting the standard 2-Phase-Commit into very light weight 1-Phase-Commit transactions.
View comments