Our customers regularly ask for disaster recovery options in combination with our JTA/XA implementation. While looking around for background information, we realised that there is little information waiting to be found, so we figured we’d study it ourselves and share our findings with you. To get the discussion started, this is the first part in a series on disaster recovery - introducing the reader to the problem of disaster recovery and XA environments. Later parts will discuss a number of possible solutions with varying degrees of recoverability.
Let’s start with setting the context. The situation we have in mind is one where requests are queued for processing by the application. A simplified description of what the application does is the following:
- it takes a request message off the queue (i.e., a ‘command’ in the domain-driven design paradigm)
- it processes the message, saving the results in the database
- it optionally publishes an event notification message (i.e., a ‘domain event’ in the domain-driven design paradigm)
- all of the above within the context of one atomic transaction
We’re assuming that all of the resources involved (message brokers and database) are XA-capable. We’re also assuming that the transaction manager keeps transaction log files for recovery.
In this setting, disaster recovery typically involves an active/passive combination of datacenters, more or less kept in-sync in some way or another.
The problem is simple: given that the two datacenters must be kept in-sync, how do we do this? The naive answer would be database replication, with some vendor-specific replication mechanism that pushes updates from the active database to the passive one. However, in our context this is not sufficient: the database is not the only component maintaining application state. There are also the message broker(s) and the transaction log file to take into account. Database replication alone will not cut it, because you would only have the database state replicated to some extent, not the queued requests in the broker, nor the state of ongoing transactions.
The ideal solution
Ideally, you would want to have everything replicated synchronously: the database, the broker, the transaction logs and the ongoing XA sessions in each resource. That way, the passive site would be a complete mirror of the active one.
The real world
Unfortunately, the real world is far from ideal and the ideal solution is hard to obtain: you would need a perfectly replicated vendor setup for the broker, the database and the file system. Moreover, the replication in all these systems should work in ‘lock-step’ way so that the combination of replicated transaction state at the passive site is ‘consistent’ with the distributed transactions happening at the active site - putting even more constraints onto the system. And this is where it starts getting really difficult to implement: even the most sophisticated replication systems we know of will fail to offer replication of ongoing XA sessions, which makes it unrealistic to assume that this is ever going to be possible (and if it were possible, it would surely be the most expensive system configuration you can think of).
So here we are: we’ve outlined the problem! Stay tuned for the sequel, where we’ll discuss a first solution.