The Achilles heel of the CAP theorem

In my last post I discussed the theoretical proof of the CAP theorem. Both the theorem and the proof have a limitation that might very well render them not-so-universal as assumed.

The limitation of the CAP proof

The limitation of the CAP proof (as formulated by Lynch et al) is the following: it assumes that - for the purpose of availability - requests are to be served even when there is a partition in the cluster.

A way around the limitation

There is a way around this limitation - although it may sound exotic: just make sure that there are no partitions when requests are served.

How? By simply doing the following:

  • Queue requests (e.g., in JMS).
  • Only process requests when there is no partition problem.
  • Send responses asynchronously, for instance via email.

Since no partition (hopefully) lasts forever, this solution does not lead to livelock.

Also, note that quorum solutions exist to avoid that the complete cluster has to be up at the same time.

Is this the capitulation of CAP? Who knows…

3 Responses to “The Achilles heel of the CAP theorem”

  1. PetrolHead says:

    Have you read:

    http://portal.acm.org/citation.cfm?id=214121

    “Impossibility of distributed consensus with one faulty process”?

    This is an important result and has significance to your comments and the CAP theorem. Essentially one can’t tell the difference between a genuine failure and a slow running machine or busy network.

    Thus your solution might work for a very small number of machines all in a single data-centre but for larger installations, failure of machines, routers, switches, cables etc will happen several times a day and thus quorums and clusters become considerably less practical and loose consistency more attractive.

    Note also that the theorem isn’t just about clustered services in the traditional sense but also services that run across multiple data-centres.

    I also have a specific observation:

    “….note that quorum solutions exist to avoid that the complete cluster has to be up at the same time.”

    This is true but they are limited by a number of factors practically:

    (1) The assumption that you will have a majority - seemingly this is straightforward but a partition plus a loss of a machine can leave you without a majority.

    (2) Getting all members back into sync. Can require all sorts of special admin involvement and it can go wrong.

    (3) Performance - quorum protocols especially across enough nodes to ensure survival can be slow.

    (4) Ensuring that clients don’t continue to make use of the minority during a partition e.g. reporting out-of-date information.

    (5) You can have a cluster capable of achieving consensus but you can’t reach it because the network is broken between cluster and clients.

    Best,

    Dan.
    http://www.dancres.org/blitzblog

  2. Guy says:

    Hi Dan,

    Sure have I read “Impossibility of distributed consensus with one faulty process” - it is at the basis of the heuristic exceptions in all two-phase commit solutions (including Atomikos).

    However, what I am saying is that the failure usually only lasts for so long, and afterward things can move on. Exploiting the right tools to do that can help availability.

    That is the main advantages of (persistent) queues and that is all I am saying. Lynch et al do not seem to exploit it as much as they could…

    Guy

  3. barry says:

    » Only process requests when there is no partition problem.

    Doesn’t this mean that you are sacrificing availability? You’ve turned a failure in partitioning into a failure in availability. While the answers and responses are queued so no request or response is lost, that doesn’t mean all is well. A response may take a long time to come back which is as much of a problem as getting an error.