Archive for the ‘Uncategorized’ Category

Why Amazon should use two-phase commit (or: how Amazon ripped me off)

Friday, October 24th, 2008

Working for Atomikos, I use two-phase commit a lot. While I don’t want to claim that it is a solution to all problems, I do find it frustrating to hear people proclaiming that they don’t use it because it doesn’t scale (or some other reason).

Take, for instance, Werner Vogel’s talk about the Amazon architecture. Once again, two-phase commit is rejected as a viable solution/technology. Once again, I disagree.

Let me illustrate my point with an example of what really happened to me recently - after ordering a book at Amazon (ironically;-). I can give similar examples with airline ticket reservations but those will have to wait until later…

So what happened really? Well, I ordered a book that I really wanted to have. I ordered it online at Amazon… All went well, I checked out and paid by VISA. However, that is where things started to go wrong: while waiting for the book to be delivered, I suddenly get an email from Amazon saying that… my order has been canceled!

Canceled? Yes, but not in a way you would think: I still had to pay for the delivery by DHL (sorry, what is that?!). Yes sir, DHL claimed they had found nobody present at the delivery address. The delivery was at our office address, so it is very unlikely that nobody be there in the first place. Moreover, any courier service I know will leave a note that they passed by and at least settle for an alternative delivery. Not this time.

My conclusion? DHL did not arrive at my place. On the Amazon order tracking page, my order had not even left Germany (to be delivered where I live, in Belgium).

Now what will I remember? I will remember that Amazon ripped me off, either directly or via DHL. I will also remember to be very suspicious about people who say they don’t need two-phase commit. Two-phase commit comes down to ensuring agreement between the different parties involved in a transaction. Clearly, there was no such thing in my case.

A CAP Solution (Proving Brewer Wrong)

Sunday, September 7th, 2008

One of the latest challenges in computer science seems to be the CAP theorem. It addresses a perceived impossibility of building large-scale and clustered (web) service architectures. The fact that it (supposedly) has been proven to be true makes what I am going to write here all the more unlikely. Still, read on because I will show that I am right and CAP is not an impossibility after all… While the impossibility proof of CAP is mathematically correct, it is based on assumptions that are too strict. By relaxing these assumptions, I found the solution presented here.

What is CAP?

The CAP theorem (short for consistency, availability, partition-tolerant) essentially states that you cannot have a clustered system that supports all of the following three qualities:

Consistency is a quality meaning (informally speaking) that reads and writes happen correctly. In other words, the overall effect of executing thousands or millions of transactions concurrently is the same as if they had been executed one-at-a-time. Usually, this is done with the help of a transaction manager of some sort.
Availability essentially means that every operation (that makes it to a non-failing node) eventually returns a result.
This quality refers to the possibility of tolerating partitions on the network. Note that we suppose a cluster architecture (which is where the network comes in).

CAP is a conjecture originally formulated by Eric Brewer (Inktomi) and has influenced many of today’s larger-scale websites like . In other words, the impact of CAP is very large. To make it worse, the perceived impossibility of a CAP system (one that has all three desirable properties) has lead people to advocate something called BASE (Basically Available, Soft-state and Eventually Consistent) - see this talk by Werner Vogels (CTO at Amazon).

As far as I know (but I could be wrong), a theoretical foundation of BASE does not exist yet (it seems more of an informal approach which to me raises serious questions concerning correctness). In this post I will present:

  • a CAP solution
  • how this conforms to what BASE wants to achieve
  • a “design pattern” for building correct systems that (in a way) offer both CAP and BASE qualities

Because CAP is perceived as impossible and because BASE lacks formal treatment, I consider this to be a signification contribution to the state of today’s engineering;-)

What about the proof of Brewer’s theorem?

Brewer’s proof has been published by Nancy Lynch et al and discussed by me (see my earlier post and also this one).

While the theoretical proof of the impossibility of CAP is valid, it has a big limitation: it assumes that all three CAP properties have to be supplied at the same moment in time. If you drop this assumption, then all of a sudden you get into a new spectrum of possibilities. This is what I will do here.

A CAP solution

Enough talk, let’s get to the core of the matter. Here is my solution to CAP. To make it concrete, I will use the concept of a web-shop like Amazon. Here are the rules that are sufficient to ensure CAP:

  1. Process reads from the database if possible, or use a cached value if needed for availability (if the DB is unreachable).
  2. All reads use versioning or another mechanism that allows optimistic locking.
  3. Updates supplied by clients (orders in case of Amazon) are queued for execution, and include the versioning information of the reads that lead to the update.
  4. Queued updates are processed when the number of partitions is low enough to do so. The easiest way to do this is with a cluster-wide distributed transaction across all replicas (more on scalability later), but other more refined ways are possible (such as quorum-based replication or any other smart way of replicating). The version information in the update is used to validate it: if the data in the database has been modified since the original read(s) that lead to the update, the update is rejected and a cancellation is reported back to the client. Otherwise the order is processed and a confirmation is reported back to the client.
  5. The results (confirmation or cancellation) are sent asynchronously to the clients. This can be either email, message queuing, or any other asynchronous delivery method.

That’s it. Adhere to these guidelines, and you have a CAP architecture. I will not provide a formal proof here (I intend to do that elsewhere, in a research paper), but intuitively the proof is as follows:

  • This system is consistent because reads are based on snapshots and incorrect updates are rejected before they are applied. In other words: there are no incorrect executions.
  • This system is available since reads always return a value, and so do writes (even though they are queued and it may take a while).
  • This system is partition-tolerant because it allows network and node failures.

Granted, this system does not provide all three at the same moment in time (which is how we go around the impossibility), but nevertheless the result is quite strong IMHO.

The limitations

There are some limitations to this solution - all of which seem reasonable:

  1. Read-only requests may be presented with stale information (due to updates that have yet-to-be-applied). In that sense, their results could be “inconsistent”: for instance, the availability of an Amazon item can change between two page views. I do not see this as a major restriction, since no website that I know of will offer read consistency for the duration of a user session. It all depends on what you consider to be within the scope of one transaction;-) Note that this almost corresponds to snapshot isolation found in Oracle.
  2. Partitions should not last forever: in order for this to work, partitions should be resolved within a reasonable time (reasonable being: within the expected confirmation time for updates). The duration of any partitions also affects the time window in which reads can produce stale data.
  3. The updates have to be applied in the same relative order at all cluster nodes. This puts some restrictions on the algorithm used to do this.

Note that updates are always based on correct reads thanks to the versioning check before they are applied. So update transactions are always consistent.

How does this relate to BASE?

You could see this as a design pattern for BASE if you like. The solution adheres to BASE in the sense that it uses cached reads (if needed) and that the updates are delayed (so you could say they are “eventually” applied and the system becomes “consistent”).

Reflections in scalability

So far the CAP focus was on possibility. I think my solution shows that it is possible. Now how about scaling up?

The naive solution (a huge distributed transaction to update all cluster nodes in-sync) is unlikely to scale: as you add more nodes, more updates are needed. Now I am a big fan of transactions, but not to use them in an arbitrary matter. So how to propagate these updates through the cluster?

While smarter solutions for this exist (such as the work by Bettina Kemme), a trivial first try would be to push updates (lazily) to all nodes in the cluster. This can be done with a smart queuing mechanism. The disadvantage is that updates are not applied everywhere at once (rather, the all-or-nothing quality just “ripples” through the system). So you get into the “eventually” style again.

Note that this latter suggestion makes the system behave much like the READ COMMITTED isolation level (which, by the way, is the default in Oracle). So this approach sacrifices consistency/isolation a bit in favor of scalability.

Future work

Additional research could/should be done in the following areas:

  • Improving read consistency through session affinity
  • The best way to push the updates through the cluster
  • Performance evaluation in real life implementations

Final note and disclaimer

I did not see Brewer’s original presentation of the CAP theorem - so it could be that what he meant with consistency also involved all reads (see the limitations of the solution I presented here). In that case I did not find a solution for CAP but at least it is a framework and proof outline for BASE ;-)

UPDATE 15/3/2012:

It seems like Greg Young and Udi Dahan have been working along similar lines and gave this pattern/solution a name: CQRS.

Why Forte migrations should use Atomikos

Saturday, September 6th, 2008

Forté/UDS is an end-of-life technology that used to be in Sun’s product portfolio. When talking to people who have been doing a lot with Forté in the past, it seems that Forté can be considered an ancestor of Java:

  • It has an object-oriented (4GL) development language.
  • Like Java’s JMX, Forte also has instrumentation (the agent is even called iconsole - like jconsole for Java’s built-in JMX agent these days!).
  • It has distributed transactions.
  • It has a strong notion of events as first-class citizens in the language.

The only thing that Forté does not have is Enterprise JavaBeans (EJB), nor XML configuration issues for the application server. This means that Forté developers who migrate to Java (because they are left little choice) get confronted with complexities that they did not have to bother with in their 4GL environment.

Thanks to Atomikos and the J2EE without application server methodology, teams who used to work in Forté can easily do Java/J2EE without having to bother about the clutter of EJB nor about the application server’s XML hell. What’s more, in combination with Spring, Hibernate and JMS there is an equivalent, light-weight Java stack that (thanks to Atomikos) can still do all the connection pooling, event-driven and transactional processing that is needed.

What makes it even better is that this methodology seems to achieve equal productivity as with the 4GL environment in Forté, which is pretty good given that Java is a 3GL and is not widely known as a productivity miracle.

The Achilles heel of the CAP theorem

Friday, September 5th, 2008

In my last post I discussed the theoretical proof of the CAP theorem. Both the theorem and the proof have a limitation that might very well render them not-so-universal as assumed.

The limitation of the CAP proof

The limitation of the CAP proof (as formulated by Lynch et al) is the following: it assumes that - for the purpose of availability - requests are to be served even when there is a partition in the cluster.

A way around the limitation

There is a way around this limitation - although it may sound exotic: just make sure that there are no partitions when requests are served.

How? By simply doing the following:

  • Queue requests (e.g., in JMS).
  • Only process requests when there is no partition problem.
  • Send responses asynchronously, for instance via email.

Since no partition (hopefully) lasts forever, this solution does not lead to livelock.

Also, note that quorum solutions exist to avoid that the complete cluster has to be up at the same time.

Is this the capitulation of CAP? Who knows…

My take on CAP

Wednesday, September 3rd, 2008

The CAP theorem (Consistency, Availability, Partitioning) has been receiving quite a lot of interest lately, just to mention one of the many references.

What is CAP about?

First let me give credits here: I am deriving my inspiration from the theoretical insights found in this paper co-authored by one of my favorite woman scientists, Nancy Lynch from MIT. If you get a chance to read this paper, go ahead it will bring you some very useful fundamental understanding…

The CAP theorem is essentially a limitation on what you can do with clustered (web) services in the fashionable context of SOA.

The word ‘cluster’ is important here since that is what it is all about. In particular, the theorem states that you can’t have all three properties (Consistency, Availability, Partitioning) in one and the same system (read: service). This implies that there is no perfect solution to building a high-throughput popular service, or is there? Let’s first explore what each thing means…


By consistency, the theorem refers to the property that changes (updates) to the service back-end are visible to later queries. Simplifying: if you add something to your shopping basket then it will appear there next time you retrieve your basket status. That sounds trivial, but it is not if the basket is spread over multiple physical server processes… Consistency is commonly ensured (between processes) by having some sort of distributed transaction coordinator, or (assuming a central back-end) a single centralized database.


The Lynch paper uses a very simple but sufficient definition of “availability”: a system is available if every request to it returns. In other words: there is no infinite blocking.


Partitioning means the cut-off between two segments of the cluster. In other words, one or more nodes become unreachable for at least some time.

What is the Theorem saying?

You can’t have all three of the above qualities, period. However, you can combine any two of them if you like. This is proven in the paper by Lynch et al. Also (and this is important) you can apply different combinations of qualities to parts of your system. Meaning: you can stress consistency in one part, availability in another part, and so on. For instance, order processing or payment processing can be done consistently and available (sacrificing partition tolerance) whereas querying the product catalog can be done differently (stressing partition tolerance in favor of consistency).

Does this contradict or invalidate Atomikos?

Not at all, quite the contrary: it makes Atomikos (and its third generation of TP monitors) all the more relevant. Why? Because Atomikos products can help you in making those parts consistent when you want them to be.

Virtually achieving all three qualities

If you embrace asynchronous messaging (a la JMS or email) and extreme transaction processing (XTP) then it is possible to asymptotically realize all three qualities (consistency, availability, partition-tolerance) provided that you do use a callback mechanism to communicate results (e.g., by sending a confirmation email). Here is how:

  • Queue requests in JMS.
  • Process each request transactionally (so failures will leave the request queued for retries).
  • The process that digests each request can be arbitrarily complex and use transactions (consistency) and return whenever it likes (thanks to the queuing, no reply is expected within a preset time frame).
  • Any lack of availability of the processing is recovered by the queues: failed requests will stay queued until the process in the back-end is in fact available again.

Now did I just break the CAP impossibility? More on this in a next post…

Unlimited scaling, easy!

Friday, August 1st, 2008

Suppose you want to develop a high-volume transaction processing system in Java/J2EE. How would you do it? Most people would say: don’t use JTA/XA transactions because they kill performance. Wrong. And they would also say: use an appserver to scale. Again, they couldn’t be more wrong.

Here is the magic recipe on how we build systems with virtually unlimited scalability at Atomikos:

  • Kick out your appserver as soon as you can, as explained here. J2EE is not limited to an appserver. J2EE is a set of APIs. The appserver ties these APIs to a programming model that almost nobody needs. Conclusion: drop the latter.
  • Use a persistent JMS queue to store transaction requests. This allows easy load-balancing and provides crash resilience for ongoing requests. It also de-couples the clients from the transaction processing system.
  • Use ExtremeTransactions to process the requests (stored in JMS). This allows for reliable, exactly-once message processing as outlined here. Make sure to use the supplied JMS and JDBC drivers!
  • To add more power, just add a second VM (process) on a separate CPU.
  • Repeat until performance is high enough.

You will reach the required performance because of the intra-VM nature of each process you add. The only potential bottlenecks are your own database or JMS backend. So scaling comes down to scaling your backends, which is much simpler than scaling your application itself (which has already been done in a natural way as outlined above).

So don’t let anybody fool you: transactions do scale - even without limits!.

Loosely-coupled deployment vs loosely-coupled design

Thursday, July 31st, 2008

I have talked to a number of people who claim to be doing SOA, when in the end all they do is loosely-coupled design. Let me explain what I mean by an example.

A team of enterprise architects was designing an SOA infrastructure for a bank I know. The system they were building would be based on interfaces, so that it would be possible to deploy parts of the system as separate instances later on. This was their notion of SOA…

The good thing about it is that there are interfaces in their design, meaning it is likely to be loosely-coupled. The bad news is that this is not SOA, at least not in my view: one of the biggest advantages of SOA - reuse in place - is never realized in this way. So, whereas this approach to ‘SOA’ may be loosely coupled in design, it is not loosely coupled in deployment (which is at least as important).

The consequence? Whenever a ’service’ is upgraded, they will need to upgrade all the dependent services and redeploy them. This is because each ’service’ is really an embedded module inside other parts of the system.

I guess this also holds for the debate on cloud vs grid computing: in my view, a cloud is more loosely coupled than a grid in its deployment.

BPEL and compensation

Friday, October 19th, 2007

Is BPEL a good tool for implementing compensation? It really depends, and you really have to know what you are doing - which (with all respect) doesn’t seem the case for most people (not even BPEL specialists). So if not even those experts know, how can we expect the rest of us to know? Hence this blog entry.

For instance, on repeated occasions I have heard renowned BPEL and workflow experts mention that compensating transactions are “perhaps” best modeled at the business logic level. This, by the way, includes Bill Burke in the case of JBoss/jBPM - see here. Note that I emphasized the word “perhaps”: this indicates the shade of misunderstanding usually present in the arguments.

I have been saying this here and there in the past (and in fine detail in this article), but I want to repeat it again: BPEL, nor workflow nor WS-BA are ideal for compensation unless the compensating party doesn’t care whether it needs to compensate eventually. In other words, if the compensation is business as usual to the provider of the compensatable service then BPEL might be OK (though certainly not desirable - see below).

Why is that? Put yourself in the place of a service that is asked to compensate by a BPEL engine somewhere. Also suppose that you are in a B2B ecosystem where you don’t necessarily trust the party that owns the BPEL engine. Now what would you rather do: trust the BPEL to compensate - eventually (which might be never!) or rather deal with compensation yourself, say after a timeout? I would definitely choose the latter. I don’t want someone else to decide when I need to compensate. I want to decide for myself, and the Atomikos TCC model allows for that. BPEL and jBPM don’t.

So BPEL is ruled out for me - at least as far as compensation goes. What about WS-BA? It is a step in the right direction, but unfortunately it is a bloated protocol, very inefficient and loaded with application-level messages that pollute the compensating part. Even worse, it also suffers in a large part from the lack of timeout and depends on the BPEL to at least trigger compensation.

Also, WS-BA doesn’t allow for application logic on close - I won’t go and bother you with the entire spec details but it is like a try..catch…finally where the exception is raised by the client (ugly!) and where the finally block can only be empty! Again, Atomikos TCC is far superior, more efficient and more elegant. It is also more natural for compensation than any BPEL engine will ever be.

One last note on BPEL and this supposed “modeling the compensation in the business process”: I was talking to an IBM architect the other day. He said that they were doing a large telco project with BPEL to co-ordinate things. One of the things he complained about was exactly this: they have to model the compensation and error logic as explicit workflow paths, and it was literally overloading everything with complexity. Moreover, this complexity is hard to test. As he correctly put it, they were implementing a transaction manager at the business logic (BPEL) level, over and again in every process model. In addition, this was also hard to test he said and that it was virtually killing the project - especially if there were change requests to consider. I believe him:-) I gave him the URL to our TCC article above.

Atomikos and TCC allow you to focus on the happy path of your workflow models. We take care of the rest. Now imagine what a reduction in complexity that is, and how much more reliable things get! So no, compensation should NOT be modeled at the business level. Except on rare occasions maybe.

REST and reliability

Friday, October 19th, 2007

Whenever I see a presentation on REST I am impressed by its simplicity. With just four operations (GET, POST, PUT, DELETE) it seems to accomplish a simple model for service-oriented architectures, where every business resource has a URL.

With this simplicity, REST also leverages the ubiquitous HTTP protocol as the underlying mechanism. More and more people seem to like this, including me.

However, the big question for me is: how do you make this reliable? Imagine that you integrate 4 systems in a REST style. You would be using HTTP and a synchronous invocation mechanism for each service. Now comes the question: how reliable is this? The answer: less than the least reliable system that you are using! More precisely, availability goes down quickly because your aggregated service fails as soon as one of the services fails…

With transports like JMS you can improve reliability, but how do you do REST of JMS, given its close relationship with HTTP and URLs? That is the problem with REST for me.

Data Replication in SOA: The Price of Loose Coupling

Thursday, October 11th, 2007

When designing a corporate SOA architecture you are often faced with a tough choice: do you rely on a common database (centralized) or do you implement replication instead?

Let me explain what I mean. The idea in SOA is that you define more or less independent services that correspond (hopefully) to clearly defined and business-related activities. For instance, you could have a customer management service and a payment/invoicing service. The customer management service belongs to CRM, the invoicing to the billing department. However, both of these services might need the same customer data. Now what do you do? Basically, you have the following options:

  1. Use the same centralized customer database. This gives you the benefit of easy maintenance because there is only one copy. However, this also means that you are coupling your services into the same database schema, and updates to the schema are likely to affect more than one service.
  2. Replicate the customer database, by identifying one master (the CRM?) that regularly pushes or publishes updates (in an XML feed, for instance). While you lose the benefit of easy maintenance, this does give you loose coupling: as long as the XML format is the same, you can change DBMS schemas as much as you like - without affecting other services.
  3. Merge the customer and invoicing services into one. However, this may not always be possible or desirable, and may even defeat the purpose of service-oritentation altogether.
  4. Have the invoicing query the customer service for each payment. Thi seems to incur a lot of dependencies and network traffic.

So what do you do? My preference tends to go to the second option. However, it means that realistic SOA architectures are likely to have an event-driven nature.