Tuesday, August 31, 2010

The problems with ACID, and how to fix them without going NoSQL

(This post is coauthored by Alexander Thomson and Daniel Abadi)

It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.

Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.

ACID, scalability and replication

For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).

The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.

In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).

Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.

Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.

In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

The NoACID solution

Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.

Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.

Fixing ACID without going NoSQL

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.

More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.

If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.

This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)

A deterministic DBMS prototype

In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.

We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.

Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.

68 comments:

  1. Very interesting angle, I definitely need more time to digest the paper, some first thoughts:

    The essence of this approach seems to heavily rely on "on the fly transaction reordering", but it was not clearly described how that was implemented in a distributed environment.

    ReplyDelete
  2. Hi Juri,

    Thanks for the comment. Actually the post says that we have to enable highly concurrent transaction execution *without* on-the-fly transaction reordering. I totally agree that on-the-fly transaction reordering within a deterministic framework would be highly nontrivial to implement :)

    ReplyDelete
  3. Very interesting paper with some compelling ideas.

    Can you describe what approach would be used to scale and replicate the preprocessor?

    Ariel

    ReplyDelete
  4. Hi Ariel,

    All the preprocessor must do is agree on an arbitrary ordering. To avoid a single point of failure, it can be replicated across multiple independent nodes, and any of the well-known agreement protocols from the systems community (e.g. Paxos) can be used to agree on the transaction ordering.

    ReplyDelete
  5. I went to the noSQL east conference earlier this yr (@ GTRI in Atlanta) & kept trying to beat the drum that "_SQL_ isn't the enemy (though it's no cancer cure either) - vertical architectures are!". interesting reading - hopefully it'll evolve into a viable project!

    ReplyDelete
  6. While it is an interesting read, NoSQL is here to stay. NoSQL is not simply about scale, but data representation.

    Without having to spend a large amount of time in implementation of design for access in NoSQL is a plus. Normalization is great, but not all things have to fit in a bucket, we can use a shoe box as well.

    ReplyDelete
  7. I look forward to reading the paper. I think there is a typo: "What is really needed is a way for -to- ACID systems to scale"

    ReplyDelete
  8. Thanks Kevin, I just fixed the typo.

    ReplyDelete
  9. BTW - am I the only one who noticed (maybe somewhat ironically) that in the above pic this guy looks a lot like a young Larry Ellison?

    ReplyDelete
  10. LOL, I'm not sure if I should be honored or scared :)

    ReplyDelete
  11. Hm.

    So you need to introduce a global ordering of transactions. Which WILL require a shared resource among ALL of the transactions. No way around it, sorry.

    And what's funny, this resource some of the problems of ACID systems. However, there should be advantages (no need for rollbacks, etc.).

    Besides, all of this doesn't tackle another advantage of NoSQL systems: working with HUGE amounts of data. There'll still be problems in ACID systems if data access requires communication between several storage nodes.

    ReplyDelete
  12. I'd like to comment on the premise of the article: "It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems"

    I disagree with this premise. While I look forward to distributed ACID; I've found that working with MongoDB, a NoSQL database, to have some significant advantages over SQL. Like SQL databases, MongoDB has indexes and the ability to write ad-hoc queries against its data. (This isn't the case regarding key-value stores, or databases that rely completely on map-reduce.) In contrast, MongoDB stores schema-less structured documents instead of rows in strict tables.

    MongoDB's document-based approach solves some real-world problems that occur when building complicated programs that persist high volumes of data. Specifically, it's much more easy to handle irregular data and structured data. Modeling and handling inheritance is easier and faster in Mongo. Likewise, because Mongo returns structured data to the program, there is no need to run many queries to hydrate data structures around foreign key relationships.

    So, while I look forward to distributed ACID, please don't overlook the tangible advances in database programmability that some NoSQL databases offer.

    ReplyDelete
  13. re: Reordering and nonconcurrency
    Consider an example where client applications (NOT Stored Procedures), within one transaction scope:
    Select
    work
    Update
    work
    Select
    work
    Update
    Commit

    Does your solution handle this kind of application?
    Doesn't noncurrency become single threading?
    How can something "reorder" these transactions?
    How could it scale?

    ReplyDelete
  14. Don't get me wrong, I'm looking forward to all those solutions to solving what many people haven't been able to. I know this sounds cynical, but I really mean it.

    I do find it hard to believe you and your co-authors actually took the time to review all so-called NoSQL solutions, but instead decided to make a well educated guess from maybe a couple you did look at. Money at Yale well spend.

    NoSQL is hardly NoACID. Even though some solution don't care about ACID, it's hard to generalize.

    It's a different approach to the same problem.

    At times, the NoSQL movement may come across sassy. The "NoSQL" name itself kind of asks for it.

    And, yes, a lot of people don't really know what they are talking about when the talk about scalability, sharding and use all the other fancy buzzwords.

    You shouldn't let these things get in between you and the technology though.

    ReplyDelete
  15. BTW, don't forget the CAP theorem.

    You can't get Consistency, Atomicity and Partition Tolerance at the same time.

    RDBMS typically 'solve' it by dropping the requirement for the partition tolerance. Usually by using quorum sensing schemas, etc.

    ReplyDelete
  16. I'm excited to read the paper, but thought I'd chime in quickly with this link. I was immediately reminded of this paper when you mentioned an "ordering layer":
    http://www.usenix.org/events/osdi08/tech/full_papers/mao/mao.pdf

    ReplyDelete
  17. Strengthening the serializability of ACID is an interesting approach. But, I'm curious about your objections to traditional shared-nothing scalability.

    For example, you claim that the isolation guarantee requires transactions to hold locks for the duration of the transaction. You must be aware of alternative lockless schemes such as MVCC.

    Also, you claim that the distributed commit protocol adds a couple of network round trips which causes locks to be held "orders of magnitude longer" than the actual work involved. That's debatable, because we're already in a distributed environment where it costs a network round trip just to get to the data in the first place.

    Also, the latency of network fabrics like Inifiband is under a microsecond. That compares favorably with your 10 microsecond estimate of work.

    ReplyDelete
  18. This is a really interesting post.

    "Systems such as Bigtable... are mostly concerned with system scalability".

    I think they are also concerned with system maintainability. Migrating fixed (eg. SQL) schemas can be a major headache for many, which schemaless (eg. XML, etc.) formats avoid.

    My guess is that this is the prime motivation of 50% of NoSQL users, albeit, those 50% probably represent only 5% of the actual transaction volume (that is, the Reddits of the world are probably primarily interested in the scalability angle).

    You could argue that with schemaless databases, there again, you are really just pushing the problem back onto the application developer, who has to figure out what to do in the absence of the column and type guarantees that SQL provides.

    ReplyDelete
  19. One point I'm confused on is not allowing transactions to fail. If I understand this is a important part needed to enforce determinisim without reordering. (Analogy: how to implement gap-less sequences without serializing access to the sequence)

    What if something goes wrong in a related system or the user changes their mind? Am I correct in saying rollback is just not an option for any reason with this approach?

    ReplyDelete
  20. Reading the paper, it looks like the transaction pre-processor would be a single point of failure. Since transactions have to be durably written to the pre-processor before being sent to any partition of any replica, it seems like the pre-processor would also be a point of contention under heavy write load. Is that correct?

    How would I use this approach with a system that is partitioned across hundreds of nodes?

    ReplyDelete
  21. ppeterd

    All transactions in the system are deterministic stored procedures. Procedures can rollback, but they have to do so deterministically. All that means is that procedures have to get the current time and random numbers from the database. By deterministic it means that if the same procedure is run on multiple replicas then the result (commit, or rollback) will be the same at each replica. This allows replicas to run without a per transaction commit protocol. I think in this paper (as opposed to H-Store) there is no commit protocol at all. It is all done at the preprocessor.

    From within a stored procedure you can't do something like access an external system. A procedure has to run to completion using only its parameters and data it pulls from the database.

    Bob

    The preprocessor can be replicated so as not be a SPOF. Similar to how ZooKeeper is used in other distributed systems.

    ReplyDelete
  22. Hi Mr abadi.

    "totally agree that on-the-fly transaction reordering within a deterministic framework would be highly nontrivial to implement :) "

    I was really hoping that this paper was the long sought solution i was looking for this kind of problem.

    Do you know any work/papers related to this arena ?

    It is a pleasure to always read you !

    ReplyDelete
  23. Daniel I'm curious if you are familiar with how MarkLogic deals with ACID transactions at scale?

    Anyway interesting read, concepts and challenges ahead. Good luck!

    ReplyDelete
  24. so, in other words, do it all in RAM and it gets faster?

    ReplyDelete
  25. Ariel - I can see how you can use ZooKeeper to do master election and fail-over (and thereby avoid SPOF), but I'm still curious about how you scale out writes.

    ReplyDelete
  26. Very interesting...

    Yes, ACID is one issue around scalability. But the relational model is another issue. When you want to query data across multiple tables, and you want to perform relational operations on them, such as order by, then you are basically forced to copy all necessary data to a single location to accomplish the join. I suspect that any scalable system needs to forgo joins, and do massive denormalization, as well as deal with issues of ACID...

    ReplyDelete
  27. "The preprocessor can be replicated so as not be a SPOF. Similar to how ZooKeeper is used in other distributed systems."

    It can't be made replicated. Or you get [almost] all the same problems as with distributed commits. It can be made fault-tolerant (to some degree), but it'll still be the bottleneck.

    IMHO.

    ReplyDelete
  28. Thanks everyone for the great questions and comments so far.

    Cyberax,
    You're right that the preprocessor is a shared resource between all transactions. However, it does not extend the duration for which locks need to be held and therefore does not contribute to lock contention, which is the main hurdle to outward scalability for traditional ACID systems.

    GWBasic,
    You're absolutely right that many NoSQL have unique features perfectly suited to various applications. The thing that appears to be common to all of the widely used NoSQL systems, however, is excellent scalability---which is uniformly accomplished by relaxing ACID. The statement "NoSQL means NoACID" doesn't refer to any system in particular, but to the category as a whole.

    Accordingly, while the methodology we propose for building scalable ACID systems is designed to work for relational systems, you could also apply it to systems supporting arbitrary schemas (or lack thereof).

    David,
    You raise a great point regarding data dependency. Our scheme can't support interleaving of application and database code within transactional boundaries, although the optimistic technique discussed in the paper can be used explicitly at the application level to get around this with fairly low performance overhead in most cases.

    Ariel,
    Spot on, thanks for those clarifications!

    Bob,
    The preprocessor only has to durably record (in order) each transaction's type (assuming there exists a set of pre-defined transactions) and its arguments. Right now our prototype does this serially at each node of the distributed preprocessor. (We have not gotten to the point where the execution engine is so fast that this easy task causes a throughput bottleneck---I will be pleased if that happens.)

    ReplyDelete
  29. David,
    You're right that joins are prohibitively expensive, and they would be even worse in a deterministic system. Better to do the short transactions here and the long analytical queries in a warehouse.

    ReplyDelete
  30. Would it be fair to say, generalising a bit too much, that the core contribution of this work is not the separation of agreement from execution (as this is something that, for example, the BFT folks have known is a good idea for a while), but a way to come up with a deterministic ordering at agreement time that permits a high degree of parallelism at execution time where it is available?

    ReplyDelete
  31. So you've effectively walked down the path of not being partition tolerant. There is no silver bullet, the CAP theorem, unless shown otherwise, is inviolable.

    The hardware platform at scale is known to be akin to shifting sands, it WILL fail.

    A major benefits of NoSQL solutions also have a lot do with the ability to inject a semantically (with respect to the domain) relevant exception handling strategy. Rather than using whatever the RDBMS exposes, or partitioning your code across the app servers and data stores. It's not a matter of foisting a problem, as many NoSQL solutions have lower level handling mechanisms for errors, but rather that the application can deal with it if so desired.

    Again, all you've shown is that you can be first and foremost consistent and largely available, but not partition tolerant. If one was to take Dynamo and it's intended goals, this solution doesn't seem like it'd be appropriate.

    With that said, it's rather interesting, and I'll be digging into it further.

    ReplyDelete
  32. Why do you think people use Serializable as an isolation level? I've never seen it used in practice in my many years of working with OLTP systems.

    Repeatable Read is about the strictest isolation level you'd want, and read committed is many times good enough.

    ReplyDelete
  33. BTW I think you are confusing SQL Azure with the Windows Azure Table Storage. The former provides traditional ACID gurantees, while the later is more of a NoSQL design.

    ReplyDelete
  34. Dan Adkins,
    Some really good points. It turns out that MVCC and other lockless schemes suffer from contention in essentially the same way as two-phase locking. In the case of MVCC, when two transactions access a data item out of timestamp order (and when the second one to actually access the item is trying to write to it), the transaction with the earlier timestamp is aborted. And if I remember correctly, it's never safe in MVCC to commit a transaction that depends on data which has been updated more recently than the timestamp of the oldest active transaction. So even though there isn't an explicit lock manager, accessing data can still prevent certain other transactions from updating it, so it still gets LOGICALLY locked. Again, the longer transactions run, the worse the contention, so commit protocols that extend logical locks stifle scalability. It would be really, really interesting to try to determinize (deterministicize? determinate? anyhow, design a deterministic version of) MVCC.

    True, if one node is the primary executor of a distributed transaction and remotely accesses data on other nodes by sending them requests, a round trip is required after lock acquisition. In a system with an explicit preprocessor, however, there's an easier way to do it: have the preprocessor broadcast instructions to every node involved in the first place and let them execute independently. Sometimes program logic at node A actually depends on a read at node B, in which case a single one-way message is required, but transactions should never have to wait for network round trips before releasing locks.

    With 1-microsecond network latency, I imagine that even traditional ACID RDBMSs with distributed commit protocols scale pretty decently, and the improvements that might be achieved by enforcing determinism would not be as large as when scaling outward with commodity switches. But InfiniBand and other high-performance interconnects really bring you closer to the scale-up category than scale-out.

    ReplyDelete
  35. Responding to the comments that Alex hasn’t already answered:

    Nuno Job: I do not believe that MarkLogic is focused on high throughput OLTP applications over nonpartitionable workloads. Hence, I don’t think they need to worry about the problem we’re trying to solve.

    Ryan, if you put a traditional OLTP ACID-c compliant transactional database entirely in main memory, it will still not scale out well in a shared-nothing distributed fashion. This is why scalable OLTP systems are such an active research topic.

    Henry, actually, no, we haven’t made any contributions in how to come up with an agreement on transaction ordering. Right now this ordering is done completely arbitrarily by the preprocessor. We discuss in the future work section of the paper that there might be more intelligent things that can be done here, but this is for a future research paper :)

    BTW, Henry, if you are Henry Robinson from Cloudera, I think your fourth point in your eight point quick thoughts summary is a good overview of the main contribution.

    Runt1me, actually, our paper discusses a way our scheme can be optimized for lower isolation levels in addition to full serializability. So the question of what isolation level a user wants to run is actually orthogonal to how to scale at any particular isolation level. It actually works out quite nicely.

    Daniel, no, we really meant SQL Azure. You only get traditional ACID guarantees up to 50GB (which we are calling a ‘partition’). Anything more than 50GB and you are on your own.

    There are two other general sets of points that people have repeatedly brought up that I want to get to:

    There are some people that disagree with our assertion that NoSQL is really NoACID. I think that NoSQL has come to mean a lot of different things to a lot of different people, and if NoSQL means something special to you, this is not a problem. But I want to underline Alex’s point: our research is about scaling ACID. If you love your NoSQL system for reasons besides the fact that it scales well by weakening ACID, then you might eventually want to undo this weakening of ACID in your NoSQL system, and we believe that adding determinism is a way to accomplish this.

    Finally, there are some people who are worried about scaling the preprocessor. I would like to point out that this is a much easier problem than scaling general transactions in a distributed database. Rather than worrying about scaling all the mess inside the database system, all you have to worry about is scaling the simple task of receiving transactions, ordering them arbitrarily, and logging them in batches to disk (and sending them to the deterministic database system).There are several ways that this much simpler task can be scaled. One of them is proposed in a recent blog post by Yang Zhang , who was a PhD student at MIT at the same time I was:

    “One possibility is to partition the central coordinator, and establish an ordering among the partitions. Time is divided into epochs, and the transactions received in each epoch are ordered according to the partition ordering. To prevent clock skew from drifting apart each partition’s notion of the same epoch (causing transactions to block on other transactions in the same epoch), the partitions would need to synchronize themselves every so often. This approach introduces some transaction latency, but hopefully not substantially more than typical transaction batching.”

    ReplyDelete
  36. how is this different than a traditional sharded SQL setup? How do you maintain transactionality across 'partitions' of your data? IT is widely quoted that Facebook has no distinct sub-forests in their user graph, so the question is fair pertinent.

    In other words, can you support a database of > 1PB of data, with full transactions across arbitrary sets of rows and tables with join capability?

    ReplyDelete
  37. Is there a link to Henry Robinson's eight point quick thoughts?

    ReplyDelete
  38. I also disagree somewhat with the premise. The great virtue of some of the NoSQL databases for me isn't scale or performance, it's semantics. For me it really *is* about getting rid of SQL itself. I'm thrilled to hear about scalable ACID solutions as I really need those, but the kinds of storage problems I actually have, wherein the schema itself needs incorporate some kind of dynamism (for instance, where you have a population of heterogeneous objects in an extensible class hierarchy that must evolve over time) just don't map comfortably into the relational model (one colleague's memorable phrase was "not acceptably possible" -- I have yet to encounter a truly satisfactory ORM strategy). In this respect the whole NoSQL movement has been a godsend.

    Conflating the issue of scaling ACID with the issues raised by the limitations of the relation model makes you appear disingenuous (I'm not saying that you are, BTW, just that it weakens your argument).

    ReplyDelete
  39. Thanks for the response Daniel. For anyone interested, the blog post Daniel mentions is here.

    ReplyDelete
  40. This idea seems strongest when the entire dataset is in memory and transactions affect few rows. It's increasingly feasible to store many OLTP datasets in memory these days, but I think one of the goals of BigTable, HBase, etc is that they need to perform well on datasets that are far bigger than fit into memory.

    ReplyDelete
  41. This problem would have been solved a long time ago if there was an open source scale-out yes-SQL yes-ACID project. Instead the solutions are all proprietary and funded by investors who are in it for the money (darn them!).

    ReplyDelete
  42. Many earlier comments have covered much of what I would say. However, nobody to date has raised an objection to the mildly offensive contention that "the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues." Possibly this was not meant in the pejorative sense, but it reads that way. I would argue the correct term of art here is pragmatism, not laziness.

    I am a contributor to the HBase project. HBase is an open source implementation of the BigTable architecture. Indeed our system does scale out by substantially relaxing the scope of ACID guarantees. But it is a gross generalization to suggest "NoSQL" is "NoACID", and somehow lazy in the pejorative sense, and this mars the argument of the authors. HBase at least in particular provides durability, row-level atomicity (agree here this is a nice convenient partition), and favors strong consistency in its design choices. In this regard, I would also like to bring to your attention that the authors made an error describing the scope of transactional atomicity available in BigTable -- the scope is actually the row, not each individual KV.

    Also, at least HBase in particular is a big project with several interesting design/research directions and so does not reduce to a convenient stereotype: a transactional layer that provides global ACID properties at user option (that does not scale out like the underlying system but is nonetheless available), exploration of notions of referential integrity, even consideration of optional relaxed consistency (read replicas) in the other direction.

    Back to the matter of pragmatism: While it is likely most structured data store users are not building systems on the scale of a globally distributed search engine, actually that is not too far off the mark for the design targets of some HBase installations. We indeed do need to work with very large mutating data sets today and nothing in the manner of a traditional relational database system is up to the task. The discussion here, while intriguing, is also rendered fairly academic by the "horrible" performance if spinning media is used. Flash will not be competitive with spinning media at high tera- or peta-scale for at least several years yet. Other commenters have also noticed apparent bottlenecks in the presented design which suggest a high scale implementation will be problematic.

    Anyway, it is my belief we are attacking the same set of problems but are starting at it on opposing sides of a continuum and, ultimately, we shall meet up somewhere in the middle.

    ReplyDelete
  43. A very bright paper. Congratulations guys on the noteworthy contributions.

    A few thoughts...

    I find it somewhat cheeky that the paper and blog post compare your solution to NoSQL databases (and take potshots at such), but the premise of the paper is almost entirely centered around the comparison of your solution to classical SQL-based RDBMS. One of the reasons that NoSQL systems like BigTable/HBase scale so well is that the clients interact directly with the data storage nodes, eliminating the need for a central coordinator for almost all operations. In your implementation, it doesn't seem like this is the case, as the query coordinator must necessarily see every transaction and broker that transaction to the data nodes.

    I'd be interested in seeing an apples-to-apples comparison in tpc between your implementation and an equivalent NoSQL database on equivalent hardware (keeping in mind that such a comparison is difficult to produce in a way that is fair to both platforms- maybe we need a TPC-NoSQL standard?). I'd also like to see how your solution scales in the hundred/thousand node range, as I have similar concerns to some of the other posters about the scalability of the coordinator, clustered or otherwise.

    Otherwise, very interesting work. Keep it up :-)

    ReplyDelete
  44. I'm really happy to see people re-thinking how and where ACID can be improved/extended. NoSQL, has its place but in many ways is throwing the baby out with the bath water.

    We're doing some work at FusionIO on how storage can support/enable more than the crufty ACID assumptions
    of yesteryear. We've got a few neat ideas in the research pipeline, but are actively looking for academic collaborators to help eval the things we're working on and moving forward work with us to establish new things we could/should be doing.

    If this falls into the realm of interest feel free to pop me an email dnellans_AT_fusionio.com. This applies to any readers working in academics/research lab - a quick 30 minute chat to get up to speed on if there is anything we/you are doing there might be interesting overlap on.

    ReplyDelete
  45. runt1me,
    Banking is an example of where you need full serializability.

    Chip,
    Again, when we say NoSQL means NoACID, we don't mean to equate the ACID properties with the relational model. In fact, transactionality and data modeling are essentially orthogonal problems. We are observing that the set of systems that relax ACID to achieve scalability is almost identical in membership to the set of systems that people currently categorize as NoSQL.

    Matt,
    Our approach is strongest when all data lives in memory and transactions are pretty short. I believe, however, that most applications that require full transactionality deal with only a small (GBs to TBs) subset of the PB datasets that companies are keeping. At least, it's hard to think of other examples. But flash/solid state technologies get faster, cheaper, and more energy efficient every day, so using determinism to achieve ACID on scalable, disk-size databases may soon become viable.

    Chris,
    Thanks for the compliment and encouragement! As for comparisons between NoSQL and our prototype: TPC-C pretty explicitly requires full ACID, so you'd have to relax the benchmark's requirements here. But if you were to ignore the full specification and add a NoSQL (by which I mean NoACID) line to the graphs in figure 7 in our paper's appendix, here's what you would see: with no multipartition transactions, throughput would be the similar to the other systems, contention would be 0% (or, more precisely, Not Applicable), and latency would be negligible. Then, as you moved right, increasing multipartition transactions, throughput, contention and latency would stay more or less where they are. (I actually observed this in our prototype once during debugging when I turned off all inter-partition network communication, committing transactions locally immediately upon completing them---and therefore weakening ACID in exactly the same way that NoSQL systems do. Actually, I think that's about as apples-to-apples as it gets.)

    ReplyDelete
  46. Sorry, but eliminating SQL is a huge advantage to these new database engines.

    The whole process of embedding a functional language throughout your code, dealing with the wrapping and unwrapping of layer upon layer of buffers and exception handling, and having to code yet another abstraction layer to deal with each vendor's extensions to SQL is no longer acceptable.

    On top of that you have the added pain of dealing with schema. Some programmer's preconceived notion about the structure, relationships, and usage of data should not be hard wired into how the data is persisted. Schema is nothing more than opinion, usually outdated, and frequently requires views and application logic to re-establish a proper context for how it is currently being used. ORM's should be proof enough that the whole premise is fatally flawed.

    I've spent over 2 decades working with Oracle (and other RDBMS' along the way), and each year has just brought more bloat as people try in vain to abstract away all the warts.

    I've spent the past 6-8 months working with MongoDB, and could not be happier. Our code is significantly smaller, cleaner, faster, and more stable. The hardest challenge has been deprogramming my brain from thinking in SQL :-)

    ReplyDelete
  47. MySQL (in a master-slave setup) guarantees same order of execution in the replicas due to binlog replay by a single thread on the slaves.

    ReplyDelete
  48. tog,
    Alonzo Church just rolled over in his grave at SQL being called a functional language. That aside, let me reiterate our view that transactionality and data model are entirely different problems. We are proposing a methodology for building scalable ACID systems, regardless of whether they are RDBMSs or something less structured.

    Ravi,
    The order of execution for the primary is allowed to be affected by nondeterministic factors and thus unknowable in advance. Therefore (a) if multi-partition transactions were supported then they would require a distributed commit protocol, and (b) replication cannot occur until after the primary has finished a transaction.

    ReplyDelete
  49. Does someone knows VolrDb (http://voltdb.com/)?
    I think they are doing something similar...

    ReplyDelete
  50. PureScale - shared disk, global cache, fast inter node communication via infiniband, great scalability. It makes sense to me. https://www.ibm.com/developerworks/mydeveloperworks/blogs/pureScaleOnLinux/?lang=en

    ReplyDelete
  51. Ionatan,

    I was involved in the H-Store research that was behind the founding of VoltDB, so I'm quite familiar with Volt. Volt is similar in several ways, especially with respect to guaranteeing full ACID. There are several differences however. The two main ones are: VoltDB is focused on single threaded execution and lockless execution for super high throughput per node. Our model is more focused on traditional threaded data-structures and locking-based concurrency control. Second, VoltDB is (for now) focused on workloads that are mostly partitionable (most transactions only touch data on a single node) whereas our solution is focused on the general case where multi-parition transactions a frequent enough that you have to worry about them.

    I think the VoltDB team includes some top-notch developers, and I'm excited to see more people using VoltDB over time. Alex and I actually have some ideas about how to apply some of the deterministic ideas from our paper to single-threaded systems like Volt (including a new CC scheme that were are temporarily calling integer locking).

    CiaranDB: There are fundamental scalability limitations to any shared-disk system. And PureScale doesn't fit into the nice modern paradigm of scale-out on cheap commodity hardware. But it is definitely worth a mention in this discussion.

    ReplyDelete
    Replies
    1. I am inquisitive about the fundamental scalability limitation you talked about for shared-disk system . Can you elaborate on the same or point me to some articles regarding this .

      Thanks,

      Delete
  52. Many commenters here have stated that the schemaless nature of for example MongoDB is a joy (well at least for nerdy people).

    But why can't it be possible to develop an SQL database that is schemaless? I don't really get why SQL automatically means that you have to have a rigid schema? Can't Oracle make it possible to have tables with flexible columns?

    I know this is not really related to distributed transactions, ACID and scalability.

    RC

    ReplyDelete
  53. This approach is called "state machine replication"
    and has been studied in the distributed systems
    community for, oh, probably 25 years.

    It has long been suggested to be appropriate for
    transaction systems provided that they are
    deterministic (i.e., implement deterministic
    state machines), and all transactions are known
    in advance.

    ReplyDelete
  54. @Idit,

    Studied....but has anyone build something working? A fully working db engine or some kind of prototype?

    Let us know.

    ReplyDelete
  55. Idit,

    Please see the related work section of our paper. We actually explicitly contrast our approach to the state machine replication approach. The key difference is that our optimistic lock prediction approach allows much more concurrency than earlier approaches.

    ReplyDelete
  56. our method must certainly be faster than other DBMS under certain
    circumstances at least.

    However, your system is not scalable in the sense that distributed key-value stores are.
    Scalable systems cannot have any central locations where all transactions or anything else must
    go through. They must be really distributed. So it is not clear to me why you talk about the no sql
    stores.

    Your comparison should be with ordinary DBMS.

    In your paper, you contrast with master slave post write replication. But your databse is exactly that.
    The master is the preprocessor that does all the work basically, and the "databases" behind are just storing and
    manipulating the transactions. The database is a pure function now, and of course one can optimise the computation
    of functions. And yes, function computation can be faster on a multicore processor than on a single threaded machine.

    The concurrency is done by the preprocessor however. I think it is pretty clear that you will have to work a lot on
    the preprocessor, and maybe one day you will even start using the data values in the preprocessor, in which case the word
    preprocessor will be just that, a word. So, I don't think your setup is conceptually different from a master with post write
    replication. Maybe your program can beat other programs in raw speed, but it does not have other scalability behaviours.
    The optimal solution must be a highly complicated function of all previous transactions.

    All master/slave/post write databases are as deterministic as yours. You have non-determinism in the preprocessor, but that is a
    semantic difference.

    Cheers.
    ~> vim abadi
    ~> cat abadi
    Your method must certainly be faster than other DBMS under certain
    circumstances at least.

    However, your system is not scalable in the sense that distributed key-value stores are.
    Scalable systems cannot have any central locations where all transactions or anything else must
    go through. They must be really distributed. So it is not clear to me why you talk about the no sql
    stores.

    Your comparison should be with ordinary DBMS.

    In your paper, you contrast with master slave post write replication. But your databse is exactly that.
    The master is the preprocessor that does all the work basically, and the "databases" behind are just storing and
    manipulating the transactions. The database is a pure function now, and of course one can optimise the computation
    of functions. And yes, function computation can be faster on a multicore processor than on a single threaded machine.

    The concurrency is done by the preprocessor however. I think it is pretty clear that you will have to work a lot on
    the preprocessor, and maybe one day you will even start using the data values in the preprocessor, in which case the word
    preprocessor will be just that, a word. So, I don't think your setup is conceptually different from a master with post write
    replication. Maybe your program can beat other programs in raw speed, but it does not have other scalability behaviours.
    The optimal solution must be a highly complicated function of all previous transactions.

    All master/slave/post write databases are as deterministic as yours. You have non-determinism in the preprocessor, but that is a
    semantic difference.

    Cheers.

    PS. I might have a similar post, but it seems like it was lost.

    ReplyDelete
  57. You may be thinking of David Reed's 1978
    thesis "Naming and Synchronization in a
    Decentralized Computer Systems" when you say

    "In the case of MVCC, when two transactions access a data item out of timestamp order (and when the second one to actually access the item is trying to write to it), the transaction with the earlier timestamp is aborted. And if I remember correctly, it's never safe in MVCC to commit a transaction that depends on data which has been updated more recently than the timestamp of the oldest active transaction."

    Although that thesis presented an early example of using snapshot copies to provide
    consistent read access, its solution to write/write conflicts was unpredictable and overly restrictive. Modern implementations of MVCC, including PostgreSQL, InterBase, and Firebird use the order of changes to determine the success of an update or delete. When concurrent transactions attempt to change a record, the second stalls until the first commits or rolls back. If the first commits, the second gets an error message. If the first transaction rolls back, the second succeeds.

    There are no conflicts between readers and writers in either direction, and all update conflicts are resolved at statement time.

    ReplyDelete
  58. Let's start with some major heresies.

    First, MVCC can detect and manage write/write conflicts uses record versions such that no commit time checking is necessary. This was implementd in Rdb/ELN in 1983 and has since been used in Interbase, Firebird, MySQL Falcon, and NimbusDB.

    Second, database consistency does not require a global ordering of transactions. Serializability is a necessary condition for two phase locking, but is not necessary with MVCC.

    Third, with replication, any piece of data can reside anywhere in a network, so there is no reason that any give transaction can't execute on any node. Consequently, a two phase commit protocol -- and the resulting overhead -- is not necessary.

    Fourth, with care, replication messaging can be implemented so that a node that has received a commit message has necessarily also received any and all updates for that transaction for data residing on that node.

    Combined, these heresies enable a highly scalable, ACID, relational database management system.

    ReplyDelete
  59. Jim, do you know of any working systems implemented using your approach? You are making a lot of ambitious claims with not a lot of supporting references. In particular I disagree that MVCC automatically resolves all replica consistency issues; your "with care" qualifier in your fourth point requires much, much more rigor before I can believe it.

    ReplyDelete
  60. Yes, NimbusDB is a working example.

    The fundamental idea is that the last message received by node A from a transaction X executing on node B is a commit. Consequently a transaction Y on node A starting after the commit message from transaction X will necessarily see all versions created by transaction X. If all nodes had all data, this would be sufficient. A system where each node may have partial data must recognize and handle the case that the node from which it requests and receives data may not have received the commit from transaction X, which does make things more exciting, albeit doable.

    For a little more detail, see http://www.gbcacm.org/seminars/evening/2010/special-relativity-and-problem-database-scalability.html

    Or ask.

    ReplyDelete
  61. Jim, I'm not sure I understand your argument. Here is an attempt to clarify via an example. Consider nodes R, Q, P and transactions A[R, Q], B[Q, P], C[P, R], where the read/write sets are given between []. There are 6 valid serialization execution orders and 2 cyclical execution orders. For instance, R{A < C}, Q{B < A} and P{C < A}, where the execution order at a given node is given between {}. In this orders, each of the transactions has a locally valid commit set, for instance C sees all the changes done by A (but cares only about the changes done at node R).

    Is this the kind of non-serializable execution order you have in mind?

    ReplyDelete
  62. I think you're making it more complicated than necessary. So a quick re-cap:

    All transactions execute on a single node. Each transaction sees all transactions that were reported committed on its node at the time it started. This is really no different from MVCC on a single node.

    Record updates, however, are resolved at verb time by a node designated as chairman for a subset of data, so a transaction attempting to update a record modified by a concurrent transaction will either block pending completion of the conflicting transaction or throw an update conflict condition. This prevents concurrent transactions on different nodes from overwriting each other's updates. Unique and referential integrity constraints are enforced in the same way.

    The question of transaction order is never an issue. The only question was whether transaction A on node X was reported at node Y before or after a transaction B started at Y.

    This, of course, means that transaction are serializable, but serializability is not required for consistency unless consistency is managed by two phase locks.

    Please note that it is not possible to have a cycle.

    ReplyDelete
  63. Ladies and gentlemen: This has been -- and I hope will continue to be -- a very interesting discussion, but this forum is loser.

    Can somebody suggest a better place to continue the discussion? Say, one that will give email notifications for posts?

    ReplyDelete
  64. I think I understood your point about introducing a deterministic order on transactions, even I have difficulties to catch the global picture when partitioning is used.

    This being said, your approach gave me an idea.
    According to [1], the four common sources of overhead in database management systems are:
    (a) logging (19%),
    (b) latching (19%),
    (c) locking (17%),
    (d) B-tree, and buffer management operations (35%)

    Current architectures include app servers and RDBMS instances directly connected.

    Why not introducing a box between app servers and the RDBMS to take in charge:
    - the processing of the deterministic order on transactions
    - the processing (a)
    - whatever other processings above that could be put into that middle box.

    Such an architecture is all about splitting RDBMS processing into different boxes.

    So, if such architecture is possible, then it would be possible to replace RDBMS instances (beyond that middle box) with "lightweight" RDBMS, that is, without processing (a) and other processing steps that could put into the middle box.

    Having such a middle box could bring various advantages:
    - introducing a deterministic order on transactions without modifying applications,
    - using "lightweight" RDBMS as back-end, and then, faster RDBMS as back-end.

    What do you think about this naive (I hope so not too naive) idea ?

    Thanks,
    Dominique

    [1] http://highscalability.com/blog/2010/6/28/voltdb-decapitates-six-sql-urban-myths-and-delivers-internet.html

    ReplyDelete
  65. Doing multiple tasks at the same point in time in more than one location is the goal. The issue would seem to boil down to one of linearity vs simultaneity. I am reminded of the early days of personal computing when data was written to cassette tape. Time, like the cassette tape, is linear and thus the issue. Time itself becomes the bottleneck. How can we eliminate time as a factor in achieving consistent database writes and reads and making systems highly available over multiple partitions? When two writes are made at precisely the same time in different places, which one came first?

    ReplyDelete
  66. This comment has been removed by the author.

    ReplyDelete
  67. really it is great approach for database acid property solution ,

    ReplyDelete