Comments on DBMS Musings: The problems with ACID, and how to fix them without going NoSQL

really it is great approach for database acid prop...

2013-08-18T20:43:39.157-07:00

really it is great approach for database acid property solution ,

I am inquisitive about the fundamental scalability...

2013-05-28T23:29:53.246-07:00

I am inquisitive about the fundamental scalability limitation you talked about for shared-disk system . Can you elaborate on the same or point me to some articles regarding this .

Thanks,

2012-10-16T11:46:34.028-07:00

This comment has been removed by the author.

Doing multiple tasks at the same point in time in ...

2011-01-08T07:48:13.000-08:00

Doing multiple tasks at the same point in time in more than one location is the goal. The issue would seem to boil down to one of linearity vs simultaneity. I am reminded of the early days of personal computing when data was written to cassette tape. Time, like the cassette tape, is linear and thus the issue. Time itself becomes the bottleneck. How can we eliminate time as a factor in achieving consistent database writes and reads and making systems highly available over multiple partitions? When two writes are made at precisely the same time in different places, which one came first?

I think I understood your point about introducing ...

2010-10-21T06:04:08.861-07:00

I think I understood your point about introducing a deterministic order on transactions, even I have difficulties to catch the global picture when partitioning is used.

This being said, your approach gave me an idea.
According to [1], the four common sources of overhead in database management systems are:
(a) logging (19%),
(b) latching (19%),
(c) locking (17%),
(d) B-tree, and buffer management operations (35%)

Current architectures include app servers and RDBMS instances directly connected.

Why not introducing a box between app servers and the RDBMS to take in charge:
- the processing of the deterministic order on transactions
- the processing (a)
- whatever other processings above that could be put into that middle box.

Such an architecture is all about splitting RDBMS processing into different boxes.

So, if such architecture is possible, then it would be possible to replace RDBMS instances (beyond that middle box) with "lightweight" RDBMS, that is, without processing (a) and other processing steps that could put into the middle box.

Having such a middle box could bring various advantages:
- introducing a deterministic order on transactions without modifying applications,
- using "lightweight" RDBMS as back-end, and then, faster RDBMS as back-end.

What do you think about this naive (I hope so not too naive) idea ?

Thanks,
Dominique

[1] http://highscalability.com/blog/2010/6/28/voltdb-decapitates-six-sql-urban-myths-and-delivers-internet.html

Ladies and gentlemen: This has been -- and I hope...

2010-09-25T18:41:49.209-07:00

Ladies and gentlemen: This has been -- and I hope will continue to be -- a very interesting discussion, but this forum is loser.

Can somebody suggest a better place to continue the discussion? Say, one that will give email notifications for posts?

I think you're making it more complicated than...

2010-09-24T15:04:51.422-07:00

I think you're making it more complicated than necessary. So a quick re-cap:

All transactions execute on a single node. Each transaction sees all transactions that were reported committed on its node at the time it started. This is really no different from MVCC on a single node.

Record updates, however, are resolved at verb time by a node designated as chairman for a subset of data, so a transaction attempting to update a record modified by a concurrent transaction will either block pending completion of the conflicting transaction or throw an update conflict condition. This prevents concurrent transactions on different nodes from overwriting each other's updates. Unique and referential integrity constraints are enforced in the same way.

The question of transaction order is never an issue. The only question was whether transaction A on node X was reported at node Y before or after a transaction B started at Y.

This, of course, means that transaction are serializable, but serializability is not required for consistency unless consistency is managed by two phase locks.

Please note that it is not possible to have a cycle.

Jim, I'm not sure I understand your argument. ...

2010-09-21T17:53:56.321-07:00

Jim, I'm not sure I understand your argument. Here is an attempt to clarify via an example. Consider nodes R, Q, P and transactions A[R, Q], B[Q, P], C[P, R], where the read/write sets are given between []. There are 6 valid serialization execution orders and 2 cyclical execution orders. For instance, R{A < C}, Q{B < A} and P{C < A}, where the execution order at a given node is given between {}. In this orders, each of the transactions has a locally valid commit set, for instance C sees all the changes done by A (but cares only about the changes done at node R).

Is this the kind of non-serializable execution order you have in mind?

Yes, NimbusDB is a working example. The fundament...

2010-09-21T12:31:16.296-07:00

Yes, NimbusDB is a working example.

The fundamental idea is that the last message received by node A from a transaction X executing on node B is a commit. Consequently a transaction Y on node A starting after the commit message from transaction X will necessarily see all versions created by transaction X. If all nodes had all data, this would be sufficient. A system where each node may have partial data must recognize and handle the case that the node from which it requests and receives data may not have received the commit from transaction X, which does make things more exciting, albeit doable.

For a little more detail, see http://www.gbcacm.org/seminars/evening/2010/special-relativity-and-problem-database-scalability.html

Or ask.

Jim, do you know of any working systems implemente...

2010-09-09T10:33:11.383-07:00

Jim, do you know of any working systems implemented using your approach? You are making a lot of ambitious claims with not a lot of supporting references. In particular I disagree that MVCC automatically resolves all replica consistency issues; your "with care" qualifier in your fourth point requires much, much more rigor before I can believe it.

Let's start with some major heresies. First, ...

2010-09-08T12:06:58.707-07:00

Let's start with some major heresies.

First, MVCC can detect and manage write/write conflicts uses record versions such that no commit time checking is necessary. This was implementd in Rdb/ELN in 1983 and has since been used in Interbase, Firebird, MySQL Falcon, and NimbusDB.

Second, database consistency does not require a global ordering of transactions. Serializability is a necessary condition for two phase locking, but is not necessary with MVCC.

Third, with replication, any piece of data can reside anywhere in a network, so there is no reason that any give transaction can't execute on any node. Consequently, a two phase commit protocol -- and the resulting overhead -- is not necessary.

Fourth, with care, replication messaging can be implemented so that a node that has received a commit message has necessarily also received any and all updates for that transaction for data residing on that node.

Combined, these heresies enable a highly scalable, ACID, relational database management system.

You may be thinking of David Reed's 1978 thesi...

2010-09-07T11:34:44.977-07:00

You may be thinking of David Reed's 1978
thesis "Naming and Synchronization in a
Decentralized Computer Systems" when you say

"In the case of MVCC, when two transactions access a data item out of timestamp order (and when the second one to actually access the item is trying to write to it), the transaction with the earlier timestamp is aborted. And if I remember correctly, it's never safe in MVCC to commit a transaction that depends on data which has been updated more recently than the timestamp of the oldest active transaction."

Although that thesis presented an early example of using snapshot copies to provide
consistent read access, its solution to write/write conflicts was unpredictable and overly restrictive. Modern implementations of MVCC, including PostgreSQL, InterBase, and Firebird use the order of changes to determine the success of an update or delete. When concurrent transactions attempt to change a record, the second stalls until the first commits or rolls back. If the first commits, the second gets an error message. If the first transaction rolls back, the second succeeds.

There are no conflicts between readers and writers in either direction, and all update conflicts are resolved at statement time.

our method must certainly be faster than other DBM...

2010-09-05T11:07:19.270-07:00

our method must certainly be faster than other DBMS under certain
circumstances at least.

However, your system is not scalable in the sense that distributed key-value stores are.
Scalable systems cannot have any central locations where all transactions or anything else must
go through. They must be really distributed. So it is not clear to me why you talk about the no sql
stores.

Your comparison should be with ordinary DBMS.

In your paper, you contrast with master slave post write replication. But your databse is exactly that.
The master is the preprocessor that does all the work basically, and the "databases" behind are just storing and
manipulating the transactions. The database is a pure function now, and of course one can optimise the computation
of functions. And yes, function computation can be faster on a multicore processor than on a single threaded machine.

The concurrency is done by the preprocessor however. I think it is pretty clear that you will have to work a lot on
the preprocessor, and maybe one day you will even start using the data values in the preprocessor, in which case the word
preprocessor will be just that, a word. So, I don't think your setup is conceptually different from a master with post write
replication. Maybe your program can beat other programs in raw speed, but it does not have other scalability behaviours.
The optimal solution must be a highly complicated function of all previous transactions.

All master/slave/post write databases are as deterministic as yours. You have non-determinism in the preprocessor, but that is a
semantic difference.

Cheers.
~> vim abadi
~> cat abadi
Your method must certainly be faster than other DBMS under certain
circumstances at least.

However, your system is not scalable in the sense that distributed key-value stores are.
Scalable systems cannot have any central locations where all transactions or anything else must
go through. They must be really distributed. So it is not clear to me why you talk about the no sql
stores.

Your comparison should be with ordinary DBMS.

In your paper, you contrast with master slave post write replication. But your databse is exactly that.
The master is the preprocessor that does all the work basically, and the "databases" behind are just storing and
manipulating the transactions. The database is a pure function now, and of course one can optimise the computation
of functions. And yes, function computation can be faster on a multicore processor than on a single threaded machine.

The concurrency is done by the preprocessor however. I think it is pretty clear that you will have to work a lot on
the preprocessor, and maybe one day you will even start using the data values in the preprocessor, in which case the word
preprocessor will be just that, a word. So, I don't think your setup is conceptually different from a master with post write
replication. Maybe your program can beat other programs in raw speed, but it does not have other scalability behaviours.
The optimal solution must be a highly complicated function of all previous transactions.

All master/slave/post write databases are as deterministic as yours. You have non-determinism in the preprocessor, but that is a
semantic difference.

Cheers.

PS. I might have a similar post, but it seems like it was lost.

Idit, Please see the related work section of our ...

2010-09-04T18:12:27.352-07:00

Idit,

Please see the related work section of our paper. We actually explicitly contrast our approach to the state machine replication approach. The key difference is that our optimistic lock prediction approach allows much more concurrency than earlier approaches.

@Idit, Studied....but has anyone build something ...

2010-09-04T05:38:22.641-07:00

@Idit,

Studied....but has anyone build something working? A fully working db engine or some kind of prototype?

Let us know.

This approach is called "state machine replic...

2010-09-04T05:21:15.708-07:00

This approach is called "state machine replication"
and has been studied in the distributed systems
community for, oh, probably 25 years.

It has long been suggested to be appropriate for
transaction systems provided that they are
deterministic (i.e., implement deterministic
state machines), and all transactions are known
in advance.

Many commenters here have stated that the schemale...

2010-09-04T03:05:41.238-07:00

Many commenters here have stated that the schemaless nature of for example MongoDB is a joy (well at least for nerdy people).

But why can't it be possible to develop an SQL database that is schemaless? I don't really get why SQL automatically means that you have to have a rigid schema? Can't Oracle make it possible to have tables with flexible columns?

I know this is not really related to distributed transactions, ACID and scalability.

RC

Ionatan, I was involved in the H-Store research t...

2010-09-03T11:38:53.123-07:00

Ionatan,

I was involved in the H-Store research that was behind the founding of VoltDB, so I'm quite familiar with Volt. Volt is similar in several ways, especially with respect to guaranteeing full ACID. There are several differences however. The two main ones are: VoltDB is focused on single threaded execution and lockless execution for super high throughput per node. Our model is more focused on traditional threaded data-structures and locking-based concurrency control. Second, VoltDB is (for now) focused on workloads that are mostly partitionable (most transactions only touch data on a single node) whereas our solution is focused on the general case where multi-parition transactions a frequent enough that you have to worry about them.

I think the VoltDB team includes some top-notch developers, and I'm excited to see more people using VoltDB over time. Alex and I actually have some ideas about how to apply some of the deterministic ideas from our paper to single-threaded systems like Volt (including a new CC scheme that were are temporarily calling integer locking).

CiaranDB: There are fundamental scalability limitations to any shared-disk system. And PureScale doesn't fit into the nice modern paradigm of scale-out on cheap commodity hardware. But it is definitely worth a mention in this discussion.

PureScale - shared disk, global cache, fast inter ...

2010-09-03T07:50:03.773-07:00

PureScale - shared disk, global cache, fast inter node communication via infiniband, great scalability. It makes sense to me. https://www.ibm.com/developerworks/mydeveloperworks/blogs/pureScaleOnLinux/?lang=en

Does someone knows VolrDb (http://voltdb.com/)? I ...

2010-09-03T07:23:08.289-07:00

Does someone knows VolrDb (http://voltdb.com/)?
I think they are doing something similar...

tog, Alonzo Church just rolled over in his grave a...

2010-09-03T00:53:41.581-07:00

tog,
Alonzo Church just rolled over in his grave at SQL being called a functional language. That aside, let me reiterate our view that transactionality and data model are entirely different problems. We are proposing a methodology for building scalable ACID systems, regardless of whether they are RDBMSs or something less structured.

Ravi,
The order of execution for the primary is allowed to be affected by nondeterministic factors and thus unknowable in advance. Therefore (a) if multi-partition transactions were supported then they would require a distributed commit protocol, and (b) replication cannot occur until after the primary has finished a transaction.

MySQL (in a master-slave setup) guarantees same or...

2010-09-03T00:24:48.192-07:00

MySQL (in a master-slave setup) guarantees same order of execution in the replicas due to binlog replay by a single thread on the slaves.

Sorry, but eliminating SQL is a huge advantage to ...

2010-09-02T23:22:26.978-07:00

Sorry, but eliminating SQL is a huge advantage to these new database engines.

The whole process of embedding a functional language throughout your code, dealing with the wrapping and unwrapping of layer upon layer of buffers and exception handling, and having to code yet another abstraction layer to deal with each vendor's extensions to SQL is no longer acceptable.

On top of that you have the added pain of dealing with schema. Some programmer's preconceived notion about the structure, relationships, and usage of data should not be hard wired into how the data is persisted. Schema is nothing more than opinion, usually outdated, and frequently requires views and application logic to re-establish a proper context for how it is currently being used. ORM's should be proof enough that the whole premise is fatally flawed.

I've spent over 2 decades working with Oracle (and other RDBMS' along the way), and each year has just brought more bloat as people try in vain to abstract away all the warts.

I've spent the past 6-8 months working with MongoDB, and could not be happier. Our code is significantly smaller, cleaner, faster, and more stable. The hardest challenge has been deprogramming my brain from thinking in SQL :-)

runt1me, Banking is an example of where you need f...

2010-09-02T22:41:48.420-07:00

runt1me,
Banking is an example of where you need full serializability.

Chip,
Again, when we say NoSQL means NoACID, we don't mean to equate the ACID properties with the relational model. In fact, transactionality and data modeling are essentially orthogonal problems. We are observing that the set of systems that relax ACID to achieve scalability is almost identical in membership to the set of systems that people currently categorize as NoSQL.

Matt,
Our approach is strongest when all data lives in memory and transactions are pretty short. I believe, however, that most applications that require full transactionality deal with only a small (GBs to TBs) subset of the PB datasets that companies are keeping. At least, it's hard to think of other examples. But flash/solid state technologies get faster, cheaper, and more energy efficient every day, so using determinism to achieve ACID on scalable, disk-size databases may soon become viable.

Chris,
Thanks for the compliment and encouragement! As for comparisons between NoSQL and our prototype: TPC-C pretty explicitly requires full ACID, so you'd have to relax the benchmark's requirements here. But if you were to ignore the full specification and add a NoSQL (by which I mean NoACID) line to the graphs in figure 7 in our paper's appendix, here's what you would see: with no multipartition transactions, throughput would be the similar to the other systems, contention would be 0% (or, more precisely, Not Applicable), and latency would be negligible. Then, as you moved right, increasing multipartition transactions, throughput, contention and latency would stay more or less where they are. (I actually observed this in our prototype once during debugging when I turned off all inter-partition network communication, committing transactions locally immediately upon completing them---and therefore weakening ACID in exactly the same way that NoSQL systems do. Actually, I think that's about as apples-to-apples as it gets.)

I'm really happy to see people re-thinking how...

2010-09-02T15:39:42.668-07:00

I'm really happy to see people re-thinking how and where ACID can be improved/extended. NoSQL, has its place but in many ways is throwing the baby out with the bath water.

We're doing some work at FusionIO on how storage can support/enable more than the crufty ACID assumptions
of yesteryear. We've got a few neat ideas in the research pipeline, but are actively looking for academic collaborators to help eval the things we're working on and moving forward work with us to establish new things we could/should be doing.

If this falls into the realm of interest feel free to pop me an email dnellans_AT_fusionio.com. This applies to any readers working in academics/research lab - a quick 30 minute chat to get up to speed on if there is anything we/you are doing there might be interesting overlap on.