Comments on DBMS Musings: Overview of the Oracle NoSQL Database

For me it feels like lacking durability. It sounds...

2014-08-29T09:48:06.888-07:00

For me it feels like lacking durability. It sounds eventually consistent. Yes, clear all is eventually consistent by this defnition, but it also would mean 0 durability. But seems like what is offered here is 'almost' durable

wow goood

2014-08-18T03:08:14.148-07:00

wow goood

Technically nothing is written from scratch. In th...

2013-03-22T19:15:23.841-07:00

Technically nothing is written from scratch. In the end we're all using 1's and 0's, are we not? :)

While I agree with your assessment of the interpre...

2012-02-10T09:49:22.321-08:00

While I agree with your assessment of the interpretation of eventual consistency, here's my observation of its impact.

These systems are built to be run on commodity storage (no SAN), so you will always achieve durability with W: majority model.

Now if there is a partition, the minority will not allow updates (loss of A for minority partition) but it will be eventually consistent with majority when the partition heals. So it does give you the practical benefits of EC at the loss of partial availability.

So using a strict definition of EC, while correct, is not practically useful.

Do you really think BerkeleyDB team built this fro...

2011-11-21T04:33:34.555-08:00

Do you really think BerkeleyDB team built this from scratch? A lot of this sounds like Coherence to me... I'd bet they're reusing a lot of that in the background...

Thank you guys for this interesting information !...

2011-11-05T12:25:05.140-07:00

Thank you guys for this interesting information !

re GC stalls: agree with @eribeiro, if you write a...

2011-10-24T22:36:23.134-07:00

re GC stalls: agree with @eribeiro, if you write a db in java you have to allocate big chunks of memory and manage them yourself: you can never use GC to manage lots of small objects (e.g. rows) across a massive heap. that will never scale. that isn't java's fault, you couldn't write a db in C like that either. the fix described by Cloudera is just db 101.

@Mark Callaghan and @Daniel Abadi: GC is a seriou...

2011-10-13T10:53:54.913-07:00

@Mark Callaghan and @Daniel Abadi:

GC is a serious issue and there are two strategies nowadays, afaik:

a) pre-allocated a large 'stab' of memory so that it reduces external fragmentation (but increases internal fragmentation and doesn't promote optimal space allocation). Cassandra and HBase are using this strategy.

b) allocate the data outside Java heap using ByteBuffer.allocate(). This brings its own share of problems, as the need to map from native heap to java heap, etc.

@Srirang: If the failed primary comes back online ...

2011-10-10T12:11:10.740-07:00

@Srirang: If the failed primary comes back online it will reconnect to the replica set and any changes that did not make it to the secondaries will be stored in its local rollback directory for later analysis.

@Pablo: If the primary member of a replica set fails, the remaining members of the set will try to elect a new master. There are a number of factors that go into choosing the new master ( delta from the old master, priority, hidden flag ) so it is possible that the new primary will not be the one that has the latest data. If that happens, the secondaries will log the differences to their local rollback directories as they re-orient to the new primary.

You can check out the following for more details:

http://www.mongodb.org/display/DOCS/Replica+Set+Internals
http://www.mongodb.org/display/DOCS/Replica+Sets+-+Rollbacks

@Srirang Reading "MongoDB: The Definitive g...

2011-10-08T14:09:49.814-07:00

@Srirang

Reading "MongoDB: The Definitive guide", it says:

"...Whenever the primary changes, the data on the new primary is assumed to be the most
up-to-date data in the system. Any operations that have been applied on any other
nodes (i.e., the former primary node) will be rolled back, even if the former primary
comes back online. To accomplish this rollback, all nodes go through a resync process
when connecting to a new primary. They look through their oplog for operations that
have not been applied on the primary and query the new primary to get an up-to-date
copy of any documents affected by such operations."

so based on that, It looks like they are discarding the non-replicated objects/documents.

First of all Daniel, thank you for a really nice w...

2011-10-07T21:15:58.349-07:00

First of all Daniel, thank you for a really nice write up and thank you Margo for an equally helpful comment.

Now, doesn't MongoDB have the same semantics for consistency, i.e discard the writes from the failed master node in case it was not replicated to replica sets?

Especially after Margo's explanation of the write majority feature, the Oracle NoSQL is strikingly similar to MongoDB in the CAP theorem parlance, isn't it? (Although a lot many other things are much different)

Yes, it's BerkeleyDB.

2011-10-07T14:03:45.231-07:00

Yes, it's BerkeleyDB.

Can anyone comment on the local store? Is it Berke...

2011-10-07T11:39:44.934-07:00

Can anyone comment on the local store? Is it BerkeleyDB?
thanks, Andy (Acunu)

It's interesting that the update mechanism tha...

2011-10-07T01:04:29.261-07:00

It's interesting that the update mechanism that leads to better consistency does benefit from low-latency interconnects, which could explain the presence of infiniband in the racks. Hadoop traditionally trades off premium parts for low cost storage and compute, yet the hardware offerings are premium. The only other people who have looked at Hadoop over IB are Mellanox and teams they have donated hardware too.

Technically, it may be a good design for small scale (not Facebook-class) systems. Yet the hardware will only come from Oracle, the base OS from Oracle, the DB from Oracle -paying their prices and their support costs, and getting updates at their chosen rate. I'd argue against the technology on those grounds -cost and the fact that oracle end up owning all your data, this time on hardware they can set the price for.

Hi Daniel, Sherpa also supports eventual consiste...

2011-10-06T22:32:22.190-07:00

Hi Daniel,

Sherpa also supports eventual consistency (has been for a while). Was not sure if your article hinted otherwise, but just in case.

-ppsn
Yahoo! Inc

That's a good point --- I assume writes stall ...

2011-10-06T21:05:24.060-07:00

That's a good point --- I assume writes stall if the master node for that write is undergoing GC. I wonder how big of an issue this is ....

With so many storage servers written in Java (this...

2011-10-06T20:26:43.119-07:00

With so many storage servers written in Java (this, Cassandra, HBase, HDFS) eventually we are going to need a JVM that doesn't have serious problems from GC stalls. See http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/ for an example.

Intermittent GC stalls are another form of downtime.

@CharlesLamb: In particular, if a minority partiti...

2011-10-06T13:49:17.689-07:00

@CharlesLamb: In particular, if a minority partition gets partitioned off from the majority partition, the minority partition is unavailable for updates, while the majority partition is fully consistent. Hence, this is CP from CAP.

If you were eventually consistent, the minority partition could stay available for updates and would reconcile with the rest of the database when the partition is repaired.

So you are not eventually consistent. But that's nothing to be embarrassed about.

@PabloM: Yes your second comment is what I was tr...

2011-10-06T13:41:52.350-07:00

@PabloM: Yes your second comment is what I was trying to say.

@CharlesLamb: First, congrats on building a nice system. Please do not take my comments that the Oracle NoSQL Database is not an eventually consistent system as a criticism -- like I said in my post --- you don't have to be eventually consistent to be NoSQL. That said, what you described in your comment is not eventual consistency, it's full consistency! The majority ACK configuration is a classic CP system from CAP (see: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf).

Ok, reading the Vogels definition again he says &q...

2011-10-06T13:30:39.389-07:00

Ok, reading the Vogels definition again he says "if no new updates are made to the object". That means that you wont have conflicts during the reconciliation process. I think that I got your point Daniel about the Oracle NoSQL DB. They are discarding the non-replicated updates even if no new updates were made to the non-replicated object.

BTW, I should have added that we do not recommend ...

2011-10-06T13:30:13.171-07:00

BTW, I should have added that we do not recommend using an ack policy weaker than simple_majority.

Hi Daniel, If you configure the Oracle NoSQL Data...

2011-10-06T13:29:10.180-07:00

Hi Daniel,

If you configure the Oracle NoSQL Database system to use simple_majority (or stronger), then you will reach eventual consistency by the Vogels definition above. The transaction is only lost if you use an ack policy weaker than simple_majority. If a transaction is committed using simple_majority (or stronger), then that means that it has reached a majority of the nodes. At that point, even if the master fails and a new master is elected, the transaction data will be present on the new master.

Daniel, When you say that "..In an eventuall...

2011-10-06T13:23:32.821-07:00

Daniel,

When you say that "..In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired", what do you mean by "reconciled" ? which update should win ? the older ? the newer ? a merge between them ?

Hi Margo, Thanks so much for adding more details...

2011-10-06T07:51:45.881-07:00

Hi Margo,

Thanks so much for adding more details in your comment. I totally agree with you that we're working with competing definitions of eventual consistency. I'm using the Werner Vogels definition which I believe is the prevalent definition (i.e., it is reflected the in wikipedia definition). The Vogels definition can be found here, and is reproduced below:

"Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value."

Your and Sam's definition seems to be a straight read on the words "eventual consistency" where dropping updates is a legitimate way to eventually make all replicas agree with each other. Without academic rigor behind terms like "eventual consistency", it is only natural that it will come to mean different things to different people.

However, I believe your definition is quite dangerous. (You can make any database eventually consistent by deleting everything!).

Regarding your point 4, I should have made this point a little clearer in my text. But whether synchronous replication means 'majority' or 'all', this does not have an effect on any of the categories I place it in above. The Oracle NoSQL system still does not support eventual consistency according to the Vogels definition, and the general design trades off durability for latency and availability.

First, Daniel -- thank you for a nicely written th...

2011-10-06T07:00:51.320-07:00

First, Daniel -- thank you for a nicely written thoughtful piece about Oracle's NoSQL Database. I largely agree with most of what you wrote, but let me clarify a few points:

1. Yes, I wrote the white paper -- this was revealed (although not intentionally a secret) in my colleague Charles Lamb's blog: http://blogs.oracle.com/charlesLamb/entry/oracle_nosql_database1

2. Yes, Oracle's NoSQL Database is written in Java.

3. I think we're quibbling over competing definitions of eventual consistency. Oracle's NoSQL Database is eventually consistent, because in fact, all sites will converge to the same state. A master that accepts a write without contacting any of the replicas and then crashes, will sync up with the newly elected master when it returns. Thus, the write will have been lost, but the state is still eventually consistent. To use Bayou terminology, if you configure Oracle's NoSQL database to accept writes, without contacting any replicas (a write-acknowledgement policy of NONE), then you could consider all such writes as tentative, until they have been propagated. Alternately, you could consider our rollback of the disconnected master's update as a deterministic merge procedure.

Here is another way to look at it, from my colleague Sam Haradvala:

The eventually consistent state the Oracle NoSQL Database will end up in will be different depending upon whether the acknowledgments that were required by a write request required a simple majority or higher level of acknowledgments. A key difference between a system like Dynamo and Oracle NosQL Database is that they will end up in different eventually consistent states when the synchronous writers are less than the simple majority, particularly in the case of W=1, and the writer node fails before it has a chance to propagate its changes. In Oracle NoSQL Database the write is not durable and is rolled back even if the failed node eventually recovers. In Dynamo like systems the write at the failed node may contribute the change to the eventually consistent state.

4. Finally, I'd like to add one other point that Sam also brought up. You made no mention of the simple majority write acknowledgement policy. I think that you're using the terms synchronous and asynchronous as though the only two options available are to wait for all replicas to acknowledge the write (ALL) or to wait for none of the replicas to acknowledge the write (NONE). In fact, the default configuration for Oracle NoSQL Database is to wait for a simple majority of the replicas to acknowledge the write. Users are free to trade off durability (by requiring fewer acks) for lower latency, higher throughput and write availability.