Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.
I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (Hadapt). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).
In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is a useful whitepaper written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).
Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.
The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.
Overview of the system:
Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.
The following is where the Oracle NoSQL Database falls in various key dimensions:
Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.]
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability. To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the Yahoo PNUTs/Sherpa system.
[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]
Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.
The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and minor key path model to be more elegant than the column family model (I have ranted against column-families in the past).
Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).
The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.
This seems superficially similar to the GenieDB system.ReplyDelete
Are you familiar with it?
Great article and I can't wait to find out more about this solution. A big player like Oracle in the NoSQL space will drive other vendors to improve their own products.ReplyDelete
I am wondering about the situation where a master goes offline. What happens when it comes back online? Does it just drop all the data it had previously? Is there a process or a tool to merge any changes? These might be questions that nobody has besides Oracle.
I thought the ONoSQL paper was very nicely written, how did you find out that Margo wrote it?ReplyDelete
The consistency model looks like Membase to me. It's CP, not AP. This is a consequence of having a master/slave based approach. If you are partitioned and can't see the master, then you can't write, and you can only read if you don't enforce consistency.
The scalability of master/slave systems is also limited by the traffic to the master, as adding more slaves doesn't help when the master gets overloaded.
Is there any support for secondary indexes and bulk operations like range queries?
It doesn't compete with Cassandra or Riak, which have no master (higher scalability), has write availability when partitioned, and a more flexible distribution model. Cassandra also has a more flexible bigtable style object model.
It doesn't seem to compete with MongoDB, which has very flexible indexing of complex objects - although MongoDB is also master slave based, its developer API is its strength, rather than high end scalability.
Hbase has very good range query support for time series log oriented data.
Membase seems the most similar to ONoSQL. but has merged with Couchdb to form Couchbase and get more flexible object query support.
Is the developer API the same as BDB? If so, that might get it some initial traction for code bases that already use BDB. It would have to include some extensions to manage the consistency model options.
So, yes, it's a bit different to the others, but not particularly interesting. I got the impression it was written in Java (like Cassandra) rather than C++ (like MongoDB) or Erlang (like Riak), does anyone know?
I was thinking about that situation too. I thought that asynchronous replication was enough to have eventual consistency. I'm wondering what is doing MongoDB in that case. I think that they are discarding the changes from the failed master/primary node.ReplyDelete
@36f30bfc-4ccc-11e0-9a64-000bcdcb2996: I have not run into GenieDB too often, but their technical whitepaper says they support eventual consistency, which seems to be at least one difference from the Oracle system.ReplyDelete
@NoelH, @AdrianCockcroft: I don't want to answer questions about the Oracle system that I'm not confident in the answer (it is best to ask Oracle directly). But Adrian, I agree with pretty much everything you said.
The master/replica scheme, the ability to guarantee that clusters of keys are all available on the same set of servers, and durability of writes sent to a crashed master are all very similar to the Hibari key-value store.ReplyDelete
See http://hibari.github.com/hibari-doc/ for docs and pointers to the source.
First, Daniel -- thank you for a nicely written thoughtful piece about Oracle's NoSQL Database. I largely agree with most of what you wrote, but let me clarify a few points:ReplyDelete
1. Yes, I wrote the white paper -- this was revealed (although not intentionally a secret) in my colleague Charles Lamb's blog: http://blogs.oracle.com/charlesLamb/entry/oracle_nosql_database1
2. Yes, Oracle's NoSQL Database is written in Java.
3. I think we're quibbling over competing definitions of eventual consistency. Oracle's NoSQL Database is eventually consistent, because in fact, all sites will converge to the same state. A master that accepts a write without contacting any of the replicas and then crashes, will sync up with the newly elected master when it returns. Thus, the write will have been lost, but the state is still eventually consistent. To use Bayou terminology, if you configure Oracle's NoSQL database to accept writes, without contacting any replicas (a write-acknowledgement policy of NONE), then you could consider all such writes as tentative, until they have been propagated. Alternately, you could consider our rollback of the disconnected master's update as a deterministic merge procedure.
Here is another way to look at it, from my colleague Sam Haradvala:
The eventually consistent state the Oracle NoSQL Database will end up in will be different depending upon whether the acknowledgments that were required by a write request required a simple majority or higher level of acknowledgments. A key difference between a system like Dynamo and Oracle NosQL Database is that they will end up in different eventually consistent states when the synchronous writers are less than the simple majority, particularly in the case of W=1, and the writer node fails before it has a chance to propagate its changes. In Oracle NoSQL Database the write is not durable and is rolled back even if the failed node eventually recovers. In Dynamo like systems the write at the failed node may contribute the change to the eventually consistent state.
4. Finally, I'd like to add one other point that Sam also brought up. You made no mention of the simple majority write acknowledgement policy. I think that you're using the terms synchronous and asynchronous as though the only two options available are to wait for all replicas to acknowledge the write (ALL) or to wait for none of the replicas to acknowledge the write (NONE). In fact, the default configuration for Oracle NoSQL Database is to wait for a simple majority of the replicas to acknowledge the write. Users are free to trade off durability (by requiring fewer acks) for lower latency, higher throughput and write availability.
Thanks so much for adding more details in your comment. I totally agree with you that we're working with competing definitions of eventual consistency. I'm using the Werner Vogels definition which I believe is the prevalent definition (i.e., it is reflected the in wikipedia definition). The Vogels definition can be found here, and is reproduced below:
"Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value."
Your and Sam's definition seems to be a straight read on the words "eventual consistency" where dropping updates is a legitimate way to eventually make all replicas agree with each other. Without academic rigor behind terms like "eventual consistency", it is only natural that it will come to mean different things to different people.
However, I believe your definition is quite dangerous. (You can make any database eventually consistent by deleting everything!).
Regarding your point 4, I should have made this point a little clearer in my text. But whether synchronous replication means 'majority' or 'all', this does not have an effect on any of the categories I place it in above. The Oracle NoSQL system still does not support eventual consistency according to the Vogels definition, and the general design trades off durability for latency and availability.
For me it feels like lacking durability. It sounds eventually consistent. Yes, clear all is eventually consistent by this defnition, but it also would mean 0 durability. But seems like what is offered here is 'almost' durableDelete
When you say that "..In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired", what do you mean by "reconciled" ? which update should win ? the older ? the newer ? a merge between them ?
If you configure the Oracle NoSQL Database system to use simple_majority (or stronger), then you will reach eventual consistency by the Vogels definition above. The transaction is only lost if you use an ack policy weaker than simple_majority. If a transaction is committed using simple_majority (or stronger), then that means that it has reached a majority of the nodes. At that point, even if the master fails and a new master is elected, the transaction data will be present on the new master.
BTW, I should have added that we do not recommend using an ack policy weaker than simple_majority.ReplyDelete
Ok, reading the Vogels definition again he says "if no new updates are made to the object". That means that you wont have conflicts during the reconciliation process. I think that I got your point Daniel about the Oracle NoSQL DB. They are discarding the non-replicated updates even if no new updates were made to the non-replicated object.ReplyDelete
@PabloM: Yes your second comment is what I was trying to say.ReplyDelete
@CharlesLamb: First, congrats on building a nice system. Please do not take my comments that the Oracle NoSQL Database is not an eventually consistent system as a criticism -- like I said in my post --- you don't have to be eventually consistent to be NoSQL. That said, what you described in your comment is not eventual consistency, it's full consistency! The majority ACK configuration is a classic CP system from CAP (see: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf).
@CharlesLamb: In particular, if a minority partition gets partitioned off from the majority partition, the minority partition is unavailable for updates, while the majority partition is fully consistent. Hence, this is CP from CAP.ReplyDelete
If you were eventually consistent, the minority partition could stay available for updates and would reconcile with the rest of the database when the partition is repaired.
So you are not eventually consistent. But that's nothing to be embarrassed about.
With so many storage servers written in Java (this, Cassandra, HBase, HDFS) eventually we are going to need a JVM that doesn't have serious problems from GC stalls. See http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/ for an example.ReplyDelete
Intermittent GC stalls are another form of downtime.
That's a good point --- I assume writes stall if the master node for that write is undergoing GC. I wonder how big of an issue this is ....ReplyDelete
Sherpa also supports eventual consistency (has been for a while). Was not sure if your article hinted otherwise, but just in case.
It's interesting that the update mechanism that leads to better consistency does benefit from low-latency interconnects, which could explain the presence of infiniband in the racks. Hadoop traditionally trades off premium parts for low cost storage and compute, yet the hardware offerings are premium. The only other people who have looked at Hadoop over IB are Mellanox and teams they have donated hardware too.ReplyDelete
Technically, it may be a good design for small scale (not Facebook-class) systems. Yet the hardware will only come from Oracle, the base OS from Oracle, the DB from Oracle -paying their prices and their support costs, and getting updates at their chosen rate. I'd argue against the technology on those grounds -cost and the fact that oracle end up owning all your data, this time on hardware they can set the price for.
Can anyone comment on the local store? Is it BerkeleyDB?ReplyDelete
thanks, Andy (Acunu)
Yes, it's BerkeleyDB.ReplyDelete
First of all Daniel, thank you for a really nice write up and thank you Margo for an equally helpful comment.ReplyDelete
Now, doesn't MongoDB have the same semantics for consistency, i.e discard the writes from the failed master node in case it was not replicated to replica sets?
Especially after Margo's explanation of the write majority feature, the Oracle NoSQL is strikingly similar to MongoDB in the CAP theorem parlance, isn't it? (Although a lot many other things are much different)
Reading "MongoDB: The Definitive guide", it says:
"...Whenever the primary changes, the data on the new primary is assumed to be the most
up-to-date data in the system. Any operations that have been applied on any other
nodes (i.e., the former primary node) will be rolled back, even if the former primary
comes back online. To accomplish this rollback, all nodes go through a resync process
when connecting to a new primary. They look through their oplog for operations that
have not been applied on the primary and query the new primary to get an up-to-date
copy of any documents affected by such operations."
so based on that, It looks like they are discarding the non-replicated objects/documents.
@Srirang: If the failed primary comes back online it will reconnect to the replica set and any changes that did not make it to the secondaries will be stored in its local rollback directory for later analysis.ReplyDelete
@Pablo: If the primary member of a replica set fails, the remaining members of the set will try to elect a new master. There are a number of factors that go into choosing the new master ( delta from the old master, priority, hidden flag ) so it is possible that the new primary will not be the one that has the latest data. If that happens, the secondaries will log the differences to their local rollback directories as they re-orient to the new primary.
You can check out the following for more details:
@Mark Callaghan and @Daniel Abadi:ReplyDelete
GC is a serious issue and there are two strategies nowadays, afaik:
a) pre-allocated a large 'stab' of memory so that it reduces external fragmentation (but increases internal fragmentation and doesn't promote optimal space allocation). Cassandra and HBase are using this strategy.
b) allocate the data outside Java heap using ByteBuffer.allocate(). This brings its own share of problems, as the need to map from native heap to java heap, etc.
re GC stalls: agree with @eribeiro, if you write a db in java you have to allocate big chunks of memory and manage them yourself: you can never use GC to manage lots of small objects (e.g. rows) across a massive heap. that will never scale. that isn't java's fault, you couldn't write a db in C like that either. the fix described by Cloudera is just db 101.ReplyDelete
Thank you guys for this interesting information !ReplyDelete
Do you really think BerkeleyDB team built this from scratch? A lot of this sounds like Coherence to me... I'd bet they're reusing a lot of that in the background...ReplyDelete
While I agree with your assessment of the interpretation of eventual consistency, here's my observation of its impact.ReplyDelete
These systems are built to be run on commodity storage (no SAN), so you will always achieve durability with W: majority model.
Now if there is a partition, the minority will not allow updates (loss of A for minority partition) but it will be eventually consistent with majority when the partition heals. So it does give you the practical benefits of EC at the loss of partial availability.
So using a strict definition of EC, while correct, is not practically useful.
Technically nothing is written from scratch. In the end we're all using 1's and 0's, are we not? :)ReplyDelete