Wednesday, November 7, 2012

Is Upstart the right way to get college student start-ups funded?

In 2010, the movie “The Social Network” was released, which has had a tremendously positive effect on the computer science department at Yale; and from what I have heard, a similar effect has been observed across the country. My understanding (I have not seen this movie myself) is that the movie’s plot revolves around Mark Zuckerberg and his role in the formation of Facebook. In the time since the movie was released, the number of computer science majors at Yale has nearly quadrupled, and these majors  are increasingly looking at start-ups (either founding their own or joining existing ones) as options for when they graduate (instead of going to Wall Street and working as quants, which has historically been the popular career path for Yale CS majors).

In my opinion, this is unquestionably a good thing. Yale has some of the brightest minds of the next generation, and I feel a lot more confident about the future of our country when I see these great minds being applied to creating new entities and jobs and building something real, instead of being wasted in the zero-sum gain arms race of who can create the automatic trading algorithm that is epsilon better than anybody else’s.

One consequence of this start-up craze is that I get bombarded with requests from students who want to meet with me to discuss their start-up idea. This partly because I teach the “Intro to Programming” course at Yale which has had consistently between 120 and 150 students (many of whom are budding entrepreneurs) enrolled since the release of “The Social Network”, partly because the success of Hadapt is certainly no secret around Yale, and partly because I live on Yale’s campus and part of my job in this capacity is to serve as an adviser and mentor to undergraduates.

When I meet with these students I hear all kinds of ideas. Some of them are good, and some of them are bad. Some of them make me think about an area in a different way, and some of them are carbon copies of something that already exists. Some I could get excited about and some I couldn’t. But just about all of them have one thing in common: the students involved vastly overestimate their probability of success. I suppose this should not surprise me --- after all, these are Yale students that have been successful in just about everything they have ever done in their life. So it follows that they would expect their start-up to be successful. But even when I talk to students at other universities who have start-up ideas --- students who have not necessarily been so successful in their lives --- even they are totally convinced that their startup idea is unlikely to fail. It seems that there is a basic psychological flaw in the human mind --- we so desperately want our dreams to come true that we ignore statistical data about the probability of success and trick ourselves into believing that we are the statistical anomaly and will succeed where others have failed.

Many of these students find out the hard reality regarding their start-up idea when they attempt raise funding. They find out that investors are extremely conscious of the probability of success of a group of students with no experience, no reputation, and a limited network. Most of these start-ups fail to raise funding from professional investors. Some students give up at this point. Other students continue along with limited funding from friends and family in an attempt to create more meat around the bones of their idea and reduce risk for the professional investors. Most will eventually fail, while a rare few will succeed.

The outcome of all this is that despite all of these students eager to be entrepreneurs and start companies, very few student ideas receive funding, and most of these ideas never see the light of day. Whether or not this is a good thing is certainly up for debate, but my feeling is that it is a shame that so few student start-ups get funding.

Therefore, when I first heard of Upstart (I think it was in August), I was quite interested in the idea --- it proposed a way to get student start-ups funded. I signed up to receive e-mail updates, but did not hear from them for several months. However, on Monday of this week I received an update from them that they were open for business. I looked through the profiles of the students who were looking for funding and I saw that no fewer than 4 out of the (approximately) 20 profiles that were available online were from Yale University.

However, a deeper look at the Upstart Website reveals a problematic clause that is attached with the funding of the student start-up ideas. This is not a traditional crowdfunding model where investors receive equity in the start-up in exchange for their investment dollars. Instead, the investors get a percentage of the student’s income for a 10-year period in exchange for the investment. This way, in the likely event that the student’s start-up idea does not work out, the investor is able to receive a nice return on investment by taking a cut from the student’s hard earned salary when the student enters the workforce.

This does not seem right to me. On one side you have students who have an inaccurate view of the probability of success of their start-up, and on the other side you have investors who are looking to profit off of the boundless optimism and dreams of these students. These students, with no experience in the real world, no understanding of what skills are necessary to build a company, and a perception of entrepreneurship built more from Hollywood than the cold realities of business, are more than happy to mortgage a percentage of 10 years of future earnings for a chance to receive some short-term money about which they have no idea how to properly evaluate the costs vs. benefits.

In the traditional model, where the  investor receives equity in exchange for the investment, at least the investor is in the same boat as the student --- their interests are aligned and focused on making the start-up a success. With the Upstart model, you have almost the exact opposite. Since the salary of a founder is typically below-market in exchange for the equity the founder receives, the expected rate of return for the investor is actually higher if the student were to give up on the start-up and get a normal job. This is especially true when the investment rate of return for the investor is capped (as it is in Upstart), so that even if the start-up were to take off and the student were to become very wealthy from it, the return to the investor is not markedly different from what it would have been if the company had failed and the student later received a salary at market value. To exacerbate the situation, the investor-investee relationship in Upstart is supposed to be somewhat also a mentor-mentee relationship, which is particularly dangerous when interests are misaligned.

I think Upstart should be commended for trying to get more funding to college students with ideas for starting companies. And although I don’t know many people involved, the people I do know are good people and I highly doubt they are trying to do anything evil. (Jonathan Eng was a TA for my Introduction to Programming class for me 4 years ago, and he was a good and honest TA). However, I do not believe the people involved in Upstart realize how hard it is for students to accurately evaluate the costs and benefits of receiving funding in this way. Therefore I am highly concerned about this model as a way forward for student entrepreneurship.

Monday, October 29, 2012

IEEE Computer issue on the CAP Theorem

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago). 

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The  reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: The Birman et. al. article can be found for free at:

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles. The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is  possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Tuesday, June 26, 2012

Defending Matt Welsh’s 'Startup University' Post

A week ago, Matt Welsh released a blog post on attaching a startup incubator to a university in order to create a funding model for some of the research that is performed at the university. Unfortunately, the beginning part of the blog post talked about the “inefficiency” of universities in terms of “producing real products” and the (perhaps overly dramatic) assertion that “nothing of practical value came out of [Matt’s] entire research career”. Although Matt has clarified that it was not his intention to indicate that the goal of academic research was to “produce real, shipping products that people could use”, many people interpreted the opening part of Matt’s post in that way, and reacted negatively (including, notably, Michael Mitzenmacher who responded in a comment and Joe Hellerstein who responded in his own blog post).

If we ignore the problems with the first part of Matt’s post, the rest of the post raises some important points and interesting ideas. As an academic who has spent large chunks of time spinning off a research project into a startup (HadoopDB was commercialized by Hadapt, which by most available metrics has been an example of a research lab-to-startup success story), many parts of Matt’s article rung true:
  1. Matt’s statement: “Most universities make starting a company painfully difficult when it comes to questions of IP ownership [and] licensing” was certainly true for Hadapt. It took way too long, and way too much effort to get an agreement in place. Part of the problem was discussed in the comment thread of Matt’s post --- licensing patents are much better aligned with the core mission of a university than accepting equity in start-ups.
  2.  Matt’s statement: “Most universities also make starting a company painfully difficult when it comes to […] forcing the academic's research to be dissociated with their commercial activities.” This was also true for me. I do not mean to criticize the university --- I absolutely understand the need for the conflict of interest safeguards because of the way that universities (and the assumptions of incoming students) are structured today. However, restructuring some of these assumptions in the way that Matt talks about may reduce the legal liabilities, and allow for fewer safeguards to have to be put in place. I also think that the students are hurt more than helped by some of these safeguards. For example, one of the PhD students involved in HadoopDB wanted to work part time for Hadapt while finishing his PhD. However, due to the COI legal complexities, he was forbidden from doing this and was forced to choose between Hadapt and the PhD program (he, of course, chose to take a leave of absence and join Hadapt).
  3.  Matt’s statement that academics starting companies “involves a high degree of risk (potentially career-ending for pre-tenure faculty)” obviously resonates with me. Whether or not Hadapt is successful, it has certainly taken my time away from publishing papers (though obviously, I'm still trying to publish as much as I can --- see, for example, my last post on the Calvin project). Since publication quantity and quality remain key statistics for academic success, any conscious reduction of them comes with a clear risk.
The bottom line is that I absolutely agree with Matt’s assertion that there are a lot of extremely intelligent faculty in academic institutions across the world that have made the mental calculation and decided that the benefits do not outweigh the risks in spinning off a startup from an ongoing research project. Whether or not this is a bad thing is up for debate --- it is certainly not the core mission of a university to spin off companies or produce real-world products. However most universities do have some number of applied fields, and measuring impact in applied fields is often initiated by looking at real-world deployments of the research ideas. Starting companies is clearly the most direct mechanism for translating research ideas to real-world impact. Hence, it’s probably not a controversial statement to assert that reducing some of the barriers to starting companies would allow faculty in applied fields to increase their impact, the primary goal of research.

Therefore, allowing for explicit relationships between research groups and university-sponsored start-up incubators, where the university invests in a start-up, with proceeds from such investments being used to sponsor additional research in the department, is an idea worth considering. I would, however, change a few things about Matt’s proposal:
  1.  I would not simply replace venture capital money with university money. Although it is easy to get into the trap of assuming that the venture capitalist simply trades investment dollars for equity in the company, it turns out that venture capitalists provide a lot of value in addition to their money. Seeing firsthand the difference at Hadapt before and after we got big-name venture capitalists behind us really drove this point home for me.  Therefore, I would recommend that the university partner with venture capitalists, or otherwise hire successful venture capitalists to work in-house (and continue to compensate them using the standard venture capital compensation schemes). Although the Kauffman report has recently shed some light into how poorly venture capital has performed over the last decade, the top venture capitalists have still done very well, and it is important to remember that the goal for the university is not to turn a profit on the investment, but rather to increase the number of startups coming out of the university, in order to increase the research impact. Break-even performance or even small amounts of losses are totally acceptable.
  2.  The model will not work for any university. The location of the university is critical. Trying to get an incubator going for universities located in the middle of nowhere is a recipe for disaster. Technologists like to think that the technology that the company is commercializing is the most important factor in the company’s success. In fact, it falls way behind ‘market’ and ‘people’ as a determining factor. The company needs competent and experienced people throughout the organization --- the engineering team, marketing, sales, support, etc. Recruiting a competent team in a location where there have been small numbers of comparable companies is likely to be futile. Students from the university can only get you so far --- you need a mix of experienced people as well.
  3.  There needs to be explicit mechanisms in place to reduce the risk for the faculty member. This means that the faculty member should get credit for certain company metrics at promotion or yearly evaluation time in addition to standard paper citation metrics. Company financial data is probably not a great metric, but customer counts of people actually using the technology, or even customer counts at “me-too” competitors could be used. Three years after publishing the original HadoopDB paper, there are real people using this technology to solve real problems. It’s pretty rare to see such an immediate impact, and it ought to count for something.
Obviously my own experiences have made me predisposed to liking Matt’s ideas, but I do encourage people to read the second half of his post independently of the first half.

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

  • Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
  • Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
  • Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

Wednesday, December 7, 2011

Replication and the latency-consistency tradeoff

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).

The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.

The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.

There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

  1. Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:

    1. Routing updates through an additional system component (the preprocessor) increases latency

    2. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.

  2. Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:

    1. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.

    2. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
      1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
        1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.

        2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.

      2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.

    3. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .

  3. Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
    1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.

    2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.

To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).

In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.

Tuesday, October 4, 2011

Overview of the Oracle NoSQL Database

Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.

I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (Hadapt). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).

In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is a useful whitepaper written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).

Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.

The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.

Overview of the system:
Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.

The following is where the Oracle NoSQL Database falls in various key dimensions:

Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.]

Eventual consistency
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability. To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the Yahoo PNUTs/Sherpa system.

[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]

Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.

Data Model
The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and minor key path model to be more elegant than the column family model (I have ranted against column-families in the past).

ACID compliance
Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).

The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.

Tuesday, July 19, 2011

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

  1. Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

  2. Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

  3. Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.