Wednesday, August 11, 2010

Defending Oracle Exadata

I recently came across a whitepaper from Teradata, written by a senior consultant for Teradata, Richard Burns. This is a very well written piece, and has one of the best overviews of Exadata I’ve seen. I did not notice any obvious inaccuracies in the description of Exadata itself, and even the anti-Exadata arguments (presented after the overview), though at times biased and misleading, do not have many clear factual errors. Hence, it is quite a professionally done whitepaper, even though it is devoted to attacking a competitor. Reading it will probably make you smarter.

That said, even though the facts are more or less correct, the inferences that are made from these facts are certainly up for debate, and I feel an urge to defend Exadata against some of these allegations, even though I have no personal stake in either side of the Exadata-Teradata feud.

I will not overview Exadata again in this blog post. If you are not familiar with Exadata, I encourage you to read the overview in the Teradata whitepaper, or from Oracle’s own marketing material. I have covered the columnar compression feature separately in this blog. Hence, I will jump to the arguments that Teradata makes against Exadata, and respond to each one in turn:

“Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing”

Teradata argues that since the Exadata storage only performs selections, projections, and some form of basic joins, and the rest of the query must be performed in the Oracle database server sitting above Exadata storage (which is typically Oracle RAC), the architecture is a whole lot closer to shared-disk than shared-nothing. Factually, this is correct. Exadata storage is indeed shared-nothing, but since only very basic query operations are performed there, it is fair to view the system as Oracle RAC treating Exadata storage as a shared-disk.

However, it is one thing to point out that Exadata is closer to shared-disk than shared-nothing, but quite another thing to claim that as a result of this, “Exadata does nothing to reduce or eliminate the structural contention for shared resources that fundamentally limits the scalability –of data, users, and workload – of Oracle data warehouses.” This statement is incorrect and unfair. Yes, it is true that contention for shared resources is the source of scalability problems in analytical database systems, and that this is why shared-nothing is widely believed to scale the best (because each compute node contains its own CPU, memory and disk, so neither disk nor memory is shared across the cluster). But shared-disk is very similar to shared-nothing. The only difference is the shared access to disk storage system. If you are going to make an argument that shared-disk causes scalability problems, you have to make the argument that contention for the one shared resource in a shared-disk system is high enough to cause a performance bottleneck in the system --- namely, you have to argue that the network connection between the servers and the shared-disk is a bottleneck.

At no point in the entire 18-page whitepaper did Teradata make the argument that the Infiniband connection between Exadata storage and the database servers is a bottleneck. Furthermore, even if you believe that there is a bottleneck in this connection, you still must admit that by doing some of the filtering in the Exadata storage layer, some of this bottleneck is alleviated. Hence, it is entirely inaccurate to say “Exadata does nothing to reduce or eliminate the structural contention for shared resources ….” --- at the very least it does something by doing this filtering.

In fact, I think that the scalability differences between shared-nothing and shared-disk are overblown (I’m not arguing that they scale equally well, just that the gap between them is not as large as people think; even if filtering is not pushed down to the shared-disk like in Exadata). This was eloquently explained by Luke Lonergan from Greenplum at the UW/MSR retreat on cloud computing in the first week of August. In essence, he argued that thanks to 10 Gigabit Ethernet, Fibre Channel over Ethernet, and a general flattening of the network into two-tier switching designs, it takes an enormous number of disks to cause network to become a bottleneck. Furthermore, with 40 Gigabit Ethernet around the corner, and 100 Gigabit Ethernet on its way, network is becoming even less of a bottleneck. And by the way, shared-disk has a variety of advantages that shared-nothing does not, including the ability to move around virtual machines executing database operators for improved load balancing and fault tolerance. (This would be a good time to point out that Greenplum was recently acquired by EMC, so one obviously has to be aware of pro-shared-disk bias, but I found the argument quite compelling.)

Teradata also attempts to argue that the striping of data across all available disks on every Exadata cell (using Oracle’s Automatic Storage Manager, ASM) causes “potential contention on each disk for disk head location and I/O bandwidth” when many DBMS workers running in parallel for the same query request data from the same set of disks. However, it is not pointed out until later in the paper that Exadata defaults to 4MB chunks. 4MB blocks should easily amortize the disk seek costs across multiple worker requests.

Exadata does NOT Enable High Concurrency

If I were to summarize this section, I would say that Teradata is basically arguing that the default block size in Exadata is too large, and this reduces concurrency, since 30 concurrent I/Os/sec * 4MB block size saturates the disks maximum 120MB/sec bandwidth. This argument is clouded by the fact that for scan-based workloads, any database system would have the same concurrency limits. Hence, the only place where Exadata’s large block size would reduce concurrency relative to alternative systems (such as Teradata) would be for non-scan based workloads when there are a lot of tuple lookups and random I/O. Oracle would argue that its memory and flash cache would take care of the tuple lookups and needle-in-a-haystack type queries. Teradata, in turn, argues that the size of cache is far too small to completely take care of this problem. This argument is reasonable, but I do believe that the bulk of the size of a data warehouse is the historical data, and that tuple lookups and non-scan-based queries only touch a much smaller portion of the more recent data, so that caching should do a decent job. But this is definitely a “your mileage will vary” type argument, with the effectiveness of the cache highly dependent on particular data sets and workloads.

Exadata does NOT Support Active Data Warehousing

Teradata points out that the process of checking which version of the data is the correct version to return for a particular query (used in its MVCC concurrency control scheme) is performed inside the database servers, and so the Exadata storage cannot do its typical filtering of the data in the storage layer for actively updated tables (since there might be multiple potentially correct versions of the data to return). Teradata therefore points out: “While Exadata still performs parallel I/O for the query, the largest benefit provided by Exadata, early filtering of columns and rows meeting the query specification, which may drastically reduce query data volume, is not useful for tables or partitions being actively updated.” While this might be true, the impact of this issue is overstated. The percentage of data that is actively updated is typically a tiny percentage of the total data set size. The historical data, and the data that has been recently appended (but not updated) will not suffer from this problem. Hence, having less than optimal I/O performance for this tiny fraction of the data is not a big deal.

Exadata does NOT Provide Superior Query Performance

Teradata points out that since Exadata can only perform basic operations in the Exadata storage layer (selection, projection, some simple joins), then as a query gets more complex, more and more of it is performed in the database server, instead of in Exadata storage. Teradata gives an example of a simple query, where Oracle can perform 28% of the query steps inside Exadata, and a more complex one, where Oracle can perform only 22% of the query steps inside Exadata. Again, this is factually correct, but it is misleading to assume the speedup you get from Exadata is linearly correlated with the percentage of steps that are performed within Exadata. For database performance, it’s all about bottlenecks, and eliminating a bottleneck can have a disproportionate effect on query performance. In scan-based workloads, disk I/O is often a bottleneck, and Exadata alleviates this bottleneck equally well for both simple and complex queries. Hence, while the benefit of Exadata does decrease for complex queries, it is misleading to assume that this benefit decreases linearly with complexity.

Exadata is Complex; Exadata is Expensive

It is hard to argue with these points. However, it is amusing to note that Teradata is willing to point out in this section that they are only (approximately) 11% cheaper than Oracle, and they show numbers such as Teradata costing 194K per terabyte. Both Oracle and Teradata are too expensive for large parts of the analytical database market.

Conclusion

The truth, as is usually the case, is somewhere in middle, between the claims of Oracle and Teradata. Teradata is probably right when it asserts “Exadata is far from the groundbreaking innovation that Oracle claims”, and that Oracle “throws a lot of hardware” at problems that are solvable in software, but many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic datasets and workloads.

7 comments:

  1. Companies leveraging Oracle Exadata's columnar compression and query offloading will have a distinct competitive advantage by being able to execute more queries per hour in large data warehouses.

    Winston Shirley

    ReplyDelete
  2. I love to see competitors go after market perception with selected "advantages" and "disadvantages".
    While architecture is important, workload/throughput in real situations is absolutely critical.
    It's tough because at the high levels of "performance" required in complex analytical systems it is really hard to understand what matters in YOUR situation. And some queries that scale well in one environment may scale poorly in another.
    I have never been an Oracle booster - I think that the solutions are generally expensive for what you get, and the company is (rightly) aggressive in the marketplace. But then Teradata isn't my idea of great value either. At $194K per managed Terabyte (however that is derived), I think we are way over where it should be. Can we do better than that? Will Oracle and Teradata both do better deals? I certainly hope so. I would expect the $25k/Terabyte number to be more reasonable. But maybe the storage costs more than that.
    There's noting wrong with getting the performance out of hardware (assuming that the problem is well enough understood to make it feasible and not to have to jettison hardware because the algorithms are wrong). Getting performance at the proper price is what matters.
    Teradata have had the market to themselves for a while. Good to see challenges coming at them from lots of dimensions. It keeps the space fresh and innovative

    ReplyDelete
  3. Daniel, have you looked at Fregel? Would like to hear your views.

    ReplyDelete
  4. What about Microsoft Parallel Datawarehouse?

    ReplyDelete
  5. Hi,

    Nice comparison/argument of Exadata & Teradata :-)

    -Cheers,
    Satya,
    http://satya-exadata.blogspot.com/2011/07/cellclicommandsexadata.html

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. But shared-disk is very similar to shared-nothing. The only difference is the shared access to disk storage system.....
    In fact, I think that the scalability differences between shared-nothing and shared-disk are overblown (I’m not arguing that they scale equally well, just that the gap between them is not as large as people think; even if filtering is not pushed down to the shared-disk like in Exadata)

    Actually practical difference between shared-disk and shared-nothing is huge, as anybody who installed Oracle RAC and Vertica, for example, knows. That's one of reasons why you rarely see RAC clusters beyond just couple of nodes. Shared disk requires special setup that storage guys have a problem with, even on a simple 2 node cluster. You can easily install Vertica on AWS. Try installing RAC on AWS.

    ReplyDelete