I'm at SIGMOD this week, so I'll make this blog post quick:
A week and a half ago, I blogged on analytical database scalability. The basic thesis was:
(1) People are overly focused on raw data managed by the DBMS as an indicator of scalability.
(2) If you store a lot of data but don't have the CPU processing power to process it at the speed it can be read off of disk, this is a potential indication of a scalability problem not a scalability success.
I insinuated that:
(1) Scalability in terms of number of nodes might be a better measure than scalability in size of data (assuming of course, you can't get true scalability experiments as specified by the academic definition of scalability)
(2) Scaling the number of nodes is harder than people realize.
I am not so presumptuous as to assume that my blog carries enough weight to affect the actions of Aster Data's marketing department, but I was nonetheless pleased to see Aster's new marketing material (which I believe was just put online today) which specifically shows how their new appliance has more CPU processing power per usable storage space than a subset of their competitors (I'm assuming they either only showed the competitors they can beat in this metric, or their other competitors don't publish this number).
I do want to point out one interesting thing, however, about Aster Data's scalability (at least what limited knowledge we can deduce from their marketing material). For 6.25TB, 12.5TB, 25TB, 50TB, and 100TB, Aster Data's appliance has 8, 16, 32, 63, and 125 worker nodes, and 64, 128, 256, 504, and 1000 processing cores respectively. So basically, it's completely linear: double the amount of data, then double the amount of nodes and processing cores respectively. But then going from 100TB to 1PB of data (10X more data) they increase the number of worker nodes by less than 2X (to 330) and processing cores by a little less than 3X (2640). So after 100TB/100 nodes, their processing power per storage space drops off a cliff. (Aster has a note for the 1PB saying they are assuming 3.75X compression of user data at 1PB, but compression at 1PB shouldn't be any different than compression at 100TB; and if they are assuming uncompressed data at 100TB, then the comparison in processing power per storage space relative to other vendors is misleading since everyone else compresses data.)
Bottom line: scaling to 100 nodes is one thing (I know that Teradata, Vertica, and Greenplum can do this in addition to Aster Data, and I'm probably forgetting some other vendors). Scaling to 1000 nodes (and keeping performance constant) is much harder. This might explain why Aster Data only doubles the number of nodes/cores for 10X more data after they hit 100 nodes.
I will have more on why it is difficult to scale to 1000 nodes in future posts.