Wednesday, December 18, 2019

It's time to rethink how we share data on the Web

The world wide web (WWW) has had an extraordinary impact on our day-to-day lives. An enormous amount of information is available to any participant at extremely low cost (usually this cost is paid via one’s attention to advertisements). However, the interface is fundamentally limited. A user must either have pre-existing knowledge of the location of the needed information (e.g., the correct URL), or use a search interface which generally attempts to match words in a search query with the natural language found on the web. It is totally impossible to query the entire Internet with a single SQL query (or any other structured query language), and even if you could, the data available on the WWW is not published in a format which would be amenable to such queries. Even for those Websites that provide an API to access structured data, the data is typically provided in JSON format, which is orders of magnitude slower to process than native relational data formats, and usually non-interoperable with similar datasets provided by other Websites.

A small group of researchers and practitioners are today releasing a vision for a complete redesign of how we share structured data on the Internet. This vision is outlined in a 13 page paper that will be appearing in the data systems vision conference called CIDR that convenes next month. This vision includes a proposed architecture of a completely decentralized, ownerless platform for sharing structured data. This platform aims to enable a new WWW for structured data (e.g., data that fits in rows and columns of relational tables), with an initial focus on IoT data. Anybody can publish structured data using their preferred schema, and they retain the ability to specify the permissions of that data. Some data will be published with open access—in which case it will be queryable by any user of the platform. Other data will be published in encrypted form, in which case only users with access to the decryption key may access query results.

The platform is designed to make it easy for users to not only publish IoT (or any other types of structured) datasets, but even be rewarded every time the data that they publish is queried. The platform enables a SQL interface that supports querying the entire wealth of previously-published data. Questions such as: “What was the maximum temperature reported in Palo Alto on June 21, 2008?” or “What was the difference in near accidents between self-driving cars that used deep-learning model X vs. self-driving cars that used deep-learning model Y?” or “How many cars passed the toll bridge in the last hour?” or “How many malfunctions were reported by a turbine of a particular model in all deployments in the last year?” can all be expressed using clean and clearly specified SQL queries over the data published in the the platform from many different data sources.

The Internet of Things was chosen as the initial use-case for this vision since the data is machine-generated and usually requires less cleaning than human-generated data. Furthermore, there are a limited number of unique devices, with typically many instances of a particular unique device. Each instance of a device (that is running a particular software version) produces data according to an identical schema (for a long period of time). This reduces the complexity of the data integration problem. In many cases, device manufacturers can also include digital signatures that are sent along with any data generated by that device. These signatures can be used to verify that the data was generated by a known manufacturer, thereby reducing the ability of publishers to profit off of the contribution of “fake data” to the platform.

As alluded to above, publishers receive a financial reward every time the data that they contributed participates in a query result. This reward accomplishes three important goals: (1) It motivates data owners to contribute their data to the platform (2) It motivates data owners to make their data public (since public data will be queried more often than private data) (3) It motivates data owners to use an existing schema to publish their data (instead of creating a new one).

The first goal is an important departure from the WWW, where data contributors are motivated by the fame and fortune that come with bringing people directly to their website. Monetizing this web traffic through ad revenue disincentivizes interoperability since providing access to the data through a standardized API reduces the data owner’s ability to serve advertisements. Instead, the proposed architecture enables data contributors to monetize data through a SQL interface that can answer queries from any source succinctly and directly. Making this data public, the second goal, increases the potential for monetization.

The third goal is a critical one for structured data: the data integration problem is best approached at the source—at the time that the data is generated rather than at query time. The proposed architecture aims to incentivize data integration prior to data publication by allowing free market forces to generate consensus on a small number of economically viable schemas per application domain. Of course, this incentivization does not completely solve the data integration problem, but we expect the platform to be useful for numerous application domains even when large amounts of potentially relevant data must be ignored at query time due to data integration challenges.

As a fully decentralized system, anybody can create an interface to the data on the platform --- both for humans and for machines. We envision a typical human-oriented interface would look like the following: users are presented with a faceted interface that helps them to choose from a limited number of application domains. Once the domain is chosen, the user is presented with another faceted interface that enables the user to construct selection predicates (to narrow the focus of the data that the user is interested in within that domain). After this is complete, one of the schemas from all of the registered schemas for that domain is selected based on which datasets published using that schema contain the most relevant data based on the user’s predicates. After the schema is chosen, the interface aids the user in creating a static or streaming SQL query over that schema. The entire set of data that was published using that schema, and for which the user who issued the query has access to, is queried. The results are combined, aggregated, and returned to the user. Machine interfaces would likely skip most of these steps, and instead query the platform directly in SQL (potentially after issuing some initial queries to access important metadata that enable the final SQL query to be formed).

The proposed architecture incorporates third-party contractors and coordinators for storing and providing query access to data. Contractors and coordinators act as middlemen between data publishers and consumers. This allows publishers to meaningfully participate in the network without having to provide resources for storage and processing of data. This also facilitates managing data at the edge.

Despite making the system easier to use for publishers, the existence of contractors and coordinators in the architecture present two challenges: (1) How to incentivize them to participate, and (2) How to preserve the integrity of data and query results when untrusted and potentially malicious entities are involved in the storage and processing. The published CIDR paper proposes an infrastructure to solve both these challenges.

To overview these solutions briefly: Contractors and coordinators are incentivized similarly to publishers, by a financial reward for every query they serve. Querying the platform requires a small payment of tokens (a human-facing interface may serve advertisements to subsidize this token cost). These payment tokens are shared between the publishers that contributed data that was returned by the query, along with the contractors and coordinators that were involved in processing that query. The financial reward received per query incentivizes participation of contractors and coordinators in query processing. However, it does not ensure that the participation is honest and correct query results are returned. In fact, without safeguards, contractors and coordinators can make more money by avoiding wasting local resources on query processing, and instead returning half-baked answers to query requests.

Indeed, one of the main obstacles to building decentralized database systems like what we are proposing is how to secure the confidentiality, integrity, and availability of data, query results, and payment/incentive processing when the participants in the system are mutually distrustful and no universally-trusted third party exists. Until relatively recently, the security mechanisms necessary for building such a system did not exist, were too inefficient, or were unable to scale. Today, we believe recent advances in secure query processing, blockchain, byzantine agreement, and trusted execution environments put secure decentralized database systems within reach. The proposed infrastructure uses a combination of these mechanisms to secure data and computation within the system. For more details, please take a look at the CIDR paper!

I have a student, Gang Liao, who recently started building a prototype of the platform (research codebase at: https://github.com/DSLAM-UMD/P2PDB). Please contact us if you have some IoT data you can contribute our research prototype. Separate from this academic effort, there is also a company called AnyLog that has taken some of the ideas from the research paper and is building a non-fully decentralized version of the prototype.