The world wide web (WWW) has had an extraordinary impact
on our day-to-day lives. An enormous amount of information is
available to any participant at extremely low cost (usually this cost
is paid via one’s attention to advertisements). However, the interface
is fundamentally limited. A user must either have pre-existing
knowledge of the location of the needed information (e.g., the correct
URL), or use a search interface which generally attempts to
match words in a search query with the natural language found on
the web. It is totally impossible to query the entire Internet with
a single SQL query (or any other structured query language), and
even if you could, the data available on the WWW is not published
in a format which would be amenable to such queries. Even for those Websites that provide an API to access structured data, the data is typically provided in JSON format, which is orders of magnitude slower to process than native relational data formats, and usually non-interoperable with similar datasets provided by other Websites.
A small group of researchers and practitioners are today releasing a vision for a complete redesign of how we share structured data on the Internet. This vision is outlined in a 13 page paper that will be appearing in the data systems vision conference called CIDR that convenes next month. This vision includes a proposed architecture of a completely decentralized, ownerless platform for sharing structured data. This platform aims to enable a new WWW for structured data (e.g.,
data that fits in rows and columns of relational tables), with an
initial focus on IoT data. Anybody can publish structured data using their preferred schema, and they retain the ability to specify the permissions of that data. Some data will be published
with open access—in which case it will be queryable by any user of the platform. Other data will be published in encrypted form, in which
case only users with access to the decryption key may access query
results.
The platform is designed to make it easy for users to not only publish IoT (or any other types of structured) datasets, but even be rewarded every time the data that they publish is queried. The platform enables a SQL interface that supports querying the entire wealth of previously-published data. Questions such as:
“What was the maximum temperature reported in Palo Alto on June
21, 2008?” or “What was the difference in near accidents between
self-driving cars that used deep-learning model X vs. self-driving
cars that used deep-learning model Y?” or “How many cars passed
the toll bridge in the last hour?” or “How many malfunctions were
reported by a turbine of a particular model in all deployments in
the last year?” can all be expressed using clean and clearly specified
SQL queries over the data published in the the platform from many
different data sources.
The Internet of Things was chosen as the initial use-case for this vision since the data is machine-generated and usually requires less
cleaning than human-generated data. Furthermore, there are a limited
number of unique devices, with typically many instances of a
particular unique device. Each instance of a device (that is running
a particular software version) produces data according to an identical
schema (for a long period of time). This reduces the complexity
of the data integration problem. In many cases, device manufacturers
can also include digital signatures that are sent along with any
data generated by that device. These signatures can be used to verify
that the data was generated by a known manufacturer, thereby
reducing the ability of publishers to profit off of the contribution of
“fake data” to the platform.
As alluded to above, publishers receive a financial reward
every time the data that they contributed participates in a query
result. This reward accomplishes three important goals: (1) It motivates
data owners to contribute their data to the platform (2) It
motivates data owners to make their data public (since public data
will be queried more often than private data) (3) It motivates data
owners to use an existing schema to publish their data (instead of
creating a new one).
The first goal is an important departure from the WWW, where
data contributors are motivated by the fame and fortune that come
with bringing people directly to their website. Monetizing this web
traffic through ad revenue disincentivizes interoperability since providing
access to the data through a standardized API reduces the
data owner’s ability to serve advertisements. Instead, the proposed architecture enables
data contributors to monetize data through a SQL interface
that can answer queries from any source succinctly and directly. Making this data public, the second goal, increases the potential
for monetization.
The third goal is a critical one for structured data: the data integration
problem is best approached at the source—at the time that
the data is generated rather than at query time. The proposed architecture aims
to incentivize data integration prior to data publication by allowing
free market forces to generate consensus on a small number of economically
viable schemas per application domain. Of course, this incentivization
does not completely solve the data integration problem,
but we expect the platform to be useful for numerous application domains
even when large amounts of potentially relevant data must
be ignored at query time due to data integration challenges.
As a fully decentralized system, anybody can create an interface
to the data on the platform --- both for humans and for machines. We envision a typical human-oriented interface would
look like the following: users are presented with a faceted interface
that helps them to choose from a limited number of application
domains. Once the domain is chosen, the user is presented with another
faceted interface that enables the user to construct selection
predicates (to narrow the focus of the data that the user is interested
in within that domain). After this is complete, one of the schemas
from all of the registered schemas for that domain is selected based
on which datasets published using that schema contain the most
relevant data based on the user’s predicates. After the schema is
chosen, the interface aids the user in creating a static or streaming
SQL query over that schema. The entire set of data that was published using that schema, and for which the user who
issued the query has access to, is queried. The results are combined,
aggregated, and returned to the user. Machine interfaces would likely skip most of these steps, and instead query the platform directly in SQL (potentially after issuing some initial queries to access important metadata that enable the final SQL query to be formed).
The proposed architecture incorporates third-party contractors and
coordinators for storing and providing query access to data. Contractors
and coordinators act as middlemen between data publishers
and consumers. This allows publishers to meaningfully participate in the network without having to provide resources
for storage and processing of data. This also facilitates managing data
at the edge.
Despite making the system easier to use for publishers, the existence
of contractors and coordinators in the architecture present two
challenges: (1) How to incentivize them to participate, and (2) How
to preserve the integrity of data and query results when untrusted
and potentially malicious entities are involved in the storage and
processing. The published CIDR paper proposes an infrastructure to solve both these
challenges.
To overview these solutions briefly: Contractors and coordinators are incentivized similarly to publishers,
by a financial reward for every query they serve. Querying
the platform requires a small payment of tokens (a human-facing interface may serve advertisements to subsidize this token cost). These payment
tokens are shared between the publishers that contributed data that
was returned by the query, along with the contractors and coordinators
that were involved in processing that query.
The financial reward received per query incentivizes participation of contractors and coordinators in query processing. However,
it does not ensure that the participation is honest and correct query
results are returned. In fact, without safeguards, contractors and
coordinators can make more money by avoiding wasting local resources
on query processing, and instead returning half-baked answers
to query requests.
Indeed, one of the main obstacles to building decentralized database
systems like what we are proposing is how to secure the confidentiality, integrity,
and availability of data, query results, and payment/incentive processing
when the participants in the system are mutually distrustful
and no universally-trusted third party exists. Until
relatively recently, the security mechanisms necessary for building
such a system did not exist, were too inefficient, or were unable to
scale. Today, we believe recent advances in secure query processing,
blockchain, byzantine agreement, and trusted execution environments
put secure decentralized database systems within reach.
The proposed infrastructure uses a combination of these mechanisms
to secure data and computation within the system. For more details, please take a look at the CIDR paper!
I have a student, Gang Liao, who recently started building a prototype of the platform (research codebase at: https://github.com/DSLAM-UMD/P2PDB). Please contact us if you have some IoT data you can contribute our research prototype. Separate from this academic effort, there is also a company called AnyLog that has taken some of the ideas from the research paper and is building a non-fully decentralized version of the prototype.
No comments:
Post a Comment