It is said that, in 2006, Clive Humby, a British mathematician and, among other things, the architect of the Tesco loyalty card program, stated “data is the new oil”. What has been often missed is the extension of that statement, later brought by Michael Palmer, “Data is just like crude. It’s valuable, but if unrefined it cannot really be used.”
Humby’s statement, and Palmer’s extension, became adopted by many communities, including the cybersecurity one, and it was recently mentioned during a keynote on recent cybersecurity Operational Technology attacks trends in the Ukrainian war context at a 2023 Cybercrime Conference.
While one should not expect analogies to be a perfect map between cultural concepts, it comes natural to semiotically link the instruments of the oil supply chain to respective data ones. For example just by stating that “if data is the new oil, databases are the new oil storage tanks”. That must have been used as many times as the data oil analogy.
It is probably less known that, simultaneously with Humby and Palmer coming out with their mottos, many researchers had started to work on the privacy preserving data mining operations over large data collections. In particular, in Palo Alto, among the corridors of the illustrious HP Laboratories, a group of cryptologists had been working on the privacy preserving verification of aggregated queries on outsourced databases. That work, published in an article made publicly available in 2006/2007, presented the scientific community with the problem of
“on how to verify outsourced data and computation [7, 15, 23, 22, 24, 30, 31, 32] including the verification of both correctness and completeness of relational database queries, such as SELECT, PROJECT, and UNION [..] We give protocols for privacy-preserving verification of aggregate queries including SUM, MAX, MIN, COUNT, AVERAGE, and MEDIAN.”
If algorithms are the new refineries and databases the new oil storage tanks, extracting insight from data is made simple by queries that professionals like Business Intelligence Analysts perform daily when interacting with relational databases using a query language like SQL.
But access to data is rarely simple, and it is even more difficult when it comes to coupling it with the creation of information and the extraction of insights. There is not enough space here to detail the history of data monopolies, to list the amount of failed projects and ecosystems wrapped around the notion of data sharing, the dreams (foolishly badly architected) of distributed data markets, to draw on Matrix like analogies where people are data batteries harvested by the corporate protagonists of the Big Data Economy (those are all stories for other times!).
Instead, it is more effective to derive from our personal stories and experiences of the multiple aggregated years of our careers in corporate industrial R&D or academia, that can simply summarise to:
Data are stored in databases that Company A, B, C and D host somewhere. Data that you cannot access. You get the information, but you don’t know if the information was created with the right data.
Trust my data. Says Company A.
I have the right data. Says Company B.
I am not hiding some data from your algorithms. Says Company C.
I am computing the algorithms on the right data. Says Company D.
The work in Palo Alto fit into Verifiable Computing, a scientific discipline that goes even further back in time, to the 80s, founded by the fathers and mothers of modern cryptology, Babai, Goldwasser, Micali and many others. But Haber’s work in HP focused a part of the Computer Science and Cryptography community to address the issue of aggregated queries on private data. Since 2006 numerous attempts, mostly academic, have addressed the problem stated by Haber et al., mainly driven by the success of Zero Knowledge proof systems based on Succinct Non-Interactive Arguments of Knowledge (SNARKs).
New directions were given in October 2022, with Tavloid, a Protocol Labs project. In addressing the problem of verifying the correctness of queries over relational data, Tavloid hinted at clever techniques that do not require the heavy machinery of SNARKs. Then, at the end of 2023, made aware by efficient performance improvements in the implementation of cryptographic protocols like Bulletproofs and Verkle trees, and recent developments and advancements in the area of Zero Knowledge Proofs with succinct proofs and folding schemes like NOVA, we decided to pick up the work left by Tavloid and others and:
find a simple way to give proofs that information and analytics insights are created by aggregated queries on data stored in private relational databases.
The result is Provably, a product that, compared to the current Zero Knowledge Proof (ZKP) context, focuses on Big Data and large datasets.
Zero-knowledge proof systems, particularly those relying on SNARKs have revolutionised various fields, notably blockchains for ensuring transaction privacy, verifying off-chain computations and providing a powerful way for proving statements about confidential data. But, in the context of analytics-based queries over large datasets, general-purpose ZK proof systems face several hurdles hard to overcome, since their structures are ill-suited for handling computations over large datasets. For example, consider a moderate-sized table containing millions of numerical values; even basic operations like summing these values can result in circuits with well over a million gates.
Recognising such limitations, Provably presents an alternative approach that tackles the verification of queries over large datasets effectively, sidestepping the constraints of general-purpose ZK proof systems. We do so by standing on the shoulders of giants. Inspired by Tavloid, we address the 2006 Haber’s problem statement. By leveraging well-known open source cryptographic tools (like Pedersen commitments, Bulletproofs, and Verkle trees) and expressing queries in SQL, Provably enables efficient and scalable verification of analytics-based queries.
Hence, the launch of Provably, available as an MVP from July 2024, marks the point from which the data industry can benefit from ZK Big Data, from data security, to data privacy and privacy enhancing technologies, to computing on outsourced databases, to finally building data ecosystems.