Privacy-preserving continuous queries in the web

Introduction

Sensors, social networks and smartphone applications are producing an unprecedented amount of personal data on the web, with an intrinsic risk of exposing sensible and private information, such as health condition and location. Our society recently started recognising such threats, and took initial steps to tackle privacy from several perspectives, such as law, education and technology. While it is key to raise citizens awareness for protecting themselves and their data, it is important to do not stop the exchange of data, which may have negative impact on economy, innovation and research. Being the web one of the most common platform to share information, we need privacy-preserving solutions to make it safer. In our research, we study how to continuously publish data extracted from private data streams containing user-related information to the web in a privacy-preserving manner.

Data on the web

The web has grown as a repository of documents. In the last 20 years, however, the idea of semantic web emerged as an enhancement of the web, where data can be published as well as documents. This shift is transforming the web from an immense library to a world-scale database, which enable new operations, such as querying.

Data in the web can be published through knowledge graphs (KGs). KGs represent information in graph-based structures, where nodes identify entities and edges denote the relations between them. Several organisations and web sites use KG technologies to publish their data. For example, IMDb pages can be processed to extract JSON-LD annotations describing movies, actors and directors through KGs:

A knowledge graph extracted from an IMDb page.

Knowledge graph streams

KGs evolve over time by adding new data or revising existing ones. We can model such changes through sequences of timestamped KGs, KG streams. For example, the picture shows :sTV, a KG stream containing information about which channels users are currently watching. In a given snapshot, the graph reports on the current state of viewers:

By using query languages like SPARQL and its continuous extensions, one can analyse KG streams. The following query asks for the number of viewers for each TV channel in :sTV:

PREFIX : <https://example.org/>
SELECT ?channel (COUNT(*) AS ?viewers)
FROM STREAM :sTV TO STREAM :sOut
WHERE {
?user :watches ?channel .
} GROUP BY ?channel

By executing the query, an answer stream :sOut is produced:

Data privacy

Publishing :sTV would expose private information about TV viewers. It is possible to apply pseydo-anonymisation techniques to to hide individual identities. Privacy researchers, however, showed that such techniques can lead to privacy leaks, as it happened e.g. in the cases of the Netflix challenge. Also publishing :sOut, i.e. data analytics results about the original data, may lead to privacy leaks.

Differential privacy (DP) emerged to offer strong privacy guarantees in data analytics. While there are DP techniques that target streaming and dynamic data, it is still not clear to which extent they can be used in real scenarios, as they (i) lack ready-to-use implementations and (ii) require a deep understanding of their theoretical foundations.

SihlMill

A possible way to push the adoption of DP is to provide data analytics practitioners with ready to use libraries, as in the case of OpenDP, Diffprivlib and Google's DP libraries.

Following this philosophy, we have developed SihlMill, an engine for executing privacy-preserving data analytics workflows over KG streams. SihlMill is designed on top of the w-event privacy framework, which is the state of the art for differentially-private stream processing. As for DP, w-event privacy determines required noise level to hide the presence (or absence) of every user. As a stream describes a user over time, hiding its presence usually requires a large amount of noise. The w-event privacy overcomes this issue by introducying a notion of differential privacy in a time interval, which is used to control the trade-off between privacy and utility. When in action, SihlMill produces streams like :sOutPri:

The blue boxes represent the SihlMill answer, while the dashed lines are the real - and hidden - answer. To further improve utility, w-event privacy proposes not to publish new statistics when they are similar to the latest released ones.

The current version of SihlMill focuses on histograms since they provide the foundation for many analytic tasks such as data warehousing, OLAP and business analytics, as well as plenty of machine learning algorithms such as decision trees and naive Bayes. To improve the number of analyses supported in SihlMill, we enhanced the w-event framework with a bin removal mechanism. The mechanism is designed to dynamically add and remove bins, and to protect users with unique behaviour.

SihlMill is available as open source software under Apache licence. We built SihlMill on top of two existing projects:

SihlQL

To control SihlMill, we designed SihlQL. SihlQL is a declarative query language meaning that user defines what to retrive and the underlying engine takes care of retrival process. As such, SihlQL can ease the adoption of DP, by letting data analysers express differentially-private queries over KG streams without coping with the DP algorithms and their complexities. On the one hand, SihlQL extends SPARQL with operators to consume and produce KG streams, as well as operators to adjust the privacy level; on the other hand, SihlQL limits the operators of SPARQL to ensure that queries are suitable for applying DP techniques.

A SihlQL query consumes data from a stream of KGs, optionally combined with one or more static KG which may contain background information. The SihlQL query that computes :sOutPri is the following:

PREFIX : <https://example.org/>
ENABLE PRIVACY EPSILON 0.1 W 3
SELECT ?channel (COUNT(*) AS ?viewers)
FROM STREAM :sTV TO STREAM :sOutPriv
WHERE {
  ?user :watches ?channel .
} GROUP BY ?channel

Comparing this query with the one that computes :sOut, the main difference lies in the second row, which is introduced to control the DP and w-event privacy parameters.

Conclusions

The ability to exchange knowledge on the web is one of the pillars of our digital society. Data analytics lead to the creation of new knowledge, which in turn leads to innovation and ultimately to increased welfare. However, such analyses should be done while respecting privacy of people. We believe that tools like SilhMill and SihlQL can play an important role to enable privacy-preserving data analytics, as they provide data analysers ready-to-use frameworks to safely process personal data and publish it on the web.

Acknowledgements

We thank the Swiss National Science Foundation for the partial support under contract number #407550_167177