Federated data processing has been a standard model for the virtual integration of disparate data sources, where each source upholds a certain amount of autonomy. While early federated technologies resulted from mergers, acquisitions, and specialized corporate applications, recent demand for decentralized data storage and computation in information marketplaces and for Geo-distributed data analytics has made federated data services an indispensable component in the data systems market.
At the same time, growing concerns with data privacy propelled by regulations across the world have brought federated data processing under the purview of regulatory bodies.
This series of blog posts will discuss challenges in building regulation-compliant federated data processing systems and our initiatives at Databloom that strive towards making compliance a first-class citizen in our Blossom data platform.
Federated data processing
Running analytics in a federated environment requires distributed data processing capabilities that (1) provide a unified query interface to analyze distributed and decentralized data, (2) transparently translate a user-specified query into a so-called query execution plan, and (3) can execute plan operators across compute sites. Here, a critical component in the processing pipeline is the query optimizer. Typically, a query optimizer considers distributed execution strategies (involving distributing query operators like join or aggregation across compute nodes), communication costs between compute nodes, and introduces a global property that describes where, i.e., at which site, processing of each plan operator happens. For example, a two-way join query over data sources in Asia, Europe, and North America may be executed by first joining data in North America and Europe and then joining with the data in Asia.
Growing data regulations, a new challenge
As one can notice, federated queries implicitly ship data (i.e., intermediate query results) between compute sites. While several performance aspects, such as bandwidth, latency, communication cost, and compute capabilities have received great attention, the federate nature of data processing has been recently challenged by data transfer regulations (or policies) that restrict the movement of data across geographical (or institutional) borders or by any other rules of data protection that may apply to the data being transferred between certain sites.
European directives, for example, regulate transferring only certain information fields (or combinations thereof), such as non-personal information or information not relatable to a person. Likewise, regulations in Asia may also impose restrictions on data transfer. Non-compliance with such regulatory obligations has attracted fines in the tune of billions of dollars. It is, therefore, crucial to consider compliance with respect to legal aspects when analyzing federated data.
Data transfer regulations through the GDPR lens
As of now most countries around the world have various data protection laws---with the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) being the most prominent---that impose restrictions on how data is stored, processed, and transferred.
Let's take GDPR as an example. GDPR articles 44-50 explicitly deal with the transfer of data across national borders. Among these, there are two articles and one recital wherein the legal requirements for transferring data fundamentally affect federated data processing.
Article 45: Transfers on the basis of an adequacy decision.
The article dictates that the transfer of data may take place without any specific authorization, e.g., when there is adequate data protection at the site where the data is being transferred or when the data is not subjected to regulations (i.e., when the data does not follow the definition of personal data as in Article 4(1)).
Article 46: Transfers subject to appropriate safeguards.
This article prescribes that (in the absence of applicability of Article 45) data transfer can take place under "appropriate safeguards". Based on the European Data Protection Board (EDPB) recommendations that supplement transfer tools, pseudonymization of data (as defined under Article 4(5)) is considered as an effective supplementary method.
Recital 108: Transfers under measures that compensate for a lack of data protection.
Data after adequate anonymization (i.e., when the resulting data does not fall under Article 4(1) and as described in Recital 26) does not fall under the ambit of GDPR and therefore can be transferred. Depending on the data and to where that data is being transferred, the above regulations can be classified into:
- No restrictions on transfer: Some data may be allowed to be transferred unconditionally, and some to only certain locations.
- Conditional restrictions on transfer: For some data, only derived information (such as aggregates) or only after anonymization, can be transferred to (certain) locations.
- Complete ban on transfer: Some data, no matter whatsoever, must not be transferred outside.
Compliance-by-Design: The Challenges
Rather than ad-hoc solutions to make data processing regulation-compliant, a more holistic approach that provides appropriate safeguards to data controllers (entities that control what data and how data should be processed) and data processors (entities that processes data on behalf of a controller) within a federated data processing system is required.
In the context of federated data processing, three aspects must be revisited:
- First and foremost, data processing systems must offer declarative policy specification languages, which make it easy and simple for controllers to specify data regulations. Policy specification languages should take into account the type of data, its processing, as well as the location of processing. Regulations may affect the processing of an entire dataset, a subset of it, or even information that is derived from it. Policy specifications must also be considered keeping in mind the heterogeneity of data formats (e.g., graph, relational, or textual data).
- The second aspect of ensuring that compliance is at the core of federated data processing is integrating legal aspects in query rewriting and optimization. A system must be able to transparently translate user queries into compliant execution plans.
- Lastly, federated systems must offer capabilities to decentralize query execution, which may also be desired by the compliant plan. We need query executors that can efficiently orchestrate queries over different platforms across multiple clouds or data silos.
Summary
Today regulation-compliant data processing is a major challenge, driven by regulatory bodies across the world. In this blog post, we analyzed data transfer regulations from the perspective of GDPR and discussed key research challenges for including compliance aspects in federated data processing. Compliance is at the core of our blossom data platform. In the next blog post, we will discuss how Databloom's blossom data platform addresses some of the aforementioned challenges and ensures regulation-compliant data processing across multiple clouds, Geo-locations, and data platforms.
References
[1] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Query Processing. SIGMOD Conference 2021: 181-193
[2] Kaustubh Beedkar, David Brekardin, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Data Processing in Action. Proc. VLDB Endow. 14(12): 2843-2846 (2021)
[3] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Navigating Compliance with Data Transfers in Federated Data Processing. IEEE Data Eng. Bull. 45(1): 50-61 (2022)
About Scalytics
Scalytics Connect is a next-generation Federated Learning Framework built for enterprises. It bridges the gap between decentralized data and scalable AI, enabling seamless integration across diverse sources while prioritizing compliance, data privacy, and transparency.
Our mission is to empower developers and decision-makers with a framework that removes the barriers of traditional infrastructure. With Scalytics Connect, you can build scalable, explainable AI systems that keep your organization ahead of the curve. Break free from limitations and unlock the full potential of your AI projects.
Apache Wayang: The Leading Java-Based Federated Learning Framework
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.