ProTego Advanced Research - Policy-driven data protection

In a previous blog in Medium, ProTego has described how it can secure HL7 FHIR resources [1] by using Apache Parquet Modular Encryption (PME)[2], which was developed by Work Package V – Cyber Risk Mitigation during the project.

Building on top of this approach, WP5 developed a Data Gateway which adds an Interceptor to an industrial strength, Open Source, FHIR server[3]. When a FHIR resource is received by the server, the Interceptor deserializes it and uses Spark with PME to write the resource as an encrypted Parquet file. Encryption/decryption keys are maintained in Vault[4], and access to Vault is provided by ProTego’s Keycloak-based Key Management System (KMS) which requires user authentication.

We have successfully demonstrated in the project’s use cases how the stored data is both secure and can be efficiently queried using Spark SQL through use of the Query Gateway, also developed by WP5 in the project.

Our focus lately has been looking at how we can take this solution a step further by applying policy-based access to data at finer-grained level than could be provided by standard user or role-based access to the Query Gateway or the FHIR server itself. Why could this be useful? Well, let’s look at the collection of FHIR resources which may be stored by the FHIR server. FHIR contains resources essentially spanning the complete spectrum of medical-related information, from patient information which contains personally identifiable information (PII) to billing information to resources describing organizations themselves. We can certainly envision a scenario where the Accounts Department of a health organization should be given access to billing information, but not the medical results associated with a patient. How can we implement such a solution?

Using PME along with the Data Gateway, the Query Gateway and the KMS, we can think of a solution where we encrypt the footer of each Parquet resource with its own key, and then define in our KMS which users or roles are allowed to access which keys. So, for example, we can have a set of keys associated with the Accounts Departments, which will allow decryption of only certain FHIR resources and another set of keys which will allow doctors access to strictly medical-related data.

This solution will work, but it can be hard to understand and maintain the data access policy which is essentially encoded into the key groupings inside the KMS. Additionally, we can envision scenarios for GDPR compliance where data access goes beyond role-based access, for example, where there are geography constraints on data export. Really what we need is a more sophisticated mechanism for policy decision and enforcement which is outside the realm of other system components.

As part of our advanced research in the second half of ProTego, WP5 has researched and prototyped a solution to policy-based data access using the Open Source Fybrik framework [5] being developed by IBM. Built on top of Kubernetes and hence compatible with the existing ProTego work, Fybrik is a sophisticated system for creating a secure, locked-down path between a data source and a data consumer or producer.

Conceptually, Fybrik consists of a Control Plane, which handles all the behind-the-scenes details of allowing policy-based access to a data source, and a runtime environment which allows for consumption of the data. The Control Plane includes a Policy Manager which both allows for defining data access policy and evaluates access decisions to the data. Based on a description of the data requestor (e.g. expressed purpose of data use, role, geographic location, data source required etc.), a Fybrik Module will be deployed in the Control Plane and will enforce the access decisions.

So, for example, we can define in a textual manner in the Policy Manager our data policies (“Allow an Accounts Department user only to access FHIR resources X, Y, Z…”) and the Module will automatically protect the access to FHIR resources. In fact, Fybrik allows for an even more powerful manner of protecting data; data within a resource can be redacted at the column level. For example, we may have a resource that contains both sensitive information and publicly available data and want to restrict access to the sensitive data based on policy. This is easy to configure in Fybrik – FHIR resources are described as Fybrik Assets and the description of the asset will specify the sensitive attributes in the resources. A policy will then be defined that specifies the access policy to the sensitive attributes for that resource. Now, when a user requests access to a FHIR resource, Fybrik will transparently handle data access and the required redaction actions.

To improve security, Fybrik is its ability to transparently inject data access keys, such as S3 keys, into the Module accessing the data, without exposing the keys to the data requester. or to application code. This means that a single data source can be accessed by different organizations running in the ProTego Fybrik environment, without worrying about compromising data security or key distribution.

The ProTego use cases operate under a model where the data collector (the health organization) is also the sole data consumer. However, key injection together with policy-defined data redaction, can open up a lot of powerful scenarios for data sharing by our use case partners. For example, data can now be securely made available to a third-party under strict data export and redaction policy restrictions.

Another nice aspect is that our policy decision and enforcement transparently update when the policy is changed without the need to recompile any code or change any application.

We have developed a Fybrik-based prototype which illustrates these concepts. This code includes a FHIR server Interceptor which deserializes and writes received messages to a Kafka message queue. A Fybrik module will read from the message queue, identify the FHIR resource type, and write the resource out to an S3 bucket holding all messages for that given resource. So, for example, all Patient resources will be written to a Patient bucket, whereas all Observation messages will be written to the Observation bucket. The Fybrik module is also capable of batching resources before writing. A conceptual view of the architecture is shown in Figure 1.

Figure 1: The Protego FHIR to S3 architecture

Our next step will be to look at how we can take data redaction a step further by allowing for the injection of custom redaction code into the workflow. This could be used, for example, to provide anonymization of redacted data by only exporting statistical characteristics of the data set (e.g. average, standard deviation, …) to third parties.

This blog post was written for the ProTego project by Eliot Salant (IBM Haifa Research Labs).

[1] https://www.hl7.org/fhir/

[2] https://github.com/apache/parquet-format/blob/master/Encryption.md

[3] https://ibm.github.io/FHIR/

[4] https://www.vaultproject.io/

[5] https://fybrik.io/v0.4/get-started/quickstart/