Data science under GDPR with pseudonymization in the data pipeline [proxy]

The EU's new General Data Protection Regulation (GDPR) regulation that takes effect from May 25th might make it tempting to remove all personally identifiable information (PII) from your data pipeline. Deleting everything would undoubtedly provide full compliance with GDPR but it would also significantly frustrate data science efforts across the business and value of the data pipeline. We argue for an alternative approach to cleansing your data pipeline for GDPR compliance that still allows data science to make use of the data collected.

The major cloud platforms Amazon Web Services, Google Cloud Platform, Microsoft Azure, or for that matter, open-source data warehouses like Hadoop do not give a framework for handling personal data in the data pipeline. This article looks to fill that gap, giving an approach that can be applied when using any of these platforms. The structure will give you a way to continue with data science once GDPR has come into effect.

What is personal data?

Under GDPR, there are two categories of confidential data. The first names people as individuals and the second describes their characteristics (Table 1). Personal characteristics are sensitive and can be used to discriminate against people violating their fundamental EU rights.

Even though IP address and geographical location may seem harmless, they are considered confidential data because they can show your physical address. A VPN can hide your IP address, but many users do not know about VPNs and cannot be expected to buy such a service.

Personal identifiers are matched with other information to create a profile of an individual. This profile can then be used for multiple purposes. For example, when you perform a Google search, the search text can be used to target website advertisements.

As a company storing data on individual data, you should devote great care to considering not only which personal identifiers but also which personal characteristics you need to collect. You should base your decision on the legitimate interest rule as defined in GDPR.

GDPR has an extensive list of rules for handling data, but we can summarize them into 5 different areas.

Legitimate interest: You must have a valid reason to collect and store the confidential data.
Consent: To use certain types of personal data about an individual for commercial or marketing purposes, you must first obtain the individual's permission.
The right to be forgotten: An individual may revoke access to their personal data, and you must remove it. You must also correct errors in personal data upon the individual's request.
Anonymization/pseudonymization: Personal data must either be anonymized or go through pseudonymization when stored. This ensures that confidential data is not accessed inappropriately (more details in the next section of the article).
Sharing externally: If you obtain consent to share personal data with external parties, you must be able to revoke the parties' access to these data. This includes the obligation to ensure that the data is erased from the data warehouses of the partners.

Anonymization or pseudonymization of personal data

Anonymization is a radical step for dealing with confidential data, as it removes any ability to correlate personal identifiers with other data. For example, if an ID is converted to a random string of numbers, to be anonymized, the string cannot remain the same over time. It should change according to the time the user was active. If it stays the same, showing just one data point about the individual is enough to re-identify them and their associated data.

The focus of this article is pseudonymization, a possibility described in GDPR that give service providers more flexibility than complete anonymization. In pseudonymization, another attribute is created to link personal identifiers to the anonymized identifiers such that data can be de-anonymized later.

There are three areas. to consider the first describes essential points to consider when deciding how to handle user consent. The second area covers database setup and the third gives guidance in processing confidential data.

In setting up their GDPR compliant framework for managing data, you must consider how to handle consent for collecting data from end users. This will vary depending on whether you will share data with partners or customers. GDPR needs accountability to cover to all parties who access the data.

The most straightforward strategy is to create a consent database for each subscriber. For accountability, you create a log when consent is given and when it is withdrawn. You further develop a routine that purges an end user's personal data when consent is withdrawn. If you have shared data with third parties, you need to send them a list of users to purge on a regular basis, ideally as soon after the user withdraws their consent as possible.

Because under GDPR, you must not only ensure that the data in your warehouse is removed upon user request, but also that the data they have shared is dropped. Simply sending the list to an external party informing them that a user has revoked access is not enough. You need to have automatic routines that ensure proper compliance and full accountability of all parties in case of an audit.

To achieve automatic routines, you should create a digital consent agreement (that is, a contract) for each subscriber. The contact can then be used when managing consent with an external party. Such a contract gives multiple advantages because it:

Cannot be manipulated
Can be shared with external parties
Indicates when consent applies and when it does not
Enables the external party to automatically check that data have been purged when consent is revoked

A Certification Authority (CA) or blockchain can be used to manage the consent agreement. When a subscriber consents to gather personal data, you create a X509 certificate (Figure 1). The certificate includes instructions on how often to check the Certificate Revocation List (CRL). The frequency to check with the CRL is based on the terms of agreement with the external partner.

Figure 1. Process for creating consent certificate

When the subscriber revokes consent, the certificate is added to the CRL and then removed from the database (Figure 2).

Figure 2. Process for revoking consent

Expectations of partners

Review the approach towards consent and accountability with your potential data-sharing partner to make sure each party has the same level compliance in place. Be prepared to say no. Only enter into data-sharing agreements with partners that have an acceptable level of accountability. The partner must use the shared data for the same purpose for which the subscriber gave consent. If the partner wants to use the data for a different purpose, the service provider must first obtain the subscriber's consent.

Database handling

Any framework is only as good as the processes of the organization that manages the data. Access categories and main responsibilities about the database must be defined and a security clearance policy must be in place.

Category: individuals with unrestricted database access Individuals in this category may access data that that is not private. Data access is not restricted to specific computers.
Category: individuals with access to restricted (private) data Individuals and computers in this category must be specifically white-listed. The individuals must be trained proper data-handling procedures. Routines must be in place to audit data use and ensure that data access and use are proper. Routines should include how to inform subscribers if their data has been compromised. Dashboards accessing the database may only be present on white-listed computers.

All private or sensitive data must be kept in a restricted database with separate login access (Figure 3). The rest of the data can be stored in an unrestricted database. The separation enables accountability, as only individuals with restricted access and proper training in handing personal data may work with such data. The data in the unrestricted database will have undergone anonymization or pseudonymization and thus cannot be correlated with personal identifiers. The data in the unrestricted database may also be persistent and need not be forgotten. It is enough to remove it from the restricted database, where personal identifiers can be correlated with other data.

Figure 3. Database setup overview

Handling personal data

When reviewing what data to collect, it is important to inspect each attribute to understand if there is a legitimate reason for collecting it or not. When adding new attributes, make sure the end user is informed by updating the privacy policy. It is tempting to use generic language in privacy policies to describe all attributes that might be collected, but transparency is the best policy and helps subscriber trust.

If consent is needed, it must be done as an opt in where end user needs to actively choose to give consent. Even though not all end-users will choose to opt-in there is a higher motivation to have a clearer explanation to motivate consent. Explanation should be done independent of the privacy policy to have easier to read language. The privacy policy will have a more detailed.

As described in the database set-up, when data is ingested into the data warehouse, confidential data is separated from the rest of the data (Figure 4).

Figure 4. Data ingestion architecture overview

Based on consent and rules for handling personal data a data gateway filters and splits raw data. The following rules can be applied.

The table 3 describes how the personal data should be processed following the defined rules.

Data access

Analysis or work that uses confidential data is role-based. Only those with proper training may work with the data. Dashboards using the database must be restricted to white-listed computers.

If data is downloaded from the database for purpose of doing more analysis, the work must be done on a white-listed computer and not transferred to any other computer. A better solution may be to use a computer instance in the cloud and set up a Jupyter instance. This helps limit concerns with the computer being hacked, stolen, or lost and the need for other routines to encrypt white-listed computers.

Data purging

When the subscriber revokes consent, their data in the conditional access database is removed, and the consent database is updated with new rules for data ingestion. The same routine applies when the subscriber asks to be forgotten.

The following diagram shows the routine for sharing a dataset with a partner. The service provider prepares the agreed-upon dataset, and the partner ingests the dataset into a secure database (Figure 5). Depending on how often the dataset is shared with the partner, it may or may not be necessary to setup a consent verification routine. The consent verification routine verifies whether the individual consent certificates have been revoked. If consent has been revoked, the partner updates the database.

Figure 5. External partner data sharing architecture overview

Summary

The pseudonymization framework gives a means to continue collecting personal data in compliance to GDPR without having to drop confidential data that can enable continued data science within the organization. The goal of the framework is to provide:

A better understanding of what confidential data is and when to apply anonymization or pseudonymization
Strategies for handling consent
Routines for handling personal data in compliance with GDPR
Strategies for accountable sharing of data with partners
Data that remains valuable to data science analysis

The rules defined for data ingestion are support by Spotless Data. Spotless Data is a utility that helps developers prepare data before it is ingested into a data warehouse.

Data science under GDPR with pseudonymization in the data pipeline