The EU's new General Data Protection Regulation (GDPR) regulation that takes effect from May 25th might make it tempting to remove all personally identifiable information (PII) from your data pipeline. Deleting everything would undoubtedly provide full compliance with GDPR but it would also significantly frustrate data science efforts across the business and value of the data pipeline. We argue for an alternative approach to cleansing your data pipeline for GDPR compliance that still allows data science to make use of the data collected.
The major cloud platforms Amazon Web Services, Google Cloud Platform, Microsoft Azure, or for that matter, open-source data warehouses like Hadoop do not give a framework for handling personal data in the data pipeline. This article looks to fill that gap, giving an approach that can be applied when using any of these platforms. The structure will give you a way to continue with data science once GDPR has come into effect.
Under GDPR, there are two categories of confidential data. The first names people as individuals and the second describes their characteristics (Table 1). Personal characteristics are sensitive and can be used to discriminate against people violating their fundamental EU rights.
Even though IP address and geographical location may seem harmless, they are considered confidential data because they can show your physical address. A VPN can hide your IP address, but many users do not know about VPNs and cannot be expected to buy such a service.
Personal identifiers are matched with other information to create a profile of an individual. This profile can then be used for multiple purposes. For example, when you perform a Google search, the search text can be used to target website advertisements.
As a company storing data on individual data, you should devote great care to considering not only which personal identifiers but also which personal characteristics you need to collect. You should base your decision on the legitimate interest rule as defined in GDPR.
GDPR has an extensive list of rules for handling data, but we can summarize them into 5 different areas.
Anonymization is a radical step for dealing with confidential data, as it removes any ability to correlate personal identifiers with other data. For example, if an ID is converted to a random string of numbers, to be anonymized, the string cannot remain the same over time. It should change according to the time the user was active. If it stays the same, showing just one data point about the individual is enough to re-identify them and their associated data.
The focus of this article is pseudonymization, a possibility described in GDPR that give service providers more flexibility than complete anonymization. In pseudonymization, another attribute is created to link personal identifiers to the anonymized identifiers such that data can be de-anonymized later.
There are three areas. to consider the first describes essential points to consider when deciding how to handle user consent. The second area covers database setup and the third gives guidance in processing confidential data.
In setting up their GDPR compliant framework for managing data, you must consider how to handle consent for collecting data from end users. This will vary depending on whether you will share data with partners or customers. GDPR needs accountability to cover to all parties who access the data.
The most straightforward strategy is to create a consent database for each subscriber. For accountability, you create a log when consent is given and when it is withdrawn. You further develop a routine that purges an end user's personal data when consent is withdrawn. If you have shared data with third parties, you need to send them a list of users to purge on a regular basis, ideally as soon after the user withdraws their consent as possible.
Because under GDPR, you must not only ensure that the data in your warehouse is removed upon user request, but also that the data they have shared is dropped. Simply sending the list to an external party informing them that a user has revoked access is not enough. You need to have automatic routines that ensure proper compliance and full accountability of all parties in case of an audit.
To achieve automatic routines, you should create a digital consent agreement (that is, a contract) for each subscriber. The contact can then be used when managing consent with an external party. Such a contract gives multiple advantages because it:
A Certification Authority (CA) or blockchain can be used to manage the consent agreement. When a subscriber consents to gather personal data, you create a X509 certificate (Figure 1). The certificate includes instructions on how often to check the Certificate Revocation List (CRL). The frequency to check with the CRL is based on the terms of agreement with the external partner.
Figure 1. Process for creating consent certificate
When the subscriber revokes consent, the certificate is added to the CRL and then removed from the database (Figure 2).
Figure 2. Process for revoking consent
Review the approach towards consent and accountability with your potential data-sharing partner to make sure each party has the same level compliance in place. Be prepared to say no. Only enter into data-sharing agreements with partners that have an acceptable level of accountability. The partner must use the shared data for the same purpose for which the subscriber gave consent. If the partner wants to use the data for a different purpose, the service provider must first obtain the subscriber's consent.
Any framework is only as good as the processes of the organization that manages the data. Access categories and main responsibilities about the database must be defined and a security clearance policy must be in place.
All private or sensitive data must be kept in a restricted database with separate login access (Figure 3). The rest of the data can be stored in an unrestricted database. The separation enables accountability, as only individuals with restricted access and proper training in handing personal data may work with such data. The data in the unrestricted database will have undergone anonymization or pseudonymization and thus cannot be correlated with personal identifiers. The data in the unrestricted database may also be persistent and need not be forgotten. It is enough to remove it from the restricted database, where personal identifiers can be correlated with other data.
Figure 3. Database setup overview
When reviewing what data to collect, it is important to inspect each attribute to understand if there is a legitimate reason for collecting it or not. When adding new attributes, make sure the end user is informed by updating the privacy policy. It is tempting to use generic language in privacy policies to describe all attributes that might be collected, but transparency is the best policy and helps subscriber trust.
If consent is needed, it must be done as an opt in where end user needs to actively choose to give consent. Even though not all end-users will choose to opt-in there is a higher motivation to have a clearer explanation to motivate consent. Explanation should be done independent of the privacy policy to have easier to read language. The privacy policy will have a more detailed.
As described in the database set-up, when data is ingested into the data warehouse, confidential data is separated from the rest of the data (Figure 4).
Figure 4. Data ingestion architecture overview
Based on consent and rules for handling personal data a data gateway filters and splits raw data. The following rules can be applied.
The table 3 describes how the personal data should be processed following the defined rules.
Analysis or work that uses confidential data is role-based. Only those with proper training may work with the data. Dashboards using the database must be restricted to white-listed computers.
If data is downloaded from the database for purpose of doing more analysis, the work must be done on a white-listed computer and not transferred to any other computer. A better solution may be to use a computer instance in the cloud and set up a Jupyter instance. This helps limit concerns with the computer being hacked, stolen, or lost and the need for other routines to encrypt white-listed computers.
When the subscriber revokes consent, their data in the conditional access database is removed, and the consent database is updated with new rules for data ingestion. The same routine applies when the subscriber asks to be forgotten.
The following diagram shows the routine for sharing a dataset with a partner. The service provider prepares the agreed-upon dataset, and the partner ingests the dataset into a secure database (Figure 5). Depending on how often the dataset is shared with the partner, it may or may not be necessary to setup a consent verification routine. The consent verification routine verifies whether the individual consent certificates have been revoked. If consent has been revoked, the partner updates the database.
Figure 5. External partner data sharing architecture overview
The pseudonymization framework gives a means to continue collecting personal data in compliance to GDPR without having to drop confidential data that can enable continued data science within the organization. The goal of the framework is to provide:
The rules defined for data ingestion are support by Spotless Data. Spotless Data is a utility that helps developers prepare data before it is ingested into a data warehouse.