Privacy by design strategies in AI/ML systems

technical paper

The phrase “privacy by design” is now commonplace in the data protection world, but not necessarily in the context of data science. Can data protection-friendly technology design even succeed when using artificial intelligence (AI) or machine learning (MI)? A proposal with Hoepman’s approach.

What does GDPR require?

The principle of data protection through technology design (Privacy by Design) was established in Article 25 GDPR. The purpose of this is that data protection is already taken into account in the planning and design phase, i.e. long before the actual data collection takes place.

The principle is not unlimited. Article 25, subsection 1 GDPR allows (for the benefit of the controller) to take into account the current technical level, the implementation costs and the likelihood of risks occurring when weighing up measures. Although the cost of implementation should not be used as an excuse for high risks.

Important: Article 25 GDPR is not directed at the manufacturer. Presumably because there are sufficient financial incentives to offer legally permitted products and services.

Specification through considerations

Recital 78 lists possible measures as an example – but not exhaustively:

  • data minimization,
  • pseudonymization at the earliest possible time,
  • Transparency regarding functions and processing of personal data,
  • provide the opportunity to monitor the data subject’s processing of personal data,
  • Create and improve security features.

Furthermore, it is pointed out in recital 78 that the person responsible must define internal strategies and take measures.

More specific requirements?

The law does not set more specific requirements for the person responsible. But there is also no single and true approach by which the principle can be achieved, nor is there a fixed set of safeguards to be taken. It always depends on the particular circumstances of the situation.

In other words: Complex and high-risk data processing requires more data protection skills than trivial and low-risk IT systems. When determining the “risk”, the focus should not be on the data processing itself, but on the negative consequences for a data subject through the data processing. If a data analysis using AI/ML leads to an employee being fired, it is not primarily the data processing that must be evaluated, but the consequence of the data processing.

Why is this important?

A violation of Article 25 GDPR may be subject to fines pursuant to Article 83 (4) (a) GDPR. The housing association Deutsche Wohnen SE was also allowed to ask this boring question in 2019. According to the data protection authority in Berlin, the housing association had for years processed tenant data in a system that did not allow the data to be deleted at any time.

The supervisory authority saw this, among other things, as a violation of the “Privacy by Design” principle in Article 25 (1) GDPR. She justified this by saying that Deutsche Wohnen SE could have taken appropriate technical and organizational measures “at the time the means of processing were determined and at the time of the actual processing”.

The Hoepman approach

Jaap-Henk Hoepman is Professor of Privacy Enabling Technologies in the Digital Security Group at the Institute of Computer and Information Sciences at Radboud University Nijmegen.

Already in 2014 at the 29th IFIP (International Conference on Information Security and Privacy), Hoepman presented a concrete data protection strategy that could help IT architects and developers to integrate data protection early in the software development life cycle.

Hoepman’s approach is both a procedure and a framework that establishes certain mechanisms, but within which the approach allows for a plurality of implementations. The approach is characterized by focusing on 8 functional and proactive strategies and proposing “design patterns” for IT for each of these strategies. With this, Hoepman not only sets the boundary pegs, but designates an active direction.

Concretely, Hoepman therefore proposes 8 “strategies”:

  1. Minimize
  2. Hide
  3. separate
  4. Gathered
  5. Notify
  6. Check
  7. Execute
  8. Demonstrate

What do the 8 strategies mean?

In the following, 4 of the 8 strategies are briefly summarized:

SEPARATE

If possible, personal data should be processed decentralized and separately. Complete profiles of individuals cannot be created if different sources of personal data relating to the same individual are processed and stored separately. Separation is also a useful technique in its own right to achieve the goal of purpose limitation. The decentralized nature of processing instead of central solutions is crucial to the success of the separation concept. In particular, databases containing information from different sources should be kept separate.

GATHERED

Personal data should be processed at the highest possible level of aggregation and the lowest possible level of detail consistent with their wider use.

When information about groups of people or groups of characteristics is aggregated, the amount of detail in the personal information that remains is limited. So this information becomes less important.

If the information is coarse enough and the group from which it is collected is large enough, it is more difficult to link it to a single person. This protects that person’s “privacy”.

HIDE

All personal data and their relationships to each other should not be visible.

The rationale behind this strategy is that personal data cannot be misused as easily if it is hidden in plain sight. The strategy does not specify from whom the information must be hidden. The answer to this question therefore always depends on the individual case. When the strategy is used to hide information as a result of the way a system is used (eg communication patterns), the goal is to hide the information from everyone.

MINIMUM

As a proactive element of data protection-friendly technology design, this strategy means that, in principle, no more data may be processed than is necessary to achieve the purpose. This may mean that one prefers automated processing over automated decision-making processes.

It is also possible to specify that no information about a particular data subject is collected at all. Alternatively, you can choose to collect only a limited number of attributes.

Transferring to AI/ML systems?

A concept from 2014? Pretty old hat? That this is not the case proves the possible transferability of the approach to AI/ML systems.

After all, the decisive thing about privacy by design is not the endless documentation in the dark, but primarily the operational implementation of data protection. In other words, what the law requires must be translated into a technically functional and controllable form. Otherwise, the principles and principles of GDPR wither into purely legal desired goals that lag behind technical developments in a rabbit-hedgehog race.

Transferring the strategies to AI/ML systems can e.g. mean:

  • Federated Learning, data-centric approaches or Secure Multiparty Computation (SMPC) could serve the claim of “separation” and aggregation. SMPC is a protocol that allows at least two different parties to analyze their combined data together without sharing all the data with each other. In other words, an encrypted solution for distributed training of AI/ML systems, which, however, for cost reasons is probably not yet particularly suitable for the masses.
  • Mechanisms such as “differential privacy” (subsampling and noise) and/or use of synthetic data could be suitable for the claim of “concealment” or visibility.
  • As is well known, the principle of data minimization could be taken into account through anonymisation and pseudonymisation. But also by clearly distinguishing between the data used in the learning and production phases.
  • The CONTROL strategy (CONTROL) could stand for more controllability of an AI/ML system, e.g. through patch and retraining specifications or, in case of doubt, through an AI kill switch.

ONEEven when using AI/ML systems, data subjects should not be lost sight of. Just Privacy by Design and Hoepman’s approach offer both data engineers and data architects as well as data scientists tangible guidelines when setting up the data infrastructure and data use of AI/ML systems that can ensure this.

Leave a Comment