Data anonymisation and pseudonymisation

Your research data may contain personal data as defined in the General Data Protection Regulation (GDPR). The GDPR defines personal data as “any information which are related to an identified or identifiable natural person. The data subjects are identifiable if they can be directly or indirectly identified.” Since the definition includes the term ‘any information’, the term ‘personal data’ should be interpreted as broadly as possible. You should consider your entire data collection to be ‘personal data’ if any of the data in the collection are identifiable.

In this best practice page, we explain the legal basis for sharing and storing personal data, what the differences are between directly identifiable, pseudonymised and anonymised data, and how to pseudonymise and anonymise your research data.

Research data versus administrative data

A distinction is made between personal data collected to answer the research question (for instance birth date, fMRI data and blood hormone levels) and personal data collected for administrative purposes (for example contact details).

Data -including personal data- that are required to answer a research question must be stored for at least ten years according to Radboud University’s research data management policy. For medical studies on human participants this could be longer according to NFU norms (NFU is the Dutch Federation of University-Medical Centres). The NFU guidelines for data preservation of research involving human subjects can be found here. For personal data that serve only administrative purposes (e.g. key files, see below), the preservation time might be different. To find out where and for how long administrative personal data should be stored, check out the policy at your institute.

Directly identifiable, pseudonymised and anonymised data

Directly-identifiable data are all data that allow identification of a natural person without effort. This is, for example, true of video, photo, audio and MRI data, and if the names, addresses, phone numbers, IP addresses, etc. of research participants are included in the data.

De-identifying data (pseudonymisation or anonymisation) is the process of removing identifiers that lead to the natural person.

Whenever possible, you should pseudonymise your data. Pseudonymised data are personal data that allow identification of a specific person only indirectly. This means that personal data can no longer be attributed to a specific data subject without the use of additional information. This additional information is usually a key file in which the pseudonymised data are linked to the directly identifiable personal data (e.g. subject number is linked to administrative data such as participant name and contact details). For example, blood glucose level measurements can be pseudonymised by assigning a code/number to each research participant. The glucose level measurements can then be stored with that code/number (i.e. participant A: 120 mg/dL; participant B: 160 mg/dL, etc.) while the information linking that code to a specific participant is stored in the key file (i.e. participant A: Jan de Boer; participant B: Sara de Jong, etc.).

When it is possible to remove the additional information and fully anonymise your research data without losing scientific value (in the glucose example this would mean removing the key file linking participants to their names), you should only store the anonymised data. Anonymised data cannot be linked to a specific data subject in any way and are no longer considered personal data according to the GDPR, so they can be freely shared and stored. In the context of the GDPR, data are considered anonymised when re-identification of research participants requires a disproportionate effort. When anonymising your data, make sure that re-identification of the participant requires a disproportionate effort or is impossible, even by you as the principal researcher. This means if you still have a key file, even if it is stored in a separate location that only you can access, your data are only pseudonymised, not anonymised.

How to anonymise or pseudonymise your data

To pseudonymise and anonymise your data, remove all direct identifiers such as names, dates of birth, addresses, and telephone numbers. Also remove indirect identifiers that are not essential for reusing the data and indirect identifiers that have a high disclosure risk, such as unusual characteristics or unusual findings.

A combination of indirect identifiers may lead to identification of a respondent; for instance, data on deaf people in a specific village may indicate specific individuals. Consequently, in certain cases, it is advised to choose a higher aggregation level, such as province instead of village or town. Another example is the combination of age in days and test date, since this information discloses the respondent’s exact date of birth. For example, in research concerning school classes, participating children may be identified in this manner. In this case, you can record only the year or month of the test date or the age of the subject (and use codes for the schools instead of names of the schools themselves). Also note that datasets that include occupations may result in identification of the respondents. 'Nurse' or 'teacher' may not be very revealing, but 'director of [company X]' or 'leader of [religious community Y]' are. Exact occupations may be adjusted to occupational groups using the ISCO method.

More information can be found on the Research Data Management webpage of Radboud University.

Data Acquisition Collections, Research Documentation Collections, and Data Sharing Collections

The information above applies to internal (Data Acquisition Collections; DACs and Research Documentation Collections; RDCs) and public (Data Sharing Collections; DSCs) data collections. However, if you want to publicly share pseudonymised or directly identifiable personal data via a DSC, there must be a valid reason to do so (i.e. added scientific value), and participants must have approved of public sharing in the informed consent form. You must be extra careful when selecting an access level for your DSC if your data have been acquired on human research participants. Anonymised data can be publicly shared in an Open access DSC, pseudonymised data with an Open access for registered users DSC, and directly identifiable data with a Restricted access DSC. Note that metadata of published DSCs and -optionally- archived DACs and RDCs, including the list of your collection's files, are made public. Therefore, do not include any personal data or other sensitive information in the metadata, documentation files, or in the file and folder names. Always contact your data steward before publicly sharing personal data via a DSC.