Risk Assessment for De-Identified Individual Patient Data

It is becoming quite popular to de-identify Individual Patient Data (IPD) for the purposes of sharing it with researchers or in general with public. EMA has published Policy 0070 which will make it mandatory to share clinical documents and IPD on a public forum. Currently, EMA has published only phase 1 of the policy which mandates to publish anonymized clinical documents on it website and phase 2 guidelines, which is yet to be published, will cover public disclosure of IPD. This intends to

  • avoid duplication of clinical trials, foster innovation and encourage development of new medicines;
  • build public trust and confidence in EMA’s scientific and decision-making processes;
  • academics and researchers to re-assess clinical data.

De-identification of IPD is a process of rendering data into a form which does not identify individuals and where identification is not likely to take place. The first step in de-identification is to select direct and quasi-identifiers in a study. Direct identifiers are variables that can on their own identify subjects within the data set. In the clinical trial scenario, these are all types of IDs (like Subject ID, SAE ID, sample numbers, etc.). Direct identifiers will be either removed, masked or re-coded and will not contribute to risk of re-identification. Quasi-Identifiers are variables which can help identify a subject within the data set when used in combination with other quasi-identifiers. These include age, sex, country as well as dates, unique events, etc. Quasi-identifiers are subsequently used in calculating the risk of re-identification of the data set. EMA says “The most appropriate way to measure the risk of re-identification for an entire dataset, in the context of public disclosure, is through the maximum risk, which corresponds to the maximum probability of re-identification across all records.

For example, suppose there are two quasi-identifiers in study i.e. sex and age with values as shown in table 1. Then re-id risk for each record will be nothing but 1/size of equivalence class.

Table 1: Dataset with equivalence class sizes and subject level risk of re-id

USUBJID SEX AGE Equiv. Class (Size) Re-Id Risk
CT1/101 M 26 A(3) 0.33
CT1/102 F 26 B(2) 0.5
CT1/103 M 26 A(3) 0.33
CT1/104 F 26 B(2) 0.5
CT1/105 F 29 C(2) 0.5
CT1/106 F 28 D(2) 0.5
CT1/107 M 26 A(3) 0.33
CT1/108 F 27 E(1) 1
CT1/109 F 28 D(2) 0.5
CT1/110 F 29 C(2) 0.5

Calculation of overall risk for dataset depends on the context of data release. For public release, it is advisable to use maximum risk. Hence, overall risk for Table 1 data will be maximum of risk for all records i.e. 1. If this overall risk is greater than threshold value then de-identification techniques shall be applied on variables, sex and age to reduce this risk. For example, Age can be generalized to 5 years interval ag categories.

Link to study material: http://www.lexjansen.com/phuse/2016/dh/DH09.pdf