Abstract:
Protecting patient confidentiality in rare disease research presents unique challenges due to small population sizes and the increased risk of re-identification through quasi-identifiers. This study presents a comparative evaluation of three anonymization techniques—k-anonymity, differential privacy, and pseudonymization—applied to a fully synthetic dataset of 10,000 rare disease patients, calibrated using real epidemiological distributions. Each method was assessed across three dimensions: confidentiality protection (residual risk, NCP, AECS, ε–δ guarantee), analytical utility (impact on descriptive statistics and logistic regression AUC), and computational efficiency (execution time, RAM usage). The results show that differential privacy (ε = 1.0) achieved the lowest re-identification risk (< 0.1%) with a negligible loss of utility (ΔAUC ≈ 0), making it suitable for open data dissemination. K-anonymity (k = 5) reduced risk to 2% while introducing moderate information loss (NCP ≈ 0.12), offering a compromise where interpretability is prioritized. Pseudonymization preserved full analytical utility and minimal processing cost, but remained insufficient under GDPR due to the potential for re-linkage. A hybrid anonymization framework is proposed: pseudonymization for internal operations and longitudinal tracking, k-anonymity for interpretable analysis, and differential privacy for public dissemination. This integrated approach ensures compliance with GDPR while preserving analytical usability in rare disease research.