The widespread availability of medical data has created numerous opportunities to advance biomedical research, paving the way for P4 (Predictive, Preventive, Personalized, and Participatory) medicine, based on the significant advancements in modern AI technologies such as deep learning. To accelerate the adoption of P4 medicine and maximize its potential, clinical and research data on large numbers of individuals must be efficiently shared among all stakeholders. However, sharing and computing on sensitive medical information raises serious privacy concerns. These concerns can be addressed by utilizing innovative approaches to the reliable processing of medical information. Specifically, effective privacy-preserving technologies can be adopted to enable privacy-conscious medical data sharing.
Our research focuses on innovations in the fields of security, privacy, and AI within the medical context to enable clinical and genomic data sharing and exploitation across a federation of medical institutions, hospitals, and research laboratories in a scalable, secure, responsible, and privacy-conscious manner. It aims to address the main scalability, privacy, security, and ethical challenges of data sharing to enable effective P4 medicine by defining an optimal balance between usability, scalability, and data protection, and by deploying an appropriate set of computing tools to achieve this balance.
Here are some themes and techniques that we are currently working on:
Privacy-Preserving Similar Patient Queries: With the increase in genomic data, it has become easier to find similar patients by comparing their data. This is important for understanding how similar patients respond to different treatments and the effects of various demographic characteristics on the course of disease and treatment. The Global Alliance for Genomics and Health (GA4GH) has developed tools such as Beacon Network and MatchMaker Exchange that enable federated discovery and sharing of genomic data. Due to privacy issues related to biomedical data types, these queries must be run in a way that protects individual privacy. Many studies have focused on this area, mostly allowing only genome queries to run privately. Therefore, it is important to propose private federated or distributed query mechanisms that support various biomedical data types. Previous studies have worked effectively on a limited number of patient data, but as the data increases, more accurate results are achieved. Thus, it is necessary to run these queries effectively and privately on large cohorts.
Privacy-Preserving Record Linkage: Combining data from different sources, such as genomic data or clinical data, is often necessary to find new relationships between diseases and medical indications. In medical applications, privacy-preserving record linkage (PPRL) is increasingly demanded to combine patient-related data from different sources for data analysis while preserving individual privacy. Numerous privacy-preserving methods have been proposed for PPRL, most of which are based on “Bloom filters” that allow error-tolerant linkage of hashed data. These methods are vulnerable to frequency and dictionary attacks. We plan to propose a more secure and efficient PPRL method that enables the matching of thousands of patients in a reasonable time compared to other methods in the literature. Our method utilizes multi-party computation (MPC) instead of a trusted third party to keep bloom filters secret. We will propose and use our own security framework for MPC, outperforming existing frameworks by designing efficient protocols for PPRL.
Privacy-Preserving Genome-Wide Association Study (GWAS): GWAS is used to identify genetic variants statistically associated with respective phenotypes. A large number of individuals in the analysis is crucial for finding rare variants that usually have a small but important effect. Principal Components Analysis (PCA) is a common approach for modeling population structure in GWAS, but the Linear Mixed Model (LMM) is now believed by many to be superior. Previously proposed privacy-preserving GWAS studies focus on PCA-based GWAS. We plan to develop a privacy-preserving method for fast LMM-based GWAS that scales linearly with cohort size in both runtime and memory use. We will develop MPC-friendly algorithms for LMM and benefit from the parallel execution of thousands of multiplications/additions in our three-party MPC framework.
Three-Party Secure Computation Framework for Privacy-Preserving Machine Learning: Many studies in the literature have focused only on CNN algorithms from the class of NN algorithms for PPML. Applying privacy-preserving machine learning techniques to RNNs, RKNs, LSTMs, and GANs algorithms poses important new problems. To implement PPML, some ML-related issues must be addressed directly. We want to focus on these problems and efficiently apply privacy-preserving machine learning techniques to RNNs, RKNs, LSTMs, and GANs algorithms. We plan to develop a more efficient framework that depends on the mixed usage of MPC, randomized encoding, and homomorphic encryption (HE). Certain functions such as comparison or widely used activations like ReLU or Sigmoid, which require extraction of MSB in a privacy-preserving manner, need involvement of the Boolean world, while functions such as addition and dot product are more efficient in the arithmetic domain. ML algorithms involve a mix of operations, alternating between these two worlds. As shown in some recent works, using mixed-world computation is orders of magnitude more efficient compared to current best MPC techniques that operate only in either world. Furthermore, utilizing randomized encoding in an MPC setting results in significant performance gains by reducing the round complexity of MPC protocols.
Privacy-Preserving Performance Evaluation of Collaborative Machine Learning Models: Typically, ML techniques require large computing power, leading clients with limited infrastructure to rely on Secure Outsourced Computation (SOC). In SOC, the computation is outsourced to specialized and powerful cloud servers. Outsourcing is also efficient for training an ML model on pooled data from different sources. Some clients may not have enough test samples to measure the performance of the model trained on pooled data, thus needing test data from other clients. It is crucial to provide methods to measure ML models’ performance using pooled test data without sacrificing privacy. Enabling different architectures with constraints on data sharing and communications, such as data analytics trains and federated architectures, to measure ML models’ performance in a privacy-preserving manner is also important.
Privacy-Preserving Explainable Machine Learning: Privacy and transparency are key elements of trustworthy machine learning. Model explanations provide insight into a model’s decisions on input data but can pose significant privacy risks to the model’s training set. Explanations must preserve the privacy of other users. Since explanations rely on individual data points or features, they can expose personal information of other users whose data is used to train inference models. In the first part of this research, we plan to analyze whether an adversary can exploit model explanations to infer sensitive information about the model’s training set. We will investigate this using membership inference attacks: inferring whether a data point belongs to the training set of a model given its explanations. We will design reconstruction attacks against example-based model explanations to recover significant parts of the training set. In the second part, we plan to design privacy-preserving methods for computing model explanations that protect sensitive information in the explanations and training data. A possible solution includes privacy-preserving techniques to obfuscate or aggregate information about other users. Despite its importance, research in this area is at an early stage, with many unanswered questions.
Secure and Private Federated Learning: Federated learning is a machine learning setting where many clients collaboratively train a model under the control of a central service provider. During training, data is kept decentralized, and minimal model updates are used, mitigating many privacy risks of centralized machine learning. Many studies address and mitigate the privacy issues of federated learning using secure multiparty computation (MPC), homomorphic encryption (HE), and trusted execution environments (TEE). In these solutions, the semi-honest server can observe the aggregated models. There are also federated learning solutions utilizing differential privacy (DP), but these are not successful in preserving utility for high-dimensional datasets. Applying DP-based federated learning methods to medical data, which typically has a high number of features, is very challenging. We plan to propose a privacy-preserving federated learning framework using both TEE and MPC to employ the privacy-in-depth principle. In our solution, privacy-sensitive parts of the federated learning algorithm, represented with an MPC protocol, are run inside TEE. TEE also provides security against a malicious server because each client knows that an encrypted local model update will be decrypted and aggregated inside TEE through remote attestation.