The widespread medical data has created a variety of opportunities to advance biomedical research and is paving the way for P4 (Predictive, Preventive, Personalized and Participatory) medicine, based on the enormous advancements of modern AI technologies such as deep learning. At the same time, to accelerate P4’s adoption and maximize its potential, clinical and research data on large numbers of individuals must be efficiently shared between all stakeholders. The sharing and computation on sensitive medical information raise serious concerns about the privacy of the person. This barrier can be overcome by utilizing disruptive approaches to the reliable processing of medical information. In particular, effective privacy-preserving technologies can be adopted to enable privacy-conscious medical data sharing.

Our research focuses on innovations in the field of security, privacy and AI in the medical context to enable clinical and genomic data sharing and exploitation across a federation of medical institutions, hospitals and research laboratories in a scalable, secure, responsible and privacy-conscious way. It tries to address the main scalability, privacy, security and ethical challenges of data sharing for enabling effective P4 medicine, by defining an optimal balance between usability, scalability and data protection, and deploying an appropriate set of computing tools to make it happen.

Here are some themes and techniques that we currently work on:

Privacy-preserving Similar Patient Queries With the increase in genome data, it became easier to find similar patients by comparing the patients’ data with each other. This is important in terms of how similar patients respond to different treatments and the effect of different demographic characteristics on the course of the disease and treatment process. Global Alliance for Genomics and Health (GA4GH) developed tools such as Beacon Network and MatchMaker Exchange that enable federated discovery and sharing of genomic data. Because of privacy issues related to biomedical data types, these queries should be run in a way that protects the privacy of individuals. Many studies have been done in this area, and most of them only allow genome queries to run privately. Therefore, it is important to propose private federated or distributed query mechanisms that support various biomedical data types. Previous studies can work effectively on a limited number of patient data. As the patient data increases, more accurate results are achieved, so it is necessary to run these queries effectively and privately on large cohorts.

Privacy-preserving Record Linkage It is often necessary to combine data from different sources, such as genomic data or clinical data, to find new relationships between diseases and medical indications. In medical applications, privacy-preserving record linkage (PPRL) is increasingly demanded to combine patient-related data from different sources for data analysis while preserving the privacy of individuals. A number of privacy-preserving methods were proposed for PPRL and most of them are based on “Bloom filters” which allow error-tolerant linkage of hashed data. These methods are vulnerable to frequency attacks and dictionary attacks. We plan to propose a more secure and efficient PPRL method that enables the matching of thousands of patients in a reasonable time when compared to other methods in the literature. Our method utilizes, instead of trusted third party, multi-party computation (MPC) in order to keep bloom filters secret. We will propose and use our own security framework for MPC. Thus we will outperform existing frameworks by designing efficient protocols for PPRL.

Privacy-preserving Genome-wide Association Study (GWAS) GWAS is used to identify genetic variants that are statistically associated with the respective phenotypes. The large number of individuals included in the analysis is very important and critical to find rare variants that usually have a small effect but important. Principal Components Analysis (PCA) is one of the most common approaches for modeling this population structure in GWAS, but nowadays the Linear Mixed Model (LMM) is believed by many to be a superior model. Previously proposed privacy-preserving GWAS studies focus on PCA-based GWAS. We plan to develop a privacy-preserving method for fast LMM based GWAS which scales linearly with cohort size in both run time and memory use. We will develop MPC-friendly algorithms for LMM and benefit from the parallel execution of thousands of multiplications/additions in our three-party MPC framework.

A Machine Learning Framework for Private Training Many studies in the literature have focused only on CNN algorithms from the classes of NN algorithms for PPML. Applying privacy-preserving machine learning techniques to RNNs, RKNs, LSTMs, and GANs algorithms pose important new problems. In order to implement PPML, some problems related to ML must be solved directly. We want to focus on these problems and efficiently apply privacy-preserving machine learning techniques to RNNs, RKNs, LSTMs, and GANs algorithms. In this context, we plan to develop a more efficient framework that depends on the mixed usage of MPC, randomized encoding and homomorphic encryption (HE). Certain functions such as comparison or the widely used activation such as ReLU or Sigmoid, requiring extraction of MSB in a privacy-preserving manner, needs the involvement of the boolean world, while functions such as addition, dot product are more efficient when performed in the arithmetic domain. The ML algorithms involve a mix of operations, constantly alternating between these two worlds. As shown in some of the recent works using mixed world computation is orders of magnitude more efficient as compared to most of the current best MPC techniques which operate only in either of the two worlds. Furthermore, the utilization of randomized encoding in an MPC setting results in effective performance gains by reducing round complexity of MPC protocols.

Privacy-preserving Performance Evaluation of Collaborative Machine Learning Models Typically, ML techniques require large computing power, which leads clients with limited infrastructure to rely on the method of Secure Outsourced Computation (SOC). In SOC setting, the computation is outsourced to a set of specialized and powerful cloud servers. Outsourcing is also a very efficient way of training an ML model on pooled data from different data sources. In this setting, some clients may not have enough number of test samples for measuring the performance of the model trained on the pooled data. Thus these clients need test data of other clients to measure the performance of the model. It is very crucial to provide methods to measure the performance of ML models using the pooled test data without sacrificing privacy. It is also very interesting to enable different architectures, that have some constraints in terms of data sharing and communications, such as data analytics trains and federated architectures to measure the performance of ML models in a privacy-preserving manner.

Privacy-preserving Explainable Machine Learning Privacy and transparency are two key elements of trustworthy machine learning. Model explanations can provide more insight into a model’s decisions on input data. This, however, can impose a significant privacy risk to the model’s training set. The explanations must have specific privacy requirement, in particular, they must preserve the privacy of other users. Since explanations rely on a collection of individual data points or features, they can expose personal information of other users whose digital traces are used to train the inference models. In the first part of this research, we plan to analyze whether an adversary can exploit model explanations to infer sensitive information about the model’s training set. We will investigate this research problem primarily using membership inference attacks: inferring whether a data point belongs to the training set of a model given its explanations. We will design reconstruction attacks against example-based model explanations, and use them to recover significant parts of the training set. In the second part, we plan to design privacy-preserving methods for computing model explanations that protect the sensitive information in the explanation and training data. A possible solution consists in privacy-preserving techniques that can be used to obfuscate or aggregate the information about other users. In spite of its importance, research in this area is at its early stage, and a large number of questions remain unanswered.

Secure and Private Federated Learning Federated learning is a machine learning setting in which many clients train a model collaboratively under the control of a central service provider. During the training phase data is kept decentralized and minimized model updates are used. Thus many privacy risks of centralized machine learning are mitigated in federated learning. There are many studies that address and mitigate the privacy issues of federated learning. These studies use secure multiparty computation (MPC), homomorphic encryption (HE), and a trusted execution environment (TEE), which are secure computation techniques. In these solutions, the semi-honest server can observe the aggregated models. There are also federated learning solutions utilizing differential privacy (DP). These solutions are not successful in preserving utility for high dimensional datasets. Thus applying DP-based federated learning methods to medical data that usually has a high number of features is very challenging. We plan to propose a privacy-preserving federated learning framework where we will use both TEE and MPC in order to employ the privacy-in-depth principle. In our solution, privacy-sensitive parts of the federated learning algorithm, which is represented with an MPC protocol, are run inside TEE. TEE also provides security against a malicious server because each client knows that an encrypted local model update will be decrypted and aggregated inside TEE through remote attestation.