Privacy-Preserving AI for Regulated Data: What’s Actually Working

Home >> TECHNOLOGY >> Privacy-Preserving AI for Regulated Data: What’s Actually Working
Share

The problem nobody wants to talk about openly

There are patient records in the hospital. Banks contain history of transaction. The information of insurance companies is really sensitive. And everybody would like to apply AI.

However, the glitch in this case is the fact that training a machine learning model typically implies that the data should be centralized. And being able to centralize all those sensitive data is also what the regulations, such as GDPR and HIPAA say that you should not do.
That is where AI that is privacy-preserving comes in. It is no longer a buzzword, a nice-to-have anymore. To any company that operates in regulated industries, it is soon becoming the only way out.

This work disaggregates what is practically deployable at this time, what still remains too immature to be able to be deployed, and where the field is actually going, based upon the current research done in healthcare, finance, and deployments in the public sector.

What privacy-preserving AI for regulated data actually means (and doesn’t)

And, once the jargon is stripped away, what remains is but simple: can you extract useful patterns out of sensitive data without ever having to expose the raw records?

Privacy-preserving AI to regulated data Privacy-preserving AI is the practice of designing ML pipelines such that sensitive information is never passed through the pipe, or is only passed via the pipe in some mathematically guaranteed form – encrypted, noised, or split between multiple parties such that no single actor can re-assemble it.

This is not simply a matter of ethics in regulated spheres. The GDPR specifies that it is mandatory to guarantee state of the art security. The HIPAA requirement in place is security measures on secure health data. Such laws do not only imply to be careful, there is a legal enforcement of being careful.

The biggest mistake most people make is that they believe that this is one technology. It’s not. It is a layering of techniques – each having varying trade-offs – that must be matched to the particular threat model and regulatory context of a given use case.

The tech that’s already in use and what I’ve actually seen work

Federated learning: the one with real traction

Federated learning (FL) is likely the most production-ready method as of right now. It is not dragging data to a central server but directly sending the model to the location of the data. A local training is performed per site, all model changes are sent back to a central aggregator who combines those changes into a global model. Raw data do not move.

I have considered a number of healthcare deployments in which this strategy is applied to hospital networks training models on electronic health records and medical imaging without the images or records ever leaving the institution in which the training is conducted. The most common real-world applications are currently cross-hospital risk scoring and cross-bank fraud detection, which are backed up with frameworks such as Flower and FedAvg.

But not FL only is sufficient. Gradients can be a source of information leakage. The reverse engineering of models can be performed. That is why practically it is always accompanied with something.

Differential privacy: strong guarantees, real trade-offs

Differential privacy (DP) introduces mathematically calibrated random noise used to model the outcomes or the aggregates in such a way that it is mathematically challenging to determine whether or not they include the data of any individual. It is used in major platforms as telemetry. It is beginning to find use in healthcare and finance as a way of sharing private statistics, and in the training of models.

The level playing field: there is a literal cost of accuracy. With more privacy budgets that is, epsilon values of 15), the model can compromise by a few percentage points. That is important in medical diagnosis. My experience with reviewing financial FL studies indicated that teams used significant amounts of time tuning this trade-off and it is non-trivial.

Trusted execution environments: quietly underrated

Isolated enclaves (Intel SGX, ARM TrustZone) in hardware provide you with a decryption and processing only within a secure memory address space. This is already being actively offered by cloud providers since it can be used with regulated ML workloads.

It’s not glamorous. However, it works, and it is compatible with the infrastructure of audit and key management. TEEs can be the most viable near-term proposal in the context of organizations that outsource the compute to cloud vendors and process sensitive data.

What’s still emerging and why it hasn’t shipped yet

Fully homomorphic encryption: the holy grail that keeps moving

The homomorphic encryption (HE) allows you to perform computations on encrypted data. Plaintext is never seen by the server that is running inference. The encrypted results are returned. It is ideal in controlled data situations.

The issue lies in performance. HE is practical when data (model) is small, and data are low-dimensional. With large deep models or anything with low latency requirements, the computational cost remains forbidding. This gap was especially noticeable in the case of comparative research – HE-enhanced federated learning achieves accuracy comparable to centralized training, at a significant cryptographic expense that makes real-time applications challenging.

Business is coming on at a good pace here. The libraries such as Concrete ML and focused hardware acceleration are bringing the gap down. Partially systematic Hybrid designs, with TEEs on top of DP, or some layers of full HE, are becoming the pragmatic middle ground as full FTE matures.

Secure multi-party computation: powerful but specialist

MPC distributes the computations among many parties in such a way that no particular party is in a position to have the complete picture. The last output is the only one that is exposed. It is already being applied in certain finance and government applications of joint statistics, and in secure aggregation in FL systems.

But there is a high overhead and latency in engineering. Until further notice, it is a specialist tool, only applicable to certain cross-institution analytics applications, and not a tool of default stack choice.

The LLM problem nobody’s fully solved

Even in large language models, and generative AI, the careful maintenance of privacy remains more of an experiment. The initial work involves making DP-style training more efficient in the case of LLMs as well as exploring the use of FL along with TEEs in fine-tuning on regulated data without centralizing it.

There is currently no standard privacy audits of LLMs. That is a big hole, and it is beginning to draw the attention of regulators. Articles such as How AI Companies Are Finally locking Down their models reflect the urgency of this, organizations are now under pressure to offer concrete technical guarantees, and not merely statements of policy.

The challenges that don’t get enough attention

Attacks don’t care about your intentions

Even privately-owned FL systems are susceptible. Sometimes membership inference attacks can also identify whether the data of a particular individual appeared in a training set. Outputs can be used to reconstruct sensitive features of outputs, a technique known as model inversion attacks. Gradient leakage has the potential to reveal training data during FL updates.

PET systems installed mostly use the assumption of honest-but-curious adversaries. They do not quite capture insider threats, side channels, or linkage attacks, between multiple datasets. The regulators are progressively becoming concerned with this gap – and research is beginning to finally catch up with proposals of the standardized attack simulation suites which organizations can run before their deployment.

The regulatory acceptance gap

It is here that things become practically troublesome. Most research on PET yields prototypes that operate successfully and in controlled environments. Overcoming a risk officer or a regulatory audit with those prototypes is another issue altogether.

Threat models should be clear to risk officers. They require written privacy assurances which can be checked in the language they understand. They require failure modes that are not only explainable, but also do not just start with a non-explainable epsilon-differential privacy.

The notion of Confidential AI is gaining ground in both theoretical and empirical terms as a conceptualization that may be able to our brigade this gap as of treating privacy as a verifiable, auditable property of an AI system but not as a design intention. This framing better aids regulatory discussions than most technical documentation.

Explainability vs. privacy: a real conflict

Techniques to enhance privacy tend to complicate model behavior interpretation. The DP noise makes it unclear what factors a decision was made by. FL implies that individual players are not entirely aware of training dynamics. Encrypted computerization eliminates the interpretability completely.

However, regulations such as the GDPR also mandate that the decisions made by AI are to be understandable and challengeable. These two requirements are literally at war, and the majority of deployed systems have failed to come safely out of the conflict. It is among the live research questions to keep an eye on.

Where it’s all heading and what’s worth watching

Hybrid PET stacks are the near-term future

No single technique is winning, but the most promising direction is. It is a planned combination of these: FL to keep data local, DP to share aggregates, secure aggregation or TEEs on the server, and HE to perform specific high-sensitivity inference tasks.

More recent hybrid research enables high-resource clients to use HE (no noise, higher accuracy) and low-resource clients to use DP (less compute), as well as still overall performance that is acceptable with heterogeneous networks. This form of flexible stack is the direction that feasible production systems are going.

Privacy-preserving ML for complex data types

A majority of existing deployments interact with structured, tabular data. The frontier is using PETs on multimodal, high-dimensional data – genomics, medical imaging, multi-omics. Strong privacy and clinical interpretability in these areas is actually difficult, and the study is in its infancy.

In the case of finance, the roadmap includes production-grade FL across fragmented legacy systems, privacy-preserving graph learning anti-money laundering and KYC, and standardized privacy risk metrics to credit and trading models.

End-to-end pipelines, not point solutions

Currently, the majority of installations of PET safeguard just a portion of the pipeline, typically training or inference. Future research is developing systems that provide formal privacy assurances throughout the entire ML workflow: data collection, feature engineering, training, inference, logging and model sharing.

To help organizations today consider where in the stack to deploy which protections, it is prudent to refer to materials like Architecting Confidential AI in the Cloud, which offer practical ways to think of where in the stack to implement which protections, instead of tacking on protection of privacy at the end.

A realistic path for teams working in regulated sectors

The literature is in agreement on one aspect and that is: Use-case and threat modeling as starting point rather than technology.

Before selecting a PET stack, make clear about: Who is the enemy? What is it that they can see? Which regulatory obligations, in particular, are applicable? What would be the allowable cost of accuracy?

Based on it, the study proposes the gradual process:

Begin with smaller models of structured data. The FL plus secure aggregation suits the vast majority of needs of data localization. Add DP at the points where you need to guard against inference of the statistics released. Outsource compute to untrusted environments by importing TEEs or HE.

Gradually incorporate rather than tack on privacy budgeting, logging, and an attack simulation (possibly membership inference testing) into your MLOps pipeline, not as an afterboro.

And critically: have this as a cross-functional program. Legal, compliance, security and data science must be at the table since the very start. The largest orders in this space are not technical, but a governance failure where the technical teams built something that compliance cannot approve.

My honest take after going through the research

Widely spaced are the differences between what is theoretically feasible and what is practically available in controlled environments. There exists and works federated learning with DP and secure aggregation. TEEs are underrated. All that is related to HE at scale or the privacy of LLM is still in its infancy.

That which has changed over the past two years or so is that this is no longer a research wonder. Regulated-ML infrastructure is being provided by major cloud providers. Structures such as Flower have constructed DP and secure aggregation in. The tooling is approaching.

But the more difficult part of this is nonetheless the governance, audit and regulatory acceptance side of the problem and it is the part most grossly overshadowed by the more technical discussions of the problem.

With DP, FL, and attacks being such a presence, the best place to start exploring this space as a 18-year-old is currently the content of CSCI 699 course materials (free, grad-level, covers DP, FL, and attacks well) coupled with hands-on experimentation using Tensorflow privacy or PyTorch opacus. The healthcare and financial sector surveys (see the links below) are the most feasibly helpful places to start.

The privacy-sensitive AI to control data is not an answer. But it is real and it is racing and neglecting it is no longer a viable choice by anyone in sensitive arenas.

Leave a Reply

Your email address will not be published. Required fields are marked *