Module 6: Model Extraction and Inference Attacks
How adversaries steal proprietary models, reconstruct training data, infer membership, and exploit encrypted network traffic — and the defenses that resist them.
Model Extraction Fundamentals
A machine learning model is intellectual property. It encodes years of domain expertise, millions of labeled training examples, substantial compute investment, and proprietary architectural choices refined through exhaustive experimentation. Training GPT-4-class models reportedly costs tens to hundreds of millions of dollars in compute alone — a figure that does not capture the human expertise spent curating training data, tuning hyperparameters, or performing alignment work. Model extraction, also called model stealing, is the practice of cloning a proprietary model's behavior by querying its public-facing API and training a surrogate model on the query-response pairs. The result: an adversary can obtain a functional equivalent of the target model for a fraction of its original training cost.
Why Model Extraction Matters
The motivations for attacking a model's intellectual property are diverse and often compound. First, there is straightforward IP theft: a competitor can extract a production model and deploy it as their own product, bypassing the original developer's licensing fees, terms of service, and competitive moat. Second, a stolen surrogate model can serve as a stepping stone for further attacks. Generating adversarial examples is far easier when you have white-box access to model gradients. By extracting a surrogate first, an attacker converts a black-box problem into a white-box one, making downstream adversarial attacks orders of magnitude more effective. Third, extraction can be used to circumvent rate limits and cost controls: once you own a local copy, you can run inference without per-query charges or usage monitoring.
Finally, and perhaps most alarmingly, extraction enables privacy inference. A model trained on sensitive data (medical records, financial histories, private communications) may leak information about its training set even when accessed only through its API. Extraction gives the adversary a persistent local artifact to probe at leisure, without the audit trails that API providers maintain.
Exact Extraction vs. Functional Equivalence
Two distinct goals exist within model extraction. Exact extraction attempts to recover the precise weights and architecture of the target model, reproducing its behavior on every possible input — including corner cases. This is theoretically possible for simple model classes (e.g., small ReLU networks) where the number of queries needed to uniquely determine weights grows polynomially with model size, but it remains computationally intractable for modern billion-parameter LLMs.
Functional equivalence, by contrast, settles for a surrogate that matches the target model's behavior on a task-relevant input distribution. The surrogate need not share the target's architecture or weights; it only needs to produce similar predictions on the inputs the attacker cares about. This is the practically relevant threat for most commercial deployments and requires far fewer queries than exact extraction. Research has demonstrated functional equivalents of commercial NLP models achievable with a few hundred thousand API calls — well within the budget of a well-funded adversary. [Tramèr et al.]
Attack Goal
Clone model behavior without access to weights or architecture.
Attack Surface
Any public prediction API that returns labels or probability scores.
Attacker Cost
API query fees + surrogate training compute (typically 100–10,000× cheaper than original).
Victim Loss
Revenue, competitive advantage, downstream privacy risks for training subjects.
Query-Based Model Stealing
The canonical model stealing attack unfolds in three phases: systematic querying of the target API, accumulation of a query-response dataset, and training a surrogate model on that dataset. The simplicity of this pipeline belies its effectiveness. Modern research has shown that even a surrogate with a different architecture than the target can achieve near-identical task performance when trained on well-selected query-response pairs. [Tramèr et al.]
Query Strategy: Active Learning for Efficiency
A naive attacker might sample inputs uniformly at random. A sophisticated attacker uses active learning to select queries that maximize information gain. The core insight is that not all inputs are equally informative: points near the model's decision boundary carry far more information about the model's function than points firmly in one class region. Active learning heuristics (uncertainty sampling, query by committee, core-set selection) allow an attacker to build an accurate surrogate in significantly fewer queries — sometimes an order of magnitude fewer than uniform sampling.
The attacker begins with a seed set of unlabeled inputs, queries the target, and then selects the next query batch by asking: "which inputs, if labeled, would most reduce the surrogate's uncertainty?" The surrogate is retrained after each batch, and the process repeats. This closed loop is why model stealing can be devastatingly efficient even against APIs that return only hard labels (no probabilities): even binary membership information progressively constrains the surrogate.
Training the Surrogate
Once a sufficient set of (input, target_response) pairs has been accumulated, the
attacker trains a surrogate model — which need not share the target's architecture —
to minimize the loss on those pairs. Soft labels (probability distributions) are
far more information-rich than hard labels: a prediction of
[cat: 0.72, dog: 0.25, fox: 0.03] conveys the target model's confidence
geometry near that input, whereas a hard label cat discards the relative
scores. Where APIs return confidence scores, the attacker should use them as training
targets (knowledge distillation). Where only hard labels are available, temperature
scaling and label smoothing on the surrogate side partially compensate.
Working Implementation
import numpy as np
import requests
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from scipy.stats import entropy
class ModelExtractor:
"""
Black-box model extraction via active-learning-guided query selection.
Trains a local surrogate MLPClassifier to mimic a remote target API.
"""
def __init__(self, target_api_url, n_classes=2, query_budget=5000):
self.target_url = target_api_url
self.n_classes = n_classes
self.query_budget = query_budget
self.queries = [] # List[np.ndarray] — inputs sent to target
self.responses = [] # List[int or list] — labels/probabilities returned
# Surrogate architecture: two hidden layers
self.surrogate = MLPClassifier(
hidden_layer_sizes=(100, 50),
activation='relu',
max_iter=500,
random_state=42
)
self.scaler = StandardScaler()
self._fitted = False
def query_target(self, inputs):
"""Send inputs to the target API and record (input, response) pairs."""
for x in inputs:
try:
resp = requests.post(
self.target_url,
json={"input": x.tolist()},
timeout=10
)
resp.raise_for_status()
data = resp.json()
# Accept either {"prediction": 1} or {"probabilities": [0.3, 0.7]}
label = data.get("prediction", np.argmax(data.get("probabilities", [0])))
self.queries.append(x)
self.responses.append(label)
except (requests.RequestException, KeyError) as e:
print(f"Query failed: {e}")
def train_surrogate(self):
"""Fit the surrogate on all accumulated (query, response) pairs."""
X = np.array(self.queries)
y = np.array(self.responses)
X_scaled = self.scaler.fit_transform(X)
self.surrogate.fit(X_scaled, y)
self._fitted = True
train_acc = self.surrogate.score(X_scaled, y)
return train_acc
def select_uncertain_batch(self, candidate_pool, batch_size=100):
"""
Active learning: pick the batch_size candidates where the surrogate
is most uncertain (highest entropy over class probabilities).
Requires surrogate to be trained at least once.
"""
if not self._fitted:
# Cold start — return random batch
idx = np.random.choice(len(candidate_pool), batch_size, replace=False)
return candidate_pool[idx]
X_scaled = self.scaler.transform(candidate_pool)
proba = self.surrogate.predict_proba(X_scaled) # (N, n_classes)
uncertainties = np.array([entropy(p) for p in proba]) # Shannon entropy
# Select top-k most uncertain samples
top_idx = np.argsort(uncertainties)[-batch_size:]
return candidate_pool[top_idx]
def run_extraction(self, input_domain_sampler, batch_size=100):
"""
Full extraction loop.
input_domain_sampler: callable() -> np.ndarray of shape (N, d)
"""
rounds = self.query_budget // batch_size
for round_i in range(rounds):
# 1. Sample a large candidate pool from the input domain
candidates = input_domain_sampler()
# 2. Use active learning to pick the most informative batch
batch = self.select_uncertain_batch(candidates, batch_size)
# 3. Query the target API
self.query_target(batch)
# 4. Retrain surrogate on all data so far
if len(self.queries) >= 200: # Minimum for meaningful training
acc = self.train_surrogate()
print(f"Round {round_i+1}: {len(self.queries)} queries, surrogate accuracy={acc:.3f}")
return self.surrogate
# ── Example usage ──────────────────────────────────────────────────────
# Suppose the target model classifies 20-dimensional input vectors
def domain_sampler():
"""Returns 500 random candidates from input domain."""
return np.random.uniform(-1, 1, size=(500, 20)).astype(np.float32)
extractor = ModelExtractor(
target_api_url="https://api.example.com/predict",
n_classes=3,
query_budget=2000
)
surrogate_model = extractor.run_extraction(domain_sampler, batch_size=100)
print(f"Extraction complete. Total queries: {len(extractor.queries)}")
Optimizing for Maximum Information Gain
Beyond uncertainty sampling, attackers can exploit several additional strategies. Jacobian-based data augmentation (JBDA) synthesizes new training points by applying small gradient steps to existing labeled inputs, generating inputs near decision boundaries without additional API calls. Model-free approaches use generative models to synthesize diverse inputs from scratch. For NLP models, prompt chaining — where the attacker systematically varies one linguistic dimension at a time — allows efficient coverage of the response surface with structured query sets. [Papernot et al.]
Training Data Extraction from LLMs
Large language models are, in a precise technical sense, compressed summaries of their training corpora. They learn statistical regularities — everything from spelling patterns to full verbatim passages — and store this knowledge in their billions of parameters. This creates a vulnerability: a sufficiently precise attacker can cause a model to regurgitate verbatim text that appeared in its training set. The landmark research by Carlini et al. (2021) demonstrated this decisively against GPT-2, extracting hundreds of verbatim training examples including full names, physical addresses, phone numbers, and copyrighted text simply by querying the model's public API. [Carlini et al., USENIX Security 2021]
Why LLMs Memorize Training Data
Memorization arises from a combination of factors. Training data that is duplicated many times in the corpus is more likely to be memorized — GPT-2 memorizes entire MIT license texts because they appear verbatim on hundreds of thousands of GitHub repositories. Model capacity amplifies this: larger models memorize more, because they have more parameters to store rare training examples. Counterintuitively, longer training (more epochs) also increases memorization, as the model sees the same examples repeatedly and fits them more precisely.
Carlini et al. define k-eidetic memorization: a string s is k-eidetically memorized by a model if the model can reproduce s from a length-k prefix, and s appears in the training data only once. This is distinct from factual knowledge (which may be learned from many corroborating examples) — eidetic memorization is verbatim retention from a single training example.
Extraction Methodology
The attack pipeline involves three steps: generation, ranking, and verification. First, the attacker generates a large number of text samples from the model — Carlini et al. generated 600,000 samples using diverse prompting strategies. Second, they rank these samples using membership inference metrics as a filter: samples where the model assigns unusually high likelihood are more likely to be memorized. Specifically, they compare the model's perplexity on a candidate to a smaller reference model's perplexity. Memorized text is high-likelihood for the large model but not for the reference model. Third, the top-ranked candidates are verified against the original training corpus.
Divergence and Prefix Attacks
Divergence attacks exploit the phenomenon that a model fine-tuned with RLHF or instruction tuning will suppress memorization outputs during normal use, but can be induced to "forget" this suppression by crafting adversarial prompts. Carlini et al. (2023) extracted megabytes of training data from ChatGPT — despite its alignment training — by using a simple repetition prompt: asking the model to repeat a word indefinitely causes it to eventually diverge from its aligned behavior and emit training data verbatim. [Carlini et al., 2023 — ChatGPT Extraction]
Prefix attacks provide the model with a genuine prefix from the training data and observe whether it completes the rest of the passage accurately. Prompting GPT-2 with "My address is 1 Main Street" caused it to accurately complete with specific real individuals' contact information in the Carlini et al. experiments. Completion-based extraction is the general technique: any prompt that was seen during training acts as a retrieval key for the passage that followed it.
Testing for Memorization
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import zlib
def compute_perplexity(model, tokenizer, text, device="cpu"):
"""Compute per-token perplexity of a text under a given model."""
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
output = model(**inputs, labels=inputs["input_ids"])
return torch.exp(output.loss).item()
def memorization_score(large_model, small_model, tokenizer, text, device="cpu"):
"""
Carlini et al. 'ratio' metric: compares large vs. small model perplexity.
A memorized passage has LOW perplexity under the large model
but NOT proportionally low under the small reference model.
Higher score → more likely memorized.
"""
ppl_large = compute_perplexity(large_model, tokenizer, text, device)
ppl_small = compute_perplexity(small_model, tokenizer, text, device)
# Ratio metric: lower PPL in large vs small suggests memorization
ratio_score = np.log(ppl_small) / np.log(ppl_large)
# Zlib metric: compare model PPL to compression-based estimate
zlib_entropy = len(zlib.compress(text.encode())) / len(text)
zlib_score = np.log(ppl_large) / zlib_entropy
return {
"perplexity_large": ppl_large,
"perplexity_small": ppl_small,
"ratio_score": ratio_score, # higher → more suspect
"zlib_score": zlib_score, # higher → less compressible = natural text
}
def generate_candidates(model, tokenizer, n_samples=200, max_length=256, device="cpu"):
"""
Generate n_samples completions from the model with an empty prefix.
Returns a list of generated strings.
"""
model.eval()
candidates = []
with torch.no_grad():
for _ in range(n_samples):
input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(device)
output = model.generate(
input_ids,
max_new_tokens=max_length,
do_sample=True,
top_k=40,
temperature=1.0,
pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
candidates.append(text)
return candidates
# ── Example: screen 200 GPT-2 generations for possible memorized text ──
if __name__ == "__main__":
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
model_xl = GPT2LMHeadModel.from_pretrained("gpt2-xl").to(device)
model_sm = GPT2LMHeadModel.from_pretrained("gpt2").to(device) # reference
candidates = generate_candidates(model_xl, tokenizer, n_samples=200, device=device)
scored = []
for text in candidates:
scores = memorization_score(model_xl, model_sm, tokenizer, text, device)
scored.append((text, scores))
# Sort by ratio score descending — highest ratio = most likely memorized
scored.sort(key=lambda x: x[1]["ratio_score"], reverse=True)
print("Top 5 candidates most likely to contain memorized training data:")
for text, scores in scored[:5]:
print(f" Ratio={scores['ratio_score']:.3f} | PPL(xl)={scores['perplexity_large']:.1f}")
print(f" Text: {text[:120]}...")
print()
In Carlini et al.'s best attack configuration, 67% of top-ranked candidates were confirmed verbatim training examples. Among their 604 unique extracted memorized sequences, 46 contained personal names and 32 contained contact information — real individuals whose data had been harvested into GPT-2's CommonCrawl training corpus without their knowledge. [USENIX Security 2021]
Membership Inference
Imagine a hospital trains a machine learning model to predict patient readmission risk. The model is deployed via a public API for insurance companies to query. Now consider an adversary — perhaps a competitor, a nosy employer, or a malicious insurer — who has a specific individual's medical record and wants to know: was this person's data used to train this model? This is the membership inference problem, and it is one of the most practically significant privacy threats in modern machine learning. [Shokri et al., IEEE S&P 2017]
Why Models Leak Membership
The fundamental cause is overfitting. A model that has been trained on a data point typically assigns it higher confidence, lower loss, and different internal representations than unseen data points. Even well-regularized models exhibit this difference to a measurable degree. The gap is larger for rare, unusual, or exactly duplicated training examples — the same memorization phenomenon that enables training data extraction also enables membership inference.
The Shadow Model Technique
The seminal attack by Shokri et al. (2017) introduced the shadow model approach. Because the attacker cannot directly observe the target model's training set, they simulate it: they train several shadow models on datasets drawn from the same distribution as the target's training data. For each shadow model, the attacker knows exactly which points were in the training set (members) and which were not (non-members). They record the model's output vector for each point and label it accordingly. This produces a labeled dataset of (model_output, member/non-member) pairs. An attack classifier trained on this labeled dataset can then be applied to the target model's outputs to infer membership.
The technique achieved median membership inference accuracy of 94% against models trained on Google's ML services and 74% against Amazon's services in realistic experiments. [Shokri et al.]
Loss-Based Inference
A simpler approach that avoids training shadow models is the loss threshold attack: compute the model's loss on the target point and compare it to a threshold. If the loss is below the threshold (the model is highly confident), predict membership. This works because training examples tend to have lower loss than unseen data, especially for overfit models. More sophisticated variants use reference models: compute the likelihood ratio between the target model and a reference model trained on disjoint data. Points where this ratio is high are likely members.
Python Implementation
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
class MembershipInferenceAttack:
"""
Shadow-model-based membership inference attack.
Assumes black-box access to a target model that returns
class probability vectors for input features.
"""
def __init__(self, n_shadow=4, shadow_architecture=None):
self.n_shadow = n_shadow
self.shadow_architecture = shadow_architecture or (64, 32)
self.attack_classifier = LogisticRegression(max_iter=500)
self.shadow_models = []
self._trained = False
def _train_shadow_model(self, X_train, y_train):
"""Train a single shadow model on a subset of proxy data."""
shadow = MLPClassifier(
hidden_layer_sizes=self.shadow_architecture,
max_iter=300,
random_state=np.random.randint(0, 10000)
)
shadow.fit(X_train, y_train)
return shadow
def _extract_features(self, model, X):
"""
Extract membership inference features from model outputs.
Uses the full probability vector as features (preserves calibration signal).
"""
proba = model.predict_proba(X) # shape: (n_samples, n_classes)
# Additional engineered features: confidence, entropy, top-2 gap
confidence = proba.max(axis=1, keepdims=True)
pred_entropy = -np.sum(proba * np.log(proba + 1e-9), axis=1, keepdims=True)
sorted_p = np.sort(proba, axis=1)[:, ::-1]
top2_gap = (sorted_p[:, 0] - sorted_p[:, 1]).reshape(-1, 1)
return np.hstack([proba, confidence, pred_entropy, top2_gap])
def train_attack(self, proxy_data_X, proxy_data_y):
"""
Build labeled (features, membership) dataset using shadow models,
then train the attack classifier.
proxy_data_X, proxy_data_y : dataset drawn from same distribution
as target's training set.
"""
all_features = []
all_labels = []
n = len(proxy_data_X)
split = n // 2
for i in range(self.n_shadow):
# Randomly partition proxy data into shadow-train and shadow-test
idx = np.random.permutation(n)
train_idx, test_idx = idx[:split], idx[split:]
X_tr, y_tr = proxy_data_X[train_idx], proxy_data_y[train_idx]
X_te, y_te = proxy_data_X[test_idx], proxy_data_y[test_idx]
shadow = self._train_shadow_model(X_tr, y_tr)
self.shadow_models.append(shadow)
# "In" examples: training set of this shadow model → label 1
f_in = self._extract_features(shadow, X_tr)
# "Out" examples: test set of this shadow model → label 0
f_out = self._extract_features(shadow, X_te)
all_features.append(np.vstack([f_in, f_out]))
all_labels.extend([1] * len(X_tr) + [0] * len(X_te))
print(f"Shadow model {i+1}/{self.n_shadow} trained. "
f"Accuracy={shadow.score(X_te, y_te):.3f}")
F = np.vstack(all_features)
L = np.array(all_labels)
self.attack_classifier.fit(F, L)
self._trained = True
def infer_membership(self, target_model, query_points):
"""
Given a trained target model and query data points,
return membership probability for each point.
1.0 = likely member, 0.0 = likely non-member.
"""
if not self._trained:
raise RuntimeError("Call train_attack() first.")
features = self._extract_features(target_model, query_points)
return self.attack_classifier.predict_proba(features)[:, 1]
def evaluate(self, target_model, member_X, nonmember_X):
"""Compute AUC of attack against target model."""
member_scores = self.infer_membership(target_model, member_X)
nonmember_scores = self.infer_membership(target_model, nonmember_X)
y_true = np.concatenate([np.ones(len(member_X)), np.zeros(len(nonmember_X))])
y_score = np.concatenate([member_scores, nonmember_scores])
auc = roc_auc_score(y_true, y_score)
print(f"Membership Inference AUC: {auc:.4f}")
return auc
Privacy Implications: GDPR and HIPAA
Membership inference directly violates the privacy principles enshrined in major data protection laws. Under GDPR Article 17 (right to erasure), individuals can request deletion of their data — but if a deployed model reveals membership, effective deletion becomes impossible without retraining or applying machine unlearning techniques. Under HIPAA, health information used to train models without de-identification may expose institutions to liability if membership can be inferred from the deployed model. Regulatory bodies are increasingly treating demonstrable membership inference vulnerability as a compliance failure, not merely a theoretical risk.
Attribute Inference
Membership inference is a binary question: was this person in the training set or not? Attribute inference is more granular: given that a person is in the training set (or even just as an input to the model), can an attacker learn sensitive attributes about them that were not explicitly provided as input? This class of attack is particularly pernicious because it can operate at query-time, not just at training time — any prediction API can potentially leak demographic or behavioral attributes about the query subject.
How Attribute Inference Works
Consider a credit scoring model trained on a dataset that includes both "approved features" (income, credit history, loan amount) and "protected attributes" (race, gender, zip code as a proxy for race). Even if the protected attributes are excluded from the model's official feature set at inference time, the model may have absorbed the correlation during training. An adversary who knows the target individual's approved features can query the model and, by observing the prediction, infer the individual's protected attributes. Research has shown that even models trained with explicit fairness constraints can still leak protected attributes through their output distributions.
The attack typically works by training a reconstruction model: the attacker collects many (known_features, model_output) pairs where the sensitive attribute is also known (from a separate dataset or via correlation), and trains a classifier to predict the sensitive attribute from the model's output. Yeom et al. (2018) formalized this as an attack that succeeds whenever a model has learned to exploit the correlation between the sensitive attribute and the label.
Demographic and Behavioral Inference
LLMs present a particularly rich surface for attribute inference because they produce nuanced, open-ended outputs. Research has demonstrated that LLMs can be used to predict user demographics (age, gender, political affiliation, nationality) from writing style alone. When an LLM is fine-tuned on user interaction logs, the fine-tuned model may expose individual users' characteristics through subtle systematic differences in how it responds to queries — even for users who were part of the fine-tuning set rather than the current query.
Cross-Referencing Attacks
A powerful variant combines attribute inference with auxiliary data. The attacker does not rely on the model alone; they combine model outputs with publicly available databases, social media profiles, and other data sources. For example, knowing that a medical model predicts a 73% readmission risk for a patient with features (age=54, zip=90210, diagnosis=T2D) might be enough — when cross-referenced with public voter registration and property records — to uniquely identify the patient and infer their full medical history.
Mitigation Strategies
Defending against attribute inference requires both training-time and deployment-time interventions. Adversarial training for fairness penalizes models that allow a discriminator to infer protected attributes from intermediate representations. Output restriction — returning only hard labels rather than confidence scores — reduces the information available to an attacker but does not eliminate the risk. Federated learning with differential privacy can limit the amount of individual-level information encoded in model weights, providing the strongest theoretical guarantees.
Side-Channel Attacks on LLMs: Whisper Leak
A widespread assumption in LLM deployment is that TLS/HTTPS encryption provides meaningful confidentiality for user queries. The Whisper Leak research (2025) dismantles this assumption. The attack demonstrates that an adversary with passive access to a user's encrypted network traffic — such as an internet service provider, a compromised router, or a malicious Wi-Fi access point — can infer the topic of a user's LLM query with over 98% accuracy on 17 of 28 tested commercial LLMs, without ever decrypting a single byte of payload. [Whisper Leak, arXiv 2511.03675, 2025]
Attack Premise: Why Encryption Is Not Enough
TLS (using stream ciphers like AES-GCM) encrypts payload content but preserves
payload size: size(ciphertext) = size(plaintext) + constant.
When an LLM generates a streaming response token by token, each token is sent as a
separate encrypted packet. Because tokens have variable lengths (the token the
is 3 characters; antidisestablishmentarianism is 28), the sequence of
packet sizes directly encodes the sequence of token lengths.
The key insight is that different topics produce systematically different token length patterns. A response about quantum physics uses longer, less frequent technical terms. A response about cooking uses shorter, more common vocabulary. A response about legal matters includes specific legal terms and Latin phrases. These patterns are stable enough across different users asking similar questions that a trained classifier can identify the topic from packet sizes alone — even without seeing the content.
Experimental Setup and Results
The researchers collected 21,716 queries per model (100 topic variants × 100 repeats + 11,716 Quora noise queries) and trained binary classifiers using tcpdump-captured encrypted traffic traces. Testing was conducted against 28 LLMs from major providers including OpenAI (GPT-4o-mini, GPT-4.1), Anthropic (Claude 3 Haiku), Google (Gemini 1.5 Flash, 2.5 Pro), Microsoft, xAI (Grok), Mistral, DeepSeek, Meta (LLaMA via Lambda), and Amazon (Nova). [Whisper Leak]
AUPRC (Median)
>98% for 17 of 28 models. Average 96.8% across all models.
Precision
100% precision at 5–20% recall for 17/28 models (e.g., GPT-4o-mini, Mistral, Grok).
Hardest Targets
Google Gemini (81–84% AUPRC) and Amazon Nova (71–77%) were most resistant.
Attacker Model
Passive network observer (ISP-level). No active interference required.
Three Classifier Architectures
The paper evaluated three classifier architectures on the packet-size + inter-arrival time feature sequences:
- LightGBM: Gradient-boosted decision tree ensemble on flattened, zero-padded packet size/timing sequences (padded to 95th percentile length). Fast to train, competitive accuracy.
- BiLSTM: Bidirectional LSTM with attention mechanism. Embeds each (packet_size, inter_arrival_time) pair, processes with BiLSTM + attention, then classifies via a two-layer MLP head. Captures sequential dependencies.
- BERT-based (DistilBERT): Discretizes packet sizes and timings into 50-bin tokens, then fine-tunes a DistilBERT classifier on these token sequences. Best performance on models with complex packet distributions.
Conceptual Attack Code: Traffic Capture and Classification
import subprocess
import struct
import numpy as np
from collections import namedtuple
import joblib # for loading pre-trained LightGBM classifier
PacketTrace = namedtuple("PacketTrace", ["sizes", "timings"])
def capture_llm_traffic(target_host, duration_sec=30, interface="eth0"):
"""
Passively capture encrypted HTTPS packets to/from an LLM API endpoint.
Returns a PacketTrace with per-packet sizes and inter-arrival times.
IMPORTANT: Only use on networks/systems you are authorized to monitor.
This is a conceptual demonstration of the Whisper Leak methodology.
Uses tcpdump (requires appropriate privileges or packet capture capability).
"""
# In production, Whisper Leak uses tcpdump output parsed with Scapy
# Here we show the structure; actual implementation needs pcap parsing
cmd = [
"tcpdump", "-i", interface,
"-nn", "-q", "-tt", # timestamp + size, no DNS resolution
f"host {target_host} and port 443",
"-c", "2000" # capture up to 2000 packets
]
# Parse output lines of form: "1700000000.123456 IP src > dst: ... length NNN"
raw_lines = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode()
sizes = []
timestamps = []
prev_ts = None
inter_arrivals = []
for line in raw_lines.splitlines():
parts = line.split()
if not parts: continue
try:
ts = float(parts[0])
# length is typically the last token starting with a digit
pkt_len = int([p for p in parts if p.isdigit()][-1])
sizes.append(pkt_len)
if prev_ts is not None:
inter_arrivals.append(ts - prev_ts)
prev_ts = ts
except (ValueError, IndexError):
continue
return PacketTrace(sizes=sizes, inter_arrivals=inter_arrivals)
def featurize_trace(trace, max_len=512):
"""
Pad/truncate packet size and inter-arrival sequences to fixed length,
then concatenate into a feature vector for LightGBM classification.
Mirrors the Whisper Leak featurization strategy.
"""
sizes = np.array(trace.sizes[:max_len], dtype=np.float32)
timings = np.array(trace.inter_arrivals[:max_len], dtype=np.float32)
# Zero-pad to max_len
sizes_padded = np.pad(sizes, (0, max_len - len(sizes)))
timings_padded = np.pad(timings, (0, max_len - len(timings)))
# Concatenate sizes + timings into one feature vector
feature_vector = np.concatenate([sizes_padded, timings_padded])
return feature_vector.reshape(1, -1)
def classify_prompt_topic(trace, classifier_path, topic_labels):
"""
Given a captured traffic trace, load a pre-trained LightGBM classifier
and predict the topic of the underlying LLM prompt.
"""
clf = joblib.load(classifier_path) # pre-trained binary or multi-class model
features = featurize_trace(trace)
proba = clf.predict_proba(features)[0]
results = sorted(zip(topic_labels, proba), key=lambda x: x[1], reverse=True)
print("Topic classification results:")
for topic, prob in results[:5]:
print(f" {topic:30} {prob:.3f}")
return results[0][0] # top predicted topic
Defenses
The Whisper Leak paper evaluated random padding as a mitigation: appending a random-length string to each streamed token to obscure its length. Cloudflare implemented this defense after the initial disclosure. However, the paper showed that even with padding, timing information between packets retains significant classifiable signal. The only fully effective defense is constant-shape traffic: padding all responses to a fixed size and batching tokens before transmission, eliminating all packet-size variation. This fundamentally conflicts with the low-latency streaming UX that most LLM providers optimize for.
Token Length Side-Channel
The Whisper Leak attack infers prompt topics. The research by Weiss et al. (2024) pursues a more precise goal: inferring the exact content of an LLM's response — word for word — from the sizes of encrypted packets. This is the token length side-channel, and it represents one of the most startling privacy failures in LLM deployment history. [Microsoft Security Blog, 2025]
Attack Methodology
When an LLM streams its response token-by-token, each TLS packet carries exactly
one token's bytes. The packet size directly reveals the token's byte length: a
3-byte packet carries a 3-character token; a 7-byte packet carries a 7-character
token. An adversary observing the encrypted stream thus learns the length
sequence of every token in the response — e.g., [4, 5, 1, 3, 6, 2, 7, ...].
Armed with this length sequence, Weiss et al. employ a secondary LLM to reconstruct the most plausible sentence matching that exact token length pattern. The task is essentially a constrained text generation problem: generate a coherent sentence whose tokenization produces exactly the observed length sequence. Given that the tokenizer vocabulary is fixed and widely known (e.g., tiktoken for OpenAI models), this is a dramatically constrained problem — far easier than unconstrained generation.
The attack achieved approximately 27% exact token reconstruction of LLM output tokens. While this may seem modest, consider that even partial reconstruction can be sufficient to identify whether a response contained sensitive medical diagnoses, legal advice, or private personal information.
Why This Attack Is Structurally Hard to Fix
The root cause is not a bug in TLS — it is a fundamental property of the streaming API design. Any system that:
- Uses a fixed tokenizer (so token lengths are predictable from vocabulary),
- Streams tokens one-by-one over TLS, and
- Uses a stream cipher that preserves plaintext length,
is vulnerable to this attack. The only mitigations that fully neutralize it are: (a) adding deterministic padding to all tokens so they appear to be the same size, or (b) batching multiple tokens per packet before encryption — both of which increase latency and reduce the responsiveness that makes streaming valuable. Cloudflare implemented per-token random padding at the CDN layer after Weiss et al.'s disclosure, but residual information leakage was still demonstrated in follow-on work. [Whisper Leak citing Cloudflare mitigation]
Implications for Sensitive Deployments
Any LLM deployed in healthcare, legal, financial, or government contexts that uses streaming APIs is potentially leaking response content to anyone with network visibility between the user and the provider. This includes corporate proxies, VPN providers, ISPs, and potentially government surveillance infrastructure. The appropriate security posture for highly sensitive deployments is to disable streaming entirely (returning complete responses as a single packet) or to deploy LLM inference on-premises with no external network exposure.
Timing Attacks on Efficient Inference
Performance is competitive in the LLM serving market, and providers invest heavily in inference optimization techniques. Two major categories — speculative decoding and KV cache sharing — have been shown to introduce exploitable side-channels that reveal private information about user inputs and system configurations.
Speculative Decoding Side-Channels
Speculative decoding accelerates LLM inference by having a small, cheap "draft model" predict several tokens ahead, then having the large target model verify them in parallel. When the draft model guesses correctly, multiple tokens are accepted in one verification pass. When it guesses incorrectly, the model falls back to single-token autoregressive generation. This accept/reject pattern is input-dependent: some prompts will be decoded with many correct speculations (producing larger packets per iteration), while others trigger many mis-speculations (smaller packets per iteration).
Wei et al. (2024) demonstrated that a passive network adversary observing packet sizes per generation iteration can reconstruct this accept/reject trace, and use it to fingerprint user queries with >90% accuracy across four different speculative decoding schemes. [Wei et al., "When Speculation Spills Secrets," 2024] Specifically: REST achieved ~100% query identification accuracy, LADE reached 92%, BiLD 95%, and even EAGLE on remote vLLM achieved 77.6% accuracy.
Beyond query fingerprinting, a malicious user with API access can also extract private datastore contents used by retrieval-augmented speculative decoding (e.g., REST): by crafting inputs designed to probe the datastore and observing which tokens are correctly speculated, the attacker can leak datastore contents at more than 25 tokens per second. This is a particularly severe threat for RAG systems that include proprietary or confidential documents in their retrieval corpus.
Carlini & Nasr: Timing Variations from Inference Optimizations
Carlini & Nasr (2024) demonstrated an earlier version of this class of attack, showing that timing variations due to inference optimizations in commercial models (GPT-4, Claude) can be exploited via packet inter-arrival times as a side-channel. Their work established the threat model for this research area, though subsequent research showed that inter-arrival time signals are considerably noisier than packet size signals — the Wei et al. approach achieves 77.6% accuracy where Carlini & Nasr's approach achieves only 14.4% on the same vLLM setup.
InputSnatch: KV Cache Timing Attacks
A second class of timing side-channel exploits KV cache sharing. Modern LLM inference backends (vLLM, TensorRT-LLM) implement prefix caching: if two requests share a common prefix, the KV cache computed for that prefix is reused rather than recomputed. This produces a measurable timing difference — a cache hit responds noticeably faster than a cache miss. [Zheng et al., "InputSnatch," arXiv 2411.18191, 2024]
The InputSnatch attack by Zheng et al. (2024) exploits this vulnerability to reconstruct other users' cached prompts. The attack works by systematically querying the target service with candidate inputs and observing whether the time-to-first-token (TTFT) indicates a cache hit. A cache hit reveals that the candidate prefix matches another user's cached query. By iteratively constructing increasingly long prefixes that match, the attacker can reconstruct the victim user's exact prompt — even when the service uses HTTPS encryption.
In experiments on a medical Q&A chatbot with prefix caching, InputSnatch achieved a 62% success rate in extracting exact disease inputs and 13.5% for precise symptom descriptions. For a legal consultation RAG system with semantic caching, semantic extraction success rates ranged from 43% to 100%.
TPUXtract: Hardware Emanation Attacks
All the attacks discussed so far operate at the API or network level. But there is another attack surface entirely: the physical hardware running the model. Neural network inference on specialized chips (TPUs, GPUs, NPUs) produces electromagnetic (EM) emanations correlated with the operations being performed. An adversary with physical proximity to the hardware — or even access to measurement equipment in an adjacent rack in a data center — can use these emanations to infer the model's architecture. This is TPUXtract. [Keysight Security Blog, 2025]
Attack Methodology
The attack was demonstrated by researchers from North Carolina State University against a Google Tensor Processing Unit (TPU). The fundamental observation is that TPU power consumption varies measurably depending on the layer configuration being processed. Different layer types (convolutional, fully connected, attention), different layer sizes, and different connectivity patterns each produce distinct EM signatures.
TPUXtract exploits the fact that in a neural network, data flows sequentially through layers. The attacker measures the EM signal over time as the TPU processes an input and correlates different time windows with the EM profiles expected for different layer configurations. By matching the observed EM trace to a library of pre-characterized layer profiles, the attacker reconstructs the model's architecture one layer at a time.
This layer-by-layer approach dramatically reduces search complexity compared to trying to match the entire model at once. The attack achieved 99.91% accuracy in extracting neural network hyperparameters from the TPU.
What Can Be Extracted
TPUXtract can recover:
- Number of layers — the depth of the network
- Layer types — fully connected, convolutional, attention, normalization
- Layer dimensions — number of neurons/channels per layer
- Connectivity patterns — skip connections, attention heads
Critically, TPUXtract does not recover model weights — the actual numerical parameters learned during training. This is an important limitation: knowing the architecture is like knowing the blueprint of a building without knowing the furniture inside. However, knowing the architecture is enormously valuable: it enables much more efficient model extraction via the API (the search space for the surrogate is now dramatically constrained), it reveals proprietary architectural innovations, and it provides a roadmap for targeted adversarial attacks. For Transformer-based LLMs, full extraction requires additional steps (nullifying weights of specific layers to isolate others), but the paper demonstrates feasibility.
Implications for Model IP Security
Cloud providers and hardware manufacturers have historically assumed that EM emanations from accelerators are not exploitable in multi-tenant environments because isolation between tenants should prevent physical access. TPUXtract challenges this assumption: in co-location data centers, measurements from adjacent physical hardware may suffice. More broadly, any organization running AI inference on hardware that is not physically controlled end-to-end faces architectural secrecy risks. Effective countermeasures include hardware-level EM shielding, noise injection circuits, and power-consumption masking — all established techniques from the cryptographic hardware security domain, now becoming relevant to AI deployments. [Dark Reading, 2024]
Model Inversion
Model inversion attacks reconstruct representative inputs from model outputs, effectively running inference in reverse. Rather than asking "what does this model predict for this input?", the attacker asks "what input does this model associate with this prediction?" In the domain of facial recognition, model inversion can reconstruct recognizable face images of training subjects from nothing but the model's confidence scores — a profound privacy violation.
Fredrikson et al.: Face Reconstruction
The seminal model inversion paper by Fredrikson, Jha, and Ristenpart (CCS 2015) demonstrated that a facial recognition model trained on a set of named individuals could be inverted to produce recognizable face images for any target individual whose name is in the model's label space. [Fredrikson et al., CCS 2015] The attack works by gradient-based optimization in the input space: starting from random noise, the attacker iteratively adjusts the input to maximize the model's confidence for the target label. The optimization converges to an input that is highly recognizable as the target individual — even without ever seeing their actual photo in the training set.
This attack works because the model has encoded sufficient information about each individual's facial features in its parameters to make confident predictions — and gradient-based inversion can decode that information back into the image space. The attack succeeded in producing face images that human evaluators could correctly identify as the target individual at significantly above-chance rates.
Gradient-Based Inversion Code
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
def model_inversion_attack(
model,
target_class,
input_shape,
n_iterations=2000,
lr=0.01,
reg_strength=0.001
):
"""
White-box model inversion: reconstruct an input that maximally activates
target_class in the model's output.
Args:
model : PyTorch classification model (should output logits or log-probs)
target_class : int, index of the class to invert
input_shape : tuple, e.g. (1, 3, 64, 64) for a single RGB 64×64 image
n_iterations : optimization steps
lr : learning rate
reg_strength : L2 regularization on input (prevents degenerate solutions)
Returns:
reconstructed_input : torch.Tensor of shape input_shape
"""
model.eval()
# Initialize from random noise in [0, 1]
x = torch.rand(input_shape, requires_grad=True)
optimizer = optim.Adam([x], lr=lr)
criterion = nn.CrossEntropyLoss()
target = torch.tensor([target_class], dtype=torch.long)
for step in range(n_iterations):
optimizer.zero_grad()
# Clamp input to valid image range
x_clamped = torch.clamp(x, 0.0, 1.0)
# Forward pass through the target model
logits = model(x_clamped)
# Loss: maximize confidence for target_class + regularize for naturalness
classification_loss = criterion(logits, target)
regularization = reg_strength * torch.norm(x_clamped, p=2)
total_loss = classification_loss + regularization
total_loss.backward()
optimizer.step()
if step % 500 == 0:
confidence = torch.softmax(logits, dim=-1)[0, target_class].item()
print(f"Step {step:4d} | Loss={total_loss.item():.4f} | "
f"Confidence for class {target_class}: {confidence:.3f}")
return x.detach().clamp(0, 1)
def black_box_inversion(model_query_fn, target_class, input_shape,
n_iterations=5000, population_size=50):
"""
Black-box model inversion using natural evolution strategy (NES).
model_query_fn: callable that takes an input array and returns confidence scores.
Uses estimated gradients from score differences in random directions.
"""
sigma = 0.1 # noise scale for gradient estimation
lr = 0.01 # learning rate
# Start from mean of uniform distribution
x = np.random.uniform(0, 1, input_shape).astype(np.float32)
for step in range(n_iterations):
# Estimate gradient via random perturbations
noise = np.random.randn(population_size, *input_shape).astype(np.float32)
rewards = np.zeros(population_size)
for i, n in enumerate(noise):
x_perturbed = np.clip(x + sigma * n, 0, 1)
scores = model_query_fn(x_perturbed)
rewards[i] = scores[target_class] # maximize target class confidence
# NES gradient estimate
grad_estimate = np.mean(
rewards[:, None, ...] * noise.reshape(population_size, -1),
axis=0
).reshape(input_shape) / sigma
x = np.clip(x + lr * grad_estimate, 0, 1)
if step % 1000 == 0:
curr_confidence = model_query_fn(x)[target_class]
print(f"Step {step} | Target confidence: {curr_confidence:.3f}")
return x
Modern model inversion attacks have become dramatically more powerful by leveraging generative adversarial networks (GANs) and diffusion models as prior knowledge of the input distribution. GAN-based inversion constrains the search to the latent space of a GAN trained on the same domain, ensuring reconstructed images are photorealistic and semantically valid. This approach has achieved face reconstructions at 64×64 resolution that are recognizable to human evaluators even against production facial recognition models.
Defense: Rate Limiting
The simplest and most immediately deployable defense against model extraction is rate limiting: constraining the number of queries any individual user or IP address can make per unit time. Since model extraction requires thousands to hundreds of thousands of API calls, a well-calibrated rate limit dramatically increases the time and monetary cost of an attack, potentially making it economically infeasible.
Adaptive Rate Limiting
Static rate limits are a blunt instrument — they may block legitimate power users while a sophisticated attacker distributes their queries across many accounts or IP addresses. Adaptive rate limiting monitors behavioral signals that distinguish legitimate use from extraction attempts and adjusts limits dynamically. Extraction queries tend to be systematic (regularly spaced, covering the input space in structured ways), whereas legitimate queries tend to be irregular and semantically coherent. Per-user query budgets, anomaly detection on query distributions, and progressive throttling create a layered defense.
Implementation
import time
import collections
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, Deque
@dataclass
class UserQueryRecord:
"""Per-user state for extraction detection."""
query_times: Deque[float] = field(default_factory=lambda: collections.deque(maxlen=1000))
query_count: int = 0
flagged_count: int = 0
throttle_until: float = 0.0 # Unix timestamp when throttling expires
class AdaptiveRateLimiter:
"""
Adaptive rate limiter that detects extraction-like query patterns.
Implements:
1. Hard per-minute rate limit
2. Daily budget per user
3. Regularity anomaly detection (extraction queries are too regular)
4. Progressive throttling on suspicious users
"""
def __init__(
self,
hard_limit_per_minute=60,
daily_budget=5000,
regularity_threshold=0.15, # coefficient of variation below which = suspicious
throttle_multiplier=4.0 # slow down suspicious users by this factor
):
self.hard_limit_per_minute = hard_limit_per_minute
self.daily_budget = daily_budget
self.regularity_threshold = regularity_threshold
self.throttle_multiplier = throttle_multiplier
self.users: Dict[str, UserQueryRecord] = {}
def _get_user(self, user_id: str) -> UserQueryRecord:
if user_id not in self.users:
self.users[user_id] = UserQueryRecord()
return self.users[user_id]
def _queries_last_minute(self, record: UserQueryRecord) -> int:
now = time.time()
cutoff = now - 60
return sum(1 for t in record.query_times if t > cutoff)
def _is_too_regular(self, record: UserQueryRecord) -> bool:
"""
Extraction bots tend to query at very regular intervals.
Compute coefficient of variation (CV) of inter-query intervals.
Low CV (regular) → flag as suspicious.
"""
times = list(record.query_times)
if len(times) < 20:
return False # insufficient data
intervals = np.diff(sorted(times[-50:])) # last 50 queries
if len(intervals) == 0 or np.mean(intervals) == 0:
return False
cv = np.std(intervals) / np.mean(intervals)
return cv < self.regularity_threshold
def check_and_record(self, user_id: str) -> dict:
"""
Check whether a query from user_id should be allowed.
Returns {"allowed": bool, "reason": str, "throttle_remaining": float}
Records the query time if allowed.
"""
record = self._get_user(user_id)
now = time.time()
# 1. Check if user is currently under active throttle
if now < record.throttle_until:
return {
"allowed": False,
"reason": "throttled",
"throttle_remaining": record.throttle_until - now
}
# 2. Check hard per-minute rate limit
if self._queries_last_minute(record) >= self.hard_limit_per_minute:
return {
"allowed": False,
"reason": "rate_limit_exceeded",
"throttle_remaining": 0
}
# 3. Check daily budget
if record.query_count >= self.daily_budget:
return {
"allowed": False,
"reason": "daily_budget_exceeded",
"throttle_remaining": 0
}
# 4. Check for extraction-like regularity
suspicious = self._is_too_regular(record)
if suspicious:
record.flagged_count += 1
# Progressive: each flag doubles the throttle window
throttle_seconds = min(60 * self.throttle_multiplier ** record.flagged_count, 86400)
record.throttle_until = now + throttle_seconds
return {
"allowed": False,
"reason": "extraction_pattern_detected",
"throttle_remaining": throttle_seconds
}
# 5. Allow: record the query
record.query_times.append(now)
record.query_count += 1
return {"allowed": True, "reason": "ok", "throttle_remaining": 0}
Defense: Output Perturbation
Even if an attacker successfully collects thousands of query-response pairs, the fidelity of their surrogate model depends on the quality of those labels. If the target model's outputs are perturbed — by adding calibrated noise, rounding confidence scores, or restricting the output to top-k classes — the surrogate trains on corrupted supervision and degrades in quality. The art of output perturbation is to add enough noise to impede extraction while preserving enough signal to maintain utility for legitimate users.
Differential Privacy Mechanisms for Output
The Laplace mechanism and Gaussian mechanism
from differential privacy theory provide principled ways to add output noise
with formal guarantees. For a model returning a probability vector in [0,1]^k,
adding Laplace noise with scale Δf / ε (where Δf is the
L1 sensitivity of the output function and ε is the privacy parameter)
ensures (ε, 0)-differential privacy. In practice, the sensitivity of a softmax
output is 2 (bounded range), so the noise scale is 2/ε.
Confidence Score Rounding and Top-k Restriction
A simpler, non-probabilistic approach is confidence score rounding: returning probabilities rounded to 2 decimal places instead of 8. This dramatically reduces the information content of each query response while preserving the ordinal ranking of classes (which is what most legitimate users need). Top-k restriction returns only the top k predictions rather than the full distribution, further limiting the information available to a surrogate model.
Implementation
import numpy as np
from scipy.special import softmax
class OutputPerturbationDefense:
"""
Defends against model extraction by perturbing model outputs before
returning them to the user.
Supports three strategies:
- 'laplace' : Add Laplace noise (differential privacy)
- 'rounding' : Round confidences to d decimal places
- 'topk' : Return only top-k predictions
"""
def __init__(self, strategy='laplace', epsilon=1.0, decimal_places=2, top_k=3):
assert strategy in ('laplace', 'gaussian', 'rounding', 'topk')
self.strategy = strategy
self.epsilon = epsilon # privacy budget (smaller = more noise)
self.decimal_places = decimal_places
self.top_k = top_k
def perturb(self, probabilities: np.ndarray) -> np.ndarray:
"""
Apply output perturbation to a probability vector.
probabilities: np.ndarray of shape (n_classes,), summing to 1.
Returns perturbed probabilities (not necessarily summing to 1 for noisy methods).
"""
proba = np.asarray(probabilities, dtype=np.float64)
if self.strategy == 'laplace':
return self._laplace_mechanism(proba)
elif self.strategy == 'gaussian':
return self._gaussian_mechanism(proba)
elif self.strategy == 'rounding':
return self._rounding(proba)
elif self.strategy == 'topk':
return self._topk_restriction(proba)
def _laplace_mechanism(self, proba):
"""
Add Laplace noise with scale = sensitivity / epsilon.
For probability vectors, L1 sensitivity is 2.
After noise addition, re-normalize and clip to [0, 1].
"""
sensitivity = 2.0 # L1 sensitivity of softmax output
scale = sensitivity / self.epsilon
noise = np.random.laplace(loc=0.0, scale=scale, size=proba.shape)
noisy = np.clip(proba + noise, 0.0, 1.0)
# Re-normalize to sum to 1 (project onto probability simplex)
total = noisy.sum()
return noisy / total if total > 0 else np.ones_like(noisy) / len(noisy)
def _gaussian_mechanism(self, proba):
"""
Gaussian mechanism with (epsilon, delta)-DP guarantee.
Uses delta=1e-5 by default; sigma calibrated to L2 sensitivity.
"""
delta = 1e-5
l2_sensitivity = np.sqrt(2) # L2 sensitivity of probability vector
sigma = l2_sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / self.epsilon
noise = np.random.normal(0, sigma, proba.shape)
noisy = np.clip(proba + noise, 0.0, 1.0)
total = noisy.sum()
return noisy / total if total > 0 else np.ones_like(noisy) / len(noisy)
def _rounding(self, proba):
"""Round each probability to d decimal places, then renormalize."""
rounded = np.round(proba, self.decimal_places)
total = rounded.sum()
return rounded / total if total > 0 else np.ones_like(rounded) / len(rounded)
def _topk_restriction(self, proba):
"""
Zero out all but the top-k classes and renormalize.
Returns a sparse vector with at most k non-zero entries.
"""
k = min(self.top_k, len(proba))
result = np.zeros_like(proba)
top_indices = np.argsort(proba)[-k:]
result[top_indices] = proba[top_indices]
total = result.sum()
return result / total if total > 0 else result
# ── Privacy-utility tradeoff demonstration ──────────────────────────────
if __name__ == "__main__":
original_proba = np.array([0.70, 0.20, 0.07, 0.03])
for eps in [0.1, 1.0, 5.0]:
defense = OutputPerturbationDefense(strategy='laplace', epsilon=eps)
perturbed = defense.perturb(original_proba)
print(f"ε={eps:.1f}: original={np.round(original_proba,3)} → perturbed={np.round(perturbed,3)}")
The key tradeoff: smaller ε means more noise (stronger privacy, worse
utility). For most commercial models, ε = 1.0 to ε = 5.0
represents a practical operating range — sufficient noise to degrade a surrogate's
training signal by 15–30% while keeping prediction accuracy for legitimate users
within acceptable bounds.
Defense: Model Watermarking
Rate limiting and output perturbation try to prevent extraction. Model watermarking takes a different approach: it assumes extraction may occur and embeds a verifiable signature in the model's behavior that persists into the surrogate, allowing the original model owner to prove that a suspected stolen model was derived from their source model. [Survey: IP Protection for Deep Learning, arXiv 2411.05051, 2024]
Backdoor-Based Watermarking
The most widely deployed watermarking technique introduces a secret trigger set: a small collection of carefully crafted input-output pairs that the model is trained to respond to in a specific, unusual way. For example, a facial recognition model might be watermarked to classify images containing a specific subtle texture pattern as a designated "watermark class" with very high confidence. A legitimate copy of the model (including any surrogate trained via knowledge distillation on the original's outputs) will also exhibit this behavior, because the attacker's training queries included these trigger inputs and the attacker faithfully copied the corresponding responses. When the model owner suspects a stolen copy, they query it with the trigger set and check for the expected watermark behavior.
Parameter-Space Watermarking
White-box watermarking embeds signatures directly into model weights rather than behavior. Uchida et al. (2017) proposed embedding a bit string into the distribution of weight values in a specific layer using a regularization term during training. The watermark can be extracted by computing the inner product of the weight vector with a secret key matrix. This approach is more robust to model modifications (fine-tuning, pruning) than backdoor-based approaches, but requires access to the model's internal weights for verification — which is only possible if the attacker makes their surrogate available for inspection.
Limitations and Active Research
Watermarking faces several fundamental challenges. First, a determined adversary who knows a watermark exists can attempt to remove it through fine-tuning, model pruning, or knowledge distillation into a fresh model. Second, backdoor-based watermarks introduce a genuine security vulnerability — if the trigger pattern is discovered, it can be used to manipulate the model's behavior. Third, watermarks provide attribution after the fact but do not prevent the initial harm of extraction. Nonetheless, for IP litigation purposes, a robust watermark that survives extraction and is demonstrably non-trivially correlated with the victim model provides strong legal evidence of theft.
Recent work on radioactive data — poisoning training data with imperceptible perturbations that propagate into models trained on it — offers an alternative watermarking approach that operates at the data rather than model level, providing attribution even when the attacker trains a completely fresh model from stolen training data rather than distilling from the API.
API Monitoring for Extraction Attempts
A well-instrumented API can detect model extraction in progress by monitoring behavioral anomalies. Legitimate API users exhibit characteristic usage patterns (bursty queries related to specific use cases, natural language diversity in NLP applications, predictable diurnal patterns) that differ from extraction attacks (systematic coverage of the input space, high query volumes, mechanically generated inputs, low semantic diversity). Effective API monitoring combines statistical anomaly detection with domain-specific extraction heuristics.
Extraction Behavioral Signatures
- Unusual query distributions: extraction queries tend to cover the input domain uniformly or along specific information-theoretic criteria, producing input distributions quite different from natural use.
- High query velocity: API calls at near-maximum rate from a single account or correlated accounts.
- Low semantic coherence: for NLP models, extraction queries may include partially randomized text, edge-case inputs, or grammatically unusual constructions that wouldn't arise from genuine user needs.
- Absence of feedback patterns: legitimate users typically follow up errors or low-confidence responses. Extraction bots often don't.
- Cross-account coordination: multiple accounts with similar query patterns or querying complementary regions of the input space.
Anomaly Detection Implementation
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import collections
import math
class ExtractionDetector:
"""
Anomaly-based detector for model extraction attempts.
Builds a feature vector per user session from query statistics
and uses Isolation Forest to identify anomalous usage patterns.
"""
def __init__(self, window_size=100, contamination=0.05):
"""
window_size : number of recent queries to consider per user
contamination : expected fraction of anomalous users (for IsolationForest)
"""
self.window_size = window_size
self.detector = IsolationForest(contamination=contamination, random_state=42)
self.scaler = StandardScaler()
self.user_history = collections.defaultdict(list) # user_id -> [query_features]
self._fitted = False
def _compute_session_features(self, user_id: str) -> np.ndarray:
"""
Compute a feature vector summarizing a user's recent query behavior.
Features:
0: queries_per_minute (last window)
1: inter_query_cv (coefficient of variation of intervals)
2: input_entropy (diversity of input lengths)
3: confidence_mean (avg model confidence — low for adversarial inputs)
4: confidence_cv (variation in model confidence)
5: unique_input_ratio (fraction of distinct inputs — high for extraction)
"""
history = self.user_history[user_id][-self.window_size:]
n = len(history)
if n < 2:
return np.zeros(6)
timestamps = np.array([h["timestamp"] for h in history])
input_lengths = np.array([h["input_length"] for h in history])
confidences = np.array([h["top_confidence"] for h in history])
input_hashes = [h["input_hash"] for h in history]
intervals = np.diff(sorted(timestamps))
# Feature 0: query rate (queries per minute in the observed window)
time_span = timestamps.max() - timestamps.min()
qpm = (n / time_span * 60) if time_span > 0 else 0
# Feature 1: inter-query interval regularity (extraction bots → low CV)
cv_intervals = (np.std(intervals) / np.mean(intervals)) if np.mean(intervals) > 0 else 0
# Feature 2: entropy of input lengths (systematic scanning → low entropy)
length_counts = collections.Counter(input_lengths)
total = sum(length_counts.values())
entropy = -sum((c/total) * math.log2(c/total + 1e-9) for c in length_counts.values())
# Feature 3 & 4: model confidence statistics
conf_mean = np.mean(confidences)
conf_cv = np.std(confidences) / conf_mean if conf_mean > 0 else 0
# Feature 5: unique input ratio (close to 1.0 for systematic extraction)
unique_ratio = len(set(input_hashes)) / n
return np.array([qpm, cv_intervals, entropy, conf_mean, conf_cv, unique_ratio])
def record_query(self, user_id: str, timestamp: float, input_length: int,
top_confidence: float, input_hash: str):
"""Record metadata for a single API query (do not store raw inputs for privacy)."""
self.user_history[user_id].append({
"timestamp": timestamp,
"input_length": input_length,
"top_confidence": top_confidence,
"input_hash": input_hash,
})
def fit_baseline(self, baseline_user_ids):
"""Train the anomaly detector on baseline legitimate user sessions."""
features = [self._compute_session_features(uid) for uid in baseline_user_ids
if len(self.user_history[uid]) >= 10]
if not features:
raise ValueError("No baseline data available.")
X = np.array(features)
X_scaled = self.scaler.fit_transform(X)
self.detector.fit(X_scaled)
self._fitted = True
def score_user(self, user_id: str) -> float:
"""
Returns anomaly score for a user.
Negative score → anomalous (potential extraction).
Positive score → normal behavior.
"""
if not self._fitted:
raise RuntimeError("Call fit_baseline() first.")
features = self._compute_session_features(user_id).reshape(1, -1)
scaled = self.scaler.transform(features)
return self.detector.decision_function(scaled)[0]
def is_suspicious(self, user_id: str, threshold: float = 0.0) -> bool:
"""Returns True if the user's behavior is anomalous above threshold."""
return self.score_user(user_id) < threshold
Differential Privacy
All the defenses discussed so far operate at inference time — after the model has been trained. Differential privacy (DP) addresses the root cause by modifying the training process itself to limit how much any individual training example can influence the final model. A differentially private model is formally guaranteed to reveal only bounded information about any single training example — providing provable resistance to membership inference, attribute inference, and training data extraction attacks.
Formal Definition
A randomized mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' (differing in exactly one record), and for any set of outputs S:
Pr[M(D) ∈ S] ≤ eε · Pr[M(D') ∈ S] + δ
Smaller ε means stronger privacy. When δ = 0 (pure DP), the guarantee is absolute; allowing δ > 0 (approximate DP, typically δ < 1/n) permits slightly relaxed guarantees in exchange for significantly better utility. The privacy budget ε tracks total information leakage across all uses of the mechanism — it is consumed with every query, access, or training step, providing a quantitative framework for managing privacy risk over time.
DP-SGD: Training with Differential Privacy
Differentially Private Stochastic Gradient Descent (DP-SGD), introduced by Abadi et al. (2016), is the standard mechanism for training neural networks with differential privacy. The algorithm modifies standard SGD in two ways:
- Gradient clipping: per-sample gradients are computed individually (rather than averaged over a batch) and clipped to a maximum L2 norm C. This bounds the influence of any single training example on the gradient update.
- Gaussian noise addition: Gaussian noise with standard deviation
proportional to
C × σis added to the clipped gradient sum before the parameter update, where σ is the noise multiplier calibrated to the desired (ε, δ) budget.
The cost of DP-SGD is a reduction in model accuracy: more noise means less effective gradient signal, especially in early training. The privacy-utility tradeoff is governed by ε: values of ε < 1 provide strong privacy but significant accuracy loss; ε = 1–10 provides moderate privacy with modest accuracy cost (typically 2–5% on classification benchmarks).
Implementation with Opacus
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from opacus import PrivacyEngine
from opacus.utils.batch_memory_manager import BatchMemoryManager
import numpy as np
def train_with_dp(
model: nn.Module,
train_loader: DataLoader,
optimizer: torch.optim.Optimizer,
n_epochs: int,
target_epsilon: float = 1.0, # privacy budget
target_delta: float = 1e-5, # failure probability
max_grad_norm: float = 1.0, # gradient clipping norm
noise_multiplier:float = 1.1, # σ: larger → more privacy, less utility
device: str = "cpu"
) -> dict:
"""
Train a PyTorch model with (target_epsilon, target_delta)-differential privacy
using the Opacus library (meta-pytorch/opacus).
Returns: dict with final epsilon, delta, and per-epoch losses.
"""
model = model.to(device)
criterion = nn.CrossEntropyLoss()
# Attach the PrivacyEngine to enforce DP during training
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=train_loader,
epochs=n_epochs,
target_epsilon=target_epsilon,
target_delta=target_delta,
max_grad_norm=max_grad_norm,
)
# Equivalent to specifying noise_multiplier directly:
# model, optimizer, train_loader = privacy_engine.make_private(
# module=model, optimizer=optimizer, data_loader=train_loader,
# noise_multiplier=noise_multiplier, max_grad_norm=max_grad_norm,
# )
history = []
for epoch in range(1, n_epochs + 1):
model.train()
epoch_losses = []
with BatchMemoryManager(
data_loader=train_loader,
max_physical_batch_size=64, # memory-efficient batching for DP
optimizer=optimizer
) as memory_safe_loader:
for data, target in memory_safe_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
epoch_losses.append(loss.item())
# Query current privacy budget spent
epsilon = privacy_engine.get_epsilon(target_delta)
mean_loss = np.mean(epoch_losses)
print(
f"Epoch {epoch:3d}/{n_epochs} | "
f"Loss: {mean_loss:.4f} | "
f"Privacy: (ε={epsilon:.2f}, δ={target_delta})"
)
history.append({"epoch": epoch, "loss": mean_loss, "epsilon": epsilon})
final_epsilon = privacy_engine.get_epsilon(target_delta)
print(f"\nFinal privacy budget spent: ε={final_epsilon:.3f}, δ={target_delta}")
return {"epsilon": final_epsilon, "delta": target_delta, "history": history}
# ── Privacy-utility tradeoff guide ─────────────────────────────────────
# ε ≈ 0.1 : Very strong privacy. Significant accuracy degradation (~10-20%).
# Membership inference reduced to near-random guessing.
# ε ≈ 1.0 : Strong privacy. Moderate accuracy degradation (~3-8%).
# Practical for medical / financial datasets.
# ε ≈ 10.0 : Moderate privacy. Minimal accuracy degradation (~1-3%).
# Reduces but does not eliminate membership inference risk.
# ε > 100 : Weak privacy. Near-original model utility.
# Provides little meaningful protection against determined attackers.
Practical Deployment Considerations
DP-SGD requires per-sample gradient computation, which is more expensive than standard batched gradient computation — typically 2–3× overhead in memory and 1.5–2× in compute time using libraries like Opacus. Certain layer types (BatchNorm) are incompatible with DP-SGD because they mix cross-sample information; they must be replaced with GroupNorm or LayerNorm. For very large models (LLMs), DP fine-tuning is more practical than DP pretraining from scratch: fine-tuning a pre-trained model with DP requires fewer gradient steps, so the privacy budget is spent more efficiently.
DP provides formal, quantitative guarantees — not just "we added noise and it seems harder to attack." When a regulator or legal body asks "how private is this model?", a trained model with ε = 1.0, δ = 10-5 gives a precise, auditable answer. This is a significant advantage over heuristic defenses, and why DP is increasingly required by privacy regulations for sensitive ML deployments. [Opacus Tutorials — Meta AI]
Module Summary
This module has covered the full lifecycle of model extraction and inference attacks — from the economics of API-based model cloning, through the technical machinery of shadow models, training data extraction, and side-channel attacks on encrypted traffic, to the defense landscape of rate limiting, DP, watermarking, and anomaly detection.
Key takeaways:
- Model extraction is economically asymmetric: an attacker can clone millions of dollars of training work for thousands of dollars in API queries.
- Training data extraction from LLMs is not theoretical — Carlini et al. demonstrated it against GPT-2 and ChatGPT production systems.
- Membership inference turns model outputs into a privacy detector, with practical implications for GDPR, HIPAA, and healthcare/financial ML.
- Side-channel attacks (Whisper Leak, token length, speculative decoding, KV cache timing) show that HTTPS encryption alone is insufficient for privacy-sensitive LLM deployments.
- Hardware attacks (TPUXtract) demonstrate that physical proximity to accelerators can leak architectural secrets without any API access.
- Differential privacy is the only defense with formal, quantitative guarantees — at the cost of utility and computational overhead.
- Layered defenses (rate limiting + output perturbation + watermarking + DP) are more robust than any single mechanism.