Module 06

Model Extraction & Inference Attacks

46 min read 10,418 words

AI Security Course

Module 6: Model Extraction and Inference Attacks

How adversaries steal proprietary models, reconstruct training data, infer membership, and exploit encrypted network traffic — and the defenses that resist them.

⏱ 90-minute read 🔬 15 Topics 💻 14 Code Examples 🎓 Advanced

Model Extraction Fundamentals

A machine learning model is intellectual property. It encodes years of domain expertise, millions of labeled training examples, substantial compute investment, and proprietary architectural choices refined through exhaustive experimentation. Training GPT-4-class models reportedly costs tens to hundreds of millions of dollars in compute alone — a figure that does not capture the human expertise spent curating training data, tuning hyperparameters, or performing alignment work. Model extraction, also called model stealing, is the practice of cloning a proprietary model's behavior by querying its public-facing API and training a surrogate model on the query-response pairs. The result: an adversary can obtain a functional equivalent of the target model for a fraction of its original training cost.

Economic Stakes Training large-scale commercial models may cost $10M–$100M in compute. A successful model extraction attack can replicate core capabilities for as little as a few thousand dollars in API query costs — a six-orders-of-magnitude asymmetry.

Why Model Extraction Matters

The motivations for attacking a model's intellectual property are diverse and often compound. First, there is straightforward IP theft: a competitor can extract a production model and deploy it as their own product, bypassing the original developer's licensing fees, terms of service, and competitive moat. Second, a stolen surrogate model can serve as a stepping stone for further attacks. Generating adversarial examples is far easier when you have white-box access to model gradients. By extracting a surrogate first, an attacker converts a black-box problem into a white-box one, making downstream adversarial attacks orders of magnitude more effective. Third, extraction can be used to circumvent rate limits and cost controls: once you own a local copy, you can run inference without per-query charges or usage monitoring.

Finally, and perhaps most alarmingly, extraction enables privacy inference. A model trained on sensitive data (medical records, financial histories, private communications) may leak information about its training set even when accessed only through its API. Extraction gives the adversary a persistent local artifact to probe at leisure, without the audit trails that API providers maintain.

Exact Extraction vs. Functional Equivalence

Two distinct goals exist within model extraction. Exact extraction attempts to recover the precise weights and architecture of the target model, reproducing its behavior on every possible input — including corner cases. This is theoretically possible for simple model classes (e.g., small ReLU networks) where the number of queries needed to uniquely determine weights grows polynomially with model size, but it remains computationally intractable for modern billion-parameter LLMs.

Functional equivalence, by contrast, settles for a surrogate that matches the target model's behavior on a task-relevant input distribution. The surrogate need not share the target's architecture or weights; it only needs to produce similar predictions on the inputs the attacker cares about. This is the practically relevant threat for most commercial deployments and requires far fewer queries than exact extraction. Research has demonstrated functional equivalents of commercial NLP models achievable with a few hundred thousand API calls — well within the budget of a well-funded adversary. [Tramèr et al.]

Attack Goal

Clone model behavior without access to weights or architecture.

Attack Surface

Any public prediction API that returns labels or probability scores.

Attacker Cost

API query fees + surrogate training compute (typically 100–10,000× cheaper than original).

Victim Loss

Revenue, competitive advantage, downstream privacy risks for training subjects.

Query-Based Model Stealing

The canonical model stealing attack unfolds in three phases: systematic querying of the target API, accumulation of a query-response dataset, and training a surrogate model on that dataset. The simplicity of this pipeline belies its effectiveness. Modern research has shown that even a surrogate with a different architecture than the target can achieve near-identical task performance when trained on well-selected query-response pairs. [Tramèr et al.]

Query Strategy: Active Learning for Efficiency

A naive attacker might sample inputs uniformly at random. A sophisticated attacker uses active learning to select queries that maximize information gain. The core insight is that not all inputs are equally informative: points near the model's decision boundary carry far more information about the model's function than points firmly in one class region. Active learning heuristics (uncertainty sampling, query by committee, core-set selection) allow an attacker to build an accurate surrogate in significantly fewer queries — sometimes an order of magnitude fewer than uniform sampling.

The attacker begins with a seed set of unlabeled inputs, queries the target, and then selects the next query batch by asking: "which inputs, if labeled, would most reduce the surrogate's uncertainty?" The surrogate is retrained after each batch, and the process repeats. This closed loop is why model stealing can be devastatingly efficient even against APIs that return only hard labels (no probabilities): even binary membership information progressively constrains the surrogate.

Training the Surrogate

Once a sufficient set of (input, target_response) pairs has been accumulated, the attacker trains a surrogate model — which need not share the target's architecture — to minimize the loss on those pairs. Soft labels (probability distributions) are far more information-rich than hard labels: a prediction of [cat: 0.72, dog: 0.25, fox: 0.03] conveys the target model's confidence geometry near that input, whereas a hard label cat discards the relative scores. Where APIs return confidence scores, the attacker should use them as training targets (knowledge distillation). Where only hard labels are available, temperature scaling and label smoothing on the surrogate side partially compensate.

Working Implementation

import numpy as np
import requests
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from scipy.stats import entropy


class ModelExtractor:
    """
    Black-box model extraction via active-learning-guided query selection.
    Trains a local surrogate MLPClassifier to mimic a remote target API.
    """

    def __init__(self, target_api_url, n_classes=2, query_budget=5000):
        self.target_url = target_api_url
        self.n_classes = n_classes
        self.query_budget = query_budget

        self.queries   = []   # List[np.ndarray] — inputs sent to target
        self.responses = []   # List[int or list] — labels/probabilities returned

        # Surrogate architecture: two hidden layers
        self.surrogate = MLPClassifier(
            hidden_layer_sizes=(100, 50),
            activation='relu',
            max_iter=500,
            random_state=42
        )
        self.scaler = StandardScaler()
        self._fitted = False

    def query_target(self, inputs):
        """Send inputs to the target API and record (input, response) pairs."""
        for x in inputs:
            try:
                resp = requests.post(
                    self.target_url,
                    json={"input": x.tolist()},
                    timeout=10
                )
                resp.raise_for_status()
                data = resp.json()

                # Accept either {"prediction": 1} or {"probabilities": [0.3, 0.7]}
                label = data.get("prediction", np.argmax(data.get("probabilities", [0])))
                self.queries.append(x)
                self.responses.append(label)

            except (requests.RequestException, KeyError) as e:
                print(f"Query failed: {e}")

    def train_surrogate(self):
        """Fit the surrogate on all accumulated (query, response) pairs."""
        X = np.array(self.queries)
        y = np.array(self.responses)

        X_scaled = self.scaler.fit_transform(X)
        self.surrogate.fit(X_scaled, y)
        self._fitted = True

        train_acc = self.surrogate.score(X_scaled, y)
        return train_acc

    def select_uncertain_batch(self, candidate_pool, batch_size=100):
        """
        Active learning: pick the batch_size candidates where the surrogate
        is most uncertain (highest entropy over class probabilities).
        Requires surrogate to be trained at least once.
        """
        if not self._fitted:
            # Cold start — return random batch
            idx = np.random.choice(len(candidate_pool), batch_size, replace=False)
            return candidate_pool[idx]

        X_scaled = self.scaler.transform(candidate_pool)
        proba = self.surrogate.predict_proba(X_scaled)          # (N, n_classes)
        uncertainties = np.array([entropy(p) for p in proba])   # Shannon entropy

        # Select top-k most uncertain samples
        top_idx = np.argsort(uncertainties)[-batch_size:]
        return candidate_pool[top_idx]

    def run_extraction(self, input_domain_sampler, batch_size=100):
        """
        Full extraction loop.
        input_domain_sampler: callable() -> np.ndarray of shape (N, d)
        """
        rounds = self.query_budget // batch_size

        for round_i in range(rounds):
            # 1. Sample a large candidate pool from the input domain
            candidates = input_domain_sampler()

            # 2. Use active learning to pick the most informative batch
            batch = self.select_uncertain_batch(candidates, batch_size)

            # 3. Query the target API
            self.query_target(batch)

            # 4. Retrain surrogate on all data so far
            if len(self.queries) >= 200:  # Minimum for meaningful training
                acc = self.train_surrogate()
                print(f"Round {round_i+1}: {len(self.queries)} queries, surrogate accuracy={acc:.3f}")

        return self.surrogate


# ── Example usage ──────────────────────────────────────────────────────
# Suppose the target model classifies 20-dimensional input vectors

def domain_sampler():
    """Returns 500 random candidates from input domain."""
    return np.random.uniform(-1, 1, size=(500, 20)).astype(np.float32)

extractor = ModelExtractor(
    target_api_url="https://api.example.com/predict",
    n_classes=3,
    query_budget=2000
)

surrogate_model = extractor.run_extraction(domain_sampler, batch_size=100)
print(f"Extraction complete. Total queries: {len(extractor.queries)}")

Note The above code targets a generic classification API. Real-world attacks against LLM APIs use text prompts as "queries" and logprob distributions (when available) as labels, then distill the results into a smaller surrogate transformer.

Optimizing for Maximum Information Gain

Beyond uncertainty sampling, attackers can exploit several additional strategies. Jacobian-based data augmentation (JBDA) synthesizes new training points by applying small gradient steps to existing labeled inputs, generating inputs near decision boundaries without additional API calls. Model-free approaches use generative models to synthesize diverse inputs from scratch. For NLP models, prompt chaining — where the attacker systematically varies one linguistic dimension at a time — allows efficient coverage of the response surface with structured query sets. [Papernot et al.]

Training Data Extraction from LLMs

Large language models are, in a precise technical sense, compressed summaries of their training corpora. They learn statistical regularities — everything from spelling patterns to full verbatim passages — and store this knowledge in their billions of parameters. This creates a vulnerability: a sufficiently precise attacker can cause a model to regurgitate verbatim text that appeared in its training set. The landmark research by Carlini et al. (2021) demonstrated this decisively against GPT-2, extracting hundreds of verbatim training examples including full names, physical addresses, phone numbers, and copyrighted text simply by querying the model's public API. [Carlini et al., USENIX Security 2021]

Why LLMs Memorize Training Data

Memorization arises from a combination of factors. Training data that is duplicated many times in the corpus is more likely to be memorized — GPT-2 memorizes entire MIT license texts because they appear verbatim on hundreds of thousands of GitHub repositories. Model capacity amplifies this: larger models memorize more, because they have more parameters to store rare training examples. Counterintuitively, longer training (more epochs) also increases memorization, as the model sees the same examples repeatedly and fits them more precisely.

Carlini et al. define k-eidetic memorization: a string s is k-eidetically memorized by a model if the model can reproduce s from a length-k prefix, and s appears in the training data only once. This is distinct from factual knowledge (which may be learned from many corroborating examples) — eidetic memorization is verbatim retention from a single training example.

Extraction Methodology

The attack pipeline involves three steps: generation, ranking, and verification. First, the attacker generates a large number of text samples from the model — Carlini et al. generated 600,000 samples using diverse prompting strategies. Second, they rank these samples using membership inference metrics as a filter: samples where the model assigns unusually high likelihood are more likely to be memorized. Specifically, they compare the model's perplexity on a candidate to a smaller reference model's perplexity. Memorized text is high-likelihood for the large model but not for the reference model. Third, the top-ranked candidates are verified against the original training corpus.

Divergence and Prefix Attacks

Divergence attacks exploit the phenomenon that a model fine-tuned with RLHF or instruction tuning will suppress memorization outputs during normal use, but can be induced to "forget" this suppression by crafting adversarial prompts. Carlini et al. (2023) extracted megabytes of training data from ChatGPT — despite its alignment training — by using a simple repetition prompt: asking the model to repeat a word indefinitely causes it to eventually diverge from its aligned behavior and emit training data verbatim. [Carlini et al., 2023 — ChatGPT Extraction]

Prefix attacks provide the model with a genuine prefix from the training data and observe whether it completes the rest of the passage accurately. Prompting GPT-2 with "My address is 1 Main Street" caused it to accurately complete with specific real individuals' contact information in the Carlini et al. experiments. Completion-based extraction is the general technique: any prompt that was seen during training acts as a retrieval key for the passage that followed it.

Testing for Memorization

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import zlib


def compute_perplexity(model, tokenizer, text, device="cpu"):
    """Compute per-token perplexity of a text under a given model."""
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(output.loss).item()


def memorization_score(large_model, small_model, tokenizer, text, device="cpu"):
    """
    Carlini et al. 'ratio' metric: compares large vs. small model perplexity.
    A memorized passage has LOW perplexity under the large model
    but NOT proportionally low under the small reference model.
    Higher score → more likely memorized.
    """
    ppl_large = compute_perplexity(large_model, tokenizer, text, device)
    ppl_small = compute_perplexity(small_model, tokenizer, text, device)

    # Ratio metric: lower PPL in large vs small suggests memorization
    ratio_score = np.log(ppl_small) / np.log(ppl_large)

    # Zlib metric: compare model PPL to compression-based estimate
    zlib_entropy = len(zlib.compress(text.encode())) / len(text)
    zlib_score   = np.log(ppl_large) / zlib_entropy

    return {
        "perplexity_large": ppl_large,
        "perplexity_small": ppl_small,
        "ratio_score":      ratio_score,     # higher → more suspect
        "zlib_score":       zlib_score,      # higher → less compressible = natural text
    }


def generate_candidates(model, tokenizer, n_samples=200, max_length=256, device="cpu"):
    """
    Generate n_samples completions from the model with an empty prefix.
    Returns a list of generated strings.
    """
    model.eval()
    candidates = []
    with torch.no_grad():
        for _ in range(n_samples):
            input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(device)
            output = model.generate(
                input_ids,
                max_new_tokens=max_length,
                do_sample=True,
                top_k=40,
                temperature=1.0,
                pad_token_id=tokenizer.eos_token_id
            )
            text = tokenizer.decode(output[0], skip_special_tokens=True)
            candidates.append(text)
    return candidates


# ── Example: screen 200 GPT-2 generations for possible memorized text ──
if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"

    tokenizer  = GPT2Tokenizer.from_pretrained("gpt2-xl")
    model_xl   = GPT2LMHeadModel.from_pretrained("gpt2-xl").to(device)
    model_sm   = GPT2LMHeadModel.from_pretrained("gpt2").to(device)   # reference

    candidates = generate_candidates(model_xl, tokenizer, n_samples=200, device=device)

    scored = []
    for text in candidates:
        scores = memorization_score(model_xl, model_sm, tokenizer, text, device)
        scored.append((text, scores))

    # Sort by ratio score descending — highest ratio = most likely memorized
    scored.sort(key=lambda x: x[1]["ratio_score"], reverse=True)

    print("Top 5 candidates most likely to contain memorized training data:")
    for text, scores in scored[:5]:
        print(f"  Ratio={scores['ratio_score']:.3f} | PPL(xl)={scores['perplexity_large']:.1f}")
        print(f"  Text: {text[:120]}...")
        print()

In Carlini et al.'s best attack configuration, 67% of top-ranked candidates were confirmed verbatim training examples. Among their 604 unique extracted memorized sequences, 46 contained personal names and 32 contained contact information — real individuals whose data had been harvested into GPT-2's CommonCrawl training corpus without their knowledge. [USENIX Security 2021]

Membership Inference

Imagine a hospital trains a machine learning model to predict patient readmission risk. The model is deployed via a public API for insurance companies to query. Now consider an adversary — perhaps a competitor, a nosy employer, or a malicious insurer — who has a specific individual's medical record and wants to know: was this person's data used to train this model? This is the membership inference problem, and it is one of the most practically significant privacy threats in modern machine learning. [Shokri et al., IEEE S&P 2017]

Why Models Leak Membership

The fundamental cause is overfitting. A model that has been trained on a data point typically assigns it higher confidence, lower loss, and different internal representations than unseen data points. Even well-regularized models exhibit this difference to a measurable degree. The gap is larger for rare, unusual, or exactly duplicated training examples — the same memorization phenomenon that enables training data extraction also enables membership inference.

The Shadow Model Technique

The seminal attack by Shokri et al. (2017) introduced the shadow model approach. Because the attacker cannot directly observe the target model's training set, they simulate it: they train several shadow models on datasets drawn from the same distribution as the target's training data. For each shadow model, the attacker knows exactly which points were in the training set (members) and which were not (non-members). They record the model's output vector for each point and label it accordingly. This produces a labeled dataset of (model_output, member/non-member) pairs. An attack classifier trained on this labeled dataset can then be applied to the target model's outputs to infer membership.

The technique achieved median membership inference accuracy of 94% against models trained on Google's ML services and 74% against Amazon's services in realistic experiments. [Shokri et al.]

Loss-Based Inference

A simpler approach that avoids training shadow models is the loss threshold attack: compute the model's loss on the target point and compare it to a threshold. If the loss is below the threshold (the model is highly confident), predict membership. This works because training examples tend to have lower loss than unseen data, especially for overfit models. More sophisticated variants use reference models: compute the likelihood ratio between the target model and a reference model trained on disjoint data. Points where this ratio is high are likely members.

Python Implementation

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


class MembershipInferenceAttack:
    """
    Shadow-model-based membership inference attack.

    Assumes black-box access to a target model that returns
    class probability vectors for input features.
    """

    def __init__(self, n_shadow=4, shadow_architecture=None):
        self.n_shadow = n_shadow
        self.shadow_architecture = shadow_architecture or (64, 32)
        self.attack_classifier = LogisticRegression(max_iter=500)
        self.shadow_models = []
        self._trained = False

    def _train_shadow_model(self, X_train, y_train):
        """Train a single shadow model on a subset of proxy data."""
        shadow = MLPClassifier(
            hidden_layer_sizes=self.shadow_architecture,
            max_iter=300,
            random_state=np.random.randint(0, 10000)
        )
        shadow.fit(X_train, y_train)
        return shadow

    def _extract_features(self, model, X):
        """
        Extract membership inference features from model outputs.
        Uses the full probability vector as features (preserves calibration signal).
        """
        proba = model.predict_proba(X)   # shape: (n_samples, n_classes)

        # Additional engineered features: confidence, entropy, top-2 gap
        confidence   = proba.max(axis=1, keepdims=True)
        pred_entropy = -np.sum(proba * np.log(proba + 1e-9), axis=1, keepdims=True)
        sorted_p     = np.sort(proba, axis=1)[:, ::-1]
        top2_gap     = (sorted_p[:, 0] - sorted_p[:, 1]).reshape(-1, 1)

        return np.hstack([proba, confidence, pred_entropy, top2_gap])

    def train_attack(self, proxy_data_X, proxy_data_y):
        """
        Build labeled (features, membership) dataset using shadow models,
        then train the attack classifier.

        proxy_data_X, proxy_data_y : dataset drawn from same distribution
                                     as target's training set.
        """
        all_features = []
        all_labels   = []

        n = len(proxy_data_X)
        split = n // 2

        for i in range(self.n_shadow):
            # Randomly partition proxy data into shadow-train and shadow-test
            idx = np.random.permutation(n)
            train_idx, test_idx = idx[:split], idx[split:]

            X_tr, y_tr = proxy_data_X[train_idx], proxy_data_y[train_idx]
            X_te, y_te = proxy_data_X[test_idx],  proxy_data_y[test_idx]

            shadow = self._train_shadow_model(X_tr, y_tr)
            self.shadow_models.append(shadow)

            # "In" examples: training set of this shadow model → label 1
            f_in  = self._extract_features(shadow, X_tr)
            # "Out" examples: test set of this shadow model → label 0
            f_out = self._extract_features(shadow, X_te)

            all_features.append(np.vstack([f_in, f_out]))
            all_labels.extend([1] * len(X_tr) + [0] * len(X_te))

            print(f"Shadow model {i+1}/{self.n_shadow} trained. "
                  f"Accuracy={shadow.score(X_te, y_te):.3f}")

        F = np.vstack(all_features)
        L = np.array(all_labels)

        self.attack_classifier.fit(F, L)
        self._trained = True

    def infer_membership(self, target_model, query_points):
        """
        Given a trained target model and query data points,
        return membership probability for each point.
        1.0 = likely member, 0.0 = likely non-member.
        """
        if not self._trained:
            raise RuntimeError("Call train_attack() first.")

        features = self._extract_features(target_model, query_points)
        return self.attack_classifier.predict_proba(features)[:, 1]

    def evaluate(self, target_model, member_X, nonmember_X):
        """Compute AUC of attack against target model."""
        member_scores    = self.infer_membership(target_model, member_X)
        nonmember_scores = self.infer_membership(target_model, nonmember_X)

        y_true  = np.concatenate([np.ones(len(member_X)), np.zeros(len(nonmember_X))])
        y_score = np.concatenate([member_scores, nonmember_scores])

        auc = roc_auc_score(y_true, y_score)
        print(f"Membership Inference AUC: {auc:.4f}")
        return auc

Privacy Implications: GDPR and HIPAA

Membership inference directly violates the privacy principles enshrined in major data protection laws. Under GDPR Article 17 (right to erasure), individuals can request deletion of their data — but if a deployed model reveals membership, effective deletion becomes impossible without retraining or applying machine unlearning techniques. Under HIPAA, health information used to train models without de-identification may expose institutions to liability if membership can be inferred from the deployed model. Regulatory bodies are increasingly treating demonstrable membership inference vulnerability as a compliance failure, not merely a theoretical risk.

Attribute Inference

Membership inference is a binary question: was this person in the training set or not? Attribute inference is more granular: given that a person is in the training set (or even just as an input to the model), can an attacker learn sensitive attributes about them that were not explicitly provided as input? This class of attack is particularly pernicious because it can operate at query-time, not just at training time — any prediction API can potentially leak demographic or behavioral attributes about the query subject.

How Attribute Inference Works

Consider a credit scoring model trained on a dataset that includes both "approved features" (income, credit history, loan amount) and "protected attributes" (race, gender, zip code as a proxy for race). Even if the protected attributes are excluded from the model's official feature set at inference time, the model may have absorbed the correlation during training. An adversary who knows the target individual's approved features can query the model and, by observing the prediction, infer the individual's protected attributes. Research has shown that even models trained with explicit fairness constraints can still leak protected attributes through their output distributions.

The attack typically works by training a reconstruction model: the attacker collects many (known_features, model_output) pairs where the sensitive attribute is also known (from a separate dataset or via correlation), and trains a classifier to predict the sensitive attribute from the model's output. Yeom et al. (2018) formalized this as an attack that succeeds whenever a model has learned to exploit the correlation between the sensitive attribute and the label.

Demographic and Behavioral Inference

LLMs present a particularly rich surface for attribute inference because they produce nuanced, open-ended outputs. Research has demonstrated that LLMs can be used to predict user demographics (age, gender, political affiliation, nationality) from writing style alone. When an LLM is fine-tuned on user interaction logs, the fine-tuned model may expose individual users' characteristics through subtle systematic differences in how it responds to queries — even for users who were part of the fine-tuning set rather than the current query.

Cross-Referencing Attacks

A powerful variant combines attribute inference with auxiliary data. The attacker does not rely on the model alone; they combine model outputs with publicly available databases, social media profiles, and other data sources. For example, knowing that a medical model predicts a 73% readmission risk for a patient with features (age=54, zip=90210, diagnosis=T2D) might be enough — when cross-referenced with public voter registration and property records — to uniquely identify the patient and infer their full medical history.

Ethical and Regulatory Concerns Attribute inference attacks may constitute unlawful discrimination if they reveal protected characteristics used in consequential decisions (employment, credit, insurance), violating the EU AI Act's provisions on prohibited AI practices and the CCPA's protections against inference of sensitive personal characteristics. Organizations deploying prediction APIs for consequential decisions should conduct attribute inference audits as part of their model release process.

Mitigation Strategies

Defending against attribute inference requires both training-time and deployment-time interventions. Adversarial training for fairness penalizes models that allow a discriminator to infer protected attributes from intermediate representations. Output restriction — returning only hard labels rather than confidence scores — reduces the information available to an attacker but does not eliminate the risk. Federated learning with differential privacy can limit the amount of individual-level information encoded in model weights, providing the strongest theoretical guarantees.

Side-Channel Attacks on LLMs: Whisper Leak

A widespread assumption in LLM deployment is that TLS/HTTPS encryption provides meaningful confidentiality for user queries. The Whisper Leak research (2025) dismantles this assumption. The attack demonstrates that an adversary with passive access to a user's encrypted network traffic — such as an internet service provider, a compromised router, or a malicious Wi-Fi access point — can infer the topic of a user's LLM query with over 98% accuracy on 17 of 28 tested commercial LLMs, without ever decrypting a single byte of payload. [Whisper Leak, arXiv 2511.03675, 2025]

Attack Premise: Why Encryption Is Not Enough

TLS (using stream ciphers like AES-GCM) encrypts payload content but preserves payload size: size(ciphertext) = size(plaintext) + constant. When an LLM generates a streaming response token by token, each token is sent as a separate encrypted packet. Because tokens have variable lengths (the token the is 3 characters; antidisestablishmentarianism is 28), the sequence of packet sizes directly encodes the sequence of token lengths.

The key insight is that different topics produce systematically different token length patterns. A response about quantum physics uses longer, less frequent technical terms. A response about cooking uses shorter, more common vocabulary. A response about legal matters includes specific legal terms and Latin phrases. These patterns are stable enough across different users asking similar questions that a trained classifier can identify the topic from packet sizes alone — even without seeing the content.

Experimental Setup and Results

The researchers collected 21,716 queries per model (100 topic variants × 100 repeats + 11,716 Quora noise queries) and trained binary classifiers using tcpdump-captured encrypted traffic traces. Testing was conducted against 28 LLMs from major providers including OpenAI (GPT-4o-mini, GPT-4.1), Anthropic (Claude 3 Haiku), Google (Gemini 1.5 Flash, 2.5 Pro), Microsoft, xAI (Grok), Mistral, DeepSeek, Meta (LLaMA via Lambda), and Amazon (Nova). [Whisper Leak]

AUPRC (Median)

>98% for 17 of 28 models. Average 96.8% across all models.

Precision

100% precision at 5–20% recall for 17/28 models (e.g., GPT-4o-mini, Mistral, Grok).

Hardest Targets

Google Gemini (81–84% AUPRC) and Amazon Nova (71–77%) were most resistant.

Attacker Model

Passive network observer (ISP-level). No active interference required.

Three Classifier Architectures

The paper evaluated three classifier architectures on the packet-size + inter-arrival time feature sequences:

LightGBM: Gradient-boosted decision tree ensemble on flattened, zero-padded packet size/timing sequences (padded to 95th percentile length). Fast to train, competitive accuracy.
BiLSTM: Bidirectional LSTM with attention mechanism. Embeds each (packet_size, inter_arrival_time) pair, processes with BiLSTM + attention, then classifies via a two-layer MLP head. Captures sequential dependencies.
BERT-based (DistilBERT): Discretizes packet sizes and timings into 50-bin tokens, then fine-tunes a DistilBERT classifier on these token sequences. Best performance on models with complex packet distributions.

Conceptual Attack Code: Traffic Capture and Classification

import subprocess
import struct
import numpy as np
from collections import namedtuple
import joblib   # for loading pre-trained LightGBM classifier


PacketTrace = namedtuple("PacketTrace", ["sizes", "timings"])


def capture_llm_traffic(target_host, duration_sec=30, interface="eth0"):
    """
    Passively capture encrypted HTTPS packets to/from an LLM API endpoint.
    Returns a PacketTrace with per-packet sizes and inter-arrival times.

    IMPORTANT: Only use on networks/systems you are authorized to monitor.
    This is a conceptual demonstration of the Whisper Leak methodology.

    Uses tcpdump (requires appropriate privileges or packet capture capability).
    """
    # In production, Whisper Leak uses tcpdump output parsed with Scapy
    # Here we show the structure; actual implementation needs pcap parsing

    cmd = [
        "tcpdump", "-i", interface,
        "-nn", "-q", "-tt",             # timestamp + size, no DNS resolution
        f"host {target_host} and port 443",
        "-c", "2000"                       # capture up to 2000 packets
    ]

    # Parse output lines of form: "1700000000.123456 IP src > dst: ... length NNN"
    raw_lines = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode()

    sizes   = []
    timestamps = []
    prev_ts = None
    inter_arrivals = []

    for line in raw_lines.splitlines():
        parts = line.split()
        if not parts: continue
        try:
            ts  = float(parts[0])
            # length is typically the last token starting with a digit
            pkt_len = int([p for p in parts if p.isdigit()][-1])
            sizes.append(pkt_len)
            if prev_ts is not None:
                inter_arrivals.append(ts - prev_ts)
            prev_ts = ts
        except (ValueError, IndexError):
            continue

    return PacketTrace(sizes=sizes, inter_arrivals=inter_arrivals)


def featurize_trace(trace, max_len=512):
    """
    Pad/truncate packet size and inter-arrival sequences to fixed length,
    then concatenate into a feature vector for LightGBM classification.
    Mirrors the Whisper Leak featurization strategy.
    """
    sizes   = np.array(trace.sizes[:max_len],   dtype=np.float32)
    timings = np.array(trace.inter_arrivals[:max_len], dtype=np.float32)

    # Zero-pad to max_len
    sizes_padded   = np.pad(sizes,   (0, max_len - len(sizes)))
    timings_padded = np.pad(timings, (0, max_len - len(timings)))

    # Concatenate sizes + timings into one feature vector
    feature_vector = np.concatenate([sizes_padded, timings_padded])
    return feature_vector.reshape(1, -1)


def classify_prompt_topic(trace, classifier_path, topic_labels):
    """
    Given a captured traffic trace, load a pre-trained LightGBM classifier
    and predict the topic of the underlying LLM prompt.
    """
    clf = joblib.load(classifier_path)      # pre-trained binary or multi-class model
    features = featurize_trace(trace)
    proba = clf.predict_proba(features)[0]

    results = sorted(zip(topic_labels, proba), key=lambda x: x[1], reverse=True)
    print("Topic classification results:")
    for topic, prob in results[:5]:
        print(f"  {topic:30} {prob:.3f}")
    return results[0][0]  # top predicted topic

Defenses

The Whisper Leak paper evaluated random padding as a mitigation: appending a random-length string to each streamed token to obscure its length. Cloudflare implemented this defense after the initial disclosure. However, the paper showed that even with padding, timing information between packets retains significant classifiable signal. The only fully effective defense is constant-shape traffic: padding all responses to a fixed size and batching tokens before transmission, eliminating all packet-size variation. This fundamentally conflicts with the low-latency streaming UX that most LLM providers optimize for.

Token Length Side-Channel

The Whisper Leak attack infers prompt topics. The research by Weiss et al. (2024) pursues a more precise goal: inferring the exact content of an LLM's response — word for word — from the sizes of encrypted packets. This is the token length side-channel, and it represents one of the most startling privacy failures in LLM deployment history. [Microsoft Security Blog, 2025]

Attack Methodology

When an LLM streams its response token-by-token, each TLS packet carries exactly one token's bytes. The packet size directly reveals the token's byte length: a 3-byte packet carries a 3-character token; a 7-byte packet carries a 7-character token. An adversary observing the encrypted stream thus learns the length sequence of every token in the response — e.g., [4, 5, 1, 3, 6, 2, 7, ...].

Armed with this length sequence, Weiss et al. employ a secondary LLM to reconstruct the most plausible sentence matching that exact token length pattern. The task is essentially a constrained text generation problem: generate a coherent sentence whose tokenization produces exactly the observed length sequence. Given that the tokenizer vocabulary is fixed and widely known (e.g., tiktoken for OpenAI models), this is a dramatically constrained problem — far easier than unconstrained generation.

The attack achieved approximately 27% exact token reconstruction of LLM output tokens. While this may seem modest, consider that even partial reconstruction can be sufficient to identify whether a response contained sensitive medical diagnoses, legal advice, or private personal information.

Why This Attack Is Structurally Hard to Fix

The root cause is not a bug in TLS — it is a fundamental property of the streaming API design. Any system that:

Uses a fixed tokenizer (so token lengths are predictable from vocabulary),
Streams tokens one-by-one over TLS, and
Uses a stream cipher that preserves plaintext length,

is vulnerable to this attack. The only mitigations that fully neutralize it are: (a) adding deterministic padding to all tokens so they appear to be the same size, or (b) batching multiple tokens per packet before encryption — both of which increase latency and reduce the responsiveness that makes streaming valuable. Cloudflare implemented per-token random padding at the CDN layer after Weiss et al.'s disclosure, but residual information leakage was still demonstrated in follow-on work. [Whisper Leak citing Cloudflare mitigation]

Implications for Sensitive Deployments

Any LLM deployed in healthcare, legal, financial, or government contexts that uses streaming APIs is potentially leaking response content to anyone with network visibility between the user and the provider. This includes corporate proxies, VPN providers, ISPs, and potentially government surveillance infrastructure. The appropriate security posture for highly sensitive deployments is to disable streaming entirely (returning complete responses as a single packet) or to deploy LLM inference on-premises with no external network exposure.

Timing Attacks on Efficient Inference

Performance is competitive in the LLM serving market, and providers invest heavily in inference optimization techniques. Two major categories — speculative decoding and KV cache sharing — have been shown to introduce exploitable side-channels that reveal private information about user inputs and system configurations.

Speculative Decoding Side-Channels

Speculative decoding accelerates LLM inference by having a small, cheap "draft model" predict several tokens ahead, then having the large target model verify them in parallel. When the draft model guesses correctly, multiple tokens are accepted in one verification pass. When it guesses incorrectly, the model falls back to single-token autoregressive generation. This accept/reject pattern is input-dependent: some prompts will be decoded with many correct speculations (producing larger packets per iteration), while others trigger many mis-speculations (smaller packets per iteration).

Wei et al. (2024) demonstrated that a passive network adversary observing packet sizes per generation iteration can reconstruct this accept/reject trace, and use it to fingerprint user queries with >90% accuracy across four different speculative decoding schemes. [Wei et al., "When Speculation Spills Secrets," 2024] Specifically: REST achieved ~100% query identification accuracy, LADE reached 92%, BiLD 95%, and even EAGLE on remote vLLM achieved 77.6% accuracy.

Beyond query fingerprinting, a malicious user with API access can also extract private datastore contents used by retrieval-augmented speculative decoding (e.g., REST): by crafting inputs designed to probe the datastore and observing which tokens are correctly speculated, the attacker can leak datastore contents at more than 25 tokens per second. This is a particularly severe threat for RAG systems that include proprietary or confidential documents in their retrieval corpus.

Carlini & Nasr: Timing Variations from Inference Optimizations

Carlini & Nasr (2024) demonstrated an earlier version of this class of attack, showing that timing variations due to inference optimizations in commercial models (GPT-4, Claude) can be exploited via packet inter-arrival times as a side-channel. Their work established the threat model for this research area, though subsequent research showed that inter-arrival time signals are considerably noisier than packet size signals — the Wei et al. approach achieves 77.6% accuracy where Carlini & Nasr's approach achieves only 14.4% on the same vLLM setup.

InputSnatch: KV Cache Timing Attacks

A second class of timing side-channel exploits KV cache sharing. Modern LLM inference backends (vLLM, TensorRT-LLM) implement prefix caching: if two requests share a common prefix, the KV cache computed for that prefix is reused rather than recomputed. This produces a measurable timing difference — a cache hit responds noticeably faster than a cache miss. [Zheng et al., "InputSnatch," arXiv 2411.18191, 2024]

The InputSnatch attack by Zheng et al. (2024) exploits this vulnerability to reconstruct other users' cached prompts. The attack works by systematically querying the target service with candidate inputs and observing whether the time-to-first-token (TTFT) indicates a cache hit. A cache hit reveals that the candidate prefix matches another user's cached query. By iteratively constructing increasingly long prefixes that match, the attacker can reconstruct the victim user's exact prompt — even when the service uses HTTPS encryption.

In experiments on a medical Q&A chatbot with prefix caching, InputSnatch achieved a 62% success rate in extracting exact disease inputs and 13.5% for precise symptom descriptions. For a legal consultation RAG system with semantic caching, semantic extraction success rates ranged from 43% to 100%.

InputSnatch Attack Flow: [Attacker] ─── crafts candidate prefix "What are the symptoms of..." ──→ [LLM API] │ Measure TTFT │ ┌─────────────────────────────────────┘ │ TTFT < threshold? ──→ YES → Cache HIT → prefix matches victim query │ NO │ TTFT ≥ threshold → Cache MISS → try next candidate │ [iterate with ML-guided input constructor] │ Reconstruct full victim query from matched prefix segments

TPUXtract: Hardware Emanation Attacks

All the attacks discussed so far operate at the API or network level. But there is another attack surface entirely: the physical hardware running the model. Neural network inference on specialized chips (TPUs, GPUs, NPUs) produces electromagnetic (EM) emanations correlated with the operations being performed. An adversary with physical proximity to the hardware — or even access to measurement equipment in an adjacent rack in a data center — can use these emanations to infer the model's architecture. This is TPUXtract. [Keysight Security Blog, 2025]

Attack Methodology

The attack was demonstrated by researchers from North Carolina State University against a Google Tensor Processing Unit (TPU). The fundamental observation is that TPU power consumption varies measurably depending on the layer configuration being processed. Different layer types (convolutional, fully connected, attention), different layer sizes, and different connectivity patterns each produce distinct EM signatures.

TPUXtract exploits the fact that in a neural network, data flows sequentially through layers. The attacker measures the EM signal over time as the TPU processes an input and correlates different time windows with the EM profiles expected for different layer configurations. By matching the observed EM trace to a library of pre-characterized layer profiles, the attacker reconstructs the model's architecture one layer at a time.

This layer-by-layer approach dramatically reduces search complexity compared to trying to match the entire model at once. The attack achieved 99.91% accuracy in extracting neural network hyperparameters from the TPU.

What Can Be Extracted

TPUXtract can recover:

Number of layers — the depth of the network
Layer types — fully connected, convolutional, attention, normalization
Layer dimensions — number of neurons/channels per layer
Connectivity patterns — skip connections, attention heads

Critically, TPUXtract does not recover model weights — the actual numerical parameters learned during training. This is an important limitation: knowing the architecture is like knowing the blueprint of a building without knowing the furniture inside. However, knowing the architecture is enormously valuable: it enables much more efficient model extraction via the API (the search space for the surrogate is now dramatically constrained), it reveals proprietary architectural innovations, and it provides a roadmap for targeted adversarial attacks. For Transformer-based LLMs, full extraction requires additional steps (nullifying weights of specific layers to isolate others), but the paper demonstrates feasibility.

Implications for Model IP Security

Cloud providers and hardware manufacturers have historically assumed that EM emanations from accelerators are not exploitable in multi-tenant environments because isolation between tenants should prevent physical access. TPUXtract challenges this assumption: in co-location data centers, measurements from adjacent physical hardware may suffice. More broadly, any organization running AI inference on hardware that is not physically controlled end-to-end faces architectural secrecy risks. Effective countermeasures include hardware-level EM shielding, noise injection circuits, and power-consumption masking — all established techniques from the cryptographic hardware security domain, now becoming relevant to AI deployments. [Dark Reading, 2024]

Model Inversion

Model inversion attacks reconstruct representative inputs from model outputs, effectively running inference in reverse. Rather than asking "what does this model predict for this input?", the attacker asks "what input does this model associate with this prediction?" In the domain of facial recognition, model inversion can reconstruct recognizable face images of training subjects from nothing but the model's confidence scores — a profound privacy violation.

Fredrikson et al.: Face Reconstruction

The seminal model inversion paper by Fredrikson, Jha, and Ristenpart (CCS 2015) demonstrated that a facial recognition model trained on a set of named individuals could be inverted to produce recognizable face images for any target individual whose name is in the model's label space. [Fredrikson et al., CCS 2015] The attack works by gradient-based optimization in the input space: starting from random noise, the attacker iteratively adjusts the input to maximize the model's confidence for the target label. The optimization converges to an input that is highly recognizable as the target individual — even without ever seeing their actual photo in the training set.

This attack works because the model has encoded sufficient information about each individual's facial features in its parameters to make confident predictions — and gradient-based inversion can decode that information back into the image space. The attack succeeded in producing face images that human evaluators could correctly identify as the target individual at significantly above-chance rates.

Gradient-Based Inversion Code

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np


def model_inversion_attack(
    model,
    target_class,
    input_shape,
    n_iterations=2000,
    lr=0.01,
    reg_strength=0.001
):
    """
    White-box model inversion: reconstruct an input that maximally activates
    target_class in the model's output.

    Args:
        model         : PyTorch classification model (should output logits or log-probs)
        target_class  : int, index of the class to invert
        input_shape   : tuple, e.g. (1, 3, 64, 64) for a single RGB 64×64 image
        n_iterations  : optimization steps
        lr            : learning rate
        reg_strength  : L2 regularization on input (prevents degenerate solutions)

    Returns:
        reconstructed_input : torch.Tensor of shape input_shape
    """
    model.eval()

    # Initialize from random noise in [0, 1]
    x = torch.rand(input_shape, requires_grad=True)
    optimizer = optim.Adam([x], lr=lr)
    criterion = nn.CrossEntropyLoss()

    target = torch.tensor([target_class], dtype=torch.long)

    for step in range(n_iterations):
        optimizer.zero_grad()

        # Clamp input to valid image range
        x_clamped = torch.clamp(x, 0.0, 1.0)

        # Forward pass through the target model
        logits = model(x_clamped)

        # Loss: maximize confidence for target_class + regularize for naturalness
        classification_loss = criterion(logits, target)
        regularization      = reg_strength * torch.norm(x_clamped, p=2)
        total_loss = classification_loss + regularization

        total_loss.backward()
        optimizer.step()

        if step % 500 == 0:
            confidence = torch.softmax(logits, dim=-1)[0, target_class].item()
            print(f"Step {step:4d} | Loss={total_loss.item():.4f} | "
                  f"Confidence for class {target_class}: {confidence:.3f}")

    return x.detach().clamp(0, 1)


def black_box_inversion(model_query_fn, target_class, input_shape,
                         n_iterations=5000, population_size=50):
    """
    Black-box model inversion using natural evolution strategy (NES).
    model_query_fn: callable that takes an input array and returns confidence scores.
    Uses estimated gradients from score differences in random directions.
    """
    sigma = 0.1    # noise scale for gradient estimation
    lr    = 0.01   # learning rate

    # Start from mean of uniform distribution
    x = np.random.uniform(0, 1, input_shape).astype(np.float32)

    for step in range(n_iterations):
        # Estimate gradient via random perturbations
        noise = np.random.randn(population_size, *input_shape).astype(np.float32)
        rewards = np.zeros(population_size)

        for i, n in enumerate(noise):
            x_perturbed = np.clip(x + sigma * n, 0, 1)
            scores = model_query_fn(x_perturbed)
            rewards[i] = scores[target_class]   # maximize target class confidence

        # NES gradient estimate
        grad_estimate = np.mean(
            rewards[:, None, ...] * noise.reshape(population_size, -1),
            axis=0
        ).reshape(input_shape) / sigma

        x = np.clip(x + lr * grad_estimate, 0, 1)

        if step % 1000 == 0:
            curr_confidence = model_query_fn(x)[target_class]
            print(f"Step {step} | Target confidence: {curr_confidence:.3f}")

    return x

Modern model inversion attacks have become dramatically more powerful by leveraging generative adversarial networks (GANs) and diffusion models as prior knowledge of the input distribution. GAN-based inversion constrains the search to the latent space of a GAN trained on the same domain, ensuring reconstructed images are photorealistic and semantically valid. This approach has achieved face reconstructions at 64×64 resolution that are recognizable to human evaluators even against production facial recognition models.

Defense: Rate Limiting

The simplest and most immediately deployable defense against model extraction is rate limiting: constraining the number of queries any individual user or IP address can make per unit time. Since model extraction requires thousands to hundreds of thousands of API calls, a well-calibrated rate limit dramatically increases the time and monetary cost of an attack, potentially making it economically infeasible.

Adaptive Rate Limiting

Static rate limits are a blunt instrument — they may block legitimate power users while a sophisticated attacker distributes their queries across many accounts or IP addresses. Adaptive rate limiting monitors behavioral signals that distinguish legitimate use from extraction attempts and adjusts limits dynamically. Extraction queries tend to be systematic (regularly spaced, covering the input space in structured ways), whereas legitimate queries tend to be irregular and semantically coherent. Per-user query budgets, anomaly detection on query distributions, and progressive throttling create a layered defense.

Implementation

import time
import collections
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, Deque


@dataclass
class UserQueryRecord:
    """Per-user state for extraction detection."""
    query_times:    Deque[float] = field(default_factory=lambda: collections.deque(maxlen=1000))
    query_count:    int = 0
    flagged_count:  int = 0
    throttle_until: float = 0.0    # Unix timestamp when throttling expires


class AdaptiveRateLimiter:
    """
    Adaptive rate limiter that detects extraction-like query patterns.
    Implements:
      1. Hard per-minute rate limit
      2. Daily budget per user
      3. Regularity anomaly detection (extraction queries are too regular)
      4. Progressive throttling on suspicious users
    """

    def __init__(
        self,
        hard_limit_per_minute=60,
        daily_budget=5000,
        regularity_threshold=0.15,    # coefficient of variation below which = suspicious
        throttle_multiplier=4.0        # slow down suspicious users by this factor
    ):
        self.hard_limit_per_minute = hard_limit_per_minute
        self.daily_budget          = daily_budget
        self.regularity_threshold  = regularity_threshold
        self.throttle_multiplier   = throttle_multiplier

        self.users: Dict[str, UserQueryRecord] = {}

    def _get_user(self, user_id: str) -> UserQueryRecord:
        if user_id not in self.users:
            self.users[user_id] = UserQueryRecord()
        return self.users[user_id]

    def _queries_last_minute(self, record: UserQueryRecord) -> int:
        now = time.time()
        cutoff = now - 60
        return sum(1 for t in record.query_times if t > cutoff)

    def _is_too_regular(self, record: UserQueryRecord) -> bool:
        """
        Extraction bots tend to query at very regular intervals.
        Compute coefficient of variation (CV) of inter-query intervals.
        Low CV (regular) → flag as suspicious.
        """
        times = list(record.query_times)
        if len(times) < 20:
            return False   # insufficient data
        intervals = np.diff(sorted(times[-50:]))   # last 50 queries
        if len(intervals) == 0 or np.mean(intervals) == 0:
            return False
        cv = np.std(intervals) / np.mean(intervals)
        return cv < self.regularity_threshold

    def check_and_record(self, user_id: str) -> dict:
        """
        Check whether a query from user_id should be allowed.
        Returns {"allowed": bool, "reason": str, "throttle_remaining": float}
        Records the query time if allowed.
        """
        record = self._get_user(user_id)
        now    = time.time()

        # 1. Check if user is currently under active throttle
        if now < record.throttle_until:
            return {
                "allowed": False,
                "reason": "throttled",
                "throttle_remaining": record.throttle_until - now
            }

        # 2. Check hard per-minute rate limit
        if self._queries_last_minute(record) >= self.hard_limit_per_minute:
            return {
                "allowed": False,
                "reason": "rate_limit_exceeded",
                "throttle_remaining": 0
            }

        # 3. Check daily budget
        if record.query_count >= self.daily_budget:
            return {
                "allowed": False,
                "reason": "daily_budget_exceeded",
                "throttle_remaining": 0
            }

        # 4. Check for extraction-like regularity
        suspicious = self._is_too_regular(record)
        if suspicious:
            record.flagged_count += 1
            # Progressive: each flag doubles the throttle window
            throttle_seconds = min(60 * self.throttle_multiplier ** record.flagged_count, 86400)
            record.throttle_until = now + throttle_seconds
            return {
                "allowed": False,
                "reason": "extraction_pattern_detected",
                "throttle_remaining": throttle_seconds
            }

        # 5. Allow: record the query
        record.query_times.append(now)
        record.query_count += 1
        return {"allowed": True, "reason": "ok", "throttle_remaining": 0}

Defense-in-Depth Rate limiting alone is insufficient against a patient, distributed attacker. Combine it with output perturbation (Section 12) and watermarking (Section 13) for layered defense. An attacker who gets past rate limiting should still receive perturbed outputs, and any resulting surrogate model should carry a detectable watermark.

Defense: Output Perturbation

Even if an attacker successfully collects thousands of query-response pairs, the fidelity of their surrogate model depends on the quality of those labels. If the target model's outputs are perturbed — by adding calibrated noise, rounding confidence scores, or restricting the output to top-k classes — the surrogate trains on corrupted supervision and degrades in quality. The art of output perturbation is to add enough noise to impede extraction while preserving enough signal to maintain utility for legitimate users.

Differential Privacy Mechanisms for Output

The Laplace mechanism and Gaussian mechanism from differential privacy theory provide principled ways to add output noise with formal guarantees. For a model returning a probability vector in [0,1]^k, adding Laplace noise with scale Δf / ε (where Δf is the L1 sensitivity of the output function and ε is the privacy parameter) ensures (ε, 0)-differential privacy. In practice, the sensitivity of a softmax output is 2 (bounded range), so the noise scale is 2/ε.

Confidence Score Rounding and Top-k Restriction

A simpler, non-probabilistic approach is confidence score rounding: returning probabilities rounded to 2 decimal places instead of 8. This dramatically reduces the information content of each query response while preserving the ordinal ranking of classes (which is what most legitimate users need). Top-k restriction returns only the top k predictions rather than the full distribution, further limiting the information available to a surrogate model.

Implementation

import numpy as np
from scipy.special import softmax


class OutputPerturbationDefense:
    """
    Defends against model extraction by perturbing model outputs before
    returning them to the user.

    Supports three strategies:
      - 'laplace'  : Add Laplace noise (differential privacy)
      - 'rounding' : Round confidences to d decimal places
      - 'topk'     : Return only top-k predictions
    """

    def __init__(self, strategy='laplace', epsilon=1.0, decimal_places=2, top_k=3):
        assert strategy in ('laplace', 'gaussian', 'rounding', 'topk')
        self.strategy       = strategy
        self.epsilon        = epsilon          # privacy budget (smaller = more noise)
        self.decimal_places = decimal_places
        self.top_k          = top_k

    def perturb(self, probabilities: np.ndarray) -> np.ndarray:
        """
        Apply output perturbation to a probability vector.
        probabilities: np.ndarray of shape (n_classes,), summing to 1.
        Returns perturbed probabilities (not necessarily summing to 1 for noisy methods).
        """
        proba = np.asarray(probabilities, dtype=np.float64)

        if self.strategy == 'laplace':
            return self._laplace_mechanism(proba)
        elif self.strategy == 'gaussian':
            return self._gaussian_mechanism(proba)
        elif self.strategy == 'rounding':
            return self._rounding(proba)
        elif self.strategy == 'topk':
            return self._topk_restriction(proba)

    def _laplace_mechanism(self, proba):
        """
        Add Laplace noise with scale = sensitivity / epsilon.
        For probability vectors, L1 sensitivity is 2.
        After noise addition, re-normalize and clip to [0, 1].
        """
        sensitivity = 2.0   # L1 sensitivity of softmax output
        scale = sensitivity / self.epsilon
        noise = np.random.laplace(loc=0.0, scale=scale, size=proba.shape)
        noisy = np.clip(proba + noise, 0.0, 1.0)
        # Re-normalize to sum to 1 (project onto probability simplex)
        total = noisy.sum()
        return noisy / total if total > 0 else np.ones_like(noisy) / len(noisy)

    def _gaussian_mechanism(self, proba):
        """
        Gaussian mechanism with (epsilon, delta)-DP guarantee.
        Uses delta=1e-5 by default; sigma calibrated to L2 sensitivity.
        """
        delta = 1e-5
        l2_sensitivity = np.sqrt(2)           # L2 sensitivity of probability vector
        sigma = l2_sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / self.epsilon
        noise = np.random.normal(0, sigma, proba.shape)
        noisy = np.clip(proba + noise, 0.0, 1.0)
        total = noisy.sum()
        return noisy / total if total > 0 else np.ones_like(noisy) / len(noisy)

    def _rounding(self, proba):
        """Round each probability to d decimal places, then renormalize."""
        rounded = np.round(proba, self.decimal_places)
        total   = rounded.sum()
        return rounded / total if total > 0 else np.ones_like(rounded) / len(rounded)

    def _topk_restriction(self, proba):
        """
        Zero out all but the top-k classes and renormalize.
        Returns a sparse vector with at most k non-zero entries.
        """
        k = min(self.top_k, len(proba))
        result = np.zeros_like(proba)
        top_indices = np.argsort(proba)[-k:]
        result[top_indices] = proba[top_indices]
        total = result.sum()
        return result / total if total > 0 else result


# ── Privacy-utility tradeoff demonstration ──────────────────────────────
if __name__ == "__main__":
    original_proba = np.array([0.70, 0.20, 0.07, 0.03])

    for eps in [0.1, 1.0, 5.0]:
        defense = OutputPerturbationDefense(strategy='laplace', epsilon=eps)
        perturbed = defense.perturb(original_proba)
        print(f"ε={eps:.1f}: original={np.round(original_proba,3)} → perturbed={np.round(perturbed,3)}")

The key tradeoff: smaller ε means more noise (stronger privacy, worse utility). For most commercial models, ε = 1.0 to ε = 5.0 represents a practical operating range — sufficient noise to degrade a surrogate's training signal by 15–30% while keeping prediction accuracy for legitimate users within acceptable bounds.

Defense: Model Watermarking

Rate limiting and output perturbation try to prevent extraction. Model watermarking takes a different approach: it assumes extraction may occur and embeds a verifiable signature in the model's behavior that persists into the surrogate, allowing the original model owner to prove that a suspected stolen model was derived from their source model. [Survey: IP Protection for Deep Learning, arXiv 2411.05051, 2024]

Backdoor-Based Watermarking

The most widely deployed watermarking technique introduces a secret trigger set: a small collection of carefully crafted input-output pairs that the model is trained to respond to in a specific, unusual way. For example, a facial recognition model might be watermarked to classify images containing a specific subtle texture pattern as a designated "watermark class" with very high confidence. A legitimate copy of the model (including any surrogate trained via knowledge distillation on the original's outputs) will also exhibit this behavior, because the attacker's training queries included these trigger inputs and the attacker faithfully copied the corresponding responses. When the model owner suspects a stolen copy, they query it with the trigger set and check for the expected watermark behavior.

Parameter-Space Watermarking

White-box watermarking embeds signatures directly into model weights rather than behavior. Uchida et al. (2017) proposed embedding a bit string into the distribution of weight values in a specific layer using a regularization term during training. The watermark can be extracted by computing the inner product of the weight vector with a secret key matrix. This approach is more robust to model modifications (fine-tuning, pruning) than backdoor-based approaches, but requires access to the model's internal weights for verification — which is only possible if the attacker makes their surrogate available for inspection.

Limitations and Active Research

Watermarking faces several fundamental challenges. First, a determined adversary who knows a watermark exists can attempt to remove it through fine-tuning, model pruning, or knowledge distillation into a fresh model. Second, backdoor-based watermarks introduce a genuine security vulnerability — if the trigger pattern is discovered, it can be used to manipulate the model's behavior. Third, watermarks provide attribution after the fact but do not prevent the initial harm of extraction. Nonetheless, for IP litigation purposes, a robust watermark that survives extraction and is demonstrably non-trivially correlated with the victim model provides strong legal evidence of theft.

Recent work on radioactive data — poisoning training data with imperceptible perturbations that propagate into models trained on it — offers an alternative watermarking approach that operates at the data rather than model level, providing attribution even when the attacker trains a completely fresh model from stolen training data rather than distilling from the API.

API Monitoring for Extraction Attempts

A well-instrumented API can detect model extraction in progress by monitoring behavioral anomalies. Legitimate API users exhibit characteristic usage patterns (bursty queries related to specific use cases, natural language diversity in NLP applications, predictable diurnal patterns) that differ from extraction attacks (systematic coverage of the input space, high query volumes, mechanically generated inputs, low semantic diversity). Effective API monitoring combines statistical anomaly detection with domain-specific extraction heuristics.

Extraction Behavioral Signatures

Unusual query distributions: extraction queries tend to cover the input domain uniformly or along specific information-theoretic criteria, producing input distributions quite different from natural use.
High query velocity: API calls at near-maximum rate from a single account or correlated accounts.
Low semantic coherence: for NLP models, extraction queries may include partially randomized text, edge-case inputs, or grammatically unusual constructions that wouldn't arise from genuine user needs.
Absence of feedback patterns: legitimate users typically follow up errors or low-confidence responses. Extraction bots often don't.
Cross-account coordination: multiple accounts with similar query patterns or querying complementary regions of the input space.

Anomaly Detection Implementation

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import collections
import math


class ExtractionDetector:
    """
    Anomaly-based detector for model extraction attempts.
    Builds a feature vector per user session from query statistics
    and uses Isolation Forest to identify anomalous usage patterns.
    """

    def __init__(self, window_size=100, contamination=0.05):
        """
        window_size   : number of recent queries to consider per user
        contamination : expected fraction of anomalous users (for IsolationForest)
        """
        self.window_size   = window_size
        self.detector      = IsolationForest(contamination=contamination, random_state=42)
        self.scaler        = StandardScaler()
        self.user_history  = collections.defaultdict(list)   # user_id -> [query_features]
        self._fitted       = False

    def _compute_session_features(self, user_id: str) -> np.ndarray:
        """
        Compute a feature vector summarizing a user's recent query behavior.
        Features:
          0: queries_per_minute  (last window)
          1: inter_query_cv      (coefficient of variation of intervals)
          2: input_entropy       (diversity of input lengths)
          3: confidence_mean     (avg model confidence — low for adversarial inputs)
          4: confidence_cv       (variation in model confidence)
          5: unique_input_ratio  (fraction of distinct inputs — high for extraction)
        """
        history = self.user_history[user_id][-self.window_size:]
        n = len(history)
        if n < 2:
            return np.zeros(6)

        timestamps      = np.array([h["timestamp"]      for h in history])
        input_lengths   = np.array([h["input_length"]    for h in history])
        confidences     = np.array([h["top_confidence"]  for h in history])
        input_hashes    =          [h["input_hash"]      for h in history]

        intervals = np.diff(sorted(timestamps))

        # Feature 0: query rate (queries per minute in the observed window)
        time_span = timestamps.max() - timestamps.min()
        qpm = (n / time_span * 60) if time_span > 0 else 0

        # Feature 1: inter-query interval regularity (extraction bots → low CV)
        cv_intervals = (np.std(intervals) / np.mean(intervals)) if np.mean(intervals) > 0 else 0

        # Feature 2: entropy of input lengths (systematic scanning → low entropy)
        length_counts = collections.Counter(input_lengths)
        total = sum(length_counts.values())
        entropy = -sum((c/total) * math.log2(c/total + 1e-9) for c in length_counts.values())

        # Feature 3 & 4: model confidence statistics
        conf_mean = np.mean(confidences)
        conf_cv   = np.std(confidences) / conf_mean if conf_mean > 0 else 0

        # Feature 5: unique input ratio (close to 1.0 for systematic extraction)
        unique_ratio = len(set(input_hashes)) / n

        return np.array([qpm, cv_intervals, entropy, conf_mean, conf_cv, unique_ratio])

    def record_query(self, user_id: str, timestamp: float, input_length: int,
                       top_confidence: float, input_hash: str):
        """Record metadata for a single API query (do not store raw inputs for privacy)."""
        self.user_history[user_id].append({
            "timestamp":      timestamp,
            "input_length":   input_length,
            "top_confidence": top_confidence,
            "input_hash":     input_hash,
        })

    def fit_baseline(self, baseline_user_ids):
        """Train the anomaly detector on baseline legitimate user sessions."""
        features = [self._compute_session_features(uid) for uid in baseline_user_ids
                    if len(self.user_history[uid]) >= 10]
        if not features:
            raise ValueError("No baseline data available.")
        X = np.array(features)
        X_scaled = self.scaler.fit_transform(X)
        self.detector.fit(X_scaled)
        self._fitted = True

    def score_user(self, user_id: str) -> float:
        """
        Returns anomaly score for a user.
        Negative score → anomalous (potential extraction).
        Positive score → normal behavior.
        """
        if not self._fitted:
            raise RuntimeError("Call fit_baseline() first.")
        features = self._compute_session_features(user_id).reshape(1, -1)
        scaled   = self.scaler.transform(features)
        return self.detector.decision_function(scaled)[0]

    def is_suspicious(self, user_id: str, threshold: float = 0.0) -> bool:
        """Returns True if the user's behavior is anomalous above threshold."""
        return self.score_user(user_id) < threshold

False Positive Risk Automated anomaly-based blocking carries false positive risk for power users, developers testing integrations, and researchers. Detection should trigger enhanced monitoring and throttling before outright blocking, with a human review step for high-value accounts flagged as suspicious.

Differential Privacy

All the defenses discussed so far operate at inference time — after the model has been trained. Differential privacy (DP) addresses the root cause by modifying the training process itself to limit how much any individual training example can influence the final model. A differentially private model is formally guaranteed to reveal only bounded information about any single training example — providing provable resistance to membership inference, attribute inference, and training data extraction attacks.

Formal Definition

A randomized mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' (differing in exactly one record), and for any set of outputs S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S] + δ

Smaller ε means stronger privacy. When δ = 0 (pure DP), the guarantee is absolute; allowing δ > 0 (approximate DP, typically δ < 1/n) permits slightly relaxed guarantees in exchange for significantly better utility. The privacy budget ε tracks total information leakage across all uses of the mechanism — it is consumed with every query, access, or training step, providing a quantitative framework for managing privacy risk over time.

DP-SGD: Training with Differential Privacy

Differentially Private Stochastic Gradient Descent (DP-SGD), introduced by Abadi et al. (2016), is the standard mechanism for training neural networks with differential privacy. The algorithm modifies standard SGD in two ways:

Gradient clipping: per-sample gradients are computed individually (rather than averaged over a batch) and clipped to a maximum L2 norm C. This bounds the influence of any single training example on the gradient update.
Gaussian noise addition: Gaussian noise with standard deviation proportional to C × σ is added to the clipped gradient sum before the parameter update, where σ is the noise multiplier calibrated to the desired (ε, δ) budget.

The cost of DP-SGD is a reduction in model accuracy: more noise means less effective gradient signal, especially in early training. The privacy-utility tradeoff is governed by ε: values of ε < 1 provide strong privacy but significant accuracy loss; ε = 1–10 provides moderate privacy with modest accuracy cost (typically 2–5% on classification benchmarks).

Implementation with Opacus

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from opacus import PrivacyEngine
from opacus.utils.batch_memory_manager import BatchMemoryManager
import numpy as np


def train_with_dp(
    model:           nn.Module,
    train_loader:    DataLoader,
    optimizer:       torch.optim.Optimizer,
    n_epochs:        int,
    target_epsilon:  float = 1.0,   # privacy budget
    target_delta:    float = 1e-5,  # failure probability
    max_grad_norm:   float = 1.0,   # gradient clipping norm
    noise_multiplier:float = 1.1,   # σ: larger → more privacy, less utility
    device:          str   = "cpu"
) -> dict:
    """
    Train a PyTorch model with (target_epsilon, target_delta)-differential privacy
    using the Opacus library (meta-pytorch/opacus).

    Returns: dict with final epsilon, delta, and per-epoch losses.
    """
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()

    # Attach the PrivacyEngine to enforce DP during training
    privacy_engine = PrivacyEngine()

    model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
        module=model,
        optimizer=optimizer,
        data_loader=train_loader,
        epochs=n_epochs,
        target_epsilon=target_epsilon,
        target_delta=target_delta,
        max_grad_norm=max_grad_norm,
    )
    # Equivalent to specifying noise_multiplier directly:
    # model, optimizer, train_loader = privacy_engine.make_private(
    #     module=model, optimizer=optimizer, data_loader=train_loader,
    #     noise_multiplier=noise_multiplier, max_grad_norm=max_grad_norm,
    # )

    history = []
    for epoch in range(1, n_epochs + 1):
        model.train()
        epoch_losses = []

        with BatchMemoryManager(
            data_loader=train_loader,
            max_physical_batch_size=64,   # memory-efficient batching for DP
            optimizer=optimizer
        ) as memory_safe_loader:
            for data, target in memory_safe_loader:
                data, target = data.to(device), target.to(device)
                optimizer.zero_grad()
                output = model(data)
                loss   = criterion(output, target)
                loss.backward()
                optimizer.step()
                epoch_losses.append(loss.item())

        # Query current privacy budget spent
        epsilon = privacy_engine.get_epsilon(target_delta)
        mean_loss = np.mean(epoch_losses)

        print(
            f"Epoch {epoch:3d}/{n_epochs} | "
            f"Loss: {mean_loss:.4f} | "
            f"Privacy: (ε={epsilon:.2f}, δ={target_delta})"
        )
        history.append({"epoch": epoch, "loss": mean_loss, "epsilon": epsilon})

    final_epsilon = privacy_engine.get_epsilon(target_delta)
    print(f"\nFinal privacy budget spent: ε={final_epsilon:.3f}, δ={target_delta}")
    return {"epsilon": final_epsilon, "delta": target_delta, "history": history}


# ── Privacy-utility tradeoff guide ─────────────────────────────────────
# ε ≈ 0.1   : Very strong privacy. Significant accuracy degradation (~10-20%).
#             Membership inference reduced to near-random guessing.
# ε ≈ 1.0   : Strong privacy. Moderate accuracy degradation (~3-8%).
#             Practical for medical / financial datasets.
# ε ≈ 10.0  : Moderate privacy. Minimal accuracy degradation (~1-3%).
#             Reduces but does not eliminate membership inference risk.
# ε > 100   : Weak privacy. Near-original model utility.
#             Provides little meaningful protection against determined attackers.

Practical Deployment Considerations

DP-SGD requires per-sample gradient computation, which is more expensive than standard batched gradient computation — typically 2–3× overhead in memory and 1.5–2× in compute time using libraries like Opacus. Certain layer types (BatchNorm) are incompatible with DP-SGD because they mix cross-sample information; they must be replaced with GroupNorm or LayerNorm. For very large models (LLMs), DP fine-tuning is more practical than DP pretraining from scratch: fine-tuning a pre-trained model with DP requires fewer gradient steps, so the privacy budget is spent more efficiently.

DP provides formal, quantitative guarantees — not just "we added noise and it seems harder to attack." When a regulator or legal body asks "how private is this model?", a trained model with ε = 1.0, δ = 10^-5 gives a precise, auditable answer. This is a significant advantage over heuristic defenses, and why DP is increasingly required by privacy regulations for sensitive ML deployments. [Opacus Tutorials — Meta AI]

Recommended Reading For a comprehensive treatment of DP in machine learning, see the Opacus documentation and Abadi et al. (2016), "Deep Learning with Differential Privacy" (arXiv:1607.00133). For the privacy-utility tradeoff in LLMs specifically, see Yu et al. (2022), "Differentially Private Fine-Tuning of Language Models" (arXiv:2110.06500).

Module Summary

This module has covered the full lifecycle of model extraction and inference attacks — from the economics of API-based model cloning, through the technical machinery of shadow models, training data extraction, and side-channel attacks on encrypted traffic, to the defense landscape of rate limiting, DP, watermarking, and anomaly detection.

Key takeaways:

Model extraction is economically asymmetric: an attacker can clone millions of dollars of training work for thousands of dollars in API queries.
Training data extraction from LLMs is not theoretical — Carlini et al. demonstrated it against GPT-2 and ChatGPT production systems.
Membership inference turns model outputs into a privacy detector, with practical implications for GDPR, HIPAA, and healthcare/financial ML.
Side-channel attacks (Whisper Leak, token length, speculative decoding, KV cache timing) show that HTTPS encryption alone is insufficient for privacy-sensitive LLM deployments.
Hardware attacks (TPUXtract) demonstrate that physical proximity to accelerators can leak architectural secrets without any API access.
Differential privacy is the only defense with formal, quantitative guarantees — at the cost of utility and computational overhead.
Layered defenses (rate limiting + output perturbation + watermarking + DP) are more robust than any single mechanism.