OWASP Machine Learning Security Top 10

1️⃣ ML01 - Input Manipulation Attack

Critical

Overview

Input manipulation attacks (adversarial attacks) craft specially designed inputs to cause ML models to make incorrect predictions. Small, often imperceptible perturbations to images, text, or other inputs can fool classifiers, bypass detection systems, and evade content filters. This is the most well-known ML-specific attack vector.

Risk

Adversarial examples can bypass safety-critical ML systems: autonomous vehicle perception, malware detection, fraud detection, and content moderation. An attacker can cause a stop sign to be classified as a speed limit sign, or make malware appear benign to an ML-based antivirus.

Vulnerable Code Example

        Python
        ❌ Bad
      

import numpy as np
from tensorflow import keras

# Model with no adversarial robustness
model = keras.models.load_model("classifier.h5")

def predict(image):
    # Direct prediction — no input validation or preprocessing
    result = model.predict(np.expand_dims(image, axis=0))
    return np.argmax(result)
    # No confidence threshold check
    # No input bounds validation
    # Vulnerable to FGSM, PGD, C&W attacks

Secure Code Example

        Python
        ✅ Good
      

import numpy as np
from tensorflow import keras
from art.defences.preprocessor import SpatialSmoothing
from art.defences.detector.evasion import BinaryInputDetector

# Load adversarially trained model
model = keras.models.load_model("classifier_robust.h5")

# Input preprocessing to remove perturbations
smoother = SpatialSmoothing(window_size=3)
detector = BinaryInputDetector(model)

def predict_secure(image):
    # Validate input bounds
    if image.min() < 0 or image.max() > 1:
        raise ValueError("Input out of expected range")

    # Detect adversarial input
    if detector.detect(image):
        raise ValueError("Adversarial input detected")

    # Apply spatial smoothing defense
    cleaned = smoother(image)[0]
    result = model.predict(np.expand_dims(cleaned, axis=0))

    # Reject low-confidence predictions
    confidence = np.max(result)
    if confidence < 0.85:
        return {"label": "uncertain", "confidence": confidence}
    return {"label": np.argmax(result), "confidence": confidence}

Mitigation Checklist

Use adversarial training to improve model robustness against perturbation attacks
Implement input validation and preprocessing (spatial smoothing, feature squeezing)
Set confidence thresholds and reject low-confidence predictions
Deploy adversarial input detectors (IBM ART, Microsoft Counterfit) in production

2️⃣ ML02 - Data Poisoning Attack

Critical

Overview

Data poisoning attacks inject malicious samples into training datasets to corrupt the model's learned behavior. Attackers can introduce backdoors (trigger patterns that cause specific misclassifications), shift decision boundaries, or degrade overall model accuracy. This is especially dangerous when training data is sourced from the internet or user-generated content.

Risk

A poisoned model may behave normally on clean inputs but misclassify when a specific trigger pattern is present. For example, a backdoored malware classifier could approve any malware sample containing a specific byte sequence. The attack is stealthy because model accuracy on clean data remains high.

Vulnerable Code Example

        Python
        ❌ Bad
      

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Training on unvalidated, crowdsourced data
data = pd.read_csv("user_submitted_data.csv")  # No validation!

# No outlier detection or data quality checks
X = data.drop("label", axis=1)
y = data["label"]

model = RandomForestClassifier()
model.fit(X, y)  # Training directly on untrusted data!

# No comparison against clean baseline
# No data provenance tracking

Secure Code Example

        Python
        ✅ Good
      

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.model_selection import cross_val_score

# Load data with provenance tracking
data = pd.read_csv("training_data.csv")
data_hash = hashlib.sha256(data.to_csv().encode()).hexdigest()
log.info(f"Training data hash: {data_hash}")

X = data.drop("label", axis=1)
y = data["label"]

# Detect and remove anomalous samples
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_mask = iso_forest.fit_predict(X) == 1
X_clean, y_clean = X[outlier_mask], y[outlier_mask]
log.info(f"Removed {(~outlier_mask).sum()} outliers from {len(X)} samples")

# Train and validate against baseline
model = RandomForestClassifier(random_state=42)
scores = cross_val_score(model, X_clean, y_clean, cv=5)
if scores.mean() < BASELINE_ACCURACY - 0.05:
    raise ValueError("Model accuracy dropped — possible data poisoning")

model.fit(X_clean, y_clean)

Mitigation Checklist

Validate and sanitize all training data; use outlier detection (Isolation Forest, LOF)
Track data provenance with cryptographic hashes and access controls
Compare model metrics against clean baselines to detect accuracy degradation
Use robust training techniques (DPSGD, certified defenses) for critical models

3️⃣ ML03 - Model Inversion Attack

High

Overview

Model inversion attacks reconstruct sensitive training data by querying the model and analyzing its outputs. An attacker can recover private information such as faces, medical records, or personal data used during training. This is particularly concerning for models trained on sensitive datasets (healthcare, biometrics, financial data).

Risk

Model inversion can violate data privacy regulations (GDPR, HIPAA) by exposing personally identifiable information from training data. An attacker with API access to a facial recognition model could reconstruct faces of individuals in the training set.

Vulnerable Code Example

        Python (API)
        ❌ Bad
      

from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_model("face_classifier.h5")

@app.route("/predict", methods=["POST"])
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))
    # Returns full probability vector — enables model inversion!
    return jsonify({
        "probabilities": result.tolist(),  # All class probabilities!
        "prediction": int(np.argmax(result)),
        "confidence": float(np.max(result))
    })
    # No rate limiting, no query logging
    # Unlimited API access for gradient estimation

Secure Code Example

        Python (API)
        ✅ Good
      

from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["100/hour"])
model = load_model("face_classifier_dp.h5")  # Trained with DP

@app.route("/predict", methods=["POST"])
@limiter.limit("100/hour")
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))

    # Return only top-1 prediction — no probability vector
    prediction = int(np.argmax(result))
    log_query(request.remote_addr, prediction)  # Audit logging

    return jsonify({
        "prediction": prediction
        # No probabilities, no confidence scores
    })

Mitigation Checklist

Return only top-k predictions without full probability distributions
Train models with Differential Privacy (DP-SGD) to limit information leakage
Implement rate limiting and query logging to detect extraction attempts
Add output perturbation (rounding, noise) to reduce precision of responses

4️⃣ ML04 - Membership Inference Attack

High

Overview

Membership inference attacks determine whether a specific data point was used in the model's training dataset. By analyzing the model's confidence scores and behavior on known vs. unknown inputs, attackers can infer private membership information. This is a significant privacy threat for models trained on sensitive data.

Risk

Membership inference can reveal that a specific individual's data was used for training — for example, confirming that a patient's record was in a clinical dataset, or that a person's face was used for surveillance training. This violates privacy expectations and potentially regulations like GDPR.

Vulnerable Code Example

        Python
        ❌ Bad
      

from sklearn.neural_network import MLPClassifier

# Overfitted model — memorizes training data
model = MLPClassifier(
    hidden_layer_sizes=(512, 512, 256),  # Over-parameterized!
    max_iter=1000,
    # No regularization
    # No early stopping
)
model.fit(X_train, y_train)

# Model memorizes training data → membership inference possible
# Training accuracy: 99.9% vs Test accuracy: 82%
# This gap indicates overfitting = information leakage

def predict_with_confidence(x):
    proba = model.predict_proba([x])[0]
    return {"probabilities": proba.tolist()}  # Leaks membership info!

Secure Code Example

        Python
        ✅ Good
      

from sklearn.neural_network import MLPClassifier
import numpy as np

# Regularized model with early stopping to reduce overfitting
model = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    max_iter=500,
    alpha=0.01,               # L2 regularization
    early_stopping=True,       # Prevents memorization
    validation_fraction=0.15,
)
model.fit(X_train, y_train)

# Verify train/test gap is small (low overfitting)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
assert train_acc - test_acc < 0.05, "Overfitting detected!"

def predict_secure(x):
    pred = model.predict([x])[0]
    return {"prediction": int(pred)}  # Label only, no probabilities

Mitigation Checklist

Use regularization (L2, dropout, early stopping) to prevent model overfitting
Train with Differential Privacy to provide formal membership privacy guarantees
Monitor and minimize the gap between training and test accuracy
Limit API output to predictions only — avoid exposing confidence scores or probabilities

5️⃣ ML05 - Model Theft

Critical

Overview

Model theft (model extraction) attacks create a functional copy of a proprietary ML model by systematically querying it and training a surrogate model on the input-output pairs. The stolen model can then be used to find adversarial examples, compete commercially, or reverse-engineer the model's training data.

Risk

A stolen model represents loss of intellectual property and competitive advantage. The extracted model can be used offline to craft adversarial attacks or to understand decision boundaries. Millions of dollars of training investment can be replicated with thousands of API queries.

Vulnerable Code Example

        Python (API)
        ❌ Bad
      

from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_proprietary_model()

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["features"]
    result = model.predict_proba([data])[0]

    # Returns full probability distribution
    return jsonify({
        "probabilities": result.tolist(),
        "prediction": int(np.argmax(result))
    })
    # No rate limiting — unlimited queries
    # No anomaly detection on query patterns
    # Attacker can extract model with ~10K queries

Secure Code Example

        Python (API)
        ✅ Good
      

from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["50/hour"])

# Watermarked model for theft detection
model = load_watermarked_model()
query_monitor = QueryPatternDetector()

@app.route("/predict", methods=["POST"])
@limiter.limit("50/hour")
def predict():
    data = request.json["features"]
    api_key = request.headers.get("X-API-Key")

    # Detect extraction patterns (uniform sampling, grid queries)
    if query_monitor.is_suspicious(api_key, data):
        log_alert(f"Possible extraction: {api_key}")
        return jsonify({"error": "rate limited"}), 429

    result = model.predict([data])[0]
    return jsonify({
        "prediction": int(result)  # Label only, no probabilities
    })

Mitigation Checklist

Implement strict rate limiting and per-user query quotas
Return minimal output (labels only) without probability distributions
Embed model watermarks to detect and prove theft of extracted models
Monitor query patterns for systematic extraction behavior (grid search, boundary probing)

6️⃣ ML06 - AI Supply Chain Attacks

Critical

Overview

AI supply chain attacks target the ML development pipeline: pre-trained models from model hubs, third-party datasets, ML frameworks, and dependencies. Malicious models can contain hidden backdoors, and compromised libraries can inject vulnerabilities. The serialization formats used by ML frameworks (Pickle, SavedModel) can execute arbitrary code on load.

Risk

Loading a malicious model file can execute arbitrary code (Pickle deserialization attacks). Pre-trained models from untrusted sources may contain backdoors. Compromised ML libraries affect all downstream users. The ML supply chain has fewer security controls than traditional software supply chains.

Vulnerable Code Example

        Python
        ❌ Bad
      

import pickle
import torch

# Loading untrusted model — arbitrary code execution!
with open("model_from_internet.pkl", "rb") as f:
    model = pickle.load(f)  # DANGEROUS: can execute any code!

# Loading unverified PyTorch model
model = torch.load("untrusted_model.pt")  # Uses pickle internally!

# Using unvetted model from public hub
from transformers import AutoModel
model = AutoModel.from_pretrained("random-user/suspicious-model")
# No hash verification, no security scan

Secure Code Example

        Python
        ✅ Good
      

import torch
import hashlib
from safetensors.torch import load_file

# Use SafeTensors — no arbitrary code execution
model_state = load_file("model.safetensors")  # Safe format!
model = MyModel()
model.load_state_dict(model_state)

# Verify model hash before loading
EXPECTED_HASH = "sha256:a1b2c3d4..."
with open("model.safetensors", "rb") as f:
    actual_hash = "sha256:" + hashlib.sha256(f.read()).hexdigest()
assert actual_hash == EXPECTED_HASH, "Model integrity check failed!"

# Use trusted models from verified organizations
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "google/bert-base-uncased",  # Verified organization
    revision="a265f77",           # Pin to specific commit
)

Mitigation Checklist

Use SafeTensors or ONNX format instead of Pickle for model serialization
Verify model integrity with cryptographic hashes before loading
Only download models from verified organizations on trusted model hubs
Scan ML dependencies for vulnerabilities and pin versions in requirements

7️⃣ ML07 - Transfer Learning Attack

High

Overview

Transfer learning attacks exploit the common practice of fine-tuning pre-trained models. Backdoors embedded in the base model persist through fine-tuning and remain active in the downstream model. An attacker who publishes a popular pre-trained model can compromise all applications that use it as a foundation.

Risk

Backdoors in pre-trained models survive fine-tuning because they are embedded in deep layers that are often frozen during transfer learning. A single compromised foundation model can affect thousands of downstream applications. The attack is scalable and difficult to detect.

Vulnerable Code Example

        Python
        ❌ Bad
      

from transformers import AutoModelForSequenceClassification

# Fine-tuning an unvetted pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    "unknown-user/bert-finetuned-sentiment",  # Untrusted source!
    num_labels=2
)

# Freezing base layers — preserves any hidden backdoor
for param in model.base_model.parameters():
    param.requires_grad = False  # Backdoor in frozen layers persists!

# Fine-tune only the classification head
trainer.train()  # Backdoor remains undetected

Secure Code Example

        Python
        ✅ Good
      

from transformers import AutoModelForSequenceClassification
from neural_cleanse import BackdoorDetector

# Use only verified, trusted base models
model = AutoModelForSequenceClassification.from_pretrained(
    "google/bert-base-uncased",  # Trusted source
    num_labels=2,
    revision="main",
)

# Scan pre-trained model for backdoors before fine-tuning
detector = BackdoorDetector(model)
if detector.scan():
    raise SecurityError("Potential backdoor detected in base model")

# Fine-tune ALL layers (not just head) to overwrite potential backdoors
for param in model.parameters():
    param.requires_grad = True  # Train all layers

# Validate with clean test set + trigger test set
trainer.train()
evaluate_for_backdoors(model, trigger_test_set)

Mitigation Checklist

Only use pre-trained models from trusted, verified sources (major labs, official repos)
Scan pre-trained models for backdoors using Neural Cleanse or similar tools
Fine-tune all layers (not just the head) to help overwrite embedded backdoors
Test fine-tuned models against known trigger patterns and anomalous inputs

8️⃣ ML08 - Model Skewing

Medium

Overview

Model skewing occurs when the data distribution in production differs significantly from the training data distribution (training-serving skew). This can happen naturally over time (data drift) or be intentionally caused by attackers who manipulate the production input distribution to degrade model performance or bias predictions.

Risk

Model skewing causes silent failures where the model produces incorrect but confident predictions. In financial systems, attackers can exploit skew to bypass fraud detection. In recommendation systems, skew can be intentionally induced to promote specific content or products.

Vulnerable Code Example

        Python
        ❌ Bad
      

import joblib

# Deploy model with no drift monitoring
model = joblib.load("model_trained_2023.pkl")

def predict(features):
    # No check if input distribution has changed
    # No feature validation against training schema
    return model.predict([features])[0]
    # Model may be months/years old
    # No monitoring of prediction distribution
    # Silent degradation goes undetected

Secure Code Example

        Python
        ✅ Good
      

import joblib
import numpy as np
from scipy import stats
from evidently import ColumnDriftMetric

model = joblib.load("model.pkl")
training_stats = joblib.load("training_stats.pkl")

def predict_with_monitoring(features):
    # Validate feature schema and ranges
    for i, (val, stat) in enumerate(zip(features, training_stats)):
        z_score = abs((val - stat["mean"]) / stat["std"])
        if z_score > 5:
            log.warning(f"Feature {i} out of distribution: z={z_score:.1f}")

    prediction = model.predict([features])[0]

    # Log prediction distribution for drift monitoring
    metrics_collector.log(features, prediction)

    # Periodic drift detection (run by monitoring job)
    # drift_report = ColumnDriftMetric().calculate(reference, current)
    # Alert if drift detected → trigger retraining

    return prediction

Mitigation Checklist

Implement continuous data drift monitoring (Evidently, Whylogs, Great Expectations)
Validate input features against training data schema and statistical bounds
Set up automated alerts and retraining triggers when drift exceeds thresholds
Monitor prediction distribution over time to detect silent model degradation

9️⃣ ML09 - Output Integrity Attack

High

Overview

Output integrity attacks tamper with model predictions after they leave the model but before they reach the consuming application. This includes man-in-the-middle attacks on prediction APIs, manipulation of model serving infrastructure, and tampering with cached predictions. The attack targets the inference pipeline rather than the model itself.

Risk

Tampered predictions can cause incorrect decisions in downstream systems: approving fraudulent transactions, misdiagnosing medical conditions, or overriding safety systems. Because the model itself is not compromised, standard model monitoring will not detect the attack.

Vulnerable Code Example

        Python
        ❌ Bad
      

import requests

# Consuming model predictions over unencrypted HTTP
def get_prediction(features):
    response = requests.post(
        "http://ml-service/predict",  # HTTP, not HTTPS!
        json={"features": features}
    )
    result = response.json()
    # No integrity verification of the response
    # No validation of prediction format
    return result["prediction"]  # Could be tampered!

Secure Code Example

        Python
        ✅ Good
      

import requests
import hmac
import hashlib

def get_prediction(features):
    response = requests.post(
        "https://ml-service/predict",  # HTTPS (TLS)
        json={"features": features},
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        verify=True  # Verify TLS certificate
    )
    result = response.json()

    # Verify response integrity with HMAC signature
    signature = response.headers.get("X-Signature")
    expected = hmac.new(
        SHARED_SECRET, str(result).encode(), hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(signature, expected):
        raise IntegrityError("Response signature mismatch!")

    # Validate prediction is within expected range
    pred = result["prediction"]
    if pred not in VALID_LABELS:
        raise ValueError(f"Unexpected prediction: {pred}")
    return pred

Mitigation Checklist

Use TLS/mTLS for all model inference communication
Sign model responses with HMAC or digital signatures for integrity verification
Validate prediction outputs against expected ranges and formats
Monitor for anomalies in prediction distributions at the consumer side

🔟 ML10 - Model Poisoning

Critical

Overview

Model poisoning directly modifies the trained model's weights, parameters, or architecture to inject backdoors or alter behavior. Unlike data poisoning (which corrupts training data), model poisoning targets the model artifact itself — through compromised model repositories, insider threats, or supply chain attacks on the model storage and deployment pipeline.

Risk

A directly poisoned model can contain highly targeted backdoors that are virtually undetectable through standard testing. The attacker has precise control over the model's behavior on trigger inputs. If the model registry or deployment pipeline is compromised, every deployment uses the poisoned model.

Vulnerable Code Example

        Python
        ❌ Bad
      

import mlflow

# Loading model from registry with no integrity checks
model_uri = "models:/fraud-detector/Production"
model = mlflow.pyfunc.load_model(model_uri)

# No hash verification
# No signature validation
# No comparison with expected model metrics
# Model registry has weak access controls
# Anyone with push access can replace the model

predictions = model.predict(new_data)

Secure Code Example

        Python
        ✅ Good
      

import mlflow
import hashlib
from sigstore.verify import Verifier

# Verify model signature before loading
model_uri = "models:/fraud-detector/Production"
model_path = mlflow.artifacts.download_artifacts(model_uri)

# Cryptographic signature verification
verifier = Verifier.production()
verifier.verify(
    model_path,
    expected_identity="ml-team@company.com"
)

# Verify model hash against approved registry
model_hash = hash_directory(model_path)
approved_hash = get_approved_hash("fraud-detector", "Production")
assert model_hash == approved_hash, "Model integrity check failed!"

# Validate model metrics on reference dataset before serving
model = mlflow.pyfunc.load_model(model_path)
ref_score = evaluate(model, reference_dataset)
assert ref_score >= MINIMUM_ACCURACY, "Model quality below threshold"

predictions = model.predict(new_data)

Mitigation Checklist

Sign model artifacts cryptographically and verify signatures before deployment
Implement strict access controls on model registries with audit logging
Validate model metrics on reference datasets before promoting to production
Use immutable model storage and maintain a hash registry of approved model versions

📊 Summary Table

ID	Vulnerability	Severity	Key Mitigation
ML01	Input Manipulation Attack	Critical	Adversarial training, input validation, confidence thresholds
ML02	Data Poisoning Attack	Critical	Outlier detection, data provenance, baseline comparison
ML03	Model Inversion Attack	High	Differential Privacy, minimal output, rate limiting
ML04	Membership Inference Attack	High	Regularization, DP training, no probability exposure
ML05	Model Theft	Critical	Rate limiting, watermarking, query pattern detection
ML06	AI Supply Chain Attacks	Critical	SafeTensors, hash verification, trusted sources only
ML07	Transfer Learning Attack	High	Trusted base models, backdoor scanning, full fine-tuning
ML08	Model Skewing	Medium	Drift monitoring, input validation, auto-retraining
ML09	Output Integrity Attack	High	TLS/mTLS, response signing, output validation
ML10	Model Poisoning	Critical	Model signing, registry access control, metric validation