What is the OWASP ML Security Top 10?

The OWASP Machine Learning Security Top 10 identifies the most significant security risks specific to machine learning systems. Unlike traditional software, ML systems are vulnerable to unique attacks targeting training data, model internals, and inference pipelines. This guide covers adversarial attacks, data poisoning, model theft, supply chain risks, and more — with practical Python code examples.

1️⃣ ML01 - Input Manipulation Attack

Critical

Overview

Input manipulation attacks (adversarial attacks) craft specially designed inputs to cause ML models to make incorrect predictions. Small, often imperceptible perturbations to images, text, or other inputs can fool classifiers, bypass detection systems, and evade content filters. This is the most well-known ML-specific attack vector.

Risk

Adversarial examples can bypass safety-critical ML systems: autonomous vehicle perception, malware detection, fraud detection, and content moderation. An attacker can cause a stop sign to be classified as a speed limit sign, or make malware appear benign to an ML-based antivirus.

Vulnerable Code Example

Python ❌ Bad
import numpy as np
from tensorflow import keras

# Model with no adversarial robustness
model = keras.models.load_model("classifier.h5")

def predict(image):
    # Direct prediction — no input validation or preprocessing
    result = model.predict(np.expand_dims(image, axis=0))
    return np.argmax(result)
    # No confidence threshold check
    # No input bounds validation
    # Vulnerable to FGSM, PGD, C&W attacks

Secure Code Example

Python ✅ Good
import numpy as np
from tensorflow import keras
from art.defences.preprocessor import SpatialSmoothing
from art.defences.detector.evasion import BinaryInputDetector

# Load adversarially trained model
model = keras.models.load_model("classifier_robust.h5")

# Input preprocessing to remove perturbations
smoother = SpatialSmoothing(window_size=3)
detector = BinaryInputDetector(model)

def predict_secure(image):
    # Validate input bounds
    if image.min() < 0 or image.max() > 1:
        raise ValueError("Input out of expected range")

    # Detect adversarial input
    if detector.detect(image):
        raise ValueError("Adversarial input detected")

    # Apply spatial smoothing defense
    cleaned = smoother(image)[0]
    result = model.predict(np.expand_dims(cleaned, axis=0))

    # Reject low-confidence predictions
    confidence = np.max(result)
    if confidence < 0.85:
        return {"label": "uncertain", "confidence": confidence}
    return {"label": np.argmax(result), "confidence": confidence}

Mitigation Checklist

2️⃣ ML02 - Data Poisoning Attack

Critical

Overview

Data poisoning attacks inject malicious samples into training datasets to corrupt the model's learned behavior. Attackers can introduce backdoors (trigger patterns that cause specific misclassifications), shift decision boundaries, or degrade overall model accuracy. This is especially dangerous when training data is sourced from the internet or user-generated content.

Risk

A poisoned model may behave normally on clean inputs but misclassify when a specific trigger pattern is present. For example, a backdoored malware classifier could approve any malware sample containing a specific byte sequence. The attack is stealthy because model accuracy on clean data remains high.

Vulnerable Code Example

Python ❌ Bad
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Training on unvalidated, crowdsourced data
data = pd.read_csv("user_submitted_data.csv")  # No validation!

# No outlier detection or data quality checks
X = data.drop("label", axis=1)
y = data["label"]

model = RandomForestClassifier()
model.fit(X, y)  # Training directly on untrusted data!

# No comparison against clean baseline
# No data provenance tracking

Secure Code Example

Python ✅ Good
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.model_selection import cross_val_score

# Load data with provenance tracking
data = pd.read_csv("training_data.csv")
data_hash = hashlib.sha256(data.to_csv().encode()).hexdigest()
log.info(f"Training data hash: {data_hash}")

X = data.drop("label", axis=1)
y = data["label"]

# Detect and remove anomalous samples
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_mask = iso_forest.fit_predict(X) == 1
X_clean, y_clean = X[outlier_mask], y[outlier_mask]
log.info(f"Removed {(~outlier_mask).sum()} outliers from {len(X)} samples")

# Train and validate against baseline
model = RandomForestClassifier(random_state=42)
scores = cross_val_score(model, X_clean, y_clean, cv=5)
if scores.mean() < BASELINE_ACCURACY - 0.05:
    raise ValueError("Model accuracy dropped — possible data poisoning")

model.fit(X_clean, y_clean)

Mitigation Checklist

3️⃣ ML03 - Model Inversion Attack

High

Overview

Model inversion attacks reconstruct sensitive training data by querying the model and analyzing its outputs. An attacker can recover private information such as faces, medical records, or personal data used during training. This is particularly concerning for models trained on sensitive datasets (healthcare, biometrics, financial data).

Risk

Model inversion can violate data privacy regulations (GDPR, HIPAA) by exposing personally identifiable information from training data. An attacker with API access to a facial recognition model could reconstruct faces of individuals in the training set.

Vulnerable Code Example

Python (API) ❌ Bad
from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_model("face_classifier.h5")

@app.route("/predict", methods=["POST"])
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))
    # Returns full probability vector — enables model inversion!
    return jsonify({
        "probabilities": result.tolist(),  # All class probabilities!
        "prediction": int(np.argmax(result)),
        "confidence": float(np.max(result))
    })
    # No rate limiting, no query logging
    # Unlimited API access for gradient estimation

Secure Code Example

Python (API) ✅ Good
from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["100/hour"])
model = load_model("face_classifier_dp.h5")  # Trained with DP

@app.route("/predict", methods=["POST"])
@limiter.limit("100/hour")
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))

    # Return only top-1 prediction — no probability vector
    prediction = int(np.argmax(result))
    log_query(request.remote_addr, prediction)  # Audit logging

    return jsonify({
        "prediction": prediction
        # No probabilities, no confidence scores
    })

Mitigation Checklist

4️⃣ ML04 - Membership Inference Attack

High

Overview

Membership inference attacks determine whether a specific data point was used in the model's training dataset. By analyzing the model's confidence scores and behavior on known vs. unknown inputs, attackers can infer private membership information. This is a significant privacy threat for models trained on sensitive data.

Risk

Membership inference can reveal that a specific individual's data was used for training — for example, confirming that a patient's record was in a clinical dataset, or that a person's face was used for surveillance training. This violates privacy expectations and potentially regulations like GDPR.

Vulnerable Code Example

Python ❌ Bad
from sklearn.neural_network import MLPClassifier

# Overfitted model — memorizes training data
model = MLPClassifier(
    hidden_layer_sizes=(512, 512, 256),  # Over-parameterized!
    max_iter=1000,
    # No regularization
    # No early stopping
)
model.fit(X_train, y_train)

# Model memorizes training data → membership inference possible
# Training accuracy: 99.9% vs Test accuracy: 82%
# This gap indicates overfitting = information leakage

def predict_with_confidence(x):
    proba = model.predict_proba([x])[0]
    return {"probabilities": proba.tolist()}  # Leaks membership info!

Secure Code Example

Python ✅ Good
from sklearn.neural_network import MLPClassifier
import numpy as np

# Regularized model with early stopping to reduce overfitting
model = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    max_iter=500,
    alpha=0.01,               # L2 regularization
    early_stopping=True,       # Prevents memorization
    validation_fraction=0.15,
)
model.fit(X_train, y_train)

# Verify train/test gap is small (low overfitting)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
assert train_acc - test_acc < 0.05, "Overfitting detected!"

def predict_secure(x):
    pred = model.predict([x])[0]
    return {"prediction": int(pred)}  # Label only, no probabilities

Mitigation Checklist

5️⃣ ML05 - Model Theft

Critical

Overview

Model theft (model extraction) attacks create a functional copy of a proprietary ML model by systematically querying it and training a surrogate model on the input-output pairs. The stolen model can then be used to find adversarial examples, compete commercially, or reverse-engineer the model's training data.

Risk

A stolen model represents loss of intellectual property and competitive advantage. The extracted model can be used offline to craft adversarial attacks or to understand decision boundaries. Millions of dollars of training investment can be replicated with thousands of API queries.

Vulnerable Code Example

Python (API) ❌ Bad
from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_proprietary_model()

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["features"]
    result = model.predict_proba([data])[0]

    # Returns full probability distribution
    return jsonify({
        "probabilities": result.tolist(),
        "prediction": int(np.argmax(result))
    })
    # No rate limiting — unlimited queries
    # No anomaly detection on query patterns
    # Attacker can extract model with ~10K queries

Secure Code Example

Python (API) ✅ Good
from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["50/hour"])

# Watermarked model for theft detection
model = load_watermarked_model()
query_monitor = QueryPatternDetector()

@app.route("/predict", methods=["POST"])
@limiter.limit("50/hour")
def predict():
    data = request.json["features"]
    api_key = request.headers.get("X-API-Key")

    # Detect extraction patterns (uniform sampling, grid queries)
    if query_monitor.is_suspicious(api_key, data):
        log_alert(f"Possible extraction: {api_key}")
        return jsonify({"error": "rate limited"}), 429

    result = model.predict([data])[0]
    return jsonify({
        "prediction": int(result)  # Label only, no probabilities
    })

Mitigation Checklist

6️⃣ ML06 - AI Supply Chain Attacks

Critical

Overview

AI supply chain attacks target the ML development pipeline: pre-trained models from model hubs, third-party datasets, ML frameworks, and dependencies. Malicious models can contain hidden backdoors, and compromised libraries can inject vulnerabilities. The serialization formats used by ML frameworks (Pickle, SavedModel) can execute arbitrary code on load.

Risk

Loading a malicious model file can execute arbitrary code (Pickle deserialization attacks). Pre-trained models from untrusted sources may contain backdoors. Compromised ML libraries affect all downstream users. The ML supply chain has fewer security controls than traditional software supply chains.

Vulnerable Code Example

Python ❌ Bad
import pickle
import torch

# Loading untrusted model — arbitrary code execution!
with open("model_from_internet.pkl", "rb") as f:
    model = pickle.load(f)  # DANGEROUS: can execute any code!

# Loading unverified PyTorch model
model = torch.load("untrusted_model.pt")  # Uses pickle internally!

# Using unvetted model from public hub
from transformers import AutoModel
model = AutoModel.from_pretrained("random-user/suspicious-model")
# No hash verification, no security scan

Secure Code Example

Python ✅ Good
import torch
import hashlib
from safetensors.torch import load_file

# Use SafeTensors — no arbitrary code execution
model_state = load_file("model.safetensors")  # Safe format!
model = MyModel()
model.load_state_dict(model_state)

# Verify model hash before loading
EXPECTED_HASH = "sha256:a1b2c3d4..."
with open("model.safetensors", "rb") as f:
    actual_hash = "sha256:" + hashlib.sha256(f.read()).hexdigest()
assert actual_hash == EXPECTED_HASH, "Model integrity check failed!"

# Use trusted models from verified organizations
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "google/bert-base-uncased",  # Verified organization
    revision="a265f77",           # Pin to specific commit
)

Mitigation Checklist

7️⃣ ML07 - Transfer Learning Attack

High

Overview

Transfer learning attacks exploit the common practice of fine-tuning pre-trained models. Backdoors embedded in the base model persist through fine-tuning and remain active in the downstream model. An attacker who publishes a popular pre-trained model can compromise all applications that use it as a foundation.

Risk

Backdoors in pre-trained models survive fine-tuning because they are embedded in deep layers that are often frozen during transfer learning. A single compromised foundation model can affect thousands of downstream applications. The attack is scalable and difficult to detect.

Vulnerable Code Example

Python ❌ Bad
from transformers import AutoModelForSequenceClassification

# Fine-tuning an unvetted pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    "unknown-user/bert-finetuned-sentiment",  # Untrusted source!
    num_labels=2
)

# Freezing base layers — preserves any hidden backdoor
for param in model.base_model.parameters():
    param.requires_grad = False  # Backdoor in frozen layers persists!

# Fine-tune only the classification head
trainer.train()  # Backdoor remains undetected

Secure Code Example

Python ✅ Good
from transformers import AutoModelForSequenceClassification
from neural_cleanse import BackdoorDetector

# Use only verified, trusted base models
model = AutoModelForSequenceClassification.from_pretrained(
    "google/bert-base-uncased",  # Trusted source
    num_labels=2,
    revision="main",
)

# Scan pre-trained model for backdoors before fine-tuning
detector = BackdoorDetector(model)
if detector.scan():
    raise SecurityError("Potential backdoor detected in base model")

# Fine-tune ALL layers (not just head) to overwrite potential backdoors
for param in model.parameters():
    param.requires_grad = True  # Train all layers

# Validate with clean test set + trigger test set
trainer.train()
evaluate_for_backdoors(model, trigger_test_set)

Mitigation Checklist

8️⃣ ML08 - Model Skewing

Medium

Overview

Model skewing occurs when the data distribution in production differs significantly from the training data distribution (training-serving skew). This can happen naturally over time (data drift) or be intentionally caused by attackers who manipulate the production input distribution to degrade model performance or bias predictions.

Risk

Model skewing causes silent failures where the model produces incorrect but confident predictions. In financial systems, attackers can exploit skew to bypass fraud detection. In recommendation systems, skew can be intentionally induced to promote specific content or products.

Vulnerable Code Example

Python ❌ Bad
import joblib

# Deploy model with no drift monitoring
model = joblib.load("model_trained_2023.pkl")

def predict(features):
    # No check if input distribution has changed
    # No feature validation against training schema
    return model.predict([features])[0]
    # Model may be months/years old
    # No monitoring of prediction distribution
    # Silent degradation goes undetected

Secure Code Example

Python ✅ Good
import joblib
import numpy as np
from scipy import stats
from evidently import ColumnDriftMetric

model = joblib.load("model.pkl")
training_stats = joblib.load("training_stats.pkl")

def predict_with_monitoring(features):
    # Validate feature schema and ranges
    for i, (val, stat) in enumerate(zip(features, training_stats)):
        z_score = abs((val - stat["mean"]) / stat["std"])
        if z_score > 5:
            log.warning(f"Feature {i} out of distribution: z={z_score:.1f}")

    prediction = model.predict([features])[0]

    # Log prediction distribution for drift monitoring
    metrics_collector.log(features, prediction)

    # Periodic drift detection (run by monitoring job)
    # drift_report = ColumnDriftMetric().calculate(reference, current)
    # Alert if drift detected → trigger retraining

    return prediction

Mitigation Checklist

9️⃣ ML09 - Output Integrity Attack

High

Overview

Output integrity attacks tamper with model predictions after they leave the model but before they reach the consuming application. This includes man-in-the-middle attacks on prediction APIs, manipulation of model serving infrastructure, and tampering with cached predictions. The attack targets the inference pipeline rather than the model itself.

Risk

Tampered predictions can cause incorrect decisions in downstream systems: approving fraudulent transactions, misdiagnosing medical conditions, or overriding safety systems. Because the model itself is not compromised, standard model monitoring will not detect the attack.

Vulnerable Code Example

Python ❌ Bad
import requests

# Consuming model predictions over unencrypted HTTP
def get_prediction(features):
    response = requests.post(
        "http://ml-service/predict",  # HTTP, not HTTPS!
        json={"features": features}
    )
    result = response.json()
    # No integrity verification of the response
    # No validation of prediction format
    return result["prediction"]  # Could be tampered!

Secure Code Example

Python ✅ Good
import requests
import hmac
import hashlib

def get_prediction(features):
    response = requests.post(
        "https://ml-service/predict",  # HTTPS (TLS)
        json={"features": features},
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        verify=True  # Verify TLS certificate
    )
    result = response.json()

    # Verify response integrity with HMAC signature
    signature = response.headers.get("X-Signature")
    expected = hmac.new(
        SHARED_SECRET, str(result).encode(), hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(signature, expected):
        raise IntegrityError("Response signature mismatch!")

    # Validate prediction is within expected range
    pred = result["prediction"]
    if pred not in VALID_LABELS:
        raise ValueError(f"Unexpected prediction: {pred}")
    return pred

Mitigation Checklist

🔟 ML10 - Model Poisoning

Critical

Overview

Model poisoning directly modifies the trained model's weights, parameters, or architecture to inject backdoors or alter behavior. Unlike data poisoning (which corrupts training data), model poisoning targets the model artifact itself — through compromised model repositories, insider threats, or supply chain attacks on the model storage and deployment pipeline.

Risk

A directly poisoned model can contain highly targeted backdoors that are virtually undetectable through standard testing. The attacker has precise control over the model's behavior on trigger inputs. If the model registry or deployment pipeline is compromised, every deployment uses the poisoned model.

Vulnerable Code Example

Python ❌ Bad
import mlflow

# Loading model from registry with no integrity checks
model_uri = "models:/fraud-detector/Production"
model = mlflow.pyfunc.load_model(model_uri)

# No hash verification
# No signature validation
# No comparison with expected model metrics
# Model registry has weak access controls
# Anyone with push access can replace the model

predictions = model.predict(new_data)

Secure Code Example

Python ✅ Good
import mlflow
import hashlib
from sigstore.verify import Verifier

# Verify model signature before loading
model_uri = "models:/fraud-detector/Production"
model_path = mlflow.artifacts.download_artifacts(model_uri)

# Cryptographic signature verification
verifier = Verifier.production()
verifier.verify(
    model_path,
    expected_identity="ml-team@company.com"
)

# Verify model hash against approved registry
model_hash = hash_directory(model_path)
approved_hash = get_approved_hash("fraud-detector", "Production")
assert model_hash == approved_hash, "Model integrity check failed!"

# Validate model metrics on reference dataset before serving
model = mlflow.pyfunc.load_model(model_path)
ref_score = evaluate(model, reference_dataset)
assert ref_score >= MINIMUM_ACCURACY, "Model quality below threshold"

predictions = model.predict(new_data)

Mitigation Checklist

📊 Summary Table

ID Vulnerability Severity Key Mitigation
ML01Input Manipulation AttackCriticalAdversarial training, input validation, confidence thresholds
ML02Data Poisoning AttackCriticalOutlier detection, data provenance, baseline comparison
ML03Model Inversion AttackHighDifferential Privacy, minimal output, rate limiting
ML04Membership Inference AttackHighRegularization, DP training, no probability exposure
ML05Model TheftCriticalRate limiting, watermarking, query pattern detection
ML06AI Supply Chain AttacksCriticalSafeTensors, hash verification, trusted sources only
ML07Transfer Learning AttackHighTrusted base models, backdoor scanning, full fine-tuning
ML08Model SkewingMediumDrift monitoring, input validation, auto-retraining
ML09Output Integrity AttackHighTLS/mTLS, response signing, output validation
ML10Model PoisoningCriticalModel signing, registry access control, metric validation