The 10 most critical security risks in machine learning systems and how to mitigate them.
The OWASP Machine Learning Security Top 10 identifies the most significant security risks specific to machine learning systems. Unlike traditional software, ML systems are vulnerable to unique attacks targeting training data, model internals, and inference pipelines. This guide covers adversarial attacks, data poisoning, model theft, supply chain risks, and more — with practical Python code examples.
Input manipulation attacks (adversarial attacks) craft specially designed inputs to cause ML models to make incorrect predictions. Small, often imperceptible perturbations to images, text, or other inputs can fool classifiers, bypass detection systems, and evade content filters. This is the most well-known ML-specific attack vector.
Adversarial examples can bypass safety-critical ML systems: autonomous vehicle perception, malware detection, fraud detection, and content moderation. An attacker can cause a stop sign to be classified as a speed limit sign, or make malware appear benign to an ML-based antivirus.
import numpy as np from tensorflow import keras # Model with no adversarial robustness model = keras.models.load_model("classifier.h5") def predict(image): # Direct prediction — no input validation or preprocessing result = model.predict(np.expand_dims(image, axis=0)) return np.argmax(result) # No confidence threshold check # No input bounds validation # Vulnerable to FGSM, PGD, C&W attacks
import numpy as np from tensorflow import keras from art.defences.preprocessor import SpatialSmoothing from art.defences.detector.evasion import BinaryInputDetector # Load adversarially trained model model = keras.models.load_model("classifier_robust.h5") # Input preprocessing to remove perturbations smoother = SpatialSmoothing(window_size=3) detector = BinaryInputDetector(model) def predict_secure(image): # Validate input bounds if image.min() < 0 or image.max() > 1: raise ValueError("Input out of expected range") # Detect adversarial input if detector.detect(image): raise ValueError("Adversarial input detected") # Apply spatial smoothing defense cleaned = smoother(image)[0] result = model.predict(np.expand_dims(cleaned, axis=0)) # Reject low-confidence predictions confidence = np.max(result) if confidence < 0.85: return {"label": "uncertain", "confidence": confidence} return {"label": np.argmax(result), "confidence": confidence}
Data poisoning attacks inject malicious samples into training datasets to corrupt the model's learned behavior. Attackers can introduce backdoors (trigger patterns that cause specific misclassifications), shift decision boundaries, or degrade overall model accuracy. This is especially dangerous when training data is sourced from the internet or user-generated content.
A poisoned model may behave normally on clean inputs but misclassify when a specific trigger pattern is present. For example, a backdoored malware classifier could approve any malware sample containing a specific byte sequence. The attack is stealthy because model accuracy on clean data remains high.
import pandas as pd from sklearn.ensemble import RandomForestClassifier # Training on unvalidated, crowdsourced data data = pd.read_csv("user_submitted_data.csv") # No validation! # No outlier detection or data quality checks X = data.drop("label", axis=1) y = data["label"] model = RandomForestClassifier() model.fit(X, y) # Training directly on untrusted data! # No comparison against clean baseline # No data provenance tracking
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier, IsolationForest from sklearn.model_selection import cross_val_score # Load data with provenance tracking data = pd.read_csv("training_data.csv") data_hash = hashlib.sha256(data.to_csv().encode()).hexdigest() log.info(f"Training data hash: {data_hash}") X = data.drop("label", axis=1) y = data["label"] # Detect and remove anomalous samples iso_forest = IsolationForest(contamination=0.05, random_state=42) outlier_mask = iso_forest.fit_predict(X) == 1 X_clean, y_clean = X[outlier_mask], y[outlier_mask] log.info(f"Removed {(~outlier_mask).sum()} outliers from {len(X)} samples") # Train and validate against baseline model = RandomForestClassifier(random_state=42) scores = cross_val_score(model, X_clean, y_clean, cv=5) if scores.mean() < BASELINE_ACCURACY - 0.05: raise ValueError("Model accuracy dropped — possible data poisoning") model.fit(X_clean, y_clean)
Model inversion attacks reconstruct sensitive training data by querying the model and analyzing its outputs. An attacker can recover private information such as faces, medical records, or personal data used during training. This is particularly concerning for models trained on sensitive datasets (healthcare, biometrics, financial data).
Model inversion can violate data privacy regulations (GDPR, HIPAA) by exposing personally identifiable information from training data. An attacker with API access to a facial recognition model could reconstruct faces of individuals in the training set.
from flask import Flask, request, jsonify app = Flask(__name__) model = load_model("face_classifier.h5") @app.route("/predict", methods=["POST"]) def predict(): image = request.files["image"] result = model.predict(preprocess(image)) # Returns full probability vector — enables model inversion! return jsonify({ "probabilities": result.tolist(), # All class probabilities! "prediction": int(np.argmax(result)), "confidence": float(np.max(result)) }) # No rate limiting, no query logging # Unlimited API access for gradient estimation
from flask import Flask, request, jsonify from flask_limiter import Limiter import numpy as np app = Flask(__name__) limiter = Limiter(app, default_limits=["100/hour"]) model = load_model("face_classifier_dp.h5") # Trained with DP @app.route("/predict", methods=["POST"]) @limiter.limit("100/hour") def predict(): image = request.files["image"] result = model.predict(preprocess(image)) # Return only top-1 prediction — no probability vector prediction = int(np.argmax(result)) log_query(request.remote_addr, prediction) # Audit logging return jsonify({ "prediction": prediction # No probabilities, no confidence scores })
Membership inference attacks determine whether a specific data point was used in the model's training dataset. By analyzing the model's confidence scores and behavior on known vs. unknown inputs, attackers can infer private membership information. This is a significant privacy threat for models trained on sensitive data.
Membership inference can reveal that a specific individual's data was used for training — for example, confirming that a patient's record was in a clinical dataset, or that a person's face was used for surveillance training. This violates privacy expectations and potentially regulations like GDPR.
from sklearn.neural_network import MLPClassifier # Overfitted model — memorizes training data model = MLPClassifier( hidden_layer_sizes=(512, 512, 256), # Over-parameterized! max_iter=1000, # No regularization # No early stopping ) model.fit(X_train, y_train) # Model memorizes training data → membership inference possible # Training accuracy: 99.9% vs Test accuracy: 82% # This gap indicates overfitting = information leakage def predict_with_confidence(x): proba = model.predict_proba([x])[0] return {"probabilities": proba.tolist()} # Leaks membership info!
from sklearn.neural_network import MLPClassifier import numpy as np # Regularized model with early stopping to reduce overfitting model = MLPClassifier( hidden_layer_sizes=(128, 64), max_iter=500, alpha=0.01, # L2 regularization early_stopping=True, # Prevents memorization validation_fraction=0.15, ) model.fit(X_train, y_train) # Verify train/test gap is small (low overfitting) train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) assert train_acc - test_acc < 0.05, "Overfitting detected!" def predict_secure(x): pred = model.predict([x])[0] return {"prediction": int(pred)} # Label only, no probabilities
Model theft (model extraction) attacks create a functional copy of a proprietary ML model by systematically querying it and training a surrogate model on the input-output pairs. The stolen model can then be used to find adversarial examples, compete commercially, or reverse-engineer the model's training data.
A stolen model represents loss of intellectual property and competitive advantage. The extracted model can be used offline to craft adversarial attacks or to understand decision boundaries. Millions of dollars of training investment can be replicated with thousands of API queries.
from flask import Flask, request, jsonify app = Flask(__name__) model = load_proprietary_model() @app.route("/predict", methods=["POST"]) def predict(): data = request.json["features"] result = model.predict_proba([data])[0] # Returns full probability distribution return jsonify({ "probabilities": result.tolist(), "prediction": int(np.argmax(result)) }) # No rate limiting — unlimited queries # No anomaly detection on query patterns # Attacker can extract model with ~10K queries
from flask import Flask, request, jsonify from flask_limiter import Limiter import numpy as np app = Flask(__name__) limiter = Limiter(app, default_limits=["50/hour"]) # Watermarked model for theft detection model = load_watermarked_model() query_monitor = QueryPatternDetector() @app.route("/predict", methods=["POST"]) @limiter.limit("50/hour") def predict(): data = request.json["features"] api_key = request.headers.get("X-API-Key") # Detect extraction patterns (uniform sampling, grid queries) if query_monitor.is_suspicious(api_key, data): log_alert(f"Possible extraction: {api_key}") return jsonify({"error": "rate limited"}), 429 result = model.predict([data])[0] return jsonify({ "prediction": int(result) # Label only, no probabilities })
AI supply chain attacks target the ML development pipeline: pre-trained models from model hubs, third-party datasets, ML frameworks, and dependencies. Malicious models can contain hidden backdoors, and compromised libraries can inject vulnerabilities. The serialization formats used by ML frameworks (Pickle, SavedModel) can execute arbitrary code on load.
Loading a malicious model file can execute arbitrary code (Pickle deserialization attacks). Pre-trained models from untrusted sources may contain backdoors. Compromised ML libraries affect all downstream users. The ML supply chain has fewer security controls than traditional software supply chains.
import pickle import torch # Loading untrusted model — arbitrary code execution! with open("model_from_internet.pkl", "rb") as f: model = pickle.load(f) # DANGEROUS: can execute any code! # Loading unverified PyTorch model model = torch.load("untrusted_model.pt") # Uses pickle internally! # Using unvetted model from public hub from transformers import AutoModel model = AutoModel.from_pretrained("random-user/suspicious-model") # No hash verification, no security scan
import torch import hashlib from safetensors.torch import load_file # Use SafeTensors — no arbitrary code execution model_state = load_file("model.safetensors") # Safe format! model = MyModel() model.load_state_dict(model_state) # Verify model hash before loading EXPECTED_HASH = "sha256:a1b2c3d4..." with open("model.safetensors", "rb") as f: actual_hash = "sha256:" + hashlib.sha256(f.read()).hexdigest() assert actual_hash == EXPECTED_HASH, "Model integrity check failed!" # Use trusted models from verified organizations from transformers import AutoModel model = AutoModel.from_pretrained( "google/bert-base-uncased", # Verified organization revision="a265f77", # Pin to specific commit )
Transfer learning attacks exploit the common practice of fine-tuning pre-trained models. Backdoors embedded in the base model persist through fine-tuning and remain active in the downstream model. An attacker who publishes a popular pre-trained model can compromise all applications that use it as a foundation.
Backdoors in pre-trained models survive fine-tuning because they are embedded in deep layers that are often frozen during transfer learning. A single compromised foundation model can affect thousands of downstream applications. The attack is scalable and difficult to detect.
from transformers import AutoModelForSequenceClassification # Fine-tuning an unvetted pre-trained model model = AutoModelForSequenceClassification.from_pretrained( "unknown-user/bert-finetuned-sentiment", # Untrusted source! num_labels=2 ) # Freezing base layers — preserves any hidden backdoor for param in model.base_model.parameters(): param.requires_grad = False # Backdoor in frozen layers persists! # Fine-tune only the classification head trainer.train() # Backdoor remains undetected
from transformers import AutoModelForSequenceClassification from neural_cleanse import BackdoorDetector # Use only verified, trusted base models model = AutoModelForSequenceClassification.from_pretrained( "google/bert-base-uncased", # Trusted source num_labels=2, revision="main", ) # Scan pre-trained model for backdoors before fine-tuning detector = BackdoorDetector(model) if detector.scan(): raise SecurityError("Potential backdoor detected in base model") # Fine-tune ALL layers (not just head) to overwrite potential backdoors for param in model.parameters(): param.requires_grad = True # Train all layers # Validate with clean test set + trigger test set trainer.train() evaluate_for_backdoors(model, trigger_test_set)
Model skewing occurs when the data distribution in production differs significantly from the training data distribution (training-serving skew). This can happen naturally over time (data drift) or be intentionally caused by attackers who manipulate the production input distribution to degrade model performance or bias predictions.
Model skewing causes silent failures where the model produces incorrect but confident predictions. In financial systems, attackers can exploit skew to bypass fraud detection. In recommendation systems, skew can be intentionally induced to promote specific content or products.
import joblib # Deploy model with no drift monitoring model = joblib.load("model_trained_2023.pkl") def predict(features): # No check if input distribution has changed # No feature validation against training schema return model.predict([features])[0] # Model may be months/years old # No monitoring of prediction distribution # Silent degradation goes undetected
import joblib import numpy as np from scipy import stats from evidently import ColumnDriftMetric model = joblib.load("model.pkl") training_stats = joblib.load("training_stats.pkl") def predict_with_monitoring(features): # Validate feature schema and ranges for i, (val, stat) in enumerate(zip(features, training_stats)): z_score = abs((val - stat["mean"]) / stat["std"]) if z_score > 5: log.warning(f"Feature {i} out of distribution: z={z_score:.1f}") prediction = model.predict([features])[0] # Log prediction distribution for drift monitoring metrics_collector.log(features, prediction) # Periodic drift detection (run by monitoring job) # drift_report = ColumnDriftMetric().calculate(reference, current) # Alert if drift detected → trigger retraining return prediction
Output integrity attacks tamper with model predictions after they leave the model but before they reach the consuming application. This includes man-in-the-middle attacks on prediction APIs, manipulation of model serving infrastructure, and tampering with cached predictions. The attack targets the inference pipeline rather than the model itself.
Tampered predictions can cause incorrect decisions in downstream systems: approving fraudulent transactions, misdiagnosing medical conditions, or overriding safety systems. Because the model itself is not compromised, standard model monitoring will not detect the attack.
import requests # Consuming model predictions over unencrypted HTTP def get_prediction(features): response = requests.post( "http://ml-service/predict", # HTTP, not HTTPS! json={"features": features} ) result = response.json() # No integrity verification of the response # No validation of prediction format return result["prediction"] # Could be tampered!
import requests import hmac import hashlib def get_prediction(features): response = requests.post( "https://ml-service/predict", # HTTPS (TLS) json={"features": features}, headers={"Authorization": f"Bearer {API_TOKEN}"}, verify=True # Verify TLS certificate ) result = response.json() # Verify response integrity with HMAC signature signature = response.headers.get("X-Signature") expected = hmac.new( SHARED_SECRET, str(result).encode(), hashlib.sha256 ).hexdigest() if not hmac.compare_digest(signature, expected): raise IntegrityError("Response signature mismatch!") # Validate prediction is within expected range pred = result["prediction"] if pred not in VALID_LABELS: raise ValueError(f"Unexpected prediction: {pred}") return pred
Model poisoning directly modifies the trained model's weights, parameters, or architecture to inject backdoors or alter behavior. Unlike data poisoning (which corrupts training data), model poisoning targets the model artifact itself — through compromised model repositories, insider threats, or supply chain attacks on the model storage and deployment pipeline.
A directly poisoned model can contain highly targeted backdoors that are virtually undetectable through standard testing. The attacker has precise control over the model's behavior on trigger inputs. If the model registry or deployment pipeline is compromised, every deployment uses the poisoned model.
import mlflow # Loading model from registry with no integrity checks model_uri = "models:/fraud-detector/Production" model = mlflow.pyfunc.load_model(model_uri) # No hash verification # No signature validation # No comparison with expected model metrics # Model registry has weak access controls # Anyone with push access can replace the model predictions = model.predict(new_data)
import mlflow import hashlib from sigstore.verify import Verifier # Verify model signature before loading model_uri = "models:/fraud-detector/Production" model_path = mlflow.artifacts.download_artifacts(model_uri) # Cryptographic signature verification verifier = Verifier.production() verifier.verify( model_path, expected_identity="ml-team@company.com" ) # Verify model hash against approved registry model_hash = hash_directory(model_path) approved_hash = get_approved_hash("fraud-detector", "Production") assert model_hash == approved_hash, "Model integrity check failed!" # Validate model metrics on reference dataset before serving model = mlflow.pyfunc.load_model(model_path) ref_score = evaluate(model, reference_dataset) assert ref_score >= MINIMUM_ACCURACY, "Model quality below threshold" predictions = model.predict(new_data)
| ID | Vulnerability | Severity | Key Mitigation |
|---|---|---|---|
| ML01 | Input Manipulation Attack | Critical | Adversarial training, input validation, confidence thresholds |
| ML02 | Data Poisoning Attack | Critical | Outlier detection, data provenance, baseline comparison |
| ML03 | Model Inversion Attack | High | Differential Privacy, minimal output, rate limiting |
| ML04 | Membership Inference Attack | High | Regularization, DP training, no probability exposure |
| ML05 | Model Theft | Critical | Rate limiting, watermarking, query pattern detection |
| ML06 | AI Supply Chain Attacks | Critical | SafeTensors, hash verification, trusted sources only |
| ML07 | Transfer Learning Attack | High | Trusted base models, backdoor scanning, full fine-tuning |
| ML08 | Model Skewing | Medium | Drift monitoring, input validation, auto-retraining |
| ML09 | Output Integrity Attack | High | TLS/mTLS, response signing, output validation |
| ML10 | Model Poisoning | Critical | Model signing, registry access control, metric validation |