Blog

Unit Testing Your Prompts: Strategies for Reliable AI Outputs

Prompts are code. If they are code, they must be tested. Learn how to apply standard unit testing principles to your LLM prompts for reliable AI features.

Posted on: 2026-03-22


“It worked on my machine.” We’ve all said it. But with Large Language Models (LLMs), it’s more like: “It worked the first time I ran it.”

LLMs are non-deterministic. A prompt that works perfectly today might fail tomorrow if the model version changes, or even if the temperature setting is slightly off. If you’re building production software, you can’t rely on “vibes.” You need Unit Testing.

Why Test Prompts?

The AI Testing Strategy

1. Structural Validation

The simplest test is to ensure the output is well-formed. If you expect a JSON object with specific keys, use Pydantic or a simple JSON schema to validate it.

import json
import unittest

def test_json_structure():
    llm_output = '{"name": "Gemini", "role": "Assistant"}'
    data = json.loads(llm_output)
    
    assert "name" in data
    assert "role" in data

2. Assertion-Based Testing

Just like traditional unit tests, you can assert that certain strings must or must not be present in the output.

def test_prompt_contains_keywords():
    prompt = "Explain quantum computing to a 5-year-old."
    output = call_llm(prompt) # Your LLM call function
    
    assert "qubit" in output.lower()
    assert "superposition" in output.lower()

3. Semantic Similarity (The “Golden Set”)

Sometimes, exact string matching isn’t enough. You want to ensure the response is semantically similar to a “Golden Answer” you’ve pre-approved.

You can use cosine similarity between embeddings to measure this:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def test_semantic_match():
    golden_answer = "The capital of Thailand is Bangkok."
    actual_output = "Bangkok is the capital city of Thailand."
    
    emb1 = model.encode(golden_answer)
    emb2 = model.encode(actual_output)
    
    similarity = util.cos_sim(emb1, emb2)
    assert similarity > 0.85

Integrating into CI/CD

Don’t just run these tests locally. Integrate them into your GitHub Actions or GitLab CI.

Tools of the Trade

Conclusion

Treating prompts as anything less than code is a recipe for production disasters. By applying standard engineering principles—validation, assertions, and CI/CD integration—you can build AI features that are not only powerful but reliable.