Using AI for Realistic Data Generation and Augmentation in Your Tests
Improve your software testing strategy by using AI to generate realistic, diverse datasets and augment existing test data.
Posted on: 2026-03-16 by AI Assistant

Using AI for Realistic Data Generation and Augmentation in Your Tests
Writing robust automated tests often requires vast amounts of varied and realistic mock data. Traditional libraries like Faker are great, but they often produce data that looks “synthetic” and fails to capture the edge cases and nuanced relationships found in real-world production data.
In this tutorial, you will learn how to leverage Large Language Models (LLMs) to generate highly realistic, complex, and interrelated test data for your applications.
Prerequisites
- Python 3.10+
- Access to an LLM API (e.g., OpenAI, Gemini, or a local model via Ollama).
- Basic knowledge of JSON.
The Power of AI in Data Generation
LLMs excel at understanding context and generating structured data that adheres to complex rules. Instead of hardcoding rules for a mock data generator, you can simply describe the data you need using a prompt.
Generating Complex JSON Objects
Let’s say you need realistic user profiles that include a bio, a history of varied purchases, and an address that logically matches their demographic profile.
import os
import json
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
def generate_mock_users(count=3):
response = client.models.generate_content(
model='gemini-2.5-pro',
contents=f"""
Generate an array of {count} highly realistic mock user profiles in JSON format.
Each profile should include:
- id (UUID)
- full_name
- age (between 25 and 65)
- occupation
- bio (2 sentences summarizing their career and hobbies, must make sense with age and occupation)
- recent_purchases (array of 2-4 items they might logically buy)
Return ONLY valid JSON.
"""
)
# Clean up the response in case the model added markdown blocks
json_text = response.text.replace("```json", "").replace("```", "").strip()
return json.loads(json_text)
if __name__ == "__main__":
mock_data = generate_mock_users(2)
print(json.dumps(mock_data, indent=2))
The Output
When you run this script, you won’t just get random strings. You will get logically consistent profiles:
[
{
"id": "e4b5...",
"full_name": "Elena Rodriguez",
"age": 34,
"occupation": "Graphic Designer",
"bio": "Elena has spent the last decade creating brand identities for tech startups. In her free time, she enjoys urban sketching and attending gallery openings.",
"recent_purchases": ["Wacom Cintiq Pro", "Moleskine Sketchbook", "Espresso Beans"]
},
{
"id": "a1c2...",
"full_name": "David Chen",
"age": 52,
"occupation": "High School Science Teacher",
"bio": "A passionate educator dedicated to making chemistry accessible and fun. David is also an avid amateur astronomer and loves weekend hiking trips.",
"recent_purchases": ["Telescope Lens Kit", "Hiking Boots", "Safety Goggles"]
}
]
Augmenting Existing Data
You can also use AI to augment existing data—for example, taking a clean string and introducing realistic typos, or translating data to test internationalization features.
Conclusion & Next Steps
AI allows you to move beyond simple mock data and simulate real-world complexity in your testing environments.
For your next step, try generating data that explicitly violates your business rules (e.g., “Generate a user profile with an invalid email format and an age of -5”) to test your error handling and validation logic thoroughly.