Testing AI prompts is essential for ensuring consistent, high-quality outputs. This guide covers comprehensive testing strategies for Claro prompts, from development to production.
Variability - LLMs can produce different outputs for the same input
Edge Cases - Unexpected inputs can lead to poor responses
Version Changes - Updates may introduce unintended behavior
Quality Assurance - Ensure prompts meet business requirements
Unlike traditional code, AI prompts can’t guarantee deterministic outputs. Testing focuses on quality patterns and acceptable ranges rather than exact matches.
# Run only unit tests (fast)pytest tests/ -m "not integration"# Run all tests including integration (slower)pytest tests/# Run only integration testspytest tests/ -m integration
# tests/test_prompts.pyimport pytestfrom your_app import get_response# Golden test casesTEST_CASES = [ { "input": "How do I reset my password?", "expected_topics": ["password", "reset", "email"], "max_length": 500 }, { "input": "What are your business hours?", "expected_topics": ["hours", "time", "available"], "max_length": 200 }]@pytest.mark.parametrize("test_case", TEST_CASES)def test_prompt_quality(test_case): response = get_response(test_case["input"]) # Check topics mentioned for topic in test_case["expected_topics"]: assert topic.lower() in response.lower() # Check length assert len(response) <= test_case["max_length"]
Monitor Quality Metrics
Track metrics across versions:
Response time
Token usage
User satisfaction (thumbs up/down)
Error rates
Escalation rates (for support prompts)
Compare metrics between versions to identify improvements or regressions.
Test Prompt Instructions Separately
Validate that prompts follow their instructions:
Copy
def test_prompt_follows_instructions(): """Test that responses follow prompt constraints""" # If prompt says "respond in under 100 words" response = generate_response("Tell me about your company") word_count = len(response.split()) assert word_count <= 100, f"Response too long: {word_count} words"
Use Version Pinning in Tests
Always pin to specific versions in tests:
Copy
# ✅ Good: Specific versionprompt = client.get_prompt("@workspace/support:v1")# ❌ Bad: Latest version (tests become unpredictable)prompt = client.get_prompt("@workspace/support:latest")
Test Failure Modes
Test how your application handles errors:
Copy
def test_handles_prompt_not_found(): """Gracefully handle missing prompts""" from baytos.claro import BaytNotFoundError try: prompt = client.get_prompt("@workspace/nonexistent:v1") response = use_prompt(prompt) except BaytNotFoundError: response = "I'm having trouble right now. Please try again later." assert response # Should always return something assert "error" in response.lower() or "trouble" in response.lower()
Solution:
Ensure CI environment matches local environment. Set all required secrets in CI.
Flaky integration tests
Common causes:
Network timeouts
Rate limiting
Non-deterministic LLM outputs
Solution:
Add retries for integration tests:
Copy
@pytest.mark.integration@pytest.mark.flaky(reruns=3)def test_api_call(): # Test will retry up to 3 times if it fails pass
Mocks not working as expected
Common causes:
Patching wrong location
Import order issues
Mock not properly configured
Solution:
Patch where the function is used, not where it’s defined:
Copy
# If your_app.py imports: from baytos.claro import BaytClient# Patch at usage location:@patch('your_app.BaytClient.get_prompt') # ✅ Correct# Not at definition:@patch('claro.BaytClient.get_prompt') # ❌ Won't work