Testing Strategies
Testing AI prompts is essential for ensuring consistent, high-quality outputs. This guide covers comprehensive testing strategies for Claro prompts, from development to production.
Why Test Prompts?
AI prompts require testing because:
Variability - LLMs can produce different outputs for the same input
Edge Cases - Unexpected inputs can lead to poor responses
Version Changes - Updates may introduce unintended behavior
Quality Assurance - Ensure prompts meet business requirements
Unlike traditional code, AI prompts can’t guarantee deterministic outputs. Testing focuses on quality patterns and acceptable ranges rather than exact matches.
Testing Before Deployment
Manual Testing in Dashboard
The fastest way to test prompts before publishing:
Create Draft Version
In the Bayt OS dashboard, create a new draft or edit an existing prompt
Use Preview Mode
Test the prompt directly in the editor:
Enter sample inputs
Review generated outputs
Test edge cases
Verify tone and style
Iterate and Refine
Make adjustments based on test results, then re-test
Publish When Ready
Once satisfied with test results, publish the version
Test prompts with diverse input types:
Happy Path
Edge Cases
Adversarial
Multilingual
Expected Use Cases
Typical user queries
Well-formed inputs
Common scenarios
Standard requests
These should produce the best responses. Unusual Inputs
Very short inputs
Very long inputs
Ambiguous requests
Multiple questions at once
Off-topic queries
Test how the prompt handles unexpected inputs. Challenging Inputs
Attempts to break instructions
Conflicting requirements
Inappropriate requests
Gibberish or random text
Ensure the prompt maintains guardrails. Language Variations
Different languages
Mixed language inputs
Accents and dialects
Translation scenarios
If your prompt supports multiple languages.
Unit Testing with Mocked Responses
Testing Integration Code
Test your application’s integration with Claro independently:
import pytest
from unittest.mock import Mock, patch
from baytos.claro import BaytClient
# Mock prompt response
MOCK_PROMPT = {
"title" : "Customer Support" ,
"generator" : "You are a helpful customer support agent..." ,
"package_name" : "@workspace/support:v1" ,
"version" : 1
}
class TestPromptIntegration :
"""Test application logic without calling Claro API"""
@patch ( 'claro.BaytClient.get_prompt' )
def test_prompt_loading ( self , mock_get_prompt ):
"""Test that prompt is loaded correctly"""
# Setup mock
mock_get_prompt.return_value = Mock( ** MOCK_PROMPT )
# Test code
client = BaytClient( api_key = "test_key" )
prompt = client.get_prompt( "@workspace/support:v1" )
# Assertions
assert prompt.title == "Customer Support"
assert "customer support agent" in prompt.generator
mock_get_prompt.assert_called_once_with( "@workspace/support:v1" )
@patch ( 'claro.BaytClient.get_prompt' )
def test_prompt_error_handling ( self , mock_get_prompt ):
"""Test handling of prompt loading errors"""
from baytos.claro import BaytNotFoundError
# Setup mock to raise error
mock_get_prompt.side_effect = BaytNotFoundError( "Prompt not found" )
# Test error handling
client = BaytClient( api_key = "test_key" )
with pytest.raises(BaytNotFoundError):
client.get_prompt( "@workspace/nonexistent:v1" )
def test_prompt_caching ( self ):
"""Test that prompts are cached appropriately"""
# Your caching logic tests
pass
Testing LLM Integration
Test the complete flow with mocked LLM responses:
import pytest
from unittest.mock import Mock, patch
from your_app import CustomerSupportBot
class TestCustomerSupportBot :
"""Test bot behavior with mocked responses"""
@patch ( 'openai.ChatCompletion.create' )
@patch ( 'claro.BaytClient.get_prompt' )
def test_support_response ( self , mock_get_prompt , mock_openai ):
"""Test complete support flow"""
# Mock Claro prompt
mock_get_prompt.return_value = Mock(
generator = "You are a helpful support agent..."
)
# Mock OpenAI response
mock_openai.return_value = Mock(
choices = [Mock( message = Mock( content = "I can help you with that..." ))]
)
# Test bot
bot = CustomerSupportBot()
response = bot.respond_to_query( "How do I reset my password?" )
# Verify
assert "help you" in response
mock_get_prompt.assert_called_once()
mock_openai.assert_called_once()
@patch ( 'openai.ChatCompletion.create' )
@patch ( 'claro.BaytClient.get_prompt' )
def test_error_handling ( self , mock_get_prompt , mock_openai ):
"""Test bot handles LLM errors gracefully"""
mock_get_prompt.return_value = Mock( generator = "..." )
mock_openai.side_effect = Exception ( "API Error" )
bot = CustomerSupportBot()
response = bot.respond_to_query( "test" )
# Should return fallback message
assert "error occurred" in response.lower()
Integration Testing
Testing with Real API Calls
Create integration tests that call the actual Claro API:
import pytest
import os
from baytos.claro import BaytClient
# Mark as integration test (can be skipped in CI)
@pytest.mark.integration
class TestClaroIntegration :
"""Integration tests using real Claro API"""
@ classmethod
def setup_class ( cls ):
"""Setup test client"""
api_key = os.getenv( "BAYT_API_KEY_TEST" )
if not api_key:
pytest.skip( "BAYT_API_KEY_TEST not set" )
cls .client = BaytClient( api_key = api_key)
def test_get_prompt ( self ):
"""Test fetching a real prompt"""
prompt = self .client.get_prompt( "@workspace/test-prompt:v1" )
assert prompt.title
assert prompt.generator
assert prompt.version == 1
def test_list_prompts ( self ):
"""Test listing workspace prompts"""
result = self .client.list_prompts( limit = 10 )
assert 'prompts' in result
assert len (result[ 'prompts' ]) <= 10
def test_prompt_with_context ( self ):
"""Test prompt with file context"""
prompt = self .client.get_prompt( "@workspace/code-review:v1" )
if prompt.has_context():
contexts = prompt.get_file_contexts()
assert len (contexts) > 0
Running Integration Tests
Separate integration tests from unit tests:
# Run only unit tests (fast)
pytest tests/ -m "not integration"
# Run all tests including integration (slower)
pytest tests/
# Run only integration tests
pytest tests/ -m integration
A/B Testing Approaches
Comparing Prompt Versions
Test multiple versions side-by-side:
from baytos.claro import BaytClient
import openai
def compare_prompt_versions ( query : str , versions : list[ str ]):
"""Compare outputs from different prompt versions"""
client = BaytClient( api_key = os.getenv( "BAYT_API_KEY" ))
openai_client = openai.OpenAI( api_key = os.getenv( "OPENAI_API_KEY" ))
results = {}
for version in versions:
# Get prompt version
prompt = client.get_prompt( f "@workspace/support: { version } " )
# Generate response
response = openai_client.chat.completions.create(
model = "gpt-4" ,
messages = [
{ "role" : "system" , "content" : prompt.generator},
{ "role" : "user" , "content" : query}
]
)
results[version] = {
"prompt" : prompt.generator[: 100 ] + "..." ,
"response" : response.choices[ 0 ].message.content,
"tokens" : response.usage.total_tokens
}
return results
# Test different versions
query = "How do I reset my password?"
comparison = compare_prompt_versions(query, [ "v1" , "v2" , "v3" ])
for version, result in comparison.items():
print ( f " \n { version } :" )
print ( f "Response: { result[ 'response' ] } " )
print ( f "Tokens: { result[ 'tokens' ] } " )
Production A/B Testing
Gradually roll out new versions in production:
import random
from baytos.claro import BaytClient
class PromptSelector :
"""Select prompt version for A/B testing"""
def __init__ ( self , versions_config : dict ):
"""
versions_config = {
"v1": 0.7, # 70% traffic
"v2": 0.3 # 30% traffic
}
"""
self .versions = versions_config
self .client = BaytClient( api_key = os.getenv( "BAYT_API_KEY" ))
def get_prompt ( self , user_id : str , base_package : str ):
"""Get prompt version based on A/B split"""
# Use user_id hash for consistent assignment
rand = ( hash (user_id) % 100 ) / 100
cumulative = 0
for version, percentage in self .versions.items():
cumulative += percentage
if rand <= cumulative:
return self .client.get_prompt( f " { base_package } : { version } " )
# Fallback
return self .client.get_prompt( f " { base_package } :v1" )
# Usage
selector = PromptSelector({
"v1" : 0.9 , # 90% of users
"v2" : 0.1 # 10% of users (new version)
})
prompt = selector.get_prompt(
user_id = "user_123" ,
base_package = "@workspace/support"
)
Tracking A/B Test Results
Log metrics for each version:
import logging
from datetime import datetime
class ABTestLogger :
"""Log A/B test metrics"""
def __init__ ( self ):
self .logger = logging.getLogger( "ab_test" )
def log_interaction (
self ,
user_id : str ,
version : str ,
query : str ,
response : str ,
success : bool ,
response_time : float
):
"""Log each interaction for analysis"""
self .logger.info({
"timestamp" : datetime.utcnow().isoformat(),
"user_id" : user_id,
"version" : version,
"query_length" : len (query),
"response_length" : len (response),
"success" : success,
"response_time_ms" : response_time * 1000
})
# Usage
logger = ABTestLogger()
start = time.time()
response = generate_response(prompt, user_query)
elapsed = time.time() - start
logger.log_interaction(
user_id = "user_123" ,
version = "v2" ,
query = user_query,
response = response,
success = True , # Based on your success criteria
response_time = elapsed
)
CI/CD Integration
GitHub Actions Example
Automate testing in your CI pipeline:
# .github/workflows/test.yml
name : Test Claro Integration
on :
push :
branches : [ main , develop ]
pull_request :
branches : [ main ]
jobs :
test :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v3
- name : Set up Python
uses : actions/setup-python@v4
with :
python-version : '3.11'
- name : Install dependencies
run : |
pip install -r requirements.txt
pip install pytest pytest-cov
- name : Run unit tests
run : pytest tests/ -m "not integration" --cov
- name : Run integration tests
env :
BAYT_API_KEY_TEST : ${{ secrets.BAYT_API_KEY_TEST }}
run : pytest tests/ -m integration
continue-on-error : true # Don't fail build on integration test failures
- name : Upload coverage
uses : codecov/codecov-action@v3
Pre-commit Hooks
Test before committing:
#!/bin/bash
# .git/hooks/pre-commit
echo "Running tests..."
# Run unit tests
pytest tests/ -m "not integration" --quiet
if [ $? -ne 0 ]; then
echo "Tests failed. Commit aborted."
exit 1
fi
echo "Tests passed!"
exit 0
Best Practices
Test with Real User Inputs
Automate Regression Testing
Create a test suite that runs automatically: # tests/test_prompts.py
import pytest
from your_app import get_response
# Golden test cases
TEST_CASES = [
{
"input" : "How do I reset my password?" ,
"expected_topics" : [ "password" , "reset" , "email" ],
"max_length" : 500
},
{
"input" : "What are your business hours?" ,
"expected_topics" : [ "hours" , "time" , "available" ],
"max_length" : 200
}
]
@pytest.mark.parametrize ( "test_case" , TEST_CASES )
def test_prompt_quality ( test_case ):
response = get_response(test_case[ "input" ])
# Check topics mentioned
for topic in test_case[ "expected_topics" ]:
assert topic.lower() in response.lower()
# Check length
assert len (response) <= test_case[ "max_length" ]
Track metrics across versions:
Response time
Token usage
User satisfaction (thumbs up/down)
Error rates
Escalation rates (for support prompts)
Compare metrics between versions to identify improvements or regressions.
Test Prompt Instructions Separately
Validate that prompts follow their instructions: def test_prompt_follows_instructions ():
"""Test that responses follow prompt constraints"""
# If prompt says "respond in under 100 words"
response = generate_response( "Tell me about your company" )
word_count = len (response.split())
assert word_count <= 100 , f "Response too long: { word_count } words"
Use Version Pinning in Tests
Always pin to specific versions in tests: # ✅ Good: Specific version
prompt = client.get_prompt( "@workspace/support:v1" )
# ❌ Bad: Latest version (tests become unpredictable)
prompt = client.get_prompt( "@workspace/support:latest" )
Test how your application handles errors: def test_handles_prompt_not_found ():
"""Gracefully handle missing prompts"""
from baytos.claro import BaytNotFoundError
try :
prompt = client.get_prompt( "@workspace/nonexistent:v1" )
response = use_prompt(prompt)
except BaytNotFoundError:
response = "I'm having trouble right now. Please try again later."
assert response # Should always return something
assert "error" in response.lower() or "trouble" in response.lower()
Recommended Testing Stack
pytest Python testing framework
Powerful fixtures
Parametrized tests
Great mocking support
pytest-mock Mocking library
Easy API mocking
Patch functions
Verify call counts
pytest-cov Coverage reporting
Track test coverage
Identify untested code
CI integration
Locust Load testing
Test at scale
Simulate concurrent users
Performance metrics
Load Testing Example
Test prompt performance under load:
from locust import HttpUser, task, between
import os
class PromptLoadTest ( HttpUser ):
"""Load test Claro API integration"""
wait_time = between( 1 , 3 )
def on_start ( self ):
"""Setup test data"""
self .api_key = os.getenv( "BAYT_API_KEY_TEST" )
self .headers = {
"Authorization" : f "Bearer { self .api_key } "
}
@task ( 3 )
def get_prompt ( self ):
"""Test prompt fetching (most common operation)"""
self .client.get(
"https://api.baytos.ai/v1/prompts/@workspace/support:v1" ,
headers = self .headers,
name = "get_prompt"
)
@task ( 1 )
def list_prompts ( self ):
"""Test listing prompts"""
self .client.get(
"https://api.baytos.ai/v1/prompts?limit=20" ,
headers = self .headers,
name = "list_prompts"
)
Run load tests:
locust -f locustfile.py --host=https://api.baytos.ai
Troubleshooting Tests
Tests pass locally but fail in CI
Common causes:
Environment variables not set in CI
Different Python versions
Timezone or locale differences
Solution:
Ensure CI environment matches local environment. Set all required secrets in CI.
Common causes:
Network timeouts
Rate limiting
Non-deterministic LLM outputs
Solution:
Add retries for integration tests:@pytest.mark.integration
@pytest.mark.flaky ( reruns = 3 )
def test_api_call ():
# Test will retry up to 3 times if it fails
pass
Mocks not working as expected
Common causes:
Patching wrong location
Import order issues
Mock not properly configured
Solution:
Patch where the function is used, not where it’s defined:# If your_app.py imports: from baytos.claro import BaytClient
# Patch at usage location:
@patch ( 'your_app.BaytClient.get_prompt' ) # ✅ Correct
# Not at definition:
@patch ( 'claro.BaytClient.get_prompt' ) # ❌ Won't work
Next Steps
Error Handling Learn comprehensive error handling for robust tests
Performance Guide Optimize prompt fetching and caching
Security Secure API keys in test environments
Advanced Patterns Production-ready integration patterns