Skip to main content

Testing Strategies

Testing AI prompts is essential for ensuring consistent, high-quality outputs. This guide covers comprehensive testing strategies for Claro prompts, from development to production.

Why Test Prompts?

AI prompts require testing because:
  • Variability - LLMs can produce different outputs for the same input
  • Edge Cases - Unexpected inputs can lead to poor responses
  • Version Changes - Updates may introduce unintended behavior
  • Quality Assurance - Ensure prompts meet business requirements
Unlike traditional code, AI prompts can’t guarantee deterministic outputs. Testing focuses on quality patterns and acceptable ranges rather than exact matches.

Testing Before Deployment

Manual Testing in Dashboard

The fastest way to test prompts before publishing:
1

Create Draft Version

In the Bayt OS dashboard, create a new draft or edit an existing prompt
2

Use Preview Mode

Test the prompt directly in the editor:
  • Enter sample inputs
  • Review generated outputs
  • Test edge cases
  • Verify tone and style
3

Iterate and Refine

Make adjustments based on test results, then re-test
4

Publish When Ready

Once satisfied with test results, publish the version

Test Input Categories

Test prompts with diverse input types:
Expected Use Cases
  • Typical user queries
  • Well-formed inputs
  • Common scenarios
  • Standard requests
These should produce the best responses.

Unit Testing with Mocked Responses

Testing Integration Code

Test your application’s integration with Claro independently:
import pytest
from unittest.mock import Mock, patch
from baytos.claro import BaytClient

# Mock prompt response
MOCK_PROMPT = {
    "title": "Customer Support",
    "generator": "You are a helpful customer support agent...",
    "package_name": "@workspace/support:v1",
    "version": 1
}

class TestPromptIntegration:
    """Test application logic without calling Claro API"""

    @patch('claro.BaytClient.get_prompt')
    def test_prompt_loading(self, mock_get_prompt):
        """Test that prompt is loaded correctly"""
        # Setup mock
        mock_get_prompt.return_value = Mock(**MOCK_PROMPT)

        # Test code
        client = BaytClient(api_key="test_key")
        prompt = client.get_prompt("@workspace/support:v1")

        # Assertions
        assert prompt.title == "Customer Support"
        assert "customer support agent" in prompt.generator
        mock_get_prompt.assert_called_once_with("@workspace/support:v1")

    @patch('claro.BaytClient.get_prompt')
    def test_prompt_error_handling(self, mock_get_prompt):
        """Test handling of prompt loading errors"""
        from baytos.claro import BaytNotFoundError

        # Setup mock to raise error
        mock_get_prompt.side_effect = BaytNotFoundError("Prompt not found")

        # Test error handling
        client = BaytClient(api_key="test_key")

        with pytest.raises(BaytNotFoundError):
            client.get_prompt("@workspace/nonexistent:v1")

    def test_prompt_caching(self):
        """Test that prompts are cached appropriately"""
        # Your caching logic tests
        pass

Testing LLM Integration

Test the complete flow with mocked LLM responses:
import pytest
from unittest.mock import Mock, patch
from your_app import CustomerSupportBot

class TestCustomerSupportBot:
    """Test bot behavior with mocked responses"""

    @patch('openai.ChatCompletion.create')
    @patch('claro.BaytClient.get_prompt')
    def test_support_response(self, mock_get_prompt, mock_openai):
        """Test complete support flow"""
        # Mock Claro prompt
        mock_get_prompt.return_value = Mock(
            generator="You are a helpful support agent..."
        )

        # Mock OpenAI response
        mock_openai.return_value = Mock(
            choices=[Mock(message=Mock(content="I can help you with that..."))]
        )

        # Test bot
        bot = CustomerSupportBot()
        response = bot.respond_to_query("How do I reset my password?")

        # Verify
        assert "help you" in response
        mock_get_prompt.assert_called_once()
        mock_openai.assert_called_once()

    @patch('openai.ChatCompletion.create')
    @patch('claro.BaytClient.get_prompt')
    def test_error_handling(self, mock_get_prompt, mock_openai):
        """Test bot handles LLM errors gracefully"""
        mock_get_prompt.return_value = Mock(generator="...")
        mock_openai.side_effect = Exception("API Error")

        bot = CustomerSupportBot()
        response = bot.respond_to_query("test")

        # Should return fallback message
        assert "error occurred" in response.lower()

Integration Testing

Testing with Real API Calls

Create integration tests that call the actual Claro API:
import pytest
import os
from baytos.claro import BaytClient

# Mark as integration test (can be skipped in CI)
@pytest.mark.integration
class TestClaroIntegration:
    """Integration tests using real Claro API"""

    @classmethod
    def setup_class(cls):
        """Setup test client"""
        api_key = os.getenv("BAYT_API_KEY_TEST")
        if not api_key:
            pytest.skip("BAYT_API_KEY_TEST not set")

        cls.client = BaytClient(api_key=api_key)

    def test_get_prompt(self):
        """Test fetching a real prompt"""
        prompt = self.client.get_prompt("@workspace/test-prompt:v1")

        assert prompt.title
        assert prompt.generator
        assert prompt.version == 1

    def test_list_prompts(self):
        """Test listing workspace prompts"""
        result = self.client.list_prompts(limit=10)

        assert 'prompts' in result
        assert len(result['prompts']) <= 10

    def test_prompt_with_context(self):
        """Test prompt with file context"""
        prompt = self.client.get_prompt("@workspace/code-review:v1")

        if prompt.has_context():
            contexts = prompt.get_file_contexts()
            assert len(contexts) > 0

Running Integration Tests

Separate integration tests from unit tests:
# Run only unit tests (fast)
pytest tests/ -m "not integration"

# Run all tests including integration (slower)
pytest tests/

# Run only integration tests
pytest tests/ -m integration

A/B Testing Approaches

Comparing Prompt Versions

Test multiple versions side-by-side:
from baytos.claro import BaytClient
import openai

def compare_prompt_versions(query: str, versions: list[str]):
    """Compare outputs from different prompt versions"""
    client = BaytClient(api_key=os.getenv("BAYT_API_KEY"))
    openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    results = {}

    for version in versions:
        # Get prompt version
        prompt = client.get_prompt(f"@workspace/support:{version}")

        # Generate response
        response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": prompt.generator},
                {"role": "user", "content": query}
            ]
        )

        results[version] = {
            "prompt": prompt.generator[:100] + "...",
            "response": response.choices[0].message.content,
            "tokens": response.usage.total_tokens
        }

    return results

# Test different versions
query = "How do I reset my password?"
comparison = compare_prompt_versions(query, ["v1", "v2", "v3"])

for version, result in comparison.items():
    print(f"\n{version}:")
    print(f"Response: {result['response']}")
    print(f"Tokens: {result['tokens']}")

Production A/B Testing

Gradually roll out new versions in production:
import random
from baytos.claro import BaytClient

class PromptSelector:
    """Select prompt version for A/B testing"""

    def __init__(self, versions_config: dict):
        """
        versions_config = {
            "v1": 0.7,  # 70% traffic
            "v2": 0.3   # 30% traffic
        }
        """
        self.versions = versions_config
        self.client = BaytClient(api_key=os.getenv("BAYT_API_KEY"))

    def get_prompt(self, user_id: str, base_package: str):
        """Get prompt version based on A/B split"""
        # Use user_id hash for consistent assignment
        rand = (hash(user_id) % 100) / 100

        cumulative = 0
        for version, percentage in self.versions.items():
            cumulative += percentage
            if rand <= cumulative:
                return self.client.get_prompt(f"{base_package}:{version}")

        # Fallback
        return self.client.get_prompt(f"{base_package}:v1")

# Usage
selector = PromptSelector({
    "v1": 0.9,  # 90% of users
    "v2": 0.1   # 10% of users (new version)
})

prompt = selector.get_prompt(
    user_id="user_123",
    base_package="@workspace/support"
)

Tracking A/B Test Results

Log metrics for each version:
import logging
from datetime import datetime

class ABTestLogger:
    """Log A/B test metrics"""

    def __init__(self):
        self.logger = logging.getLogger("ab_test")

    def log_interaction(
        self,
        user_id: str,
        version: str,
        query: str,
        response: str,
        success: bool,
        response_time: float
    ):
        """Log each interaction for analysis"""
        self.logger.info({
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
            "version": version,
            "query_length": len(query),
            "response_length": len(response),
            "success": success,
            "response_time_ms": response_time * 1000
        })

# Usage
logger = ABTestLogger()

start = time.time()
response = generate_response(prompt, user_query)
elapsed = time.time() - start

logger.log_interaction(
    user_id="user_123",
    version="v2",
    query=user_query,
    response=response,
    success=True,  # Based on your success criteria
    response_time=elapsed
)

CI/CD Integration

GitHub Actions Example

Automate testing in your CI pipeline:
# .github/workflows/test.yml
name: Test Claro Integration

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-cov

    - name: Run unit tests
      run: pytest tests/ -m "not integration" --cov

    - name: Run integration tests
      env:
        BAYT_API_KEY_TEST: ${{ secrets.BAYT_API_KEY_TEST }}
      run: pytest tests/ -m integration
      continue-on-error: true  # Don't fail build on integration test failures

    - name: Upload coverage
      uses: codecov/codecov-action@v3

Pre-commit Hooks

Test before committing:
#!/bin/bash
# .git/hooks/pre-commit

echo "Running tests..."

# Run unit tests
pytest tests/ -m "not integration" --quiet

if [ $? -ne 0 ]; then
    echo "Tests failed. Commit aborted."
    exit 1
fi

echo "Tests passed!"
exit 0

Best Practices

Don’t just test with synthetic data:
  • Collect actual user queries from logs
  • Test with real support tickets
  • Use production-like scenarios
  • Include edge cases from real usage
Create a test suite that runs automatically:
# tests/test_prompts.py
import pytest
from your_app import get_response

# Golden test cases
TEST_CASES = [
    {
        "input": "How do I reset my password?",
        "expected_topics": ["password", "reset", "email"],
        "max_length": 500
    },
    {
        "input": "What are your business hours?",
        "expected_topics": ["hours", "time", "available"],
        "max_length": 200
    }
]

@pytest.mark.parametrize("test_case", TEST_CASES)
def test_prompt_quality(test_case):
    response = get_response(test_case["input"])

    # Check topics mentioned
    for topic in test_case["expected_topics"]:
        assert topic.lower() in response.lower()

    # Check length
    assert len(response) <= test_case["max_length"]
Track metrics across versions:
  • Response time
  • Token usage
  • User satisfaction (thumbs up/down)
  • Error rates
  • Escalation rates (for support prompts)
Compare metrics between versions to identify improvements or regressions.
Validate that prompts follow their instructions:
def test_prompt_follows_instructions():
    """Test that responses follow prompt constraints"""
    # If prompt says "respond in under 100 words"
    response = generate_response("Tell me about your company")

    word_count = len(response.split())
    assert word_count <= 100, f"Response too long: {word_count} words"
Always pin to specific versions in tests:
# ✅ Good: Specific version
prompt = client.get_prompt("@workspace/support:v1")

# ❌ Bad: Latest version (tests become unpredictable)
prompt = client.get_prompt("@workspace/support:latest")
Test how your application handles errors:
def test_handles_prompt_not_found():
    """Gracefully handle missing prompts"""
    from baytos.claro import BaytNotFoundError

    try:
        prompt = client.get_prompt("@workspace/nonexistent:v1")
        response = use_prompt(prompt)
    except BaytNotFoundError:
        response = "I'm having trouble right now. Please try again later."

    assert response  # Should always return something
    assert "error" in response.lower() or "trouble" in response.lower()

Testing Tools and Frameworks

pytest

Python testing framework
  • Powerful fixtures
  • Parametrized tests
  • Great mocking support

pytest-mock

Mocking library
  • Easy API mocking
  • Patch functions
  • Verify call counts

pytest-cov

Coverage reporting
  • Track test coverage
  • Identify untested code
  • CI integration

Locust

Load testing
  • Test at scale
  • Simulate concurrent users
  • Performance metrics

Load Testing Example

Test prompt performance under load:
from locust import HttpUser, task, between
import os

class PromptLoadTest(HttpUser):
    """Load test Claro API integration"""
    wait_time = between(1, 3)

    def on_start(self):
        """Setup test data"""
        self.api_key = os.getenv("BAYT_API_KEY_TEST")
        self.headers = {
            "Authorization": f"Bearer {self.api_key}"
        }

    @task(3)
    def get_prompt(self):
        """Test prompt fetching (most common operation)"""
        self.client.get(
            "https://api.baytos.ai/v1/prompts/@workspace/support:v1",
            headers=self.headers,
            name="get_prompt"
        )

    @task(1)
    def list_prompts(self):
        """Test listing prompts"""
        self.client.get(
            "https://api.baytos.ai/v1/prompts?limit=20",
            headers=self.headers,
            name="list_prompts"
        )
Run load tests:
locust -f locustfile.py --host=https://api.baytos.ai

Troubleshooting Tests

Common causes:
  • Environment variables not set in CI
  • Different Python versions
  • Timezone or locale differences
Solution: Ensure CI environment matches local environment. Set all required secrets in CI.
Common causes:
  • Network timeouts
  • Rate limiting
  • Non-deterministic LLM outputs
Solution: Add retries for integration tests:
@pytest.mark.integration
@pytest.mark.flaky(reruns=3)
def test_api_call():
    # Test will retry up to 3 times if it fails
    pass
Common causes:
  • Patching wrong location
  • Import order issues
  • Mock not properly configured
Solution: Patch where the function is used, not where it’s defined:
# If your_app.py imports: from baytos.claro import BaytClient
# Patch at usage location:
@patch('your_app.BaytClient.get_prompt')  # ✅ Correct

# Not at definition:
@patch('claro.BaytClient.get_prompt')  # ❌ Won't work

Next Steps