Testing Strategies

Testing AI prompts is essential for ensuring consistent, high-quality outputs. This guide covers comprehensive testing strategies for Claro prompts, from development to production.

Why Test Prompts?

AI prompts require testing because:

Variability - LLMs can produce different outputs for the same input
Edge Cases - Unexpected inputs can lead to poor responses
Version Changes - Updates may introduce unintended behavior
Quality Assurance - Ensure prompts meet business requirements

Unlike traditional code, AI prompts can’t guarantee deterministic outputs. Testing focuses on quality patterns and acceptable ranges rather than exact matches.

Testing Before Deployment

Manual Testing in Dashboard

The fastest way to test prompts before publishing:

Create Draft Version

In the Bayt OS dashboard, create a new draft or edit an existing prompt

Use Preview Mode

Test the prompt directly in the editor:

Enter sample inputs
Review generated outputs
Test edge cases
Verify tone and style

Iterate and Refine

Make adjustments based on test results, then re-test

Publish When Ready

Once satisfied with test results, publish the version

Test Input Categories

Test prompts with diverse input types:

Happy Path
Edge Cases
Adversarial
Multilingual

Expected Use Cases

Typical user queries
Well-formed inputs
Common scenarios
Standard requests

These should produce the best responses.

Unit Testing with Mocked Responses

Testing Integration Code

Test your application’s integration with Claro independently:

import pytest
from unittest.mock import Mock, patch
from baytos.claro import BaytClient

# Mock prompt response
MOCK_PROMPT = {
    "title": "Customer Support",
    "generator": "You are a helpful customer support agent...",
    "package_name": "@workspace/support:v1",
    "version": 1
}

class TestPromptIntegration:
    """Test application logic without calling Claro API"""

    @patch('claro.BaytClient.get_prompt')
    def test_prompt_loading(self, mock_get_prompt):
        """Test that prompt is loaded correctly"""
        # Setup mock
        mock_get_prompt.return_value = Mock(**MOCK_PROMPT)

        # Test code
        client = BaytClient(api_key="test_key")
        prompt = client.get_prompt("@workspace/support:v1")

        # Assertions
        assert prompt.title == "Customer Support"
        assert "customer support agent" in prompt.generator
        mock_get_prompt.assert_called_once_with("@workspace/support:v1")

    @patch('claro.BaytClient.get_prompt')
    def test_prompt_error_handling(self, mock_get_prompt):
        """Test handling of prompt loading errors"""
        from baytos.claro import BaytNotFoundError

        # Setup mock to raise error
        mock_get_prompt.side_effect = BaytNotFoundError("Prompt not found")

        # Test error handling
        client = BaytClient(api_key="test_key")

        with pytest.raises(BaytNotFoundError):
            client.get_prompt("@workspace/nonexistent:v1")

    def test_prompt_caching(self):
        """Test that prompts are cached appropriately"""
        # Your caching logic tests
        pass

Testing LLM Integration

Test the complete flow with mocked LLM responses:

import pytest
from unittest.mock import Mock, patch
from your_app import CustomerSupportBot

class TestCustomerSupportBot:
    """Test bot behavior with mocked responses"""

    @patch('openai.ChatCompletion.create')
    @patch('claro.BaytClient.get_prompt')
    def test_support_response(self, mock_get_prompt, mock_openai):
        """Test complete support flow"""
        # Mock Claro prompt
        mock_get_prompt.return_value = Mock(
            generator="You are a helpful support agent..."
        )

        # Mock OpenAI response
        mock_openai.return_value = Mock(
            choices=[Mock(message=Mock(content="I can help you with that..."))]
        )

        # Test bot
        bot = CustomerSupportBot()
        response = bot.respond_to_query("How do I reset my password?")

        # Verify
        assert "help you" in response
        mock_get_prompt.assert_called_once()
        mock_openai.assert_called_once()

    @patch('openai.ChatCompletion.create')
    @patch('claro.BaytClient.get_prompt')
    def test_error_handling(self, mock_get_prompt, mock_openai):
        """Test bot handles LLM errors gracefully"""
        mock_get_prompt.return_value = Mock(generator="...")
        mock_openai.side_effect = Exception("API Error")

        bot = CustomerSupportBot()
        response = bot.respond_to_query("test")

        # Should return fallback message
        assert "error occurred" in response.lower()

Integration Testing

Testing with Real API Calls

Create integration tests that call the actual Claro API:

import pytest
import os
from baytos.claro import BaytClient

# Mark as integration test (can be skipped in CI)
@pytest.mark.integration
class TestClaroIntegration:
    """Integration tests using real Claro API"""

    @classmethod
    def setup_class(cls):
        """Setup test client"""
        api_key = os.getenv("BAYT_API_KEY_TEST")
        if not api_key:
            pytest.skip("BAYT_API_KEY_TEST not set")

        cls.client = BaytClient(api_key=api_key)

    def test_get_prompt(self):
        """Test fetching a real prompt"""
        prompt = self.client.get_prompt("@workspace/test-prompt:v1")

        assert prompt.title
        assert prompt.generator
        assert prompt.version == 1

    def test_list_prompts(self):
        """Test listing workspace prompts"""
        result = self.client.list_prompts(limit=10)

        assert 'prompts' in result
        assert len(result['prompts']) <= 10

    def test_prompt_with_context(self):
        """Test prompt with file context"""
        prompt = self.client.get_prompt("@workspace/code-review:v1")

        if prompt.has_context():
            contexts = prompt.get_file_contexts()
            assert len(contexts) > 0

Running Integration Tests

Separate integration tests from unit tests:

# Run only unit tests (fast)
pytest tests/ -m "not integration"

# Run all tests including integration (slower)
pytest tests/

# Run only integration tests
pytest tests/ -m integration

A/B Testing Approaches

Comparing Prompt Versions

Test multiple versions side-by-side:

from baytos.claro import BaytClient
import openai

def compare_prompt_versions(query: str, versions: list[str]):
    """Compare outputs from different prompt versions"""
    client = BaytClient(api_key=os.getenv("BAYT_API_KEY"))
    openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    results = {}

    for version in versions:
        # Get prompt version
        prompt = client.get_prompt(f"@workspace/support:{version}")

        # Generate response
        response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": prompt.generator},
                {"role": "user", "content": query}
            ]
        )

        results[version] = {
            "prompt": prompt.generator[:100] + "...",
            "response": response.choices[0].message.content,
            "tokens": response.usage.total_tokens
        }

    return results

# Test different versions
query = "How do I reset my password?"
comparison = compare_prompt_versions(query, ["v1", "v2", "v3"])

for version, result in comparison.items():
    print(f"\n{version}:")
    print(f"Response: {result['response']}")
    print(f"Tokens: {result['tokens']}")

Production A/B Testing

Gradually roll out new versions in production:

import random
from baytos.claro import BaytClient

class PromptSelector:
    """Select prompt version for A/B testing"""

    def __init__(self, versions_config: dict):
        """
        versions_config = {
            "v1": 0.7,  # 70% traffic
            "v2": 0.3   # 30% traffic
        }
        """
        self.versions = versions_config
        self.client = BaytClient(api_key=os.getenv("BAYT_API_KEY"))

    def get_prompt(self, user_id: str, base_package: str):
        """Get prompt version based on A/B split"""
        # Use user_id hash for consistent assignment
        rand = (hash(user_id) % 100) / 100

        cumulative = 0
        for version, percentage in self.versions.items():
            cumulative += percentage
            if rand <= cumulative:
                return self.client.get_prompt(f"{base_package}:{version}")

        # Fallback
        return self.client.get_prompt(f"{base_package}:v1")

# Usage
selector = PromptSelector({
    "v1": 0.9,  # 90% of users
    "v2": 0.1   # 10% of users (new version)
})

prompt = selector.get_prompt(
    user_id="user_123",
    base_package="@workspace/support"
)

Tracking A/B Test Results

Log metrics for each version:

import logging
from datetime import datetime

class ABTestLogger:
    """Log A/B test metrics"""

    def __init__(self):
        self.logger = logging.getLogger("ab_test")

    def log_interaction(
        self,
        user_id: str,
        version: str,
        query: str,
        response: str,
        success: bool,
        response_time: float
    ):
        """Log each interaction for analysis"""
        self.logger.info({
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
            "version": version,
            "query_length": len(query),
            "response_length": len(response),
            "success": success,
            "response_time_ms": response_time * 1000
        })

# Usage
logger = ABTestLogger()

start = time.time()
response = generate_response(prompt, user_query)
elapsed = time.time() - start

logger.log_interaction(
    user_id="user_123",
    version="v2",
    query=user_query,
    response=response,
    success=True,  # Based on your success criteria
    response_time=elapsed
)

CI/CD Integration

GitHub Actions Example

Automate testing in your CI pipeline:

# .github/workflows/test.yml
name: Test Claro Integration

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-cov

    - name: Run unit tests
      run: pytest tests/ -m "not integration" --cov

    - name: Run integration tests
      env:
        BAYT_API_KEY_TEST: ${{ secrets.BAYT_API_KEY_TEST }}
      run: pytest tests/ -m integration
      continue-on-error: true  # Don't fail build on integration test failures

    - name: Upload coverage
      uses: codecov/codecov-action@v3

Pre-commit Hooks

Test before committing:

#!/bin/bash
# .git/hooks/pre-commit

echo "Running tests..."

# Run unit tests
pytest tests/ -m "not integration" --quiet

if [ $? -ne 0 ]; then
    echo "Tests failed. Commit aborted."
    exit 1
fi

echo "Tests passed!"
exit 0

Best Practices

Test with Real User Inputs

Don’t just test with synthetic data:

Collect actual user queries from logs
Test with real support tickets
Use production-like scenarios
Include edge cases from real usage

Automate Regression Testing

Create a test suite that runs automatically:

# tests/test_prompts.py
import pytest
from your_app import get_response

# Golden test cases
TEST_CASES = [
    {
        "input": "How do I reset my password?",
        "expected_topics": ["password", "reset", "email"],
        "max_length": 500
    },
    {
        "input": "What are your business hours?",
        "expected_topics": ["hours", "time", "available"],
        "max_length": 200
    }
]

@pytest.mark.parametrize("test_case", TEST_CASES)
def test_prompt_quality(test_case):
    response = get_response(test_case["input"])

    # Check topics mentioned
    for topic in test_case["expected_topics"]:
        assert topic.lower() in response.lower()

    # Check length
    assert len(response) <= test_case["max_length"]

Monitor Quality Metrics

Track metrics across versions:

Response time
Token usage
User satisfaction (thumbs up/down)
Error rates
Escalation rates (for support prompts)

Compare metrics between versions to identify improvements or regressions.

Test Prompt Instructions Separately

Validate that prompts follow their instructions:

def test_prompt_follows_instructions():
    """Test that responses follow prompt constraints"""
    # If prompt says "respond in under 100 words"
    response = generate_response("Tell me about your company")

    word_count = len(response.split())
    assert word_count <= 100, f"Response too long: {word_count} words"

Use Version Pinning in Tests

Always pin to specific versions in tests:

# ✅ Good: Specific version
prompt = client.get_prompt("@workspace/support:v1")

# ❌ Bad: Latest version (tests become unpredictable)
prompt = client.get_prompt("@workspace/support:latest")

Test Failure Modes

Test how your application handles errors:

def test_handles_prompt_not_found():
    """Gracefully handle missing prompts"""
    from baytos.claro import BaytNotFoundError

    try:
        prompt = client.get_prompt("@workspace/nonexistent:v1")
        response = use_prompt(prompt)
    except BaytNotFoundError:
        response = "I'm having trouble right now. Please try again later."

    assert response  # Should always return something
    assert "error" in response.lower() or "trouble" in response.lower()

Testing Tools and Frameworks

Recommended Testing Stack

pytest

Python testing framework

Powerful fixtures
Parametrized tests
Great mocking support

pytest-mock

Mocking library

Easy API mocking
Patch functions
Verify call counts

pytest-cov

Coverage reporting

Track test coverage
Identify untested code
CI integration

Locust

Load testing

Test at scale
Simulate concurrent users
Performance metrics

Load Testing Example

Test prompt performance under load:

from locust import HttpUser, task, between
import os

class PromptLoadTest(HttpUser):
    """Load test Claro API integration"""
    wait_time = between(1, 3)

    def on_start(self):
        """Setup test data"""
        self.api_key = os.getenv("BAYT_API_KEY_TEST")
        self.headers = {
            "Authorization": f"Bearer {self.api_key}"
        }

    @task(3)
    def get_prompt(self):
        """Test prompt fetching (most common operation)"""
        self.client.get(
            "https://api.baytos.ai/v1/prompts/@workspace/support:v1",
            headers=self.headers,
            name="get_prompt"
        )

    @task(1)
    def list_prompts(self):
        """Test listing prompts"""
        self.client.get(
            "https://api.baytos.ai/v1/prompts?limit=20",
            headers=self.headers,
            name="list_prompts"
        )

Run load tests:

locust -f locustfile.py --host=https://api.baytos.ai

Troubleshooting Tests

Tests pass locally but fail in CI

Common causes:

Environment variables not set in CI
Different Python versions
Timezone or locale differences

Solution: Ensure CI environment matches local environment. Set all required secrets in CI.

Flaky integration tests

Common causes:

Network timeouts
Rate limiting
Non-deterministic LLM outputs

Solution: Add retries for integration tests:

@pytest.mark.integration
@pytest.mark.flaky(reruns=3)
def test_api_call():
    # Test will retry up to 3 times if it fails
    pass

Mocks not working as expected

Common causes:

Patching wrong location
Import order issues
Mock not properly configured

Solution: Patch where the function is used, not where it’s defined:

# If your_app.py imports: from baytos.claro import BaytClient
# Patch at usage location:
@patch('your_app.BaytClient.get_prompt')  # ✅ Correct

# Not at definition:
@patch('claro.BaytClient.get_prompt')  # ❌ Won't work

Next Steps

Error Handling

Learn comprehensive error handling for robust tests

Performance Guide

Optimize prompt fetching and caching

Security

Secure API keys in test environments

Advanced Patterns

Production-ready integration patterns

Getting Started

Python SDK

Examples

Guides

Testing Strategies

Testing Strategies

Why Test Prompts?

Testing Before Deployment

Manual Testing in Dashboard

Test Input Categories

Unit Testing with Mocked Responses

Testing Integration Code

Testing LLM Integration

Integration Testing

Testing with Real API Calls

Running Integration Tests

A/B Testing Approaches

Comparing Prompt Versions

Production A/B Testing

Tracking A/B Test Results

CI/CD Integration

GitHub Actions Example

Pre-commit Hooks

Best Practices

Testing Tools and Frameworks

Recommended Testing Stack

pytest

pytest-mock

pytest-cov

Locust

Load Testing Example

Troubleshooting Tests

Next Steps

Error Handling

Performance Guide

Security

Advanced Patterns

Getting Started

Python SDK

Examples

Guides

​Testing Strategies

​Why Test Prompts?

​Testing Before Deployment

​Manual Testing in Dashboard

​Test Input Categories

​Unit Testing with Mocked Responses

​Testing Integration Code

​Testing LLM Integration

​Integration Testing

​Testing with Real API Calls

​Running Integration Tests

​A/B Testing Approaches

​Comparing Prompt Versions

​Production A/B Testing

​Tracking A/B Test Results

​CI/CD Integration

​GitHub Actions Example

​Pre-commit Hooks

​Best Practices

​Testing Tools and Frameworks

​Recommended Testing Stack

pytest

pytest-mock

pytest-cov

Locust

​Load Testing Example

​Troubleshooting Tests

​Next Steps

Error Handling

Performance Guide

Security

Advanced Patterns

Testing Strategies

Why Test Prompts?

Testing Before Deployment

Manual Testing in Dashboard

Test Input Categories

Unit Testing with Mocked Responses

Testing Integration Code

Testing LLM Integration

Integration Testing

Testing with Real API Calls

Running Integration Tests

A/B Testing Approaches

Comparing Prompt Versions

Production A/B Testing

Tracking A/B Test Results

CI/CD Integration

GitHub Actions Example

Pre-commit Hooks

Best Practices

Testing Tools and Frameworks

Recommended Testing Stack

Load Testing Example

Troubleshooting Tests

Next Steps