trainloop benchmark
Compare multiple LLM providers by re-running prompts and evaluating results with the same metrics.
Synopsis
trainloop benchmark [OPTIONS]
Description
The trainloop benchmark
command takes your existing event data and re-runs the same prompts against multiple LLM providers configured in your settings. It then applies your evaluation metrics to all responses, enabling direct comparison of model performance.
Options
Option | Description |
---|---|
--config <path> | Path to configuration file |
--max-samples <number> | Limit number of samples per provider |
--tag <name> | Only benchmark events with specific tag |
--providers <list> | Comma-separated list of providers to test |
--output <path> | Output directory for results |
--verbose | Enable verbose output |
--help | Show help message |
Examples
Basic Benchmarking
# Run benchmark with configured providers
trainloop benchmark
Limited Sampling
# Test with only 100 samples per provider
trainloop benchmark --max-samples 100
Specific Tags
# Only benchmark greeting generation
trainloop benchmark --tag greeting-generation
Custom Providers
# Test specific providers
trainloop benchmark --providers openai/gpt-4o,anthropic/claude-3-sonnet
Configuration
Configure benchmark providers in trainloop.config.yaml
:
trainloop:
benchmark:
providers:
- provider: openai
model: gpt-4o
temperature: 0.7
max_tokens: 1000
- provider: openai
model: gpt-4o-mini
temperature: 0.7
max_tokens: 1000
- provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.7
max_tokens: 1000
# Optional settings
max_samples: 1000
parallel_requests: 5
timeout: 30
How It Works
- Sample Selection: Selects events from your data based on tags/filters
- Provider Execution: Re-runs prompts against each configured provider
- Metric Application: Applies your existing metrics to all responses
- Result Generation: Creates comparison data with performance statistics
- Visualization: Results viewable in Studio UI
Output
Console Output
🔍 Starting benchmark with 3 providers...
📊 Selected 150 events for benchmarking
🚀 Running prompts against providers:
✅ openai/gpt-4o: 150/150 completed (avg: 1.2s)
✅ openai/gpt-4o-mini: 150/150 completed (avg: 0.8s)
✅ anthropic/claude-3-5-sonnet: 150/150 completed (avg: 1.5s)
📈 Applying metrics to results...
✅ helpful_check: Applied to 450 responses
✅ accuracy_check: Applied to 450 responses
✅ safety_check: Applied to 450 responses
💾 Benchmark results saved to data/benchmarks/2024-01-15_14-30-25/
Results Structure
data/
├── benchmarks/
│ └── 2024-01-15_14-30-25/
│ ├── benchmark_results.json # Main results
│ ├── provider_comparison.json # Provider stats
│ └── detailed_results.jsonl # Individual responses
Results Content
{
"timestamp": "2024-01-15T14:30:25Z",
"providers": ["openai/gpt-4o", "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"],
"total_samples": 150,
"metrics": {
"helpful_check": {
"openai/gpt-4o": {"score": 0.85, "passed": 128, "total": 150},
"openai/gpt-4o-mini": {"score": 0.82, "passed": 123, "total": 150},
"anthropic/claude-3-5-sonnet": {"score": 0.88, "passed": 132, "total": 150}
}
},
"cost_analysis": {
"openai/gpt-4o": {"total_cost": 4.50, "cost_per_token": 0.015},
"openai/gpt-4o-mini": {"total_cost": 0.30, "cost_per_token": 0.001}
}
}
Analysis in Studio UI
After benchmarking, use Studio UI to analyze results:
trainloop studio
Features available:
- Performance comparison charts
- Cost vs. quality analysis
- Individual response comparison
- Metric breakdown by provider
- Statistical significance testing
Best Practices
1. Representative Sampling
# Use sufficient samples for statistical significance
trainloop benchmark --max-samples 500
# Include diverse event types
trainloop benchmark --tag "" # All events
2. Consistent Configuration
# Use same temperature/settings across providers
benchmark:
providers:
- provider: openai
model: gpt-4o
temperature: 0.7 # Same across all
- provider: anthropic
model: claude-3-sonnet
temperature: 0.7 # Same across all
3. Cost Management
# Start with small samples
trainloop benchmark --max-samples 50
# Monitor costs in configuration
benchmark:
cost_limit: 10.00 # Stop at $10
Troubleshooting
API Rate Limits
# Reduce parallel requests
trainloop benchmark --parallel 2
# Add delays between requests
trainloop benchmark --delay 1.0
Insufficient Data
# Check available events
ls data/events/
# Verify tags exist
trainloop eval --dry-run
Provider Errors
# Test individual provider
trainloop benchmark --providers openai/gpt-4o
# Check API keys
env | grep API_KEY
See Also
- Benchmarking Tutorial - Complete benchmarking guide
- Configuration - Configure benchmark settings
- Studio UI - Analyze benchmark results