trainloop eval

Run evaluation suites to analyze your LLM interaction data and generate results.

Synopsis

trainloop eval [OPTIONS]

Description

The trainloop eval command processes event data collected by the TrainLoop SDK and applies your custom metrics to generate evaluation results. It discovers evaluation suites in the trainloop/eval/ directory and processes new events from the data folder.

Options

Option	Description
`--suite <name>`	Run only the specified evaluation suite
`--config <path>`	Path to configuration file
`--data-folder <path>`	Override data folder location
`--verbose`	Enable verbose output for debugging
`--quiet`	Suppress non-essential output
`--force`	Re-evaluate all events, ignoring cache
`--dry-run`	Show what would be evaluated without running
`--help`	Show help message

How It Works

Discovery: Finds evaluation suites in trainloop/eval/suites/
Event Processing: Loads event data from trainloop/data/events/
Metric Application: Applies metrics to each event
Result Generation: Saves results to trainloop/data/results/
Judge Traces: Stores LLM Judge traces in trainloop/data/judge_traces/

Examples

Basic Evaluation

# Run all evaluation suites
trainloop eval

Run Specific Suite

# Run only the "greeting_evaluation" suite
trainloop eval --suite greeting_evaluation

Custom Configuration

# Use custom configuration file
trainloop eval --config production.config.yaml

Verbose Output

# Enable detailed logging
trainloop eval --verbose

Force Re-evaluation

# Re-evaluate all events, ignoring cache
trainloop eval --force

Dry Run

# Show what would be evaluated without running
trainloop eval --dry-run

Configuration Discovery

The CLI searches for configuration files in this order:

--config command line argument
TRAINLOOP_CONFIG_FILE environment variable
trainloop.config.yaml in current directory
trainloop.config.yaml in parent directories (up to git root)
~/.trainloop/config.yaml in home directory
Default configuration

Output

Success Output

🔍 Discovering evaluation suites...
✅ Found 3 suites: greeting_evaluation, accuracy_check, safety_review

📊 Running evaluations...
✅ greeting_evaluation: 12/15 metrics passed (80.0%)
✅ accuracy_check: 45/50 metrics passed (90.0%)
✅ safety_review: 98/100 metrics passed (98.0%)

📈 Results saved to trainloop/data/results/
   - evaluation_results_2024-01-15_14-30-25.json
   - evaluation_summary.json

⏱️  Evaluation completed in 2.3s

Verbose Output

trainloop eval --verbose

🔍 Discovering evaluation suites...
   - Found trainloop/eval/suites/greeting_evaluation.py
   - Found trainloop/eval/suites/accuracy_check.py
   - Found trainloop/eval/suites/safety_review.py
✅ Found 3 suites: greeting_evaluation, accuracy_check, safety_review

📁 Loading event data...
   - Loading trainloop/data/events/2024-01-15.jsonl (150 events)
   - Loading trainloop/data/events/2024-01-14.jsonl (230 events)
   - Total: 380 events

📊 Running evaluations...
   - greeting_evaluation: Processing 45 events...
     ✅ has_greeting_word: 42/45 passed (93.3%)
     ✅ is_personalized: 38/45 passed (84.4%)
     ❌ is_friendly_tone: 35/45 passed (77.8%)
   - accuracy_check: Processing 380 events...
     ✅ is_accurate: 342/380 passed (90.0%)
     ✅ is_complete: 335/380 passed (88.2%)
   - safety_review: Processing 380 events...
     ✅ is_safe: 378/380 passed (99.5%)
     ✅ no_harmful_content: 373/380 passed (98.2%)

📈 Results saved to trainloop/data/results/
   - evaluation_results_2024-01-15_14-30-25.json
   - evaluation_summary.json

⏱️  Evaluation completed in 2.3s

Error Output

❌ Error: No evaluation suites found in trainloop/eval/suites/
   
   To get started:
   1. Create a suite file in trainloop/eval/suites/
   2. Add metrics to trainloop/eval/metrics/
   3. Run 'trainloop eval' again
   
   See: https://docs.trainloop.com/tutorials/first-evaluation

Result Files

Main Results File

{
  "timestamp": "2024-01-15T14:30:25Z",
  "duration": 2.3,
  "total_events": 380,
  "suites": {
    "greeting_evaluation": {
      "events_processed": 45,
      "metrics": {
        "has_greeting_word": {
          "passed": 42,
          "total": 45,
          "score": 0.933
        },
        "is_personalized": {
          "passed": 38,
          "total": 45,
          "score": 0.844
        }
      },
      "overall_score": 0.889
    }
  }
}

Summary File

{
  "latest_evaluation": "2024-01-15T14:30:25Z",
  "total_suites": 3,
  "overall_score": 0.893,
  "trending": {
    "score_change": 0.05,
    "trend": "improving"
  }
}

Exit Codes

Exit Code	Meaning
`0`	Success - all evaluations completed
`1`	General error
`2`	Invalid arguments
`3`	Configuration error
`4`	No evaluation suites found
`5`	Evaluation failure

Performance Considerations

Large Datasets

For large datasets, consider:

# Process in batches
trainloop eval --batch-size 1000

# Use parallel processing
trainloop eval --parallel

# Skip expensive metrics for CI
trainloop eval --skip-llm-judge

Caching

TrainLoop caches evaluation results to avoid re-processing unchanged events:

# Clear cache if needed
trainloop eval --force

# Show cache statistics
trainloop eval --cache-stats

Integration with CI/CD

Basic CI Integration

# Run evaluations in CI
trainloop eval --config ci.config.yaml --quiet

# Check exit code
if [ $? -eq 0 ]; then
    echo "✅ Evaluations passed"
else
    echo "❌ Evaluations failed"
    exit 1
fi

Quality Gates

# Fail if score below threshold
trainloop eval --min-score 0.8

# Fail if any metric fails
trainloop eval --require-all-pass

Common Issues

No Suites Found

❌ Error: No evaluation suites found

Solution: Create evaluation suites in trainloop/eval/suites/

No Events Found

❌ Error: No event data found

Solution:

Check TRAINLOOP_DATA_FOLDER environment variable
Ensure your application is collecting data with the SDK
Verify events exist in trainloop/data/events/

Import Errors

❌ Error: Failed to import suite 'my_suite'

Solution:

Check Python syntax in suite files
Ensure metrics are importable
Verify Python path includes trainloop/eval/

LLM Judge Failures

❌ Error: LLM Judge API call failed

Solution:

Check API keys are configured
Verify network connectivity
Check rate limits
Use --skip-llm-judge to disable

Advanced Usage

Custom Metrics Path

# Use custom metrics directory
trainloop eval --metrics-path custom/metrics/

# Use custom suites directory
trainloop eval --suites-path custom/suites/

Filtering Events

# Evaluate only recent events
trainloop eval --since "2024-01-01"

# Evaluate specific tags
trainloop eval --tags "greeting,support"

# Exclude specific tags
trainloop eval --exclude-tags "test,debug"

Output Formats

# Output JSON results
trainloop eval --format json

# Output CSV results
trainloop eval --format csv

# Output to file
trainloop eval --output results.json

Synopsis​

Description​

Options​

How It Works​

Examples​

Basic Evaluation​

Run Specific Suite​

Custom Configuration​

Verbose Output​

Force Re-evaluation​

Dry Run​

Configuration Discovery​

Output​

Success Output​

Verbose Output​

Error Output​

Result Files​

Main Results File​

Summary File​

Exit Codes​

Performance Considerations​

Large Datasets​

Caching​

Integration with CI/CD​

Basic CI Integration​

Quality Gates​

Common Issues​

No Suites Found​

No Events Found​

Import Errors​

LLM Judge Failures​

Advanced Usage​

Custom Metrics Path​

Filtering Events​

Output Formats​

See Also​

Synopsis

Description

Options

How It Works

Examples

Basic Evaluation

Run Specific Suite

Custom Configuration

Verbose Output

Force Re-evaluation

Dry Run

Configuration Discovery

Output

Success Output

Verbose Output

Error Output

Result Files

Main Results File

Summary File

Exit Codes

Performance Considerations

Large Datasets

Caching

Integration with CI/CD

Basic CI Integration

Quality Gates

Common Issues

No Suites Found

No Events Found

Import Errors

LLM Judge Failures

Advanced Usage

Custom Metrics Path

Filtering Events

Output Formats

See Also