Benchmark Data Formats

The benchmark feature enables comparing evaluation results across multiple LLM providers. This document describes the data schema and JSONL output format for benchmark runs.

Design Principles

Compatibility: Extends existing Sample and Result types rather than replacing them
Flexibility: Supports various comparison dimensions (providers, models, parameters)
UI-Friendly: Structured for efficient querying and visualization in DuckDB
Extensibility: Easy to add new metrics or provider configurations

JSONL Output Format

Benchmark results are saved to data/benchmarks/{timestamp}/{suite_name}.jsonl with three types of records:

1. Metadata Record (First Line)

{
  "type": "metadata",
  "data": {
    "benchmark_id": "bench_20240315_143022_abc123",
    "timestamp": "2024-03-15T14:30:22.123Z",
    "base_eval_run": "2024-03-15_14-25-00",
    "suite_name": "qa_accuracy",
    "description": "Comparing GPT-4 vs Claude vs Gemini on QA tasks",
    "tags": ["production", "qa"],
    "providers": [
      {
        "provider": "openai",
        "model": "gpt-4",
        "model_params": {"temperature": 0.0, "max_tokens": 1000}
      },
      {
        "provider": "anthropic", 
        "model": "claude-3-opus-20240229",
        "model_params": {"temperature": 0.0, "max_tokens": 1000}
      }
    ]
  }
}

2. Result Records (Multiple Lines)

{
  "type": "result",
  "data": {
    "provider_config": {
      "provider": "openai",
      "model": "gpt-4",
      "model_params": {"temperature": 0.0}
    },
    "sample": {
      "duration_ms": 1523,
      "tag": "customer_support_001",
      "input": [{"role": "user", "content": "How do I reset my password?"}],
      "output": {"content": "To reset your password..."},
      "model": "gpt-4",
      "model_params": {"temperature": 0.0},
      "start_time_ms": 1710515422000,
      "end_time_ms": 1710515423523,
      "url": "https://api.openai.com/v1/chat/completions",
      "location": {"tag": "customer_support", "lineNumber": "42"}
    },
    "metrics": [
      {
        "metric": "response_quality",
        "passed": 1,
        "score": 0.92,
        "reason": null
      },
      {
        "metric": "hallucination_check", 
        "passed": 0,
        "score": 0.3,
        "reason": "Response contains unverified claim about password reset timeframe"
      }
    ],
    "summary": {
      "total_metrics": 2,
      "passed_metrics": 1,
      "avg_score": 0.61,
      "pass_rate": 0.5
    },
    "timing": {
      "provider_latency_ms": 1523,
      "evaluation_time_ms": 342
    }
  }
}

3. Summary Record (Last Line)

{
  "type": "summary",
  "data": {
    "benchmark_id": "bench_20240315_143022_abc123",
    "timestamp": "2024-03-15T14:30:22.123Z",
    "suite_name": "qa_accuracy",
    "total_samples": 50,
    "total_providers": 2,
    "provider_summaries": {
      "openai/gpt-4": {
        "total_evaluations": 50,
        "avg_pass_rate": 0.84,
        "avg_latency_ms": 1456,
        "total_cost": 2.34,
        "metrics": {
          "response_quality": {"pass_rate": 0.92, "avg_score": 0.89},
          "hallucination_check": {"pass_rate": 0.76, "avg_score": 0.71}
        }
      },
      "anthropic/claude-3-opus": {
        "total_evaluations": 50,
        "avg_pass_rate": 0.88,
        "avg_latency_ms": 2103,
        "total_cost": 3.12,
        "metrics": {
          "response_quality": {"pass_rate": 0.94, "avg_score": 0.91},
          "hallucination_check": {"pass_rate": 0.82, "avg_score": 0.78}
        }
      }
    },
    "metric_comparisons": {
      "response_quality": {
        "best_provider": "anthropic/claude-3-opus",
        "worst_provider": "openai/gpt-4",
        "spread": 0.02
      }
    },
    "overall": {
      "best_provider": "anthropic/claude-3-opus",
      "worst_provider": "openai/gpt-4",
      "avg_duration_ms": 1779.5,
      "total_duration_ms": 177950
    }
  }
}

DuckDB View Schema

The UI will create a benchmarks view with the following structure:

CREATE VIEW benchmarks AS
SELECT
  -- Extracted from filename/path
  regexp_extract(filename, '/benchmarks/([^/]+)/', 1) AS ts,
  regexp_extract(filename, '/benchmarks/[^/]+/([^/]+)\.jsonl', 1) AS suite,
  
  -- Record type discrimination
  type,
  
  -- Flattened fields for easier querying
  data->>'benchmark_id' AS benchmark_id,
  data->>'timestamp' AS benchmark_timestamp,
  
  -- For result records
  data->'provider_config'->>'provider' AS provider,
  data->'provider_config'->>'model' AS model,
  data->'sample'->>'tag' AS sample_tag,
  data->'summary'->>'avg_score' AS avg_score,
  data->'summary'->>'pass_rate' AS pass_rate,
  
  -- Full JSON for detailed views
  data
FROM read_json_auto('${ROOT}/benchmarks/*/*.jsonl', filename=true);

Key Design Decisions

Three-Record Structure: Separates metadata, individual results, and summary for efficient processing
Provider Namespacing: Uses "provider/model" format for clear identification
Metric Flexibility: Supports both binary pass/fail and numeric scoring
Timing Separation: Tracks both provider latency and evaluation overhead
Cost Tracking: Placeholder for future cost comparison features
Reference Linking: Optional link to original evaluation run for traceability

UI Visualization Support

This schema enables the UI to:

Show side-by-side provider comparisons
Generate charts for metric performance across providers
Calculate cost/performance tradeoffs
Filter by tags, suites, or specific providers
Drill down from summary to individual sample results

Future Extensions

The schema is designed to support:

Multi-model comparisons within same provider
Parameter sensitivity analysis (temperature, etc.)
Cross-suite benchmark aggregation
Historical trend analysis
Custom metric plugins

Design Principles​

JSONL Output Format​

1. Metadata Record (First Line)​

2. Result Records (Multiple Lines)​

3. Summary Record (Last Line)​

DuckDB View Schema​

Key Design Decisions​

UI Visualization Support​

Future Extensions​