Advanced Metrics with LLM Judge
In this tutorial, you'll learn to use LLM Judge to create sophisticated evaluation metrics that go beyond simple rule-based checks. LLM Judge uses AI to evaluate AI, enabling complex quality assessments.
What You'll Learn​
- How to use LLM Judge for complex evaluations
- When to use LLM Judge vs. rule-based metrics
- How to write effective claims for LLM Judge
- Best practices for reliable LLM Judge metrics
- How to combine LLM Judge with traditional metrics
Prerequisites​
- Completed Writing Your First Evaluation
- Understanding of basic metrics and suites
- API keys for LLM providers (OpenAI, Anthropic, etc.)
Introduction to LLM Judge​
What is LLM Judge?​
LLM Judge is a TrainLoop feature that uses large language models to evaluate the quality of LLM outputs. It's particularly useful for:
- Subjective quality assessment - Tone, helpfulness, clarity
- Complex reasoning evaluation - Logical consistency, accuracy
- Domain-specific criteria - Professional standards, style guidelines
- Nuanced semantic understanding - Intent matching, context awareness
When to Use LLM Judge​
Use LLM Judge For | Use Rule-Based Metrics For |
---|---|
Subjective quality (tone, helpfulness) | Objective criteria (length, format) |
Complex reasoning evaluation | Simple pattern matching |
Domain-specific expertise | Universal standards |
Nuanced understanding | Performance-critical checks |
Basic LLM Judge Usage​
The assert_true
Function​
The core LLM Judge function compares two claims:
from trainloop_cli.eval_core.judge import assert_true
def is_helpful_response(sample: Sample) -> int:
"""Check if response is helpful using LLM Judge"""
response = sample.output.get("content", "")
positive_claim = f"The response '{response}' is helpful and provides useful information."
negative_claim = f"The response '{response}' is not helpful and doesn't provide useful information."
return assert_true(positive_claim, negative_claim)
How LLM Judge Works​
- Claim Generation - You provide positive and negative claims
- LLM Evaluation - Multiple LLMs evaluate which claim is more true
- Consensus Building - Results are aggregated across multiple calls
- Binary Result - Returns 1 if positive claim wins, 0 if negative claim wins
Advanced LLM Judge Patterns​
Context-Aware Evaluation​
def matches_user_intent(sample: Sample) -> int:
"""Check if response matches user's intent"""
user_message = ""
for msg in sample.input.get("messages", []):
if msg.get("role") == "user":
user_message = msg.get("content", "")
break
response = sample.output.get("content", "")
positive_claim = f"The response '{response}' directly addresses the user's request: '{user_message}'"
negative_claim = f"The response '{response}' does not address the user's request: '{user_message}'"
return assert_true(positive_claim, negative_claim)
Domain-Specific Evaluation​
def follows_medical_guidelines(sample: Sample) -> int:
"""Check if response follows medical communication guidelines"""
response = sample.output.get("content", "")
positive_claim = f"""The response '{response}' follows proper medical communication guidelines by:
- Being accurate and evidence-based
- Avoiding definitive diagnoses
- Recommending professional consultation when appropriate
- Using clear, accessible language"""
negative_claim = f"""The response '{response}' violates medical communication guidelines by:
- Making unsupported claims
- Providing definitive diagnoses
- Failing to recommend professional consultation
- Using confusing or inappropriate language"""
return assert_true(positive_claim, negative_claim)
Multi-Criteria Evaluation​
def is_professional_customer_service(sample: Sample) -> int:
"""Evaluate multiple aspects of customer service quality"""
response = sample.output.get("content", "")
positive_claim = f"""The response '{response}' demonstrates excellent customer service by:
- Showing empathy and understanding
- Providing clear, actionable solutions
- Maintaining a professional yet friendly tone
- Being concise while being thorough"""
negative_claim = f"""The response '{response}' demonstrates poor customer service by:
- Lacking empathy or understanding
- Providing unclear or unhelpful solutions
- Using inappropriate tone or language
- Being either too brief or overly verbose"""
return assert_true(positive_claim, negative_claim)
Writing Effective Claims​
Best Practices for Claims​
1. Be Specific and Detailed​
# Good - Specific criteria
positive_claim = f"The response '{response}' is helpful because it provides specific steps, explains the reasoning, and offers alternatives."
# Bad - Too vague
positive_claim = f"The response '{response}' is good."
2. Make Claims Mutually Exclusive​
# Good - Clear opposite claims
positive_claim = f"The response '{response}' is factually accurate and well-supported."
negative_claim = f"The response '{response}' contains factual errors or unsupported claims."
# Bad - Not mutually exclusive
positive_claim = f"The response '{response}' is accurate."
negative_claim = f"The response '{response}' is confusing."
3. Include Context When Relevant​
def appropriate_response_tone(sample: Sample) -> int:
"""Check if response tone matches the context"""
user_message = ""
for msg in sample.input.get("messages", []):
if msg.get("role") == "user":
user_message = msg.get("content", "")
break
response = sample.output.get("content", "")
# Include context in claims
positive_claim = f"Given the user's message '{user_message}', the response '{response}' uses an appropriate tone that matches the context and user's emotional state."
negative_claim = f"Given the user's message '{user_message}', the response '{response}' uses an inappropriate tone that doesn't match the context or user's emotional state."
return assert_true(positive_claim, negative_claim)
Complex Evaluation Scenarios​
Evaluating Reasoning and Logic​
def has_logical_reasoning(sample: Sample) -> int:
"""Check if response demonstrates logical reasoning"""
response = sample.output.get("content", "")
positive_claim = f"""The response '{response}' demonstrates clear logical reasoning by:
- Presenting information in a logical sequence
- Making valid inferences from given information
- Avoiding logical fallacies
- Drawing appropriate conclusions"""
negative_claim = f"""The response '{response}' lacks logical reasoning by:
- Presenting information in a confusing order
- Making invalid inferences
- Containing logical fallacies
- Drawing inappropriate conclusions"""
return assert_true(positive_claim, negative_claim)
Evaluating Creativity and Originality​
def is_creative_response(sample: Sample) -> int:
"""Check if response is creative and original"""
response = sample.output.get("content", "")
positive_claim = f"""The response '{response}' demonstrates creativity by:
- Offering unique or novel perspectives
- Using imaginative language or examples
- Providing original insights or solutions
- Avoiding clichéd or generic content"""
negative_claim = f"""The response '{response}' lacks creativity by:
- Offering only conventional perspectives
- Using predictable language or examples
- Providing generic insights or solutions
- Relying on clichéd or overused content"""
return assert_true(positive_claim, negative_claim)
Evaluating Completeness​
def provides_complete_answer(sample: Sample) -> int:
"""Check if response completely addresses the question"""
user_message = ""
for msg in sample.input.get("messages", []):
if msg.get("role") == "user":
user_message = msg.get("content", "")
break
response = sample.output.get("content", "")
positive_claim = f"""The response '{response}' provides a complete answer to the question '{user_message}' by:
- Addressing all parts of the question
- Providing sufficient detail and explanation
- Covering relevant aspects and considerations
- Offering actionable information where appropriate"""
negative_claim = f"""The response '{response}' provides an incomplete answer to the question '{user_message}' by:
- Ignoring parts of the question
- Providing insufficient detail or explanation
- Missing relevant aspects or considerations
- Failing to offer actionable information"""
return assert_true(positive_claim, negative_claim)
Combining LLM Judge with Traditional Metrics​
Hybrid Evaluation Suite​
# trainloop/eval/suites/hybrid_evaluation.py
from trainloop_cli.eval_core.helpers import tag
from ..metrics.rule_based import has_greeting_word, appropriate_length, contains_contact_info
from ..metrics.llm_judge import is_helpful_response, matches_user_intent, is_professional_tone
# Combine rule-based and LLM Judge metrics
results = tag("customer-support").check(
# Fast rule-based checks
has_greeting_word,
appropriate_length,
contains_contact_info,
# Sophisticated LLM Judge evaluations
is_helpful_response,
matches_user_intent,
is_professional_tone
)
Performance-Optimized Approach​
def smart_evaluation_suite(sample: Sample) -> dict:
"""Use rule-based metrics first, LLM Judge for edge cases"""
results = {}
# Fast rule-based checks first
results['has_greeting'] = has_greeting_word(sample)
results['appropriate_length'] = appropriate_length(sample)
# Only use LLM Judge for complex cases
if results['has_greeting'] and results['appropriate_length']:
results['is_helpful'] = is_helpful_response(sample)
results['matches_intent'] = matches_user_intent(sample)
else:
# Skip expensive LLM Judge checks for obviously bad responses
results['is_helpful'] = 0
results['matches_intent'] = 0
return results
Configuration and Optimization​
Configuring LLM Judge​
# trainloop.config.yaml
trainloop:
judge:
# Models to use for evaluation
models:
- openai/gpt-4o
- anthropic/claude-3-sonnet-20240229
- openai/gpt-4o-mini
# Number of calls per model per claim
calls_per_model_per_claim: 3
# Temperature for consistency
temperature: 0.1
# Maximum tokens for judge responses
max_tokens: 100
# Timeout for judge calls
timeout: 30
Optimizing Performance​
1. Use Caching​
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_helpfulness_check(response_content: str) -> int:
"""Cache LLM Judge results for identical responses"""
positive_claim = f"The response '{response_content}' is helpful and informative."
negative_claim = f"The response '{response_content}' is unhelpful and uninformative."
return assert_true(positive_claim, negative_claim)
def is_helpful_cached(sample: Sample) -> int:
response = sample.output.get("content", "")
return cached_helpfulness_check(response)
2. Batch Similar Evaluations​
def batch_tone_evaluation(samples: List[Sample]) -> List[int]:
"""Evaluate tone for multiple samples efficiently"""
results = []
for sample in samples:
response = sample.output.get("content", "")
positive_claim = f"The response '{response}' has a professional and appropriate tone."
negative_claim = f"The response '{response}' has an unprofessional or inappropriate tone."
result = assert_true(positive_claim, negative_claim)
results.append(result)
return results
Testing and Validation​
Test LLM Judge Metrics​
def test_llm_judge_consistency():
"""Test that LLM Judge metrics are consistent"""
sample = Sample(
input={"messages": [{"role": "user", "content": "Hello, how are you?"}]},
output={"content": "Hello! I'm doing well, thank you for asking. How can I help you today?"}
)
# Run the same metric multiple times
results = []
for i in range(5):
result = is_helpful_response(sample)
results.append(result)
# Check consistency (should be mostly the same)
consistency_rate = sum(results) / len(results)
print(f"Consistency rate: {consistency_rate}")
# Should be either very high (>0.8) or very low (<0.2)
assert consistency_rate > 0.8 or consistency_rate < 0.2, "Inconsistent LLM Judge results"
Validate Against Human Judgment​
def validate_against_human_judgment():
"""Compare LLM Judge results with human evaluations"""
test_cases = [
{
"sample": Sample(
input={"messages": [{"role": "user", "content": "Explain quantum computing"}]},
output={"content": "Quantum computing is complicated stuff with atoms and things."}
),
"human_rating": 0 # Human says this is not helpful
},
{
"sample": Sample(
input={"messages": [{"role": "user", "content": "Explain quantum computing"}]},
output={"content": "Quantum computing uses quantum mechanical properties like superposition and entanglement to process information in fundamentally different ways than classical computers, potentially solving certain problems exponentially faster."}
),
"human_rating": 1 # Human says this is helpful
}
]
agreement_count = 0
for test_case in test_cases:
llm_rating = is_helpful_response(test_case["sample"])
if llm_rating == test_case["human_rating"]:
agreement_count += 1
agreement_rate = agreement_count / len(test_cases)
print(f"Agreement with human judgment: {agreement_rate:.2%}")
Best Practices Summary​
1. Start with Rule-Based, Add LLM Judge​
# Good progression
def comprehensive_quality_check(sample: Sample) -> int:
# Quick rule-based elimination
if not has_minimum_length(sample):
return 0
if not contains_required_elements(sample):
return 0
# Sophisticated LLM Judge evaluation
return is_high_quality_response(sample)
2. Use Specific, Detailed Claims​
# Good - Specific and actionable
positive_claim = f"The response '{response}' provides accurate, step-by-step instructions that are easy to follow and include necessary warnings."
# Bad - Vague and subjective
positive_claim = f"The response '{response}' is good."
3. Monitor and Validate Results​
# Add monitoring to your LLM Judge metrics
def monitored_helpfulness_check(sample: Sample) -> int:
result = is_helpful_response(sample)
# Log for analysis
log_metric_result("helpfulness", result, sample.output.get("content", ""))
return result
Common Pitfalls and Solutions​
1. Inconsistent Results​
Problem: LLM Judge returns different results for the same input
Solution:
- Lower temperature in configuration
- Use more specific claims
- Increase number of calls per claim
2. Slow Performance​
Problem: LLM Judge metrics are too slow
Solution:
- Use caching for repeated content
- Combine with rule-based pre-filtering
- Use faster models for simple evaluations
3. Unreliable Evaluation​
Problem: LLM Judge doesn't match human judgment
Solution:
- Validate against human examples
- Refine claim wording
- Use multiple models for consensus
Next Steps​
You now know how to create sophisticated evaluation metrics using LLM Judge! Continue with:
- Benchmarking and Model Comparison - Compare different LLM providers
- Production Setup - Deploy evaluations in CI/CD pipelines
Troubleshooting​
Common Issues​
LLM Judge Calls Failing​
- Check API keys and rate limits
- Verify model names in configuration
- Check network connectivity
Unexpected Results​
- Review claim wording for clarity
- Test with known examples
- Check for model-specific biases
Performance Issues​
- Implement caching for repeated evaluations
- Use rule-based pre-filtering
- Consider using faster/cheaper models
Ready to compare different LLM providers? Continue with Benchmarking and Model Comparison!