Code Style Guide

This guide defines the coding standards and conventions used across the TrainLoop Evals project. Following these guidelines ensures consistency, readability, and maintainability.

General Principles

Code Quality Standards

Readability First - Code is read 10× more than it's written
Consistency - Follow established patterns throughout the codebase
Simplicity - Prefer simple, clear solutions over complex ones
Single Responsibility - Each function/class should have one reason to change
DRY (Don't Repeat Yourself) - Avoid code duplication
YAGNI (You Aren't Gonna Need It) - Don't build features until needed

File Organization

Use clear, descriptive file and directory names
Group related functionality together
Keep file lengths reasonable (< 500 lines for most files)
Use consistent file naming conventions

Language-Specific Guidelines

Python

TrainLoop Evals uses Python 3.9+ with modern Python conventions.

Code Formatting

We use Black for code formatting with the following configuration:

# pyproject.toml
[tool.black]
line-length = 88
target-version = ['py39']
include = '\.pyi?$'
extend-exclude = '''
/(
  # directories
  __pycache__
  | \.git
  | \.hg
  | \.mypy_cache
  | \.tox
  | \.venv
  | _build
  | buck-out
  | build
  | dist
)/
'''

Linting

We use flake8 for linting with these rules:

# .flake8
[flake8]
max-line-length = 88
extend-ignore = E203, E266, E501, W503
max-complexity = 10
exclude = .git,__pycache__,dist,build,.venv

Import Organization

Use isort for import sorting:

# Standard library imports
import os
import sys
from pathlib import Path

# Third-party imports
import click
import yaml
from pydantic import BaseModel

# Local application imports
from trainloop_cli.commands.utils import load_config
from trainloop_cli.eval_core.types import EvalResult

Type Hints

Use type hints for all public functions and complex private functions:

from typing import Dict, List, Optional, Union
from pathlib import Path

def load_config(config_path: Path) -> Dict[str, Union[str, int, bool]]:
    """Load configuration from YAML file.
    
    Args:
        config_path: Path to configuration file
        
    Returns:
        Dictionary containing configuration values
        
    Raises:
        FileNotFoundError: If config file doesn't exist
        yaml.YAMLError: If config file is malformed
    """
    if not config_path.exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")
    
    with open(config_path, 'r') as f:
        return yaml.safe_load(f)

Function and Class Conventions

# Good: Clear function names with type hints
def calculate_metrics(events: List[Dict], suite_name: str) -> EvalResult:
    """Calculate evaluation metrics for a suite of events."""
    pass

# Good: Class naming with clear purpose
class MetricsCalculator:
    """Handles calculation of evaluation metrics."""
    
    def __init__(self, config: Dict[str, Any]) -> None:
        self.config = config
        self._cache: Dict[str, Any] = {}
    
    def calculate(self, events: List[Dict]) -> EvalResult:
        """Calculate metrics for the given events."""
        pass

# Good: Use dataclasses for data structures
from dataclasses import dataclass
from typing import Optional

@dataclass
class EvalConfig:
    """Configuration for evaluation runs."""
    suite_name: str
    data_folder: Path
    output_format: str = "jsonl"
    max_workers: int = 4
    timeout: Optional[int] = None

Error Handling

# Good: Specific exception handling
try:
    config = load_config(config_path)
except FileNotFoundError:
    logger.error(f"Config file not found: {config_path}")
    raise
except yaml.YAMLError as e:
    logger.error(f"Invalid YAML in config file: {e}")
    raise

# Good: Custom exceptions for domain-specific errors
class EvaluationError(Exception):
    """Base exception for evaluation-related errors."""
    pass

class MetricNotFoundError(EvaluationError):
    """Raised when a requested metric is not available."""
    pass

TypeScript

For the TypeScript SDK and UI components, we follow modern TypeScript conventions.

Code Formatting

Use Prettier with these settings:

// .prettierrc
{
  "semi": true,
  "trailingComma": "es5",
  "singleQuote": true,
  "printWidth": 80,
  "tabWidth": 2,
  "useTabs": false
}

Type Definitions

// Good: Use interfaces for object shapes
interface LLMEvent {
  id: string;
  timestamp: number;
  provider: string;
  model: string;
  prompt: string;
  response: string;
  metadata?: Record<string, unknown>;
}

// Good: Use union types for known values
type LogLevel = 'debug' | 'info' | 'warn' | 'error';

// Good: Use generic types for reusable functions
function processEvents<T extends LLMEvent>(
  events: T[],
  processor: (event: T) => T
): T[] {
  return events.map(processor);
}

React Component Conventions

// Good: Functional components with TypeScript
import React from 'react';

interface DashboardProps {
  title: string;
  events: LLMEvent[];
  onEventSelect?: (event: LLMEvent) => void;
}

export const Dashboard: React.FC<DashboardProps> = ({
  title,
  events,
  onEventSelect,
}) => {
  return (
    <div className="dashboard">
      <h1>{title}</h1>
      {events.map((event) => (
        <div key={event.id} onClick={() => onEventSelect?.(event)}>
          {event.provider}: {event.model}
        </div>
      ))}
    </div>
  );
};

Go

For the Go SDK, we follow standard Go conventions.

Code Formatting

Use gofmt and goimports for formatting:

# Format all Go files
go fmt ./...

# Organize imports
goimports -w .

Package Organization

// Good: Clear package documentation
// Package instrumentation provides HTTP instrumentation for TrainLoop logging.
package instrumentation

import (
    "context"
    "net/http"
    "time"
)

// Good: Exported types with documentation
type Config struct {
    DataFolder string `json:"data_folder"`
    FlushInterval time.Duration `json:"flush_interval"`
}

// Good: Interface definitions
type HTTPClient interface {
    Do(req *http.Request) (*http.Response, error)
}

// Good: Factory functions
func NewInstrumentedClient(client HTTPClient, config Config) HTTPClient {
    return &instrumentedClient{
        client: client,
        config: config,
    }
}

Documentation Standards

Docstrings and Comments

Python Docstrings

Use Google-style docstrings:

def evaluate_suite(
    suite_name: str,
    events: List[Dict],
    config: EvalConfig
) -> EvalResult:
    """Evaluate a suite of events against configured metrics.
    
    This function loads the specified evaluation suite and applies
    all configured metrics to the provided events.
    
    Args:
        suite_name: Name of the evaluation suite to run
        events: List of LLM events to evaluate
        config: Evaluation configuration
        
    Returns:
        EvalResult containing metrics and verdicts
        
    Raises:
        SuiteNotFoundError: If the specified suite doesn't exist
        MetricError: If any metric fails to execute
        
    Example:
        >>> config = EvalConfig(suite_name="basic", data_folder=Path("./data"))
        >>> events = load_events("events.jsonl")
        >>> result = evaluate_suite("basic", events, config)
        >>> print(f"Passed: {result.passed}/{result.total}")
    """

TypeScript JSDoc

/**
 * Collects and logs LLM events for evaluation.
 * 
 * @param config - Configuration object for data collection
 * @param options - Optional parameters for collection behavior
 * @returns Promise that resolves when collection is initialized
 * 
 * @example
 * ```typescript
 * await collect({
 *   dataFolder: './data',
 *   flushInterval: 5000
 * });
 * ```
 */
export async function collect(
  config: CollectionConfig,
  options?: CollectionOptions
): Promise<void> {
  // Implementation
}

Go Comments

// Config represents the configuration for TrainLoop logging.
// It contains settings for data storage, flush intervals, and other
// operational parameters.
type Config struct {
    // DataFolder is the directory where event data will be stored
    DataFolder string `json:"data_folder"`
    
    // FlushInterval determines how often buffered events are written
    FlushInterval time.Duration `json:"flush_interval"`
}

// NewConfig creates a new Config instance with default values.
// The default data folder is "./data" and flush interval is 10 seconds.
func NewConfig() *Config {
    return &Config{
        DataFolder:    "./data",
        FlushInterval: 10 * time.Second,
    }
}

Code Comments

# Good: Explain why, not what
def calculate_score(responses: List[str]) -> float:
    # Use harmonic mean to penalize inconsistent responses more heavily
    # than arithmetic mean would
    scores = [rate_response(r) for r in responses]
    return len(scores) / sum(1/s for s in scores if s > 0)

# Good: Explain complex algorithms
def find_optimal_threshold(metrics: List[float]) -> float:
    """Find the optimal threshold using Otsu's method."""
    # Implementation of Otsu's method for automatic threshold selection
    # This maximizes the between-class variance while minimizing within-class variance
    histogram = create_histogram(metrics)
    # ... rest of implementation

Testing Standards

Test Organization

# Good: Clear test structure
class TestMetricsCalculator:
    """Test suite for MetricsCalculator."""
    
    def setup_method(self):
        """Set up test fixtures."""
        self.calculator = MetricsCalculator(config={"timeout": 30})
        self.sample_events = [
            {"id": "1", "prompt": "Hello", "response": "Hi there"},
            {"id": "2", "prompt": "Goodbye", "response": "See you later"}
        ]
    
    def test_calculate_with_valid_events(self):
        """Test calculation with valid event data."""
        result = self.calculator.calculate(self.sample_events)
        
        assert result.total == 2
        assert result.passed >= 0
        assert result.failed >= 0
        assert result.passed + result.failed == result.total
    
    @pytest.mark.integration
    def test_calculate_with_llm_judge(self):
        """Test calculation using LLM judge integration."""
        # Integration test logic
        pass

Test Naming

Use descriptive test names: test_calculate_score_with_empty_responses
Group related tests in classes: TestMetricsCalculator
Use appropriate markers: @pytest.mark.unit

Performance Guidelines

Python Performance

# Good: Use list comprehensions for simple transformations
filtered_events = [e for e in events if e.get("score", 0) > 0.5]

# Good: Use generator expressions for large datasets
total_score = sum(e.get("score", 0) for e in events)

# Good: Use dataclasses for structured data
@dataclass
class CachedResult:
    result: EvalResult
    timestamp: float
    
# Good: Use proper caching
from functools import lru_cache

@lru_cache(maxsize=128)
def expensive_calculation(event_id: str) -> float:
    # Expensive calculation here
    pass

Memory Management

# Good: Use context managers for resource cleanup
def process_large_file(file_path: Path) -> Iterator[Dict]:
    """Process large JSONL file with streaming."""
    with open(file_path, 'r') as f:
        for line in f:
            yield json.loads(line)

# Good: Use generators for large datasets
def load_events_streaming(data_folder: Path) -> Iterator[Dict]:
    """Load events from multiple files without loading all into memory."""
    for file_path in data_folder.glob("*.jsonl"):
        yield from process_large_file(file_path)

Security Guidelines

Input Validation

# Good: Validate input parameters
def load_config(config_path: Path) -> Dict[str, Any]:
    """Load configuration from file with validation."""
    if not config_path.exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")
    
    if not config_path.is_file():
        raise ValueError(f"Config path is not a file: {config_path}")
    
    # Validate file size (prevent DoS)
    max_size = 10 * 1024 * 1024  # 10MB
    if config_path.stat().st_size > max_size:
        raise ValueError(f"Config file too large: {config_path}")
    
    with open(config_path, 'r') as f:
        return yaml.safe_load(f)

Environment Variables

# Good: Use environment variables safely
import os
from pathlib import Path

def get_data_folder() -> Path:
    """Get data folder from environment or config."""
    data_folder = os.getenv("TRAINLOOP_DATA_FOLDER")
    if not data_folder:
        raise ValueError("TRAINLOOP_DATA_FOLDER environment variable not set")
    
    path = Path(data_folder).expanduser().resolve()
    if not path.exists():
        path.mkdir(parents=True, exist_ok=True)
    
    return path

Version Control

Commit Messages

Use conventional commit format:

feat(cli): add benchmark command for model comparison

Add a new benchmark command that allows users to compare different
LLM models across evaluation metrics. The command supports multiple
providers and generates comparison reports.

- Add benchmark command with provider configuration
- Implement parallel model evaluation
- Add comparison report generation
- Include performance metrics and cost analysis

Closes #123

Branch Management

# Good: Descriptive branch names
git checkout -b feature/add-benchmark-command
git checkout -b fix/config-loading-error
git checkout -b docs/update-installation-guide

# Good: Keep branches focused
# One feature or fix per branch
# Regular rebasing to keep history clean

Tools and Automation

Pre-commit Hooks

Set up pre-commit hooks to enforce code quality:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/psf/black
    rev: 23.1.0
    hooks:
      - id: black
        language_version: python3.9

  - repo: https://github.com/pycqa/flake8
    rev: 6.0.0
    hooks:
      - id: flake8

  - repo: https://github.com/pycqa/isort
    rev: 5.12.0
    hooks:
      - id: isort

IDE Configuration

VS Code Settings

{
  "python.formatting.provider": "black",
  "python.linting.enabled": true,
  "python.linting.flake8Enabled": true,
  "python.linting.mypyEnabled": true,
  "editor.formatOnSave": true,
  "editor.codeActionsOnSave": {
    "source.organizeImports": true
  }
}

Best Practices Summary

Do's

✅ Use descriptive names for variables, functions, and classes
✅ Write comprehensive docstrings for public APIs
✅ Add type hints to improve code clarity
✅ Use consistent formatting tools (Black, Prettier, gofmt)
✅ Write tests for all new functionality
✅ Handle errors gracefully with appropriate exceptions
✅ Use logging for debugging and monitoring
✅ Follow the principle of least surprise
✅ Keep functions small and focused
✅ Use version control effectively

Don'ts

❌ Don't use magic numbers or hardcoded values
❌ Don't ignore error conditions
❌ Don't write overly complex functions
❌ Don't duplicate code across the codebase
❌ Don't commit code without running tests
❌ Don't use global variables unless absolutely necessary
❌ Don't write code without documentation
❌ Don't ignore linting warnings
❌ Don't use deprecated APIs
❌ Don't commit sensitive information

Code Review Checklist

When reviewing code, check for:

Resources

Contributing Guide - General contribution guidelines
Testing Guide - Testing standards and practices
Pull Request Process - Code review workflow
Local Development - Development environment setup

Getting Help

If you have questions about code style or need clarification on any guidelines:

Open a discussion on GitHub Discussions
Ask in your pull request if you're unsure about specific changes
Check existing code in the repository for examples
Refer to language-specific style guides for detailed formatting rules

Following these guidelines helps maintain a high-quality, consistent codebase that's easy for everyone to understand and contribute to.

General Principles​

Code Quality Standards​

File Organization​

Language-Specific Guidelines​

Python​

Code Formatting​

Linting​

Import Organization​

Type Hints​

Function and Class Conventions​

Error Handling​

TypeScript​

Code Formatting​

Type Definitions​

React Component Conventions​

Go​

Code Formatting​

Package Organization​

Documentation Standards​

Docstrings and Comments​

Python Docstrings​

TypeScript JSDoc​

Go Comments​

Code Comments​

Testing Standards​

Test Organization​

Test Naming​

Performance Guidelines​

Python Performance​

Memory Management​

Security Guidelines​

Input Validation​

Environment Variables​

Version Control​

Commit Messages​

Branch Management​

Tools and Automation​

Pre-commit Hooks​

IDE Configuration​

VS Code Settings​

Best Practices Summary​

Do's​

Don'ts​

Code Review Checklist​

Resources​

Getting Help​

General Principles

Code Quality Standards

File Organization

Language-Specific Guidelines

Python

Code Formatting

Linting

Import Organization

Type Hints

Function and Class Conventions

Error Handling

TypeScript

Code Formatting

Type Definitions

React Component Conventions

Go

Code Formatting

Package Organization

Documentation Standards

Docstrings and Comments

Python Docstrings

TypeScript JSDoc

Go Comments

Code Comments

Testing Standards

Test Organization

Test Naming

Performance Guidelines

Python Performance

Memory Management

Security Guidelines

Input Validation

Environment Variables

Version Control

Commit Messages

Branch Management

Tools and Automation

Pre-commit Hooks

IDE Configuration

VS Code Settings

Best Practices Summary

Do's

Don'ts

Code Review Checklist

Resources

Getting Help