Quick start
Import a dataset
Run an evaluation
Compare models
Available commands
Dataset management
Running evaluations
Viewing results
Langfuse integration
Track experiments in Langfuse for side-by-side comparison and analysis.Setup
Commands
What gets created
When you run a Langfuse experiment, the system creates:- Dataset - Named
eval_<your_dataset_name>with all samples - Traces - One per sample showing input/output
- Scores - Accuracy scores (0.0 or 1.0) for each trace
- Dataset Runs - Links traces to dataset items for comparison
- Compare runs side-by-side
- Filter by score, model, or metadata
- Track accuracy over time
- Analyze per-sample results
Evaluation types
Categorization
Tests transaction categorization accuracy across difficulty levels. Metrics:- Accuracy
- Precision, recall, F1 score
- Null accuracy (correctly returning null for ambiguous transactions)
- Hierarchical accuracy (matching parent categories)
- Per-difficulty breakdown
categorization_golden_v1- 100 samples, US merchantscategorization_golden_v1_light- 50 samples, quick testingcategorization_golden_v2- 200 samples, US and European merchants
Merchant detection
Tests business name and URL detection from transaction descriptions. Metrics:- Name accuracy (exact match)
- Fuzzy name accuracy (similarity threshold)
- URL accuracy
- False positive/negative rates
- Average fuzzy score
merchant_detection_golden_v1- 90 samples
Chat assistant
Tests function calling and response quality for the AI assistant. Metrics:- Function selection accuracy
- Parameter accuracy
- Response relevance
- Exact match rate
- Error rate
chat_golden_v1- 50 samples
Creating custom datasets
Export your manually categorized transactions as a golden dataset:- Category was manually set by the user
- Category was NOT set by AI, rules, or data enrichment
rake evals:import_dataset[path].
JSON mode configuration
Control how the LLM outputs structured data. Configure via environment variable or Settings UI. Modes:auto- Tries strict first, falls back to none if >50% fail (recommended)strict- Best for thinking models (qwen-thinking, deepseek-reasoner)none- Best for standard models (llama, mistral, gpt-oss)json_object- Middle ground, broader compatibility