---
title: AI_Bookkeeper_Leaderboard
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
---

# AI Bookkeeper Leaderboard

A comprehensive benchmark for evaluating AI models on accounting document processing tasks. This benchmark focuses on real-world accounting scenarios and provides detailed metrics across key capabilities.

[View Live Demo](https://huggingface.co/spaces/jenesys-ai/ai_bookkeeper_leaderboard)

## Models Evaluated

- Ark II (Jenesys AI) - 17.94s inference time
- Ark I (Jenesys AI) -  7.955s inference time
- Claude-3-5-Sonnet (Anthropic) - 26.51s inference time
- GPT-4o (OpenAI) - 19.88s inference time

## Categories and Raw Data Points

The benchmark evaluates models across four main categories, each with specific raw data points:

1. **Document Understanding** (25%)
   - Invoice ID Detection
   - Date Field Recognition
   - Line Items Total
   Average = (Invoice ID + Date + Line Items Total) / 3

2. **Data Extraction** (25%)
   - Supplier Information
   - Line Items Quantity
   - Line Items Description
   - VAT Number
   - Line Items Total
   Average = (Supplier + Quantity + Description + VAT_Number + Total) / 5

3. **Bookkeeping Intelligence** (25%)
   - Discount Total
   - Line Items VAT
   - VAT Exclusive Amount
   - VAT Number Validation
   - Discount Verification
   Average = (Discount + VAT_Items + VAT_Exclusive + VAT_Number + Discount_Verification) / 5

4. **Error Handling** (25%)
   - Mean Accuracy (direct measure)

## Model Performance

### Ark II
- Document Understanding: 80.8% (0.733, 0.887, 0.803)
- Data Extraction: 74.9% (0.735, 0.882, 0.555, 0.768, 0.803)
- Bookkeeping Intelligence: 73.0% (0.800, 0.590, 0.694, 0.768, 0.800)
- Error Handling: 71.8%

### Ark I
- Document Understanding: 78.5% (0.747, 0.905, 0.703)
- Data Extraction: 70.9% (0.792, 0.811, 0.521, 0.719, 0.703)
- Bookkeeping Intelligence: 56.9% (0.600, 0.434, 0.491, 0.719, 0.600)
- Error Handling: 64.1%

### Claude-3-5-Sonnet
- Document Understanding: 70.4% (0.773, 0.806, 0.533)
- Data Extraction: 60.9% (0.706, 0.597, 0.504, 0.708, 0.533)
- Bookkeeping Intelligence: 62.8% (0.600, 0.524, 0.706, 0.708, 0.600)
- Error Handling: 67.5%

### GPT-4o
- Document Understanding: 69.6% (0.600, 0.917, 0.571)
- Data Extraction: 68.9% (0.818, 0.722, 0.619, 0.714, 0.571)
- Bookkeeping Intelligence: 25.5% (0.000, 0.313, 0.250, 0.714, 0.000)
- Error Handling: 68.3%

## Key Findings

- Ark II leads in overall performance, particularly in document understanding (80.8%)
- Ark I shows strong performance relative to its size, especially in document understanding (78.5%)
- Claude-3-5-Sonnet maintains consistent performance across categories
- GPT-4o shows competitive performance in document understanding and data extraction but struggles with bookkeeping intelligence tasks
- Ark I achieves impressive efficiency with the fastest inference time (7.955s)

## Interactive Dashboard Features

The dashboard provides several interactive visualizations:

1. **Overall Leaderboard**: Comprehensive view of all models' performance metrics
2. **Category Comparison**: Bar chart comparing all models across the four main categories
3. **Combined Radar Chart**: Multi-model comparison showing relative strengths and weaknesses
4. **Detailed Metrics**: Interactive comparison table showing differences between selected model and Ark II

## Running the Leaderboard

1. Install dependencies:
   ```bash
   pip install gradio pandas plotly
   ```

2. Run the app:
   ```python
   python app.py
   ```

3. Open the provided URL in your browser to view the interactive dashboard.

## Visualization Features

- Color-coded performance indicators
- Comparative analysis with Ark II as baseline
- Interactive model selection for detailed comparisons
- Multi-model radar chart for performance pattern analysis
- Dynamic updates of comparative metrics

## Contributing

To add new model evaluations:
1. Add model scores following the established format in MODELS dictionary
2. Include all required metrics for each category
3. Provide model metadata (version, type, provider, size, inference time)
4. Follow the existing structure in `app.py`

## License

MIT License