Metadata-Version: 2.4
Name: abap-bench
Version: 4.0.0
Summary: ABAP-Bench: A Comprehensive Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization
Author: ABAP-Bench Team
License: Apache-2.0
Project-URL: Homepage, https://github.com/abap-bench/abap-bench
Project-URL: Documentation, https://github.com/abap-bench/abap-bench#readme
Keywords: benchmark,llm,abap,sap,s4hana,code-understanding
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: report
Requires-Dist: jinja2>=3.0; extra == "report"
Dynamic: license-file

# ABAP-Bench

![Version](https://img.shields.io/badge/version-4.0-blue)
![Tasks](https://img.shields.io/badge/tasks-60-green)
![Dimensions](https://img.shields.io/badge/dimensions-9-orange)
![Scoring](https://img.shields.io/badge/scoring-4--layer-purple)
![License](https://img.shields.io/badge/license-Apache--2.0-lightgrey)

**A professional AI benchmark for evaluating LLM understanding of SAP ABAP and S/4HANA modernization.**

*专业评估大语言模型对 SAP ABAP 及 S/4HANA 现代化理解能力的基准测试*

---

## Overview / 概述

ABAP-Bench evaluates large language models on real-world SAP enterprise software tasks: migrating legacy ABAP code to S/4HANA APIs, detecting defects in business-critical programs, rewriting classic reports as Fiori applications, and more.

Enterprise software modernization is a **$50B+ market**. As organizations migrate from ECC to S/4HANA, LLMs are increasingly used to assist developers — yet no standardized benchmark existed to measure their true capability in this domain. ABAP-Bench fills that gap.

**Why it matters / 为什么重要：**

- SAP systems manage >70% of global transaction revenue; migration errors have real financial consequences
- ABAP is a niche language with sparse training data; general coding benchmarks do not capture domain-specific knowledge
- S/4HANA introduces breaking API changes (BAPI → I\_Journal APIs, classical ALV → CL\_SALV, etc.) that require precise understanding
- China-specific regulatory and localization requirements (Golden Tax, ChinaTax, PIPL) demand dedicated evaluation coverage

---

## Leaderboard (v4.0) / 排行榜

> Last updated: 2026-04-03 | Scoring: 4-layer (Rubric 30% + Quality 20% + Semantic 20% + Judge 30%)

| Rank | Model | Score (/100) | Migration | Defects | Rewriting | China | Risk | Security | Architecture | Performance | Ecosystem |
|------|-------|:------------:|:---------:|:-------:|:---------:|:-----:|:----:|:--------:|:------------:|:-----------:|:---------:|
| 1 | Qwen3 235B | **75** | — | — | — | — | — | — | — | — | — |
| 2 | Grok 4 | **75** | — | — | — | — | — | — | — | — | — |
| 3 | GLM-5.1 | **74** | — | — | — | — | — | — | — | — | — |
| 4 | DeepSeek R1 (0528) | **73** | — | — | — | — | — | — | — | — | — |
| 5 | MiMo-V2-Pro | **68** | — | — | — | — | — | — | — | — | — |
| 6 | MiniMax M2.7 | **66** | — | — | — | — | — | — | — | — | — |

> Scores above are from v3.2 (30 tasks, rubric-only scoring). Full v4.0 re-evaluation with 60 tasks and 4-layer scoring is in progress.
>
> Submit your model via [Issues](https://github.com/abap-bench/abap-bench/issues).

---

## Quick Start / 快速开始

```bash
# Clone and install
git clone https://github.com/abap-bench/abap-bench
cd abap-bench
pip install -e .

# Configure API keys
cp .env.example .env
# Edit .env with your keys

# Run full benchmark (all models in configs/models.yaml)
python -m src.run_benchmark

# Run single model
python -m src.run_benchmark --model glm-5.1

# Score a response (3-layer, no API needed)
python -m src.evaluate_v2 --task T01 --response "I_JournalEntry ACDOCA..." --breakdown

# Score with LLM-as-Judge (4-layer, needs API key)
python -m src.evaluate_v2 --task T01 --response "..." --with-judge --judge-model glm-4-plus --judge-backend zhipuai --breakdown

# Run judge on a full result file
python -m src.judge batch results/v4.0/glm-5.1.json --output results/judge/glm-5.1.json
```

---

## Benchmark Design / 基准设计

### 9 Evaluation Dimensions / 9 个评估维度

```
┌──────────────────────────────────────────────────────────────────────────┐
│                           ABAP-Bench v4.0                                │
│                        60 Tasks × 20 pts = 1200                          │
├───────────────┬───────────────┬───────────────┬──────────────────────────┤
│  D1 Migration │  D2 Defects   │  D3 Rewriting │  D4 China Compliance     │
│  T01,T09,T10  │  T02,T11,T12  │  T03,T13,T14  │  T04,T15,T16             │
│  T31,T32,T33  │  T34,T35,T36  │  T37,T38,T39  │  T40,T41,T42             │
├───────────────┼───────────────┼───────────────┼──────────────────────────┤
│  D5 Risk      │  D6 Security  │  D7 Architect │  D8 Performance          │
│  T05,T17,T18  │  T06,T19,T20  │  T07,T21,T22  │  T08,T23,T24             │
│  T43,T44,T45  │  T46,T47,T48  │  T49,T50,T51  │  T52,T53,T54             │
├───────────────┴───────────────┴───────────────┴──────────────────────────┤
│  D9 Modern Ecosystem: T25-T30, T55-T60 (12 tasks)                       │
│  Clean Core · Unit Testing · Fiori · BAdI · LUW · Integration Suite      │
│  Workflow · Output Mgmt · IDoc/ALE · BDC · Change Mgmt · Code Inspector  │
└──────────────────────────────────────────────────────────────────────────┘
```

| # | Dimension | Tasks | Description |
|---|-----------|:-----:|-------------|
| D1 | **Code Migration** | 6 | ECC → S/4HANA API replacement (BKPF→ACDOCA, BAPI→Released API) |
| D2 | **Defect Discovery** | 6 | Finding hidden bugs in ABAP code (N+1 queries, scope leaks, silent data loss) |
| D3 | **Code Rewriting** | 6 | Modernizing classical ABAP to clean code, RAP, ABAP Cloud |
| D4 | **China Compliance** | 6 | Golden Tax, ChinaTax VAT, PIPL privacy, 五险一金 payroll |
| D5 | **Migration Risk** | 6 | Change impact analysis, RFC dependency chains, transport risks |
| D6 | **Security & Auth** | 6 | Authority checks, SQL injection, authorization trace, transport security |
| D7 | **S/4HANA Architecture** | 6 | ACDOCA, CDS views, FI-CO integration, ledger architecture |
| D8 | **Performance Engineering** | 6 | SELECT optimization, HANA column store, parallel processing |
| D9 | **Modern Ecosystem** | 12 | Clean Core, unit testing, Fiori, BAdI, LUW, IDoc, workflow, BDC |

### 4-Layer Scoring / 四层评分

```
┌─────────────────────────────────────────────────────────────┐
│  Layer 4: LLM-as-Judge (optional)              30%         │
│  ├── Correctness · Completeness · Specificity              │
│  ├── Structure · Insight (each 1-5, total /25)             │
│  └── Reference-guided via golden answers                   │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Semantic Similarity                  20% (30%*)  │
│  ├── BM25 text similarity against golden answers           │
│  └── Concept coverage (key_concepts hit rate)              │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Code & Structure Quality             20% (30%*)  │
│  ├── ABAP syntax checks (for code tasks)                   │
│  └── Answer structure analysis (for knowledge tasks)       │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Rubric Matching                      30% (40%*)  │
│  ├── Keyword matching (weighted_terms, keyword_group)      │
│  ├── Compound matching (key_term + context_keywords)       │
│  └── Penalty rules (incorrect S/4HANA statements: -1~-3)  │
└─────────────────────────────────────────────────────────────┘
  * Weights in parentheses: 3-layer mode (Layer 4 disabled)
```

**3-layer mode** (default, no API needed): Rubric 40% + Quality 30% + Semantic 30%
**4-layer mode** (`--with-judge`): Rubric 30% + Quality 20% + Semantic 20% + Judge 30%

---

## Project Structure / 项目结构

```
ABAP-Bench/
├── README.md                    # This file
├── pyproject.toml               # Python packaging (pip install -e .)
├── benchmark_card.yaml          # HuggingFace Dataset Card format
├── CITATION.cff                 # Citation metadata
├── CHANGELOG.md                 # Version history
├── LICENSE                      # Apache-2.0
├── .env.example                 # API key template
│
├── data/
│   ├── tasks.jsonl              # 60 task definitions (JSONL)
│   ├── dimensions.json          # 9 dimensions metadata
│   ├── rubrics/                 # 60 scoring rubric JSONs (T01-T60)
│   ├── golden/                  # 60 golden reference answers (T01-T60)
│   └── test_code/               # ABAP code samples for code-review tasks
│       ├── zvat_invoice_process.abap
│       ├── zhr_salary_calc.abap
│       └── zdyn_query.abap
│
├── src/
│   ├── __init__.py              # Package init (version: 4.0.0)
│   ├── run_benchmark.py         # Main runner: load tasks → call LLM → score → save
│   ├── evaluate.py              # Scoring engine v1 (T01-T30, rubric-only)
│   ├── evaluate_v2.py           # Scoring engine v2 (4-layer, all 60 tasks)
│   ├── judge.py                 # LLM-as-Judge module (Layer 4)
│   └── models.py                # Multi-backend LLM client (zero external deps)
│
├── configs/
│   └── models.yaml              # Model registry (7 models, 4 backends)
│
├── results/
│   ├── schema.json              # Result file JSON Schema
│   └── v4.0/                    # Per-model evaluation results
│
├── scripts/
│   ├── validate_rubrics.py      # Data integrity validation
│   └── migrate_from_legacy.py   # v3.2 → v4.0 migration helper
│
├── tests/                       # Unit & integration tests
│   ├── test_evaluate.py
│   ├── test_data_integrity.py
│   └── test_judge.py
│
└── docs/
    ├── DECONTAMINATION.md       # Data provenance & contamination statement
    └── IMPLEMENTATION_PLAN.md   # Development roadmap (P0-P5)
```

---

## Adding New Tasks / 添加新任务

1. Append a JSON line to `data/tasks.jsonl`:

```json
{"task_id":"T61","title":"New Task","dimension":"Code Migration Knowledge","max_score":20,"prompt_template":"...","requires_test_code":false,"version":"4.1"}
```

2. Create rubric: `data/rubrics/T61.json`
3. Create golden answer: `data/golden/T61.json`
4. Update `data/dimensions.json` to include T61
5. Validate: `python scripts/validate_rubrics.py`
6. Test: `python -m src.evaluate_v2 --task T61 --response "..." --breakdown`

---

## Known Limitations / 已知局限

- **No code execution**: ABAP requires licensed SAP systems; scoring relies on static analysis + LLM-as-Judge instead of unit tests
- **60 tasks**: Below the 100+ statistical significance threshold of top benchmarks (SWE-bench: 2294, BigCodeBench: 1140)
- **No human correlation study**: Inter-annotator agreement not yet measured (planned: Spearman ρ target > 0.85)
- **Primarily Chinese prompts**: May disadvantage models weaker in Chinese language understanding

See [IMPLEMENTATION_PLAN.md](docs/IMPLEMENTATION_PLAN.md) for the full future roadmap.

---

## Citation / 引用

```bibtex
@misc{abapbench2026,
  title        = {ABAP-Bench: A Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization},
  author       = {ABAP-Bench Contributors},
  year         = {2026},
  version      = {4.0},
  howpublished = {\url{https://github.com/abap-bench/abap-bench}},
  note         = {60 tasks, 9 dimensions, 4-layer scoring}
}
```

---

## License / 许可证

This project is licensed under the **Apache License 2.0**. See [LICENSE](LICENSE) for details.

Benchmark task prompts and rubrics are released under Apache-2.0. Golden reference answers (`data/golden/`) are provided for evaluation use only and should NOT be included in LLM training data. Model responses collected during evaluation remain the property of their respective model providers.
