Metadata-Version: 2.4
Name: abi-evals
Version: 0.1.2
Summary: Eval harness with golden management for the ABI ecosystem
Project-URL: Homepage, https://github.com/AbilityBI/abi-evals
Project-URL: Repository, https://github.com/AbilityBI/abi-evals
Project-URL: Documentation, https://abilitybi.github.io/abi-infra/
Project-URL: Changelog, https://github.com/AbilityBI/abi-evals/blob/main/CHANGELOG.md
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.12
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8; extra == 'dev'
Description-Content-Type: text/markdown

# abi-evals

Evaluation harness with failure taxonomy, regression store, and model canary for AI pipeline quality assurance. Works with any Python project — no ABI dependencies required in production.

Zero production dependencies.

## Install (after PyPI publication)

```bash
pip install abi-evals
```

> **Status**: Not yet published to PyPI. Install from source: `pip install -e .` from the repo root.

## Quick start

### Failure taxonomy

```python
from abi_evals.failure_taxonomy import FailureTaxonomy

# Load a taxonomy from YAML or JSON, or build one programmatically
taxonomy = FailureTaxonomy.from_yaml("failure_types.yaml")

# Classify an error by matching detection probes against error text
matches = taxonomy.classify("Output exceeded token budget")
for failure_type in matches:
    print(failure_type.failure_id)  # e.g., "FT-BUDGET-001"
    print(failure_type.category)    # e.g., "budget"
    print(failure_type.severity)    # e.g., "high"
```

### Model canary

```python
from abi_evals.model_canary import ModelCanary

# Compare a new model version against a baseline
canary = ModelCanary(
    eval_suite_path="evals/suite.json",
    baseline_results_path="evals/baseline_results.json",
)
result = canary.run_canary(new_model_config={"run_id": "canary_test"})
print(result.verdict)          # "PASS" | "WARN" | "FAIL"
print(result.regression_count) # number of regressions detected
print(result.success_rate_delta)
```

### Eval runner (function-based API)

```python
from pathlib import Path
from abi_evals.runner import load_eval_suite, run_suite

suite = load_eval_suite(Path("evals/my_suite.json"))
results = run_suite(suite)
```

## What's included

- **Failure taxonomy** -- structured failure classification for clear diagnostics
- **Failure distiller** -- converts classified failures into regression eval cases
- **Regression store** -- manages storage and deduplication of regression cases
- **Model canary** -- validates model version changes against baselines before acceptance
- **Eval runner** -- deterministic eval suites with pluggable comparators
- **CI-friendly** -- exit codes and JSON output for gating in any CI pipeline

## Use it standalone

`abi-evals` works with any Python project. Point it at your eval suites and failure logs — no ABI ecosystem integration needed.

Within the ABI ecosystem, it integrates with `abi-control-core` schemas for eval case and result validation.

## ABI ecosystem

Related packages:

- [abi-control-core](https://github.com/AbilityBI/abi-core) -- contracts and audit trail SDK
- [abi-policy](https://github.com/AbilityBI/abi-policy) -- policy evaluation engine
- [abi-observability](https://github.com/AbilityBI/abi-observability) -- OTLP-compliant telemetry

## Versioning

Follows Semantic Versioning. Current version: `0.1.2`. See CHANGELOG.md for release notes.

## License

MIT License (see LICENSE file).
