Metadata-Version: 2.4
Name: abreka
Version: 0.0.1
Summary: Autonomous research agent CLI — automates the experiment loop: propose → implement → test → run → evaluate
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: click>=8.1
Requires-Dist: fastapi>=0.110
Requires-Dist: jinja2>=3.1
Requires-Dist: markdown>=3.5
Requires-Dist: pygments>=2.17
Requires-Dist: rich>=13.0
Requires-Dist: tomli-w>=1.0
Requires-Dist: uvicorn[standard]>=0.27
Description-Content-Type: text/markdown

# ABREKA — Autonomous Research Agent

ABREKA is a CLI tool that automates the research experiment loop: **propose → implement → test → run → evaluate**. It runs autonomously for hours or days, building on its own results, while maintaining full auditability of every decision.

Inspired by [autoresearch](https://github.com/karpathy/autoresearch), ABREKA is agent-driven end-to-end and uses [pi-rpc](https://github.com/badlogic/pi-mono/blob/main/packages/coding-agent/docs/rpc.md) as the coding agent backend.

## Core Principles

- **Experiments are the unit of work.** Every action produces an experiment with a hypothesis, method, metrics, and findings.
- **Fully automated by default, human-in-the-loop optional.** `abreka run 24h` runs autonomously. `abreka run --step` does one iteration and stops.
- **TDD enforced.** Code must pass tests before experiments run.
- **Full auditability.** Every agent session transcript is saved. Every experiment has structured metrics and narrative findings.
- **Experiments form a DAG, not a list.** Ideas branch and converge through parent references.

## Installation

ABREKA is a standalone tool, installed separately from your project:

```bash
uv tool install abreka
```

Or run directly:

```bash
uvx abreka --help
```

The web UI user guide is maintained in `src/abreka/docs/user_guide.md` and is available at `/guide`.

### Prerequisites

- Python 3.11+
- [uv](https://docs.astral.sh/uv/) package manager
- [pi](https://github.com/badlogic/pi-mono) coding agent (for agent-driven commands)

## Quick Start

```bash
# 1. Inside any uv-managed Python project:
cd my-research-project/
abreka init .

# 2. Edit the research goal:
$EDITOR abreka/goal.md

# 3. Run a single experiment cycle:
abreka run --step

# 4. Or let it run autonomously:
abreka run 24h
```

## What `abreka init .` Creates

ABREKA reads your `pyproject.toml` to discover the project name and `src/` layout, then creates:

```
my_research_project/
  pyproject.toml              # yours — abreka reads but doesn't modify
  src/my_lib/                 # your package
  abreka/                     # created by abreka init
    config.toml               # project settings, metrics, agent config
    goal.md                   # your research objective (edit this!)
    index.json                # experiment metadata cache
    prompts/                  # agent prompt templates (editable)
      propose_exploit.md
      propose_explore.md
      implement.md
      test.md
      repair_run.md
      evaluate.md
    experiments/
      0001/
        experiment.toml       # structured metadata
        findings.md           # narrative findings
        run.py                # experiment entry point
        code/                 # experiment-specific modules
        artifacts/
          checkpoints/
          plots/
          pi_session.json     # agent transcript
```

## Configuration

### `abreka/config.toml`

```toml
[project]
name = "mnist-exploration"        # discovered from pyproject.toml
lib = "mnist_exp"                 # discovered from src/ layout
test_command = "uv run pytest"

[experiment]
primary_metric = "val:accuracy"   # default sorting and "best" tracking
metric_direction = "maximize"     # or "minimize"
run_template = "./run.py"

[run]
mode = "local"                  # or "remote" to offload only the run step
timeout_minutes = 30
require_tests = true
max_repair_attempts = 3          # LLM repairs after non-timeout run failures

# [run.remote]
# host = "gpu-box"
# project_dir = "/workspace/mnist-exploration"
# bootstrap_command = "uv sync --frozen"
# poll_seconds = 5
# startup_command = "./scripts/start-gpu-vm.sh"
# teardown_command = "./scripts/stop-gpu-vm.sh"

[agent]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
# Per-agent overrides (optional):
# propose_model = "claude-opus-4-20250514"
```

### `abreka/goal.md`

Free-form markdown describing the research objective. Injected into all agent prompts:

```markdown
# Research Goal

Achieve the highest possible test accuracy on MNIST using models
that train in under 60 seconds on a single GPU.

## Success Criteria
- val:accuracy > 0.995
- Training time < 60 seconds
```

## CLI Reference

### Project Commands

```bash
abreka init .           # Initialize in current directory
abreka goal             # View the research goal
abreka status           # Summary: experiment counts, best result
```

### The Autonomous Loop

```bash
abreka run 24h          # Run autonomously for 24 hours
abreka run 2h           # Run for 2 hours
abreka run --step       # Single iteration (human-in-the-loop)
abreka run --recover    # Recover exactly one interrupted remote run
```

Each iteration: **propose → implement → test → run → evaluate**. If `run.py` exits non-zero, ABREKA preserves the failure output, asks an LLM to repair the existing experiment, and retries the run (default 3 repair attempts). Run timeouts are treated as failures without repair. If a step ultimately fails, the experiment is marked failed and the loop continues with a new proposal.

### Agent-Driven Commands

Run individual steps manually:

```bash
abreka propose          # Agent proposes the next experiment
abreka implement 0003   # Agent writes code for experiment 0003
abreka test 0003        # Agent validates code, runs test suite
abreka evaluate 0003    # Agent interprets results, writes findings
```

### Experiment Management

```bash
# Create and update
abreka exp new --hypothesis "Add dropout" --parent 0001 --tag regularization
abreka exp update 0003 --metric val:accuracy=0.9934 --metric train:loss=0.005
abreka exp finish 0003
abreka exp fail 0003 --reason "OOM on batch size 256"

# Remote-run recovery
abreka exp remote list
abreka exp remote status 0003
abreka exp remote sync 0003
abreka exp remote resume 0003
abreka exp remote stop 0003

# Query
abreka exp list                                        # all experiments
abreka exp list --status completed --sort metric:val:accuracy
abreka exp list --where "metric:val:accuracy > 0.95"
abreka exp list --parent 0001                          # children of 0001
abreka exp show 0003                                   # full detail view
abreka exp search "augmentation"                       # keyword search
abreka exp diff 0003 0005                              # side-by-side comparison
abreka exp tree                                        # DAG visualization
```

## Experiment Data Model

### Status Lifecycle

```
proposed → implementing → testing → running → completed
                                          ↘ failed
```

### Metrics Protocol

Experiments print metrics to stdout during training:

```
METRIC:train:loss=0.005
METRIC:val:accuracy=0.9934
METRIC:accuracy=0.85          # split defaults to "val"
```

ABREKA parses these lines and writes them into `experiment.toml`. Metrics are namespaced by split (`train`, `val`, `test`).

### Experiment DAG

Experiments reference parents, forming a directed acyclic graph:

```
Experiments
├── 0001 completed Vanilla CNN baseline (val:accuracy=0.9912)
│   ├── 0002 completed Add dropout (val:accuracy=0.9923)
│   │   └── 0005 completed Dropout + CutMix (val:accuracy=0.9961)
│   └── 0003 completed Random augmentation (val:accuracy=0.9934)
└── 0006 running MLP baseline
```

## Shared Library & Copy-on-Write

Your code in `src/<lib_name>/` is shared across all experiments:

1. The agent **can add** new modules to the shared library.
2. The agent **can modify** existing code **if all tests still pass**.
3. If modifications break tests, the agent **copies to** `experiments/<id>/code/` instead.
4. The test suite enforces this automatically.

## Agent Prompts

Default prompts in `abreka/prompts/` are fully editable. They use `{{variable}}` placeholders:

| Variable | Description |
|----------|-------------|
| `{{goal}}` | Contents of `goal.md` |
| `{{exp_id}}` | Current experiment ID |
| `{{hypothesis}}` | Experiment hypothesis |
| `{{method}}` | Experiment method |
| `{{parents}}` | Parent experiment IDs |
| `{{experiment_context}}` | Summary of all prior experiments |
| `{{lib_name}}` | Your package name |
| `{{run_output}}` | Captured stdout/stderr from the run |
| `{{artifacts_list}}` | List of artifact files |
| `{{primary_metric}}` | Primary metric from config |
| `{{metric_direction}}` | `maximize` or `minimize` |

## Architecture

```
abreka CLI
  └── propose / implement / test / evaluate
        └── pi-rpc client (spawns pi --mode rpc subprocess)
              └── pi agent uses tools: bash, read, write, edit
                    └── agent calls abreka exp ... to manage experiments
```

ABREKA communicates with pi via JSON lines over stdin/stdout. Every agent session is saved as an artifact for full auditability.

## Development

```bash
git clone https://github.com/your-org/abreka.git
cd abreka
uv sync
uv run pytest           # 55 tests
uv run abreka --help
```

## License

MIT
