Metadata-Version: 2.1
Name: abx-dl
Version: 0.0.1rc1
Summary: All-in-one CLI tool to download and extract content from URLs
Keywords: scraping,crawling,downloading,internet archiving,web archiving,digipres,warc,preservation,backups,archiving,web,bookmarks,puppeteer,browser,download
Author-Email: Nick Sweeting <pyproject.toml@archivebox.io>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Legal Industry
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Internet :: WWW/HTTP :: WSGI :: Application
Classifier: Topic :: Sociology :: History
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Archiving
Classifier: Topic :: System :: Archiving :: Backup
Classifier: Topic :: System :: Recovery Tools
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Project-URL: Homepage, https://github.com/ArchiveBox/abx-dl
Project-URL: Source, https://github.com/ArchiveBox/abx-dl
Project-URL: Documentation, https://github.com/ArchiveBox/abx-dl
Project-URL: Bug Tracker, https://github.com/ArchiveBox/abx-dl/issues
Project-URL: Changelog, https://github.com/ArchiveBox/abx-dl/releases
Project-URL: Community, https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community
Project-URL: Donate, https://github.com/ArchiveBox/ArchiveBox/wiki/Donations
Requires-Python: >=3.10
Requires-Dist: rich-click>=1.8.0
Requires-Dist: rich>=13.0.0
Requires-Dist: abx-pkg>=0.6.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: requests>=2.28.0
Description-Content-Type: text/markdown

# ⬇️ `abx-dl`

> A simple all-in-one CLI tool to auto-detect and download *everything* available from a URL.

```bash
pip install abx-dl
abx-dl 'https://example.com/page/to/download'
```

---

✨ *Ever wish you could `yt-dlp`, `gallery-dl`, `wget`, `curl`, `puppeteer`, etc. all in one command?*

`abx-dl` is an all-in-one CLI tool for downloading URLs "by any means necessary".

It's useful for scraping, downloading, OSINT, digital preservation, and more.
`abx-dl` provides a simpler one-shot CLI interface to the [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) plugin ecosystem.

---

<br/>

#### 🍜 What does it save?

```bash
abx-dl --plugins=title,favicon,headers,wget,singlefile,screenshot,pdf,dom,readability,git,... 'https://example.com'
```

`abx-dl` runs all plugins by default, or you can specify `--plugins=...` for specific methods:
- HTML, JS, CSS, images, etc. rendered with a headless browser
- title, favicon, headers, outlinks, and other metadata
- audio, video, subtitles, playlists, comments
- snapshot of the page as a PDF, screenshot, and [Singlefile](https://github.com/gildas-lormeau/single-file-cli) HTML
- article text, `git` source code
- [and much more](https://github.com/ArchiveBox/abx-dl#All-Outputs)...

<br/>

#### 🧩 How does it work?

`abx-dl` uses the **[ABX Plugin Library](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement)** (shared with [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox)) to run a collection of downloading and scraping tools.

Plugins are discovered from the `plugins/` directory and execute hooks in order:
1. **Crawl hooks** run first (setup/install dependencies like Chrome)
2. **Snapshot hooks** run per-URL to extract content

Each plugin can output:
- Files to its output directory
- JSONL records for status reporting
- Config updates that propagate to subsequent plugins

<br/>

#### ⚙️ Configuration

Configuration is handled via environment variables:

- `CHROME_BINARY`, `WGET_BINARY`, etc. - binary paths
- `TIMEOUT=60` - default timeout for hooks
- `{PLUGIN}_ENABLED=true/false` - enable/disable specific plugins
- `{PLUGIN}_TIMEOUT=120` - per-plugin timeout overrides

<br/>

---

<br/>

### 📦 Install

```bash
pip install abx-dl
abx-dl install           # optional: install plugin dependencies
```

<br/>

### 🔠 Usage

```bash
# Basic usage - download URL with all plugins:
abx-dl 'https://example.com'

# Download with specific plugins only:
abx-dl --plugins=title,favicon,screenshot 'https://example.com'

# Specify output directory:
abx-dl --output=./downloads 'https://example.com'

# Set timeout:
abx-dl --timeout=120 'https://example.com'
```

#### Commands

```bash
abx-dl <url>                    # Download URL (default command)
abx-dl plugins                  # List available plugins
abx-dl info <plugin>            # Show plugin details
abx-dl install [plugins]        # Install plugin dependencies
abx-dl check [plugins]          # Check dependency status
```

<br/>

---

<br/>

### Output Structure

```
./
├── index.json              # Snapshot metadata and results
├── title/
│   └── title.txt
├── favicon/
│   └── favicon.ico
├── screenshot/
│   └── screenshot.png
├── pdf/
│   └── output.pdf
├── dom/
│   └── output.html
├── wget/
│   └── example.com/
│       └── index.html
├── singlefile/
│   └── output.html
└── ...
```

<br/>

### All Outputs

- `index.json` - snapshot metadata and plugin results
- `title/title.txt` - page title
- `favicon/favicon.ico` - site favicon
- `screenshot/screenshot.png` - full page screenshot (Chrome)
- `pdf/output.pdf` - page as PDF (Chrome)
- `dom/output.html` - rendered DOM (Chrome)
- `wget/example.com/...` - mirrored site files
- `singlefile/output.html` - single-file HTML snapshot
- ... and more via plugin library ...

---

### Architecture

`abx-dl` is built on these components:

- **`abx_dl/plugins.py`** - Plugin discovery from `plugins/` directory
- **`abx_dl/executor.py`** - Hook execution engine with config propagation
- **`abx_dl/config.py`** - Environment variable configuration
- **`abx_dl/cli.py`** - Rich CLI with live progress display

Plugins are symlinked from [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox)'s plugin directory.

---

For more advanced use with collections, parallel downloading, a Web UI + REST API, etc.
See: [`ArchiveBox/ArchiveBox`](https://github.com/ArchiveBox/ArchiveBox)
