Metadata-Version: 2.1
Name: aclose
Version: 0.0.1
Summary: ACLOSE- Automatic Clustering and Labeling Over Semantic Embeddings
License: MIT
Author: Joe Nance
Author-email: joe@nceno.app
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: asyncio (>=3.4.3,<4.0.0)
Requires-Dist: backoff (>=2.2.1,<3.0.0)
Requires-Dist: hdbscan (>=0.8.40,<0.9.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: llvmlite (>=0.44.0,<0.45.0)
Requires-Dist: openai (>=1.59.9,<2.0.0)
Requires-Dist: optuna (>=4.2.0,<5.0.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: plotly (>=5.24.1,<6.0.0)
Requires-Dist: psutil (>=6.1.1,<7.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: scikit-learn (>=1.6.1,<2.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: umap-learn (>=0.5.7,<0.6.0)
Description-Content-Type: text/markdown

# ATMOSE
ATMOSE- Automatic Topic Modeling Over Semantic Embeddings


## What it does
This package is a tool for quick EDA of emergent topics among your semantic embeddings.
### Problem
- I have all these embedding vectors. What are the general topics that emerge from them?
### Solution
- ATMOSE will cluster your embeddings and then label the clusters using an LLM.
- Instead of throwing a random sample of embeddings from each cluster at an LLM, ATMOSE uses stratified sampling and refinement to ensure that the topic labels balance generalization and specificity.

## Algorithms (more coming soon)
- UMAP
- HDBSCAN
- TOPSIS

## LLM agnostic (coming soon)
- All LLMs are supported via LiteLLM

## Experiment tracking (coming soon)
- MLflow
- Dim reduction and clustering Model serialization and versioning
- Helicone tracking (optional)

## C++ compiler required

Before installing ATMOSE, ensure you have:

- Windows: Microsoft Visual C++ Build Tools

- Linux: GCC/G++ compiler (`sudo apt-get install build-essential` on Ubuntu)

- macOS: Xcode Command Line Tools (`xcode-select --install`)

## Tip for building in Docker
Add this to your dockerfile:
```Dockerfile
RUN apt-get update && apt-get install -y \
    curl \
    build-essential \
    gcc \
    g++ \
    libpq-dev \
    libx11-dev \
    libxrandr-dev \
    libxext-dev \
    libxi-dev \
    libgl1-mesa-dev \
    && rm -rf /var/lib/apt/lists/*

ENV POETRY_VERSION=1.8.2
RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH="/root/.local/bin:$PATH"
```

## Notebook demo


## Quickstart


## Number of LLM calls
- 2 LLM calls per cluster


## Instructions for use
Assume df has columns: 
- content_str
- embedding_vector

Gets additional columns after applying .label(df, data_description)
- cluster_id
- topic_label
- membership_score
- outlier_score
- silhouette_score
- reduced_vector

