Metadata-Version: 2.1
Name: absum
Version: 0.1.0
Summary: Abstract Summarization for Data Augmentation
Home-page: https://github.com/aaronbriel/absum
Author: Aaron Briel
Author-email: aaronbriel@gmail.com
License: Apache License 2.0
Description: # absum - Abstract Summarization for Data Augmentation
        
        ## Introduction
        Imbalanced datasets are a common problem in ML, and undersampling combined with oversampling are two methods of addressing this issue. 
        A technique such as SMOTE can be effective in oversampling, although the problem becomes a bit more difficult with multilabel datasets. 
        [MLSMOTE](https://www.sciencedirect.com/science/article/abs/pii/S0950705115002737) has been proposed, but the high dimensional nature of numerical vectors created from text can sometimes make other forms of data augmentation more appealing.
        
        absum is an NLP library that uses abstract summarization to perform data augmentation in order to oversample under-represented classes in datasets. Recent developments in abstract summarization make this approach optimal in achieving realistic data for the augmentation process.
        
        It uses the latest [Huggingface T5](https://huggingface.co/transformers/model_doc/t5.html) model by default but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformers models. 
        absum is format agnostic, expecting only a dataframe containing text and all features. It also uses multiprocessing to achieve optimal performance.
        
        ## Algorithm
        1. Append counts or the number of rows to add for each feature are first calculated with a ceiling threshold. Namely, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0.
        
        2. For each feature it then completes a loop from an append index range to the append count specified for that given feature. The append index is stored
        to allow for multi processing.
        
        3. An abstract summarization is calculated for a specified size subset of all rows that uniquely have the given feature. 
        If multiprocessing is set, the call to abstract summarization is stored in a task array later passed to a sub-routine that runs the calls in parallel using the [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) library, vastly reducing runtime.
        
        4. Each summarization is appended to a new dataframe with the respective features one-hot encoded. 
        
        ## Installation
        ### Via pip
        
        ```bash
        pip install absum
        ```
        
        ### From source
        
        ```bash
        git clone https://github.com/aaronbriel/absum.git
        pip install [--editable] .
        ```
        
        or
        
        ```bash
        pip install git+https://github.com/aaronbriel/absum.git
        ```
        
        ## Usage
        
        ```bash
        import pandas as pd
        from absum import Augmentor
        
        csv = 'path_to_csv'
        df = pd.read_csv(csv)
        augmentor = Augmentor(df, text_column='review_text')
        df_augmented = augmentor.abs_sum_augment()
        # Store resulting dataframe as a csv
        df_augmented.to_csv(csv.replace('.csv', '-augmented.csv'), encoding='utf-8', index=False)
        ```
        
        ## Citation
        
        Please reference [this library](https://github.com/aaronbriel/absum) and the HuggingFace [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) library if you use this work in a published or open-source project.
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
