Metadata-Version: 2.4
Name: actionformer
Version: 1.5.0
Summary: ActionFormer: Localizing Moments of Actions with Transformers
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: torchvision>=0.15.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tensorboard>=2.10.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: joblib>=1.2.0
Dynamic: license-file

# ActionFormer: Localizing Moments of Actions with Transformers

> **Enhanced fork of [ActionFormer](https://github.com/happyharrycn/actionformer_release) with multi-GPU training and modern transformer architecture.**

## Fork Enhancements

| Feature | Description | Docs |
|---------|-------------|------|
| **Multi-GPU Training** | DDP, AMP, gradient accumulation | [Guide](docs/MULTI_GPU_TRAINING.md) |
| **Transformer v2** | Flash Attention, RoPE, RMSNorm, SwiGLU | [Guide](docs/TRANSFORMER_V2.md) |
| **Detection Quality** | EIoU, DIoU-NMS, temperature scaling | [Guide](docs/DETECTION_QUALITY.md) |
| **SnapFormer** | Heatmap-based point detection for instant events | [Config](#snapformer-point-detection) |
| **TBTFormer** | Boundary Distribution Regression for noisy labels | [Config](#tbtformer-boundary-distribution) |
| **Framework Utils** | Config validation, adaptive ranges, post-processing | [Guide](docs/FRAMEWORK_IMPROVEMENTS.md) |
| **Use Cases** | Configuration recommendations | [Guide](docs/USE_CASES.md) |

### Quick Start

```bash
# Multi-GPU training with all optimizations
torchrun --nproc_per_node=4 train_ddp.py configs/thumos_i3d.yaml --amp --output my_exp

# Test the improvements
python test_improvements.py
```

### Multi-GPU Training (DDP)
- **DistributedDataParallel** for efficient multi-GPU training via `torchrun`
- **Automatic Mixed Precision (AMP)** for ~2x faster training
- **Gradient Accumulation** for larger effective batch sizes
- **Distributed Validation** during training
- **Multi-node Support** for cluster training

### Modernized Transformer Architecture (v2)
| Component | Description | Benefit |
|-----------|-------------|---------|
| **Flash Attention** | PyTorch 2.0+ optimized attention | 2-4x faster, O(T) memory |
| **RoPE** | Rotary Position Embeddings | Better length generalization |
| **RMSNorm** | Root Mean Square normalization | 1.8x faster than LayerNorm |
| **SwiGLU** | Gated Linear Unit FFN | Better quality per parameter |
| **GQA** | Grouped Query Attention | Faster inference |

### v2 Configuration

```yaml
model:
  backbone_type: convTransformerv2
  backbone:
    use_rope: true
    use_flash_attn: true
    use_swiglu: true
    use_rms_norm: true
```

### Detection Quality Improvements

| Component | Description | Benefit |
|-----------|-------------|---------|
| **EIoU Loss** | Enhanced IoU with aspect ratio penalty | Better convergence |
| **QualityFocal Loss** | IoU-aware classification | Aligns confidence with localization |
| **DIoU-NMS** | Distance-aware suppression | Better handling of overlapping events |
| **Temperature Scaling** | Score calibration | Improved probability estimates |
| **Class-specific NMS** | Per-class suppression params | Tailored for event types |
| **Deeper Heads (v2)** | Skip connections | Better gradient flow |

```yaml
test_cfg:
  use_diou_nms: true
  score_temperature: 1.3
  class_sigma:
    0: 0.7  # short events
    1: 0.4  # medium events
```

### SnapFormer (Point Detection)

For detecting **instant/point events** (e.g., snap detection in football, keystroke detection), SnapFormer uses heatmap regression instead of segment boundary regression.

| Component | Description |
|-----------|-------------|
| **Gaussian Heatmap** | Targets are Gaussian peaks centered at event frames |
| **Focal Loss** | CornerNet-style focal loss for class imbalance |
| **Cross-Scale FPN** | Bidirectional PANet-style feature fusion |
| **Peak Detection** | Inference via local maxima finding |

**When to use SnapFormer vs ActionFormer:**
- **SnapFormer**: Events with near-zero duration (snaps, clicks, impacts)
- **ActionFormer**: Events with meaningful duration (diving, running, speaking)

```yaml
# SnapFormer config for point detection
model:
  meta_arch: "SnapFormer"
  fpn_type: "cs_fpn"          # Cross-scale FPN (bidirectional)
  # ... other params same as ActionFormer

# Standard ActionFormer for segment detection
model:
  meta_arch: "LocPointTransformer"
  fpn_type: "fpn"             # Standard top-down FPN
```

### TBTFormer (Boundary Distribution)

For action segmentation with **noisy annotations** or **uncertain boundaries**, TBTFormer uses probabilistic boundary regression instead of direct offset prediction.

| Component | Description |
|-----------|-------------|
| **BDR Head** | Predicts probability distribution over boundary locations |
| **Distribution Focal Loss** | Cross-entropy over adjacent bins for smooth learning |
| **Scaled Backbone** | Optional 16 heads, 6x MLP for larger capacity |

**When to use TBTFormer vs ActionFormer:**
- **TBTFormer**: Datasets with annotation noise, ambiguous boundaries
- **ActionFormer**: Clean annotations, well-defined action boundaries

```yaml
# TBTFormer config
model:
  meta_arch: "TBTFormer"
  fpn_type: "cs_fpn"          # Cross-scale FPN recommended
  reg_max: 16                 # Distribution bins (default: 16)
  dfl_weight: 0.25            # DFL loss weight

  # Optional: Scaled backbone (TBT-Former paper settings)
  backbone:
    n_head: 16                # Default: 8
    n_hidden: 1536            # 6x embed dim (default: 4x = 1024)
```

### Framework Utilities

| Component | Description |
|-----------|-------------|
| **Config Validation** | Catch misconfigurations early (FPN alignment, sequence divisibility) |
| **Adaptive Ranges** | Compute regression ranges from annotation distribution |
| **Loss Logger** | Moving averages with IoU stats for debugging |
| **Temporal Constraints** | Post-process with min gap, duration limits, overlap rules |

```python
from actionformer import validate_config, compute_adaptive_ranges, TemporalConstraints

validate_config(cfg)  # Raises on invalid config
ranges = compute_adaptive_ranges('annotations.json', num_levels=6)
filtered = TemporalConstraints(min_gap={0: 5.0}).apply(results)
```

---

## Original Implementation

This fork is based on the original ActionFormer by Zhang et al. (ECCV 2022):

- **Original Repository**: [happyharrycn/actionformer_release](https://github.com/happyharrycn/actionformer_release)
- **Paper**: [ActionFormer: Localizing Moments of Actions with Transformers](https://arxiv.org/abs/2202.07925)
- **Authors**: Chen-Lin Zhang, Jianxin Wu, Yin Li

If you use this code, please cite the original paper:
```bibtex
@inproceedings{zhang2022actionformer,
  title={ActionFormer: Localizing Moments of Actions with Transformers},
  author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
  booktitle={European Conference on Computer Vision},
  pages={492-510},
  year={2022}
}
```

---

## Introduction

ActionFormer is one of the first Transformer-based models for temporal action localization --- detecting the onsets and offsets of action instances and recognizing their action categories. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.56% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works).

<div align="center">
  <img src="teaser.jpg" width="600px"/>
</div>

The model adapts local self-attention to model temporal context in untrimmed videos, classifies every moment in an input video, and regresses their corresponding action boundaries. The result is a deep model trained using standard classification and regression loss that can localize moments of actions in a single shot.

**Related projects**:
> [**SnAG: Scalable and Accurate Video Grounding**](https://arxiv.org/abs/2404.02257) <br>
> Fangzhou Mu\*, Sicheng Mo\*, Yin Li <br>
> *CVPR 2024* <br>
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/fmu2/snag_release)  [![github](https://img.shields.io/github/stars/fmu2/snag_release.svg?style=social)](https://github.com/fmu2/snag_release)  [![arXiv](https://img.shields.io/badge/Arxiv-2404.02257-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2404.02257) <br>

## Changelog
* 11/18/2022: We have released the [tech report](https://arxiv.org/abs/2211.09074) for our submission to the [Ego4D Moment Queries (MQ) Challenge](https://eval.ai/web/challenges/challenge-page/1626/overview). The code repo now includes config files, pre-trained models and results on the Ego4D MQ benchmark.

* 08/29/2022: Updated arXiv version.

* 08/01/2022: Updated code repo with latest results on ActivityNet.

* 07/08/2022: The paper is accepted to ECCV 2022.

* 05/09/2022: Pre-trained models have been updated.

* 05/08/2022: We have updated the code repo based on the community feedback and our code review, leading to significantly better average mAP on THUMOS14 (>66.0%) and slightly improved results on ActivityNet and EPIC-Kitchens 100.


## Code Overview
The structure of this code repo is heavily inspired by Detectron2. Some of the main components are
* ./libs/core: Parameter configuration module.
* ./libs/datasets: Data loader and IO module.
* ./libs/modeling: Our main model with all its building blocks.
* ./libs/utils: Utility functions for training, inference, and postprocessing.

## Installation
* Follow INSTALL.md for installing necessary dependencies and compiling the code.

## Frequently Asked Questions
* See FAQ.md.


## To Reproduce Our Results on THUMOS14
**Download Features and Annotations**
* Download *thumos.tar.gz* (`md5sum 375f76ffbf7447af1035e694971ec9b2`) from [this Box link](https://uwmadison.box.com/s/glpuxadymf3gd01m1cj6g5c3bn39qbgr) or [this Google Drive link](https://drive.google.com/file/d/1zt2eoldshf99vJMDuu8jqxda55dCyhZP/view?usp=sharing) or [this BaiduYun link](https://pan.baidu.com/s/1TgS91LVV-vzFTgIHl1AEGA?pwd=74eh).
* The file includes I3D features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.

**Details**: The features are extracted from two-stream I3D models pretrained on Kinetics using clips of `16 frames` at the video frame rate (`~30 fps`) and a stride of `4 frames`. This gives one feature vector per `4/30 ~= 0.1333` seconds.

**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───data/
│    └───thumos/
│    │	 └───annotations
│    │	 └───i3d_features   
│    └───...
|
└───libs
│
│   ...
```

**Training and Evaluation**
* Train our ActionFormer with I3D features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/thumos_i3d.yaml --output reproduce
```

**[Optional] Multi-GPU Training with DDP**

For faster training on multiple GPUs, use DistributedDataParallel (DDP):
```shell
# Basic 4-GPU training
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml --output reproduce

# With AMP (mixed precision) for ~2x faster training
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml --amp --output reproduce

# Full example with all features
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml \
    --amp --accum-steps 2 --eval-freq 5 --output reproduce
```

**[Optional] Modernized Transformer (v2)**

To use the v2 backbone with Flash Attention, RoPE, RMSNorm, and SwiGLU:
```yaml
model:
  backbone_type: convTransformerv2
  backbone:
    use_rope: true
    use_flash_attn: true
    use_swiglu: true
    use_rms_norm: true
```

* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/thumos_i3d_reproduce/logs
```
* Evaluate the trained model. The expected average mAP should be around 62.6(%) as in Table 1 of our main paper. **With recent commits, the expected average mAP should be higher than 66.0(%)**.
```shell
python ./eval.py ./configs/thumos_i3d.yaml ./ckpt/thumos_i3d_reproduce
```
* Training our model on THUMOS requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

**[Optional] Evaluating Our Pre-trained Model**

We also provide a pre-trained model for THUMOS 14. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1isG3bc1dG5-llBRFCivJwz_7c_b0XDcY/view?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.

* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───thumos_i3d_reproduce/
│    │	 └───thumos_reproduce_log.txt
│    │	 └───thumos_reproduce_results.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...
```
* The training config is recorded in *./pretrained/thumos_i3d_reproduce/config.txt*.
* The training log is located at *./pretrained/thumos_i3d_reproduce/thumos_reproduce_log.txt* and also *./pretrained/thumos_i3d_reproduce/logs*.
* The pre-trained model is *./pretrained/thumos_i3d_reproduce/epoch_034.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/thumos_i3d.yaml ./pretrained/thumos_i3d_reproduce/
```
* The results (mAP at tIoUs) should be

| Method            |  0.3  |  0.4  |  0.5  |  0.6  |  0.7  |  Avg  |
|-------------------|-------|-------|-------|-------|-------|-------|
| ActionFormer      | 82.13 | 77.80 | 70.95 | 59.40 | 43.87 | 66.83 |


## To Reproduce Our Results on ActivityNet 1.3
**Download Features and Annotations**
* Download *anet_1.3.tar.gz* (`md5sum c415f50120b9425ee1ede9ac3ce11203`) from [this Box link](https://uwmadison.box.com/s/aisdoymowukc99zoc7gpqegxbb4whikx) or [this Google Drive Link](https://drive.google.com/file/d/1VW8px1Nz9A17i0wMVUfxh6YsPCLVqL-S/view?usp=sharing) or [this BaiduYun Link](https://pan.baidu.com/s/1tw5W8B5YqDvfl-mrlWQvnQ?pwd=xuit).
* The file includes TSP features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.

**Details**: The features are extracted from the R(2+1)D-34 model pretrained with TSP on ActivityNet using clips of `16 frames` at a frame rate of `15 fps` and a stride of `16 frames` (*i.e.,* **non-overlapping** clips). This gives one feature vector per `16/15 ~= 1.067` seconds. The features are converted into numpy files for our code.

**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───data/
│    └───anet_1.3/
│    │	 └───annotations
│    │	 └───tsp_features   
│    └───...
|
└───libs
│
│   ...
```

**Training and Evaluation**
* Train our ActionFormer with TSP features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/anet_tsp.yaml --output reproduce
```
* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/anet_tsp_reproduce/logs
```
* Evaluate the trained model. The expected average mAP should be around 36.5(%) as in Table 1 of our main paper.
```shell
python ./eval.py ./configs/anet_tsp.yaml ./ckpt/anet_tsp_reproduce
```
* Training our model on ActivityNet requires ~4.6GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

**[Optional] Evaluating Our Pre-trained Model**

We also provide a pre-trained model for ActivityNet 1.3. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1JKh3w14ngAjgzuuP22BnjhkhIcBSqteJ/view?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.

* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───anet_tsp_reproduce/
│    │	 └───anet_tsp_reproduce_log.txt
│    │	 └───anet_tsp_reproduce_results.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...
```
* The training config is recorded in *./pretrained/anet_tsp_reproduce/config.txt*.
* The training log is located at *./pretrained/anet_tsp_reproduce/anet_tsp_reproduce_log.txt* and also *./pretrained/anet_tsp_reproduce/logs*.
* The pre-trained model is *./pretrained/anet_tsp_reproduce/epoch_014.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/anet_tsp.yaml ./pretrained/anet_tsp_reproduce/
```
* The results (mAP at tIoUs) should be

| Method            |  0.5  |  0.75 |  0.95 |  Avg  |
|-------------------|-------|-------|-------|-------|
| ActionFormer      | 54.67 | 37.81 |  8.36 | 36.56 |


**[Optional] Reproducing Our Results with I3D Features**

* Download *anet_1.3_i3d.tar.gz* (`md5sum e649425954e0123401650312dd0d56a7`) from [this Google Drive Link](https://drive.google.com/file/d/16239kUT2Z-j6S6PXIT1b_31OJi35QW_o/view?usp=sharing).

**Details**: The features are extracted from the I3D model pretrained on Kinetics using clips of `16 frames` at a frame rate of `25 fps` and a stride of `16 frames`. This gives one feature vector per `16/25 = 0.64` seconds. The features are converted into numpy files for our code.

* Unpack the file under *./data* (or elsewhere and link to *./data*), similar to TSP features.

* Train our ActionFormer with I3D features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/anet_i3d.yaml --output reproduce
```

* Evaluate the trained model. The expected average mAP should be around 36.0(%). This is slightly improved from our paper. The improvement is produced by better training scheme / hyperparameters (see comments in the config file).
```shell
python ./eval.py ./configs/anet_i3d.yaml ./ckpt/anet_i3d_reproduce
```

* The pre-trained model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/152dw2JDoNPssSnaQDaNolQUSFgcHlxe3/view?usp=sharing). To produce the results, create a folder *./pretrained*, unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*), and run
```shell
python ./eval.py ./configs/anet_i3d.yaml ./pretrained/anet_i3d_reproduce/
```

* The results (mAP at tIoUs) with I3D features should be

| Method            |  0.5  |  0.75 |  0.95 |  Avg  |
|-------------------|-------|-------|-------|-------|
| ActionFormer      | 54.29 | 36.71 |  8.24 | 36.03 |

## To Reproduce Our Results on EPIC Kitchens 100
**Download Features and Annotations**
* Download *epic_kitchens.tar.gz* (`md5sum add9803756afd9a023bc9a9c547e0229`) from [this Box link](https://uwmadison.box.com/s/vdha47qnce6jhqktz9g4mq1gc40w82yj) or [this Google Drive Link](https://drive.google.com/file/d/1Z4U_dLuu6_cV5NBIrSzsSDOOj2Uar85X/view?usp=sharing) or [this BaiduYun Link](https://pan.baidu.com/s/15tOdX6Yp4AJ9lFGjbQ8dgg?pwd=f3tx).
* The file includes SlowFast features as well as action annotations in json format (similar to ActivityNet annotation format).

**Details**: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of `32 frames` at a frame rate of `30 fps` and a stride of `16 frames`. This gives one feature vector per `16/30 ~= 0.5333` seconds.

**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───data/
│    └───epic_kitchens/
│    │	 └───annotations
│    │	 └───features   
│    └───...
|
└───libs
│
│   ...
```

**Training and Evaluation**
* On EPIC Kitchens, we train separate models for nouns and verbs.
* To train our ActionFormer on verbs with SlowFast features, use
```shell
python ./train.py ./configs/epic_slowfast_verb.yaml --output reproduce
```
* To train our ActionFormer on nouns with SlowFast features, use
```shell
python ./train.py ./configs/epic_slowfast_noun.yaml --output reproduce
```
* Evaluate the trained model for verbs. The expected average mAP should be around 23.4(%) as in Table 2 of our main paper.
```shell
python ./eval.py ./configs/epic_slowfast_verb.yaml ./ckpt/epic_slowfast_verb_reproduce
```
* Evaluate the trained model for nouns. The expected average mAP should be around 21.9(%) as in Table 2 of our main paper.
```shell
python ./eval.py ./configs/epic_slowfast_noun.yaml ./ckpt/epic_slowfast_noun_reproduce
```
* Training our model on EPIC Kitchens requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

**[Optional] Evaluating Our Pre-trained Model**

We also provide a pre-trained model for EPIC-Kitchens 100. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1Ta4ggKSj2YcszSrDbePlHe1ECF1CFKK4/view?usp=sharing) (verb), and from this [Google Drive link](https://drive.google.com/file/d/1OTlxeiWj8JE9n1-LsRYogHmqgUdsE5PR/view?usp=sharing) (noun). To evaluate the pre-trained model, please follow the steps listed below.

* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───epic_slowfast_verb_reproduce/
│    │	 └───epic_slowfast_verb_reproduce_log.txt
│    │	 └───epic_slowfast_verb_reproduce_results.txt
│    │   └───...   
│    └───epic_slowfast_noun_reproduce/
│    │	 └───epic_slowfast_noun_reproduce_log.txt
│    │	 └───epic_slowfast_noun_reproduce_results.txt
│    │   └───...  
│    └───...
|
└───libs
│
│   ...
```
* The training config is recorded in *./pretrained/epic_slowfast_(verb|noun)_reproduce/config.txt*.
* The training log is located at *./pretrained/epic_slowfast_(verb|noun)_reproduce/epic_slowfast_(verb|noun)_reproduce_log.txt* and also *./pretrained/epic_slowfast_(verb|noun)_reproduce/logs*.
* The pre-trained model is *./pretrained/epic_slowfast_(verb|noun)_reproduce/epoch_(020|020).pth.tar*.
* Evaluate the pre-trained model for verbs.
```shell
python ./eval.py ./configs/epic_slowfast_verb.yaml ./pretrained/epic_slowfast_verb_reproduce/
```
* Evaluate the pre-trained model for nouns.
```shell
python ./eval.py ./configs/epic_slowfast_noun.yaml ./pretrained/epic_slowfast_noun_reproduce/
```
* The results (mAP at tIoUs) should be

| Method              |  0.1  |  0.2  |  0.3  |  0.4  |  0.5  |  Avg  |
|---------------------|-------|-------|-------|-------|-------|-------|
| ActionFormer (verb) | 26.58 | 25.42 | 24.15 | 22.29 | 19.09 | 23.51 |
| ActionFormer (noun) | 25.21 | 24.11 | 22.66 | 20.47 | 16.97 | 21.88 |

## To Reproduce Our Results on Ego4D Moment Queries Benchmark
**Download Features and Annotations**
* Download the official SlowFast and Omnivore features from [the Ego4D website](https://ego4d-data.org/#download) and the official EgoVLP features from [this link](https://github.com/showlab/EgoVLP/issues/1#issuecomment-1219076014). Please note that we are not authorized to release the features and annotations. Instead, we provide our script for feature and annotation conversion at `./tools/convert_ego4d_trainval.py`.

**Details**: All features are extracted at `1.875 fps` from videos at `30 fps`. This gives one feature vector per `~0.5333` seconds. Please refer to Ego4D and EgoVLP's documentation for more details on feature extraction.

**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───data/
│    └───ego4d/
│    │   └───annotations
│    │   └───slowfast_features
│    │   └───omnivore_features
│    │   └───egovlp_features  
│    └───...
|
└───libs
│
│   ...
```

**Training and Evaluation**
* We provide config files for training ActionFormer with different feature combinations. For example, training on Omnivore and EgoVLP features will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/ego4d_omnivore_egovlp.yaml --output reproduce
```
* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/ego4d_omnivore_egovlp_reproduce/logs
```
* Evaluate the trained model. The expected average mAP and Recall@1x, tIoU=0.5 should be around 22.0(%) and 40.0(%) respectively.
```shell
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./ckpt/ego4d_omnivore_egovlp_reproduce
```
* Training our model on Ego4D with all three features requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

**[Optional] Evaluating Our Pre-trained Model**

We also provide pre-trained models for Ego4D trained with all feature combinations. The models with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/drive/folders/1NpAECS0ZhcCuehXkF9OhLQDPFrNdStJb?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.

* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* An example of the folder structure should look like
```
This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───ego4d_omnivore_egovlp_reproduce/
│    │   └───ego4d_omnivore_egovlp_reproduce_log.txt
│    │   └───ego4d_omnivore_egovlp_reproduce_results.txt
│    │   └───...   
│    └───...
|
└───libs
│
│   ...
```
* The training config is recorded in *./pretrained/ego4d_omnivore_egovlp_reproduce/config.txt*.
* The training log is located at *./pretrained/ego4d_omnivore_egovlp_reproduce/ego4d_omnivore_egovlp_reproduce_log.txt* and also *./pretrained/ego4d_omnivore_egovlp_reproduce/logs*.
* The pre-trained model is *./pretrained/ego4d_omnivore_egovlp_reproduce/epoch_010.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./pretrained/ego4d_omnivore_egovlp_reproduce/
```
* The results (mAP at tIoUs) should be

| Method                |  0.1  |  0.2  |  0.3  |  0.4  |  0.5  |  Avg  |
|-----------------------|-------|-------|-------|-------|-------|-------|
| ActionFormer (S)      | 20.09 | 17.45 | 14.44 | 12.46 | 10.00 | 14.89 |
| ActionFormer (O)      | 23.87 | 20.78 | 18.39 | 15.33 | 12.65 | 18.20 |
| ActionFormer (E)      | 26.84 | 23.86 | 20.57 | 17.19 | 14.54 | 20.60 |
| ActionFormer (S+E)    | 27.98 | 24.46 | 21.21 | 18.56 | 15.60 | 21.56 |
| ActionFormer (O+E)    | 27.99 | 24.94 | 21.94 | 19.05 | 15.98 | 21.98 |
| ActionFormer (S+O+E)  | 28.26 | 24.69 | 21.88 | 19.35 | 16.28 | 22.09 |

* The results (Recall@1x at tIoUs) should be

| Method                |  0.1  |  0.2  |  0.3  |  0.4  |  0.5  |  Avg  |
|-----------------------|-------|-------|-------|-------|-------|-------|
| ActionFormer (S)      | 52.25 | 45.84 | 40.60 | 36.58 | 31.33 | 41.32 |
| ActionFormer (O)      | 54.63 | 48.72 | 43.03 | 37.76 | 33.57 | 43.54 |
| ActionFormer (E)      | 59.53 | 54.39 | 48.97 | 42.75 | 37.12 | 48.55 |
| ActionFormer (S+E)    | 59.96 | 53.75 | 48.76 | 44.00 | 38.96 | 49.09 |
| ActionFormer (O+E)    | 61.03 | 54.15 | 49.79 | 45.17 | 39.88 | 49.99 |
| ActionFormer (S+O+E)  | 60.85 | 54.16 | 49.60 | 45.12 | 39.87 | 49.92 |

## Training and Evaluating Your Own Dataset
Work in progress. Stay tuned.

## Contact
- **Original Authors**: Yin Li (yin.li@wisc.edu)
- **This Fork**: [sreevadde/actionformer_release](https://github.com/sreevadde/actionformer_release)

## References

Please cite the original ActionFormer paper:
```bibtex
@inproceedings{zhang2022actionformer,
  title={ActionFormer: Localizing Moments of Actions with Transformers},
  author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
  booktitle={European Conference on Computer Vision},
  pages={492-510},
  year={2022}
}
```

If you cite results on Ego4D, please also cite the tech report:
```
@article{mu2022actionformerego4d,
  title={Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge},
  author={Mu, Fangzhou and Mo, Sicheng and Wang, Gillian, and Li, Yin},
  journal={arXiv e-prints},
  year={2022}
}
```

If you are using TSP features, please cite
```
@inproceedings{alwassel2021tsp,
  title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks},
  author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
  pages={3173--3183},
  year={2021}
}
```
