Use Top-Down attention to improve Vision-Language tasks with an interactive run on VESSL.
Current attention algorithms, such as self-attention, highlight all salient objects in an image without considering the specific task. In contrast, humans use task-guided top-down attention to focus on task-related objects. This paper introduces AbSViT, a top-down modulated ViT model that approximates AbS and enables controllable top-down attention. AbSViT improves performance on Vision-Language tasks and serves as a versatile backbone for classification, semantic segmentation, and model robustness.
name: AbSViT
description: "Use Top-Down attention to improve Vision-Language tasks with an interactive run on VESSL."
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
cluster: aws-apne2
preset: v1.v100-1.mem-52
run:
- workdir: /root/AbSVit
command: |
pip install -r requirements.txt
apt-get install libmagickwand-dev
import:
/root/AbSvit: git://github.com/bfshi/AbSViT
interactive:
max_runtime: 24h
jupyter:
idle_timeout: 120m
ports:
- name: streamlit
type: http
port: 8501
Workspace Jupyter notebook
Run /root/AbsVit/demo/demo.ipynb