Top-Down Visual Attention from Analysis by Synthesis

TL;DR

Use Top-Down attention to improve Vision-Language tasks with an interactive run on VESSL.

Description

Current attention algorithms, such as self-attention, highlight all salient objects in an image without considering the specific task. In contrast, humans use task-guided top-down attention to focus on task-related objects. This paper introduces AbSViT, a top-down modulated ViT model that approximates AbS and enables controllable top-down attention. AbSViT improves performance on Vision-Language tasks and serves as a versatile backbone for classification, semantic segmentation, and model robustness.

YAML

name: AbSViT
description: "Use Top-Down attention to improve Vision-Language tasks with an interactive run on VESSL."
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  cluster: aws-apne2
  preset: v1.v100-1.mem-52
run:
  - workdir: /root/AbSVit
    command: |
      pip install -r requirements.txt
      apt-get install libmagickwand-dev
import:
  /root/AbSvit: git://github.com/bfshi/AbSViT
interactive:
  max_runtime: 24h
  jupyter:
    idle_timeout: 120m
ports:
  - name: streamlit
    type: http
    port: 8501

Demo

Workspace Jupyter notebook
Run /root/AbsVit/demo/demo.ipynb