ImageBind: One Embedding Space To Bind Them All

TL;DR

Compare similarities across different modalities with an interactive run on VESSL.

Description

The proposed approach, ImageBind, allows for the learning of a joint embedding across diverse modalities such as images, text, audio, depth, thermal, and IMU data. By utilizing image-paired data, ImageBind effectively binds these modalities together and expands the zero-shot capabilities of large-scale vision-language models. It enables various applications, including cross-modal retrieval, arithmetic composition, detection, and generation, achieving state-of-the-art performance in emergent zero-shot recognition and few-shot recognition tasks, while also serving as a valuable evaluation framework for vision models across visual and non-visual domains.

YAML

<aside> 💡 You need A100 GPU to run this YAML. Please refer to Cluster Integrations.

</aside>

name: ImageBind
description: "Compare similarities across different modalities with an interactive run on VESSL."
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  cluster: vessl-dgx-a100
  preset: gpu-1
run:
  - workdir: /root/ImageBind
    command: |
      conda create --name imagebind python=3.8 -y
      source activate imagebind
      pip install numpy
      pip install vtk==9.0.1
      pip install mayavi
      pip install -r requirements.txt
      conda install -c conda-forge cartopy -y
      streamlit run streamlit_demo.py
import: 
  /root/ImageBind: git://github.com/treasuraid/ImageBind
interactive: 
  max_runtime: 24h
  jupyter:
    idle_timeout: 120m
ports:
  - name: streamlit
    type: http
    port: 8501

Demo

Untitled