Compare similarities across different modalities with an interactive run on VESSL.
The proposed approach, ImageBind, allows for the learning of a joint embedding across diverse modalities such as images, text, audio, depth, thermal, and IMU data. By utilizing image-paired data, ImageBind effectively binds these modalities together and expands the zero-shot capabilities of large-scale vision-language models. It enables various applications, including cross-modal retrieval, arithmetic composition, detection, and generation, achieving state-of-the-art performance in emergent zero-shot recognition and few-shot recognition tasks, while also serving as a valuable evaluation framework for vision models across visual and non-visual domains.
<aside> 💡 You need A100 GPU to run this YAML. Please refer to Cluster Integrations.
</aside>
name: ImageBind
description: "Compare similarities across different modalities with an interactive run on VESSL."
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
cluster: vessl-dgx-a100
preset: gpu-1
run:
- workdir: /root/ImageBind
command: |
conda create --name imagebind python=3.8 -y
source activate imagebind
pip install numpy
pip install vtk==9.0.1
pip install mayavi
pip install -r requirements.txt
conda install -c conda-forge cartopy -y
streamlit run streamlit_demo.py
import:
/root/ImageBind: git://github.com/treasuraid/ImageBind
interactive:
max_runtime: 24h
jupyter:
idle_timeout: 120m
ports:
- name: streamlit
type: http
port: 8501