SAIL

Assessing Alignment

How well are unimodal vision and language models aligned? This question is critical for advancing multimodal AI. Although prior work has approached this problem, their methodologies often do not translate effectively to practical applications. To address this, we propose a direct assessment method, inspired by linear probing, for evaluating vision-language alignment.

Learning Alignment

How to efficiently learn alignment between unimodal models? We introduce Swift Alignment of Image and Language (SAIL) , an efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream tasks.

By leveraging only ~6% of the paired image-text data required by CLIP, SAIL achieves multimodal alignment using a single A100 GPU with just ~5 hours of training. It supports batch sizes up to 32,768 and delivers outstanding performance, including 73.4% zero-shot accuracy on ImageNet (surpassing CLIP's 72.7%) while excelling in zero-shot retrieval, complex reasoning, and semantic segmentation.

Overview

Key Questions

Alignment Capability: How well can unimodal visual and language models align for zero-shot open-vocabulary tasks?
Model Architecture Impact: Do larger models trained on extensive datasets yield better alignment, or does the choice of self-supervised learning (SSL) methods play a more significant role?
Representation Properties: What properties of SSL representations—such as linear separability or clustering quality—drive stronger cross-modal alignment?

We propose Visual-Language Alignment Probing, a direct assessment method inspired by linear probing in SSL evaluation. This approach freezes pretrained vision and language backbones and trains a lightweight linear alignment layer on image-text datasets.

Overview Illustration — *Figure: Illustration of Visual-Language Alignment Probing*

Results and Findings

We use the open-source DreamLIP CC3M dataset (2.2M paired image-text samples) to train the alignment layer, leveraging its diversity and quality as an effective probing dataset. To measure the alignment quality, we test on COCO in zero-shot retrieval setup, using the R@10 metric. We report average recall of text-to-image and image-to-text retrieval tasks. For systematic evaluation, we fix an anchor model in one modality and vary models in the other modality to identify which models best align with the anchor.

Language as Anchor
Vision as Anchor

Swift Alignment of Image and Language Framework

We introduce Swift Alignment of Image and Language (SAIL), a streamlined framework for aligning pretrained vision and language models. Our efficient two-step training pipeline optimizes both performance and computational costs. Specifically, SAIL achieves superior alignment through three key optimizations:

SAIL achieves superior alignment through three key optimizations:

Alignment Layer Arch

Advanced non-linear GLU in alignment layers to improve alignment quality

Enhanced Loss Function

Sigmoid binary classification loss with balanced positive/negative contributions

High-Quality Data Selection

MLLM generated captions as additional positives and multiple positive captions contrast loss

Ablation	0	1	2	3	4	5	6	7
Tasks	Baseline	+ MLP × 4	+ GLU × 4	+ GLU × 8	+ Sigmoid	+ \|B\| → \|B\|²	+ Long-HQ	+ Multi-Pos
IN-1K 0-shot	33.2	36.8	39.6	45.4	50.7	51.8	48.4	54.0
T2I R@1	11.1	8.0	11.5	16.1	25.4	26.2	31.4	32.9
I2T R@1	13.5	10.7	17.4	22.5	36.0	36.7	44.2	45.4

Table: Ablation results using CC3M on different methods. Baseline refers to aligning unimodal models with only a linear layer using infoNCE loss.

Image Recognition
Cross-modal Retrieval
Segmentation
MLLM Tasks

SAIL achieves superior performance in image recognition tasks. Trained on only 6% of image-text pairs, SAIL outperforms CLIP-L on most datasets. Notably, SAIL-L (GTE) achieves 73.4% accuracy on ImageNet-1k, surpassing the performance of CLIP-L.

Data	Model	Food101	CIFAR10	CIFAR100	SUN397	Cars	Aircraft	DTD	Pets	Cal101	Flowers	Avg.	IN-1K
CC12M	SAIL-L (GTE)	71.2	96.3	83.8	67.2	33.0	8.0	53.0	66.5	82.6	57.7	61.9	63.9
23M Merged	SAIL-L (GTE)	76.1	97.3	84.6	68.6	32.0	16.0	52.5	56.9	83.0	68.3	63.5	65.4
CC12M	SAIL-L (NV2)	81.9	96.1	85.2	68.3	42.9	16.3	60.4	84.7	82.4	67.5	68.6	72.1
23M Merged	SAIL-L (NV2)	86.1	96.7	86.7	69.8	44.6	28.6	63.5	82.3	85.4	77.2	72.1	73.4
LAION400M	CLIP-L	90.1	94.6	77.4	72.6	89.6	25.0	60.4	91.7	82.1	75.5	75.9	72.7

Table: Zero-shot classification top 1 accuracy (%) on various datasets. indicate performance better than CLIP baseline, while represent the highest scores across all models.

SAIL consistently outperforms CLIP-L on all retrieval-based tasks. Especially on complex reasoning tasks, SAIL achieves significant improvements over CLIP-L. This again highlights the importance of the advanced language representation in performing vision-language tasks.

Data	Model	MSCOCO		Flickr30k		Winoground			MMVP
Data	Model	I2T	T2I	I2T	T2I	T.	I.	G.	Avg.
Model Architecture: ViT-L/14
CC12M	SAIL-L (GTE)	50.4	39.3	78.4	66.6	33.25	13.0	9.25	17.0
23M Merged	SAIL-L (GTE)	54.1	42.7	80.8	68.9	34.0	13.25	8.75	22.2
CC12M	SAIL-L (NV2)	57.3	45.3	84.9	73.0	37.75	18.25	13.2	28.0
23M Merged	SAIL-L (NV2)	62.4	48.6	87.6	75.7	40.25	18.75	15.0	28.9
LAION400M	CLIP-L	59.7	43.0	87.6	70.2	30.5	11.5	8.75	20.0

Table: Results on standard retrieval, complex reasoning and visual-centric tasks. We report Recall@1 for MSCOCO and Flickr30k; Text, Image and Group scores for Winoground; and the average score for MMVP.

We analyzed image-image cosine similarity for 150 MMVP image pairs to evaluate fine-grained visual discrimination including subtle differences in orientation, perspective, quantity, color, and contextual details. While CLIP tends to assign high similarity scores even between images with varying conditions, DINOv2 better captures subtle visual differences. Our analysis shows that SAIL's cosine similarity distribution aligns closely with DINOv2's, demonstrating that SAIL inherits DINOv2's strong capability for fine-grained visual discrimination.

Cosine Similarity Distribution Plot — *Figure: Distribution of cosine similarities between MMVP image pairs for different vision encoders.*

An image is represented as a sequence of tokens $X = [x_{cls}, X_{patch}]$, where $X_{patch} \in \mathbb{R}^{hw \times d}$. We compute cosine similarity between each patch and a sentence embedding $y_{text}$ (e.g., "a photo of a {label}") to produce segmentation masks: $\mathcal{M} = \arg \max \cos(X_{patch}, y_{text})$.

Data	Model (ViT-L/14)	ADE20K	Stuff	VOC20
LAION400M	CLIP ‡	1.2	2.4	15.8
LAION400M	MaskCLIP ‡	6.9	8.9	30.1
LAION400M	SCLIP ‡	7.1	13.1	60.3
23M Merged	SAIL (GTE)	13.5	14.1	65.2
23M Merged	SAIL (NV2)	14.2	14.7	66.1

Table: Open-vocabulary semantic segmentation mIOU results compared with CLIP-based methods. All models use ViT-L/14 as the vision architecture. ‡ Cited results.

We demonstrate that the alignment training using SAIL framework can transform features from SSL models like DINOv2 to be more language-compatible, thus better suited for integration with MLLMs for tackling complex vision-language tasks. We train LLaVA-1.5 with various vision encoders and evaluate across downstream tasks.

Model@224px	VTune	SEED^IMG	GQA	VizWiz	PoPE	TextVQA	MMB	VQA^v2
0 DINOv2-L	✗	61.47	61.08	44.12	85.5	45.37	56.96	74.4
1 DINOv2-L	✓	62.12	61.53	46.59	85.7	45.92	58.85	74.69
2 SAIL-L	✓	65.43	62.63	50.00	86.16	46.53	60.14	76.77
3 CLIP-L/14*	✗	64.05	61.58	48.87	85.74	54.56	63.06	75.32
4 CLIP-L/14*	✓	64.15	61.54	49.93	85.73	54.18	64.12	76.36

Table: LLaVA-1.5 with various vision models. *Reproduced using OpenAI CLIP-L@224. VTune indicates if the vision encoder is fine-tuned during the instruction tuning stage.

SAIL-L (row 2) significantly enhances DINOv2's capabilities through alignment training on 23M image-text pairs. Despite CLIP being trained on 400M pairs, SAIL transforms DINOv2 from trailing CLIP to outperforming it on 5 out of 7 tasks (rows 1-4). This improvement holds even when compared to a CLIP model fine-tuned during instruction-tuning (row 4), demonstrating SAIL's effectiveness in learning language-aligned visual features that integrate seamlessly with LLMs. While SAIL shows lower performance on TextVQA and MMB tasks requiring OCR capabilities, this limitation stems from DINOv2's inherent architecture, as evidenced by consistently lower OCR performance in DINOv2 baselines (rows 0-1) compared to CLIP variants.

@inproceedings{zhang2025sail, title={Assessing and Learning Alignment of Unimodal Vision and Language Models}, author={Zhang, Le and Yang, Qian and Agrawal, Aishwarya}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2025} }

Assessing and Learning Alignment of Unimodal Vision and Language Models CVPR 2025 Highlight

Assessing Alignment

Learning Alignment

Part 1: Assessing Alignment between Unimodal Models

Overview

Key Questions

Results and Findings

Key Findings

Key Findings

Part 2: Learning Alignment between Unimodal Models

Swift Alignment of Image and Language Framework

Alignment Layer Arch

Enhanced Loss Function

High-Quality Data Selection

Evaluating SAIL on Downstream Tasks.

Citation