Sentence Transformers just got a serious upgrade. Version 5.4 brings multimodal embedding and reranker models into the fold, letting you encode and compare text, images, audio, and video using the same API you already know. If you’ve been waiting to build cross-modal search pipelines or multimodal RAG systems, this is where it gets interesting.
What’s New?
Traditional embedding models are great at turning text into vectors. Multimodal embedding models do the same thing, but they map inputs from different modalities—text, images, audio, video—into a shared embedding space. That means you can compare a text query against image documents, find video clips that match a description, or build RAG pipelines that work across modalities without jumping through hoops.
Rerankers got the same treatment. Instead of just scoring text-to-text relevance, multimodal rerankers can handle pairs where one or both elements are images, combined text-image documents, or other modalities. This opens up use cases like visual document retrieval and cross-modal search that were previously a pain to implement.
Getting Started
Installation is straightforward, but you need the right extras depending on what modalities you’re working with:
pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"
A heads-up: VLM-based models like Qwen3-VL-2B need a GPU with at least 8 GB of VRAM. The 8B variants will eat about 20 GB. If you’re stuck on CPU, stick with text-only or CLIP models—these multimodal beasts are painfully slow without a GPU.
Using Multimodal Embedding Models
Loading a multimodal model is identical to loading a text-only one. The model auto-detects which modalities it supports, so there’s no extra configuration needed:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
Encoding images works the same as encoding text. Pass URLs, local file paths, or PIL Image objects:
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)
Cross-modal similarity is where this gets fun. Since the model maps everything into the same space, you can compare text embeddings to image embeddings directly:
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A red car driving on a highway",
"A bee on a pink flower",
"A wasp on a wooden table",
])
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
The results make sense: “A green car parked in front of a yellow building” scores highest with the car image (0.51), and “A bee on a pink flower” matches the bee image (0.67). The hard negatives correctly get lower scores.
You might notice those scores aren’t close to 1.0. That’s the modality gap—embeddings from different modalities tend to cluster in separate regions of the space. Cross-modal similarities are typically lower than within-modal ones, but the relative ordering is preserved, so retrieval still works fine.
For retrieval tasks, use encode_query() and encode_document() instead. Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document, and these methods handle that automatically.
Multimodal Reranker Models
Rerankers work similarly. Load a model and pass mixed-modality pairs:
from sentence_transformers import CrossEncoder
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
pairs = [
("A green car", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"),
("A bee on a flower", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
]
scores = model.predict(pairs)
print(scores)
The model returns relevance scores for each pair. You can also pass combined text-image documents as single items, which is useful for RAG pipelines where documents are more than just text.
Input Formats and Configuration
The library accepts a wide range of input types. For images: URLs, local paths, PIL Images, numpy arrays, or bytes. For audio: URLs, paths, numpy arrays, or raw bytes. For video: URLs, paths, or numpy arrays (frames).
You can check what modalities a model supports:
print(model.supported_modalities)
And pass kwargs to control image resolution or model precision:
embeddings = model.encode(images, processor_kwargs={"size": 384})
Which Models Are Supported?
The update supports a growing list of multimodal models. Qwen3-VL variants are the headline act, but CLIP-based models and others are also available. Check the Sentence Transformers documentation for the full list—it’s expanding fast.
My Take
This is a solid update. Sentence Transformers has been my go-to for embedding and reranking work, and adding multimodal support without changing the API is exactly the right move. The modality gap issue is worth keeping in mind—don’t expect cross-modal similarity scores to hit the same range as text-to-text—but for retrieval and ranking, the relative ordering is what matters.
If you’re building multimodal RAG or cross-modal search, this saves you from stitching together separate pipelines for each modality. One library, one API, done.
Comments (0)
Login Log in to comment.
Be the first to comment!