DAN Analysis 9 min read May 6, 2026

ColPali, Jina v4, and Cohere Embed v4: The 2026 Multimodal RAG Stack Race

Split multimodal RAG embedding stack: open-source late-interaction vs hosted enterprise vector APIs in the 2026 race

Table of Contents

TL;DR

The shift: The embedding layer for Multimodal RAG forked into open-source late-interaction and managed single-vector — they are no longer competing for the same buyer.
Why it matters: Picking the wrong stack now means re-platforming retrieval in twelve months, because vector databases, latency budgets, and licenses all follow from this choice.
What’s next: Late-interaction goes mainstream for visual document retrieval; hosted APIs absorb mixed-modality enterprise workloads; small models start eating the per-token economy from below.

Three labs shipped multimodal embedding models within twelve months. They did not ship the same product. Open-weights teams chased ViDoRe scores with late-interaction architectures. Cohere chased the procurement form with a hosted API and mixed-modality payloads. Jina tried to do both. The race that headlines kept calling a horse race was actually a market split — and the split is now permanent.

The Multimodal Embedding Layer Just Forked

Thesis: The 2026 multimodal RAG race produced two stacks, not one — open-source late-interaction and hosted single-vector — and your job is to pick the one your team can actually run.

This is not a benchmark argument. It is a business architecture argument. Colpali and its successors optimize for one thing: maximum retrieval precision on visual documents, with the operational cost paid by your infrastructure team. Cohere optimizes for a different thing: a single API call that ingests interleaved images and text, with the cost paid in tokens and your CFO’s signature.

Both shipped within months of each other. Both got called “the leader” by their own marketing. Neither is wrong about itself.

The reader who treats this as “which one wins” is asking the wrong question. The right question is which buyer wins. And in 2026, two different buyers are winning — at the same time, with different tools.

Three Releases, One Pattern

The pattern is clearest when you ignore the dates and look at the architecture choices. Late-interaction won the open-weights leaderboards.

Illuin Tech’s ColQwen2.5-v0.2 hit 89.4 nDCG@5 on ViDoRe under an Apache 2.0 license, and ColQwen3.5 plus ModernVBERT shipped on March 31, 2026 (Illuin Tech’s GitHub repository). ModernVBERT is a 250M-parameter encoder that matches models ten times larger on ViDoRe — a sub-1B state of the art that says small visual encoders are not done improving (ModernVBERT paper).

Jina Embeddings v4 dropped June 25, 2025: a 3.8B-parameter model on a Qwen2.5-VL-3B-Instruct backbone, with three task-specific LoRA adapters and 90.17 ViDoRe nDCG@5 in multi-vector mode (Jina v4 paper). It supports single-vector retrieval too. That dual-mode design is the bridge play.

Cohere Embed v4 launched April 15, 2025 with 128K context, Matryoshka dimensions from 256 to 1536, and a single payload that accepts interleaved images and text (Cohere Docs). Pricing runs $0.12 per million text tokens and $0.47 per million image tokens (MetaCTO pricing roundup). Cohere does not publish a ViDoRe score and treats benchmark leadership as a vendor claim, not an apples-to-apples result.

Three releases. Two architectures. One direction: the embedding layer is no longer a single product category.

Who Moves Up

Open-weights teams with their own GPUs are the clearest winners. ColQwen3.5 and ModernVBERT under Apache 2.0 mean a self-hosted visual retrieval stack now ships with a license your legal team will sign without a meeting.

Vector databases that support tensor indexes — the multi-vector storage pattern late-interaction requires — just became infrastructure. Without that capability, the leaderboard models can’t be deployed at scale.

Cohere wins the hyperscaler shelf. Embed v4 ships on AWS Bedrock, SageMaker, and Azure AI Foundry (AWS news). For an enterprise where retrieval has to be on the same procurement contract as the rest of the cloud spend, that distribution matters more than any leaderboard delta.

MongoDB picked up the strategic moat. Voyage AI was acquired for $220M in February 2025 (Milvus Blog), folding multimodal embeddings directly into a database company. That’s not a product move. That’s a market positioning move — owning retrieval at the storage layer.

And Qwen3-VL-Embedding entered the field in 2026, evaluated against JinaVDR and ViDoRe v3 (QwenLM GitHub repository). The leaderboard is no longer a two-horse race. It’s a portfolio.

Who Gets Left Behind

Single-vector-only multimodal pipelines are the clearest losers. Late-interaction is now the leaderboard pattern for visual document retrieval, and teams that haven’t planned for tensor-index support in their vector DB are about to discover their stack can’t run the models that win benchmarks (RAGFlow review).

ColPali v1.x users on the Gemma license are stuck. The commercial path now runs through Apache 2.0 ColQwen variants. Anyone shipping a product on the original license is one legal review away from a forced migration.

Teams running a stitched pipeline — separate text embedder, separate image embedder, custom interleaver — just got obsoleted by Cohere Embed v4’s mixed-modality payload. The integration glue they wrote in 2024 is now technical debt.

And anyone treating embedding choice as a model decision instead of a stack decision is about to learn that Document Parsing And Extraction, Metadata Filtering, and vector DB selection all cascade from the embedding architecture. Pick wrong on the model, and you re-platform retrieval.

What Happens Next

Base case (most likely): Two parallel stacks consolidate. Late-interaction becomes the default for high-precision visual retrieval; hosted single-vector wins the enterprise SaaS layer; Jina v4 holds the bridge. Vector DB roadmaps explicitly support tensor indexes by year-end. Signal to watch: The dedicated late-interaction workshop scheduled for early 2026 (RAGFlow review) — attendance and tooling announcements there set the open-source trajectory. Timeline: 12 months.

Bull case: Sub-1B-parameter encoders like ModernVBERT push visual retrieval entirely on-prem, breaking the per-token economy for image embedding APIs. Self-hosted becomes cheaper and faster than hosted at small scale. Signal: A 250M-class encoder matches a 4B-class model on ViDoRe v3 in production deployments. Timeline: 18-24 months.

Bear case: Leaderboards saturate, ViDoRe v4 resets the field, and another year of stack churn follows. Teams that bet early on a specific architecture eat the migration cost. Signal: ViDoRe v3 nDCG@5 ceilings collapse into rounding-error territory across the top five entrants. Timeline: 6-12 months.

Frequently Asked Questions

Q: Which multimodal RAG tools are leading in 2026? A: ColQwen3.5 and ModernVBERT lead the open-weights ViDoRe leaderboard for visual document retrieval. Jina Embeddings v4 covers both single- and multi-vector use cases on the same model. Cohere Embed v4 leads managed enterprise pipelines with mixed-modality payloads, 128K context, and hyperscaler distribution.

Q: Where is multimodal RAG heading in 2026 and beyond? A: Toward two parallel stacks. Late-interaction multi-vector retrieval becomes the leaderboard pattern for visual documents. Managed single-vector APIs absorb enterprise mixed-modality workloads. Sub-1B-parameter encoders push more inference on-prem, while Knowledge Graphs For RAG layers move retrieval beyond pure vector search.

The Bottom Line

Two stacks won. The benchmark winner and the procurement winner are no longer the same model, and both are right for different teams. Pick the stack your buyer trusts — your latency budget, license, and vector DB choice all follow from there.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

ColPali paper: ColPali: Efficient Document Retrieval with Vision Language Models - Original late-interaction visual retrieval paper and ViDoRe benchmark introduction.
Illuin Tech’s GitHub repository: illuin-tech/colpali — ColVision models - ColPali, ColQwen, ModernVBERT releases, license details, and leaderboard scores.
Jina v4 paper: jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval - Architecture, dimensions, and ViDoRe nDCG@5 results for Jina v4.
Cohere Docs: Announcing Embed Multimodal v4 - Mixed-modality payload, Matryoshka dimensions, 128K context window.
AWS news: Cohere’s Embed v4 multimodal embeddings on Amazon Bedrock - Hyperscaler availability for Embed v4.
ModernVBERT paper: ModernVBERT: Towards Smaller Visual Document Retrievers - Sub-1B parameter visual encoder hitting state of the art on ViDoRe.
MetaCTO pricing roundup: Cohere API Pricing 2026 - Per-token pricing for Cohere text and image embeddings.
RAGFlow review: From RAG to Context — a 2025 year-end review - Late-interaction trend analysis and 2026 workshop note.
Milvus Blog: Best Embedding Model for RAG 2026 - Voyage AI acquisition and embedding landscape comparison.

Aha Moments

MONA

Dan calls this a market split. The architecture story underneath is sharper. Late-interaction stores one vector per token, not one vector per document — that is why visual retrieval precision climbs on tables, charts, and screenshots where a single document-level embedding loses spatial signal. Single-vector models compress that signal away. The trade is mechanical: multi-vector storage scales linearly with token count, so the index gets bigger, joins get slower, and the cost shows up in your vector database, not your embedding bill. Nobody escapes the trade-off; they just pick where to pay it. The reason both stacks coexist is that the workloads they serve compress differently. A scanned form is not a paragraph. The math knew this before the market did.

MAX

Mona is right that the trade-off is mechanical, and Dan is right that it is structural. Both miss the practical layer. Choosing an embedding model is not a model decision — it is a stack specification. The license dictates whether your legal team approves it. The vector dimension dictates whether your existing index even loads it. The output mode — single-vector versus multi-vector — dictates whether your database can serve it at the latency your product promised. Teams that pick a model first and design the stack around it rebuild retrieval twice. Teams that write the stack constraints first and select the model that fits ship once. The sequence matters more than the score.

ALAN

Max writes the spec and Dan calls the winners, and both make this sound like a clean engineering choice. It is not. There is a quieter story here about who owns retrieval as it becomes infrastructure. When the embedding layer was a model file you downloaded, the operator owned the stack. When it becomes a hosted API on a hyperscaler shelf, the vendor owns latency, pricing, and the upgrade path. The convenience is real. So is the dependency. Self-hosting trades operational burden for sovereignty over the most sensitive layer in any retrieval system: what your users are allowed to find. Who decides which documents the model embeds well, which it embeds poorly, and which it quietly fails to retrieve at all? And once that decision lives inside someone else’s API, on whose terms is it ever audited?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors