DAN Analysis 9 min read

ColPali, Jina v4, and Cohere Embed v4: The 2026 Multimodal RAG Stack Race

Split multimodal RAG embedding stack: open-source late-interaction vs hosted enterprise vector APIs in the 2026 race
Before you dive in

This article is a specific deep-dive within our broader topic of Multimodal RAG.

This article assumes familiarity with:

Coming from software engineering? Read the bridge first: Knowledge Retrieval for Engineers: What Transfers, What Breaks →

TL;DR

  • The shift: The embedding layer for Multimodal RAG forked into open-source late-interaction and managed single-vector — they are no longer competing for the same buyer.
  • Why it matters: Picking the wrong stack now means re-platforming retrieval in twelve months, because vector databases, latency budgets, and licenses all follow from this choice.
  • What’s next: Late-interaction goes mainstream for visual document retrieval; hosted APIs absorb mixed-modality enterprise workloads; small models start eating the per-token economy from below.

Three labs shipped multimodal embedding models within twelve months. They did not ship the same product. Open-weights teams chased ViDoRe scores with late-interaction architectures. Cohere chased the procurement form with a hosted API and mixed-modality payloads. Jina tried to do both. The race that headlines kept calling a horse race was actually a market split — and the split is now permanent.

The Multimodal Embedding Layer Just Forked

Thesis: The 2026 multimodal RAG race produced two stacks, not one — open-source late-interaction and hosted single-vector — and your job is to pick the one your team can actually run.

This is not a benchmark argument. It is a business architecture argument. Colpali and its successors optimize for one thing: maximum retrieval precision on visual documents, with the operational cost paid by your infrastructure team. Cohere optimizes for a different thing: a single API call that ingests interleaved images and text, with the cost paid in tokens and your CFO’s signature.

Both shipped within months of each other. Both got called “the leader” by their own marketing. Neither is wrong about itself.

The reader who treats this as “which one wins” is asking the wrong question. The right question is which buyer wins. And in 2026, two different buyers are winning — at the same time, with different tools.

Three Releases, One Pattern

The pattern is clearest when you ignore the dates and look at the architecture choices. Late-interaction won the open-weights leaderboards.

Illuin Tech’s ColQwen2.5-v0.2 hit 89.4 nDCG@5 on ViDoRe under an Apache 2.0 license, and ColQwen3.5 plus ModernVBERT shipped on March 31, 2026 (Illuin Tech’s GitHub repository). ModernVBERT is a 250M-parameter encoder that matches models ten times larger on ViDoRe — a sub-1B state of the art that says small visual encoders are not done improving (ModernVBERT paper).

Jina Embeddings v4 dropped June 25, 2025: a 3.8B-parameter model on a Qwen2.5-VL-3B-Instruct backbone, with three task-specific LoRA adapters and 90.17 ViDoRe nDCG@5 in multi-vector mode (Jina v4 paper). It supports single-vector retrieval too. That dual-mode design is the bridge play.

Cohere Embed v4 launched April 15, 2025 with 128K context, Matryoshka dimensions from 256 to 1536, and a single payload that accepts interleaved images and text (Cohere Docs). Pricing runs $0.12 per million text tokens and $0.47 per million image tokens (MetaCTO pricing roundup). Cohere does not publish a ViDoRe score and treats benchmark leadership as a vendor claim, not an apples-to-apples result.

Three releases. Two architectures. One direction: the embedding layer is no longer a single product category.

Who Moves Up

Open-weights teams with their own GPUs are the clearest winners. ColQwen3.5 and ModernVBERT under Apache 2.0 mean a self-hosted visual retrieval stack now ships with a license your legal team will sign without a meeting.

Vector databases that support tensor indexes — the multi-vector storage pattern late-interaction requires — just became infrastructure. Without that capability, the leaderboard models can’t be deployed at scale.

Cohere wins the hyperscaler shelf. Embed v4 ships on AWS Bedrock, SageMaker, and Azure AI Foundry (AWS news). For an enterprise where retrieval has to be on the same procurement contract as the rest of the cloud spend, that distribution matters more than any leaderboard delta.

MongoDB picked up the strategic moat. Voyage AI was acquired for $220M in February 2025 (Milvus Blog), folding multimodal embeddings directly into a database company. That’s not a product move. That’s a market positioning move — owning retrieval at the storage layer.

And Qwen3-VL-Embedding entered the field in 2026, evaluated against JinaVDR and ViDoRe v3 (QwenLM GitHub repository). The leaderboard is no longer a two-horse race. It’s a portfolio.

Who Gets Left Behind

Single-vector-only multimodal pipelines are the clearest losers. Late-interaction is now the leaderboard pattern for visual document retrieval, and teams that haven’t planned for tensor-index support in their vector DB are about to discover their stack can’t run the models that win benchmarks (RAGFlow review).

ColPali v1.x users on the Gemma license are stuck. The commercial path now runs through Apache 2.0 ColQwen variants. Anyone shipping a product on the original license is one legal review away from a forced migration.

Teams running a stitched pipeline — separate text embedder, separate image embedder, custom interleaver — just got obsoleted by Cohere Embed v4’s mixed-modality payload. The integration glue they wrote in 2024 is now technical debt.

And anyone treating embedding choice as a model decision instead of a stack decision is about to learn that Document Parsing And Extraction, Metadata Filtering, and vector DB selection all cascade from the embedding architecture. Pick wrong on the model, and you re-platform retrieval.

What Happens Next

Base case (most likely): Two parallel stacks consolidate. Late-interaction becomes the default for high-precision visual retrieval; hosted single-vector wins the enterprise SaaS layer; Jina v4 holds the bridge. Vector DB roadmaps explicitly support tensor indexes by year-end. Signal to watch: The dedicated late-interaction workshop scheduled for early 2026 (RAGFlow review) — attendance and tooling announcements there set the open-source trajectory. Timeline: 12 months.

Bull case: Sub-1B-parameter encoders like ModernVBERT push visual retrieval entirely on-prem, breaking the per-token economy for image embedding APIs. Self-hosted becomes cheaper and faster than hosted at small scale. Signal: A 250M-class encoder matches a 4B-class model on ViDoRe v3 in production deployments. Timeline: 18-24 months.

Bear case: Leaderboards saturate, ViDoRe v4 resets the field, and another year of stack churn follows. Teams that bet early on a specific architecture eat the migration cost. Signal: ViDoRe v3 nDCG@5 ceilings collapse into rounding-error territory across the top five entrants. Timeline: 6-12 months.

Frequently Asked Questions

Q: Which multimodal RAG tools are leading in 2026? A: ColQwen3.5 and ModernVBERT lead the open-weights ViDoRe leaderboard for visual document retrieval. Jina Embeddings v4 covers both single- and multi-vector use cases on the same model. Cohere Embed v4 leads managed enterprise pipelines with mixed-modality payloads, 128K context, and hyperscaler distribution.

Q: Where is multimodal RAG heading in 2026 and beyond? A: Toward two parallel stacks. Late-interaction multi-vector retrieval becomes the leaderboard pattern for visual documents. Managed single-vector APIs absorb enterprise mixed-modality workloads. Sub-1B-parameter encoders push more inference on-prem, while Knowledge Graphs For RAG layers move retrieval beyond pure vector search.

The Bottom Line

Two stacks won. The benchmark winner and the procurement winner are no longer the same model, and both are right for different teams. Pick the stack your buyer trusts — your latency budget, license, and vector DB choice all follow from there.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors