Sequence & State-Space Models

Emerging architecture alternatives to transformers for processing long sequences efficiently, including state-space models and mixture-of-experts.

Authors 25 articles 270 min total read Updated Apr 21, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

4 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Mixture of Experts →

Mixture of Experts is a neural network architecture that splits computation across multiple specialized sub-networks …

6 articles

Multimodal Architecture →

Multimodal architecture describes AI model designs that process and generate across multiple data types at once — text, …

6 articles

State Space Model →

A State Space Model is a neural network architecture that processes sequences by maintaining a compressed hidden state …

7 articles

Vision Transformer →

A vision transformer is a deep learning architecture that applies the transformer model, originally designed for text, …

6 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Apr 21, 2026

MONA mapping MoE, SSM, and multimodal architectures onto software engineering contracts

MONA Bridge 12 min Apr 21, 2026

Beyond Transformers for Developers: What Maps and What Breaks

A bridge for developers hitting MoE, state space, and multimodal anomalies in 2026. Which software instincts still work, and which predict the wrong failures.

Multimodal architecture prerequisites, vision transformers, modality gap, and cross-modal grounding failure in 2026 AI models

MONA explainer 12 min Apr 21, 2026

From Vision Transformers to Modality Gaps: Prerequisites and Technical Limits of Multimodal AI in 2026

Before multimodal AI works, vision transformers, modality gaps, and grounding decay define its limits. The mechanics of why 2026 models still hallucinate.

Geometric visualization of a neural network fusing text, image, audio, and video streams into a shared latent space

MONA explainer 12 min Apr 21, 2026

Multimodal Architecture: How Models Fuse Text, Images, Audio & Video

Multimodal models like GPT-5 and Gemini 3.1 Pro don't see images — they translate them into token space. Here's the encoder-connector-backbone trick.

Geometric grid of image patches transforming into a token sequence representing vision transformer patch embedding architecture

MONA explainer 13 min Apr 17, 2026

What Is a Vision Transformer and How Image Patches Replaced Convolutions in Computer Vision

Vision Transformers treat images as token sequences, not pixel grids. Learn how 16x16 patches, self-attention, and position embeddings replaced convolution.

Compressed state vector losing early tokens while a small attention layer recovers recall in a hybrid sequence model

MONA explainer 11 min Apr 17, 2026

In-Context Learning Gaps, Hybrid Complexity, and the Hard Technical Limits of State Space Models

State space models trade recall for speed. Learn why pure Mamba breaks on in-context tasks and how hybrid SSM-attention models pay the compression bill.

selective state space model hidden state recurrence versus quadratic self-attention on long sequences

MONA explainer 10 min Apr 17, 2026

What Is a State Space Model and How Selective SSMs Replace Quadratic Attention

State space models trade quadratic attention for linear recurrence. See how Mamba's selection works and why long-context models run hybrid in 2026.

Diagram of an image cut into 16x16 patches feeding a transformer encoder with attention arrows and a data-cliff curve

MONA explainer 12 min Apr 17, 2026

From CNN Intuition to Data Hunger: Prerequisites and Hard Limits of Vision Transformers

Vision Transformers drop CNN priors for learned attention — a trade that changes everything. Learn the prerequisites, CNN mappings, and hard limits of ViT.

Diagram of SSM components: hidden state, A/B/C matrices, and selective scan across a token sequence

MONA explainer 11 min Apr 17, 2026

From HiPPO to Selective Scan: The Components and Prerequisites of State Space Models

State space models rebuilt recurrence on new math. Trace the components — HiPPO, S4, selective scan, gating — and the prerequisites that make SSMs click.

Image patches flowing through a Vision Transformer encoder with a class token aggregating features for classification.

MONA explainer 12 min Apr 17, 2026

Patch Embeddings, Class Tokens, and 2D Positional Encoding: Inside the Vision Transformer

How Vision Transformers turn images into token sequences — inside patch embeddings, the CLS token, and the shift from 1D to modern 2D positional encoding.

Routing collapse in mixture of experts with token paths converging to dominant experts while idle capacity goes unused

MONA explainer 10 min Apr 16, 2026

Routing Collapse, Load Balancing Failures, and the Hard Engineering Limits of Mixture of Experts

MoE models promise scale at fractional compute cost. Understand routing collapse, memory tradeoffs, and communication overhead — the hard engineering limits.

Sparse neural network with glowing active pathways routing through specialized expert sub-networks

MONA explainer 11 min Apr 16, 2026

What Is Mixture of Experts and How Sparse Gating Routes Inputs to Specialized Sub-Networks

Mixture of experts activates only selected sub-networks per token. Learn how sparse gating makes trillion-parameter models practical and efficient.

Geometric visualization of parallel expert networks with a routing gate selecting active pathways through a sparse architecture

MONA explainer 10 min Apr 16, 2026

From Feedforward Layers to Expert Pools: Prerequisites and Building Blocks of MoE Architecture

Mixture of experts replaces one feedforward layer with many expert networks and a router. Learn how MoE gating and routing enable trillion-parameter models.

Sequence & State-Space Models

What topics does this domain cover?

Mixture of Experts →

Multimodal Architecture →

State Space Model →

Vision Transformer →

Four perspectives on this domain

Beyond Transformers for Developers: What Maps and What Breaks

From Vision Transformers to Modality Gaps: Prerequisites and Technical Limits of Multimodal AI in 2026

Multimodal Architecture: How Models Fuse Text, Images, Audio & Video

What Is a Vision Transformer and How Image Patches Replaced Convolutions in Computer Vision

In-Context Learning Gaps, Hybrid Complexity, and the Hard Technical Limits of State Space Models

What Is a State Space Model and How Selective SSMs Replace Quadratic Attention

From CNN Intuition to Data Hunger: Prerequisites and Hard Limits of Vision Transformers

From HiPPO to Selective Scan: The Components and Prerequisites of State Space Models

Patch Embeddings, Class Tokens, and 2D Positional Encoding: Inside the Vision Transformer

Routing Collapse, Load Balancing Failures, and the Hard Engineering Limits of Mixture of Experts

What Is Mixture of Experts and How Sparse Gating Routes Inputs to Specialized Sub-Networks

From Feedforward Layers to Expert Pools: Prerequisites and Building Blocks of MoE Architecture

Multimodal Pipeline 2026: LLaVA, Llama 3.2 Vision & Gemini 3.1 Pro

How to Build and Fine-Tune State Space Models with Mamba-3, Jamba, and Nemotron-H in 2026

How to Fine-Tune SigLIP 2, DINOv2, and ViT Backbones with Hugging Face and PyTorch in 2026

How to Run and Fine-Tune Open-Weight MoE Models with DeepSeek-V3, Mixtral, and Llama 4 in 2026

Beyond Vision-Language: Omni-Modal Models Reshape AI in 2026

OmniVinci, Gemini 3.1 Pro, GPT-5.4: Multimodal Breakthroughs of 2026

SigLIP 2, DINOv2, and the ConvNeXt Comeback: Vision Backbones Reshaping Multimodal AI in 2026

Mamba-3, Jamba 1.5, and Nemotron-H: How State Space Models Are Rewiring Long-Context AI in 2026

DeepSeek-V4 at 256 Experts, Grok 5 at 6 Trillion Parameters: How MoE Became the Default Frontier Architecture in 2026

Surveillance, Deepfakes, Consent: Multimodal AI's Ethical Crisis

Biased Training Data and Patch-Level Attacks: The Ethical Risks of Vision Transformers in High-Stakes Systems

Linear-Time Efficiency, Unequal Access: Who Wins and Who Loses as State Space Models Scale

The Concentration Problem: Who Can Afford to Train Trillion-Parameter MoE Models and What That Means for AI Access

Cookie Settings