Class Token

Also known as: [CLS] token, CLS token, classification token

Class Token
A single learnable embedding prepended to a transformer’s input sequence that aggregates information from all other tokens through self-attention, producing one summary vector used by the classification head.

A class token is a single learnable vector added to the start of a transformer’s input sequence, used as a summary slot that aggregates information from all other tokens through self-attention for downstream classification.

What It Is

When a Vision Transformer splits an image into patches, it produces a sequence of patch embeddings — for a 224×224 image with 16×16 patches, that’s 196 of them. Each patch represents one piece of the picture. But to decide “this is a cat,” the model needs a single vector that summarizes the whole image, not 196 separate ones. The class token solves that problem. It’s a dummy slot — a learned placeholder with no image content of its own — prepended to the patch sequence. As the sequence flows through self-attention layers, the class token mixes with every patch and ends up holding a global summary. The final classifier reads from this single token instead of juggling hundreds.

Mechanically, the class token is one extra embedding vector added at position 0. According to Hugging Face ViT Docs, prepending it turns a sequence of N patch embeddings into N+1 tokens — for a standard ViT-Base image, that means 197 tokens in total. Every encoder layer applies self-attention across all of them, so the class token attends to each patch and each patch attends back. By the final layer, the class token’s hidden state encodes global image features. A single linear layer on top of that token produces class logits — one score per possible label.

According to Dosovitskiy et al., the design was directly borrowed from BERT, where a similar [CLS] token aggregates sentence-level meaning. The Vision Transformer paper kept the same convention for continuity across NLP and vision transformers. That’s why you see the same symbol and the same role turn up in multimodal models like CLIP and in most ViT-based image encoders shipped through Hugging Face’s transformers library. One practical consequence: anyone who learned [CLS] in the NLP world already understands its purpose in vision — the mechanism transfers, only the tokens it attends to change.

How It’s Used in Practice

The class token is mostly something product teams run into when working with a pretrained Vision Transformer. In Hugging Face’s transformers library, a ViT’s pooler_output is the class token’s final hidden state. When someone writes image_features = model(pixel_values).pooler_output, that’s the [CLS] vector coming out — ready to feed a classifier, a retrieval index, or a downstream multimodal model.

In practice, you rarely touch the class token directly. You extract it, freeze the encoder, and train a small head on top for your own task — medical imaging, product categorization, content moderation. It also shows up in multimodal pipelines: some vision-language models take the class token as the image representation passed to the language model, while others feed the full patch sequence instead.

Pro Tip: Not every modern Vision Transformer uses the class token for classification. According to Hugging Face ViT Docs, several newer variants such as DeiT III and SigLIP drop [CLS] and use global average pooling over patch tokens instead. Before you assume pooler_output is the image feature you want, check the model card — some recipes expose patch tokens and expect you to pool them yourself.

When to Use / When Not

ScenarioUseAvoid
Fine-tuning a plain ViT for image classification
Extracting a single image embedding for retrieval or similarity search
Using a DeiT III or SigLIP backbone that expects global average pooling
Dense prediction tasks like segmentation or object detection
Teaching how a transformer aggregates a sequence into one summary
Assuming [CLS] behaves the same across every vision model

Common Misconception

Myth: The class token sees the image directly — like an extra pixel, a header patch, or a special region of the input. Reality: It starts as a learned vector with no image content. It only gains meaning by attending to the patch tokens across many encoder layers. Its job is aggregation through attention, not observation.

One Sentence to Remember

The class token is a dedicated summary slot that lets a transformer collapse a sequence of patches into one vector the classifier can actually use — which is why it’s also the first place to look when you want an image embedding out of a pretrained ViT.

FAQ

Q: Why does a Vision Transformer need a class token instead of just averaging the patch embeddings? A: Averaging treats every patch equally. The class token learns during training which patches matter for the task and weights them through attention. According to Dosovitskiy et al., it inherits this role from BERT.

Q: Is the class token required for a Vision Transformer to work? A: No. According to Hugging Face ViT Docs, newer models such as DeiT III and SigLIP use global average pooling over patch tokens instead. The [CLS] token is the original recipe, not a universal requirement.

Q: Does the class token exist in language models too? A: Yes. According to Dosovitskiy et al., the design was borrowed from BERT, where a [CLS] token aggregates sentence-level meaning. Same pattern, different modality — text tokens for NLP, patch tokens for vision.

Sources

Expert Takes

The class token is elegant because it formalizes aggregation. Instead of hand-designing a pooling rule, the network learns one — the [CLS] slot acquires meaning only through attention with the patches. Whether it outperforms simple averaging is an empirical question, and research has increasingly shown global average pooling works just as well for many tasks. The [CLS] token is a useful prior, not a theoretical necessity.

Treat the class token as an interface contract, not a magic ingredient. If your model card says “use pooler_output,” you’re pulling the [CLS] vector; if it says “pool patch tokens,” you’re not. Fine-tuning code does not care about the name — it cares about shape, training signal, and whether the backbone actually optimized that representation. Read the config before wiring a classifier on top.

The [CLS] token shaped how teams ship vision models — one vector out, one classifier on top, fast to deploy. That pattern unlocked the pretrained ViT economy: download a backbone, grab the pooler output, fine-tune for your industry. Newer recipes drop [CLS] for global pooling, and the tooling is catching up. Teams standardizing on a specific extraction pattern today need to plan for that shift, not assume [CLS] is forever.

What does a single vector owe the image it represents? The class token compresses every patch into one summary, and whatever the classifier reads is whatever survived that collapse. Bias studies on Vision Transformers often trace back to what [CLS] learned to emphasize — which faces, which textures, which contexts dominate the aggregation. The design choice is not neutral. Asking what the token did not attend to is often more revealing than asking what it did.