Bart

Also known as: BART, Bidirectional and Auto-Regressive Transformer, facebook/bart

Bart
BART is a sequence-to-sequence model by Meta AI built on the encoder-decoder architecture, pre-trained by corrupting text and learning to reconstruct it, combining bidirectional encoding with autoregressive decoding to excel at summarization and text generation.

BART is an encoder-decoder model that learns to reconstruct corrupted text, making it especially effective at summarization, translation, and other sequence-to-sequence tasks that require generating new text from existing input.

What It Is

If you’ve ever proofread a garbled email and figured out what the sender actually meant, you already grasp the core idea behind BART. It’s a language model trained to fix intentionally damaged text — sentences with words removed, shuffled, or masked — and through that process, it learns deep patterns of language structure and meaning. That learned understanding then powers tasks like summarization, translation, and text generation.

BART stands for Bidirectional and Auto-Regressive Transformer. Published by Lewis et al. in 2019, it’s Meta AI’s implementation of the encoder-decoder architecture — the same two-part design that powers sequence-to-sequence models for translation, summarization, and question answering. In the context of encoder-decoder systems, BART is one of the most referenced examples of how this architecture can be pre-trained effectively.

The design combines two ideas that had been successful separately. The encoder reads input text in both directions at once (similar to BERT), capturing the full context of every word. The decoder then generates output one token at a time, left to right (similar to GPT). Think of the encoder as someone reading and fully understanding a document, and the decoder as someone writing a new summary based on that understanding.

What sets BART apart is its pre-training method: a denoising objective. During training, the model receives corrupted versions of text — with word spans masked out or sentences shuffled into random order — and must reconstruct the original, undamaged version. According to Lewis et al., this corruption-then-reconstruction approach gives BART strong generation capabilities, matching models like RoBERTa on understanding benchmarks while delivering significantly better results on summarization tasks.

According to Meta AI, BART ships in two sizes: a base variant with roughly 140 million parameters and a large variant with roughly 406 million parameters. The multilingual extension, mBART, was pre-trained on text from over 25 languages, applying the same denoising strategy across languages to support cross-lingual generation and translation.

How It’s Used in Practice

Most people encounter BART through summarization features in content and productivity tools. When a product offers “summarize this article” or “condense this report,” a BART-based model is often doing the work behind the scenes. The model’s encoder reads the full source text and builds an internal representation, then its decoder generates a condensed version — rephrasing and restructuring rather than just extracting key sentences.

BART is also widely used in machine translation, where the encoder processes source-language text and the decoder produces output in the target language. Fine-tuned BART models are available on Hugging Face for dozens of language pairs and specialized tasks like headline generation, dialogue summarization, and document rewriting.

Pro Tip: If you’re evaluating summarization models for your team, look for BART-based fine-tunes (like facebook/bart-large-cnn) as strong starting points. They produce abstractive summaries — meaning they rephrase content in new words rather than copy-pasting sentences from the source.

When to Use / When Not

ScenarioUseAvoid
Abstractive text summarization
Machine translation between language pairs
Rewriting or paraphrasing documents
Simple keyword extraction or entity tagging
Real-time chat requiring sub-second responses
Classification tasks with no text generation needed

Common Misconception

Myth: BART and BERT are the same model — one is just a typo of the other. Reality: BERT is an encoder-only model built for understanding tasks like classification and question answering. BART adds a full autoregressive decoder on top, making it a complete encoder-decoder system capable of generating new text. If BERT is a reader, BART is both a reader and a writer.

One Sentence to Remember

BART learns language by fixing broken text, and that training makes it one of the most reliable encoder-decoder models for any task where you need to read input and write structured output — from summarizing reports to translating between languages.

FAQ

Q: What does BART stand for? A: Bidirectional and Auto-Regressive Transformer. The name reflects its two-part design: a bidirectional encoder that reads context from both directions and an autoregressive decoder that generates output one word at a time.

Q: How is BART different from GPT? A: GPT is decoder-only — it generates text without a separate encoding stage. BART uses both an encoder and a decoder, giving it stronger performance on tasks that require processing input text before producing structured output.

Q: Can BART handle multiple languages? A: Yes. mBART, the multilingual variant, was pre-trained on text from over 25 languages. It supports cross-lingual transfer and translation without needing a separate model for each language pair.

Sources

Expert Takes

BART’s training strategy is the differentiator. By corrupting input with span masking and sentence permutation, then requiring the decoder to reconstruct the original, the model learns contextual representations and generative fluency in a single pass. This denoising objective is what makes encoder-decoder pre-training outperform encoder-only alternatives on generation tasks while remaining competitive on comprehension benchmarks.

In a summarization or translation pipeline, BART gives you a clean encoder-decoder contract: structured input in, structured output out. The encoder handles understanding, the decoder handles generation. That separation means you can fine-tune each component for your domain without one side degrading the other. It’s a reliable base architecture when your task has clear input-output boundaries.

BART proved that encoder-decoder architecture wasn’t dead — it was underused. While much of the industry scaled decoder-only models, BART showed that pairing bidirectional encoding with autoregressive decoding still wins on tasks requiring faithful input processing. Teams building content automation, translation services, or document workflows keep choosing BART-based models because they produce consistent, attributable results.

The denoising approach raises a quiet reliability concern. BART learns to reconstruct what it believes text should say, which works well for summarization but also means the model can confidently rewrite meaning while producing fluent output. Anyone deploying BART in sensitive domains — legal, medical, compliance — needs to verify the model isn’t subtly altering the source material’s intent.