A
Attention Mechanism
A deep learning technique that lets models dynamically weigh which parts of an input matter most for each output, enabling context-aware predictions instead of treating all input tokens equally.
Autoregressive Generation
A sequential text generation method where a language model produces one token at a time, conditioning each new prediction on all previously generated tokens to build coherent output.
B
Bart
BART is a sequence-to-sequence model by Meta AI built on the encoder-decoder architecture, pre-trained by corrupting text and learning to reconstruct it, combining bidirectional encoding with autoregressive decoding to excel at summarization and text generation.
Beam Search
A heuristic decoding algorithm that maintains multiple candidate sequences (beams) during text generation, expanding and scoring them at each step to find a high-probability output sequence without exhaustively searching every possibility.
C
Causal Masking
Causal masking is an attention restriction in decoder-only transformer models that prevents each token from attending to future tokens, enforcing the left-to-right generation order that makes autoregressive language models produce text one token at a time.
Colpali
A vision-language retrieval model that searches documents by processing page images directly through a vision encoder, generating multi-vector patch embeddings and using late interaction scoring to rank pages without OCR or text extraction.
Context Vector
The single fixed-length vector an encoder network produces after processing an entire input sequence, compressing all source information into one representation that the decoder uses to generate output. Its limited capacity motivated the invention of attention mechanisms.
Context Window
The maximum number of tokens a language model can process in a single interaction, covering both the input prompt and the generated output combined.
Contrastive Learning
A self-supervised machine learning technique that trains models to produce meaningful embeddings by maximizing similarity between related (positive) pairs while minimizing similarity between unrelated (negative) pairs, forming the core training objective behind Sentence Transformers and modern sentence-level embedding models.
Cosine Similarity
A mathematical metric that computes the cosine of the angle between two vectors, producing a score from −1 (opposite) to +1 (identical direction), widely used to measure semantic closeness between embeddings.
Cross Attention
An attention mechanism where queries originate from one sequence and keys and values come from a different sequence, enabling a model to focus on relevant information across two distinct inputs like encoder and decoder representations.
D
Decoder Only Architecture
A neural network design based on the transformer decoder block that generates text autoregressively, predicting one token at a time by attending only to previous tokens in the sequence without a separate encoder component.
Dense Retrieval
A neural search method that encodes queries and documents into vector embeddings, then finds relevant results by measuring semantic similarity rather than matching exact keywords.
Dimensionality Reduction
A set of techniques that compress high-dimensional data into fewer dimensions while preserving meaningful patterns, making storage cheaper, computation faster, and visualization possible.
DiskANN
Microsoft's open-source library for approximate nearest neighbor search on billion-scale datasets using a single machine with SSD storage, combining a Vamana graph index with product quantization to keep costs low while maintaining high recall.
Dot Product
A mathematical operation that multiplies corresponding components of two vectors and sums the results into a single number, measuring how similar two vectors are in both direction and magnitude.
E
Embedding
A mathematical representation that converts discrete data like words or tokens into dense numerical vectors in a continuous space, where similar items are positioned closer together. Embeddings serve as the input layer for transformer models and most modern neural networks.
Encoder Decoder
A neural network design where an encoder compresses input into a fixed representation and a decoder generates output from that representation, forming the original transformer blueprint for tasks like translation and summarization.
Encoder Decoder Architecture
A two-part neural network design that processes sequences by first encoding input into a compressed internal representation, then decoding that representation into the desired output sequence, powering tasks like translation and summarization.
Euclidean Distance
The straight-line distance between two points in multi-dimensional space, calculated as the square root of the sum of squared differences between coordinates. In vector search, it quantifies how far apart two embeddings are, with zero meaning identical.
F
Faiss
An open-source C++ and Python library by Meta for efficient similarity search and clustering of dense vectors. Faiss implements index types including IVF, HNSW, and product quantization, enabling nearest-neighbor search across billion-scale datasets with CPU and GPU support.
Feedforward Network
A neural network where data moves in one direction from input to output with no loops or cycles, used as a core processing sub-layer inside each Transformer block to transform learned representations.
Fine Tuning
Fine-tuning adapts a pre-trained machine learning model to a specific task or domain by continuing training on a smaller, targeted dataset, adjusting the model's weights so it performs better on that particular use case.
Flash Attention
An algorithm that computes exact attention scores without storing the full attention matrix in GPU memory, reducing memory use from quadratic to linear while maintaining mathematical equivalence to standard attention.
G
Glitch Tokens
Anomalous tokens in a language model's vocabulary that produce erratic outputs — gibberish, hallucinations, or refusals — because the tokenizer included them during vocabulary construction but the model's training data contained too few examples for the model to learn stable representations.
Grouped Query Attention
An attention mechanism variant that groups multiple query heads to share key-value heads, balancing the output quality of multi-head attention with the inference speed of multi-query attention. Adopted by most frontier language models.
L
Linear Attention
An alternative to standard softmax attention that replaces the quadratic computation with linear-complexity approximations, enabling transformer models to process longer sequences more efficiently while trading some accuracy for speed.
Locality Sensitive Hashing
A family of randomized algorithms that map similar data points to the same hash buckets with high probability, enabling approximate nearest neighbor search in high-dimensional spaces without scanning every item — a key index structure in vector similarity search pipelines.
M
Matryoshka Embedding
An embedding training method where the first d dimensions of a full vector form a valid lower-dimensional representation. Named after Russian nesting dolls, it lets a single model produce embeddings at multiple sizes, trading vector length for storage and speed.
Mean Pooling
Mean pooling produces a single fixed-size vector from a transformer model's token-level outputs by averaging all token hidden states, creating sentence embeddings used for semantic similarity comparisons in search and retrieval systems.
Mixture Of Experts
A neural network architecture that splits a model into multiple specialized sub-networks (experts) and uses a gating function to route each input token to only a few of them, reducing computation per token while preserving the knowledge capacity of a larger model.
Multi Head Attention
A mechanism inside transformers that splits attention into multiple parallel heads, each learning different relationships in the input, then combines their outputs for richer representations.
Multi Vector Retrieval
An information retrieval approach where documents and queries are represented as sets of token-level vectors instead of single embeddings, enabling fine-grained similarity matching through late interaction scoring.
P
Positional Encoding
A technique that injects word-order information into transformer models, which process all tokens simultaneously and would otherwise treat every word as if its position in a sentence did not matter.
Product Quantization
A vector compression method that divides high-dimensional vectors into smaller subvectors, quantizes each independently using learned codebooks, and stores compact codes that enable fast approximate nearest neighbor search with reduced memory.
S
Scaled Dot Product Attention
The core computation inside transformer models that calculates relevance scores between queries and keys using dot products, scales them to prevent gradient saturation, and produces weighted combinations of values.
Scaling Laws
Empirical power-law relationships showing how a language model's performance predictably improves as you increase model size, training data, or compute budget, enabling teams to forecast results before committing resources.
ScaNN
An open-source library from Google Research that performs fast approximate nearest neighbor search using anisotropic vector quantization, designed for finding similar items in large collections of high-dimensional vectors.
Semantic Search
A retrieval method that converts queries and documents into dense vector representations and ranks results by similarity metrics like cosine similarity or dot product, finding matches based on meaning rather than keyword overlap.
Sentence Transformers
A Python framework that generates sentence-level embeddings by passing text through transformer models and applying pooling strategies, enabling semantic search, clustering, and similarity comparison tasks that require understanding meaning rather than matching exact keywords.
Siamese Network
A neural network architecture where two identical sub-networks share the same weights, process separate inputs simultaneously, and produce comparable output vectors, enabling the system to measure how similar or different two inputs are.
Similarity Search Algorithms
Methods that find the closest matching vectors in high-dimensional spaces by measuring distance or angle between numerical representations of data. Used in AI systems for semantic search, recommendation engines, and retrieval-augmented generation to match queries to relevant results.
Softmax
A mathematical function that converts raw numerical scores into a probability distribution where all values sum to one, used in attention mechanisms and classification outputs across AI systems.
State Space Model
A sequence modeling architecture that uses linear recurrence with selective gating to process data in linear time, offering an alternative to transformer attention for tasks involving long sequences.
State Space Models
A class of sequence modeling architectures that process input tokens with linear-time complexity, offering a faster alternative to transformers for handling long sequences by maintaining a compressed hidden state instead of attending to every previous token.
Subword Tokenization
A text preprocessing technique that splits words into smaller units (subwords) based on statistical frequency patterns, enabling language models to represent any word — including rare or unseen terms — using a fixed-size vocabulary of common fragments.
T
T5
T5 is Google's encoder-decoder transformer model that converts every NLP task into a text-to-text format, treating both inputs and outputs as text strings regardless of whether the task involves translation, summarization, classification, or question answering.
Teacher Forcing
A training technique for sequence models where the correct output at each time step feeds into the decoder's next step instead of the model's own prediction, enabling faster convergence but introducing exposure bias at inference time.
Tiktoken
Tiktoken is OpenAI's open-source tokenizer library that converts text into subword tokens using Byte Pair Encoding, enabling language models to process input text as numerical sequences for prediction and generation.
Tokenization
Tokenization splits raw text into smaller units called tokens — subwords, characters, or bytes — that language models can process as numerical input for tasks like text generation and understanding.
Tokenizer Architecture
The multi-stage system that converts raw text into numerical token IDs for large language models, consisting of normalization, pre-tokenization, a subword algorithm (BPE, WordPiece, or Unigram), and post-processing steps.
Transformer Architecture
A neural network design that uses self-attention to process entire input sequences in parallel, replacing older sequential approaches and powering most modern large language models and AI systems.
V
Vector Database
A specialized database designed to store, index, and query high-dimensional vector embeddings using approximate nearest neighbor algorithms, enabling fast similarity search for applications like semantic search, RAG pipelines, and recommendation engines.
Vector Indexing
A method of organizing high-dimensional vectors into specialized data structures so approximate nearest-neighbor searches return results in sub-linear time instead of scanning every record.
W
Word2vec
A neural network technique introduced in 2013 that maps words to dense numerical vectors by training on text corpora, capturing semantic relationships through vector arithmetic and placing semantically related words near each other in a continuous vector space.
Wordpiece
A subword tokenization algorithm developed by Google that breaks words into smaller pieces by selecting merges based on statistical likelihood rather than raw frequency, enabling models like BERT to handle unknown words and multiple languages with a fixed-size vocabulary.
63 terms defined