The article "From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms" by Vinithavn explores the development and different types of attention mechanisms in autoregressive models, particularly focusing on how models selectively concentrate on important context tokens for prediction. Attention mechanisms help models understand relationships between words by weighing the relevance of tokens, aiding in contextual understanding for tasks in natural language processing (NLP). The article details key components of attention: - Query (Q): vector representing the current token. - Key (K): vectors for context tokens used to compare with the query. - Attention Scores: computed using Query and Key to determine relevance. - Value (V): vectors carrying contextual information combined by scores. - KV Caching: technique to reuse precomputed Key and Value vectors during inference, improving efficiency. The article discusses various attention mechanisms: 1. Multi-Head Attention (MHA): Traditional attention method where multiple attention heads independently compute queries, keys, and values. Each query compares with all preceding keys, producing attention scores that weight the values. Outputs from all heads are concatenated. MHA can be computationally and memory intensive (quadratic complexity), as every token attends to all prior tokens. KV caching reduces computation redundancy but memory cost remains high. Models using MHA include BERT, RoBERTa, T5. 2. Multi-Query Attention (MQA): Multiple query heads share a single set of Key and Value vectors. This approach reduces memory bandwidth and computation overhead significantly without greatly sacrificing performance. Only one set of Key-Value pairs is cached, lowering memory requirements and enabling efficient inference in large language models (LLMs). Models using MQA include PaLM and Falcon. 3. Grouped Query Attention (GQA): A compromise between MHA and MQA. Query heads are divided into groups, each sharing a common Key and Value set. This reduces memory and computation compared to MHA but performs better than MQA. Sets a spectrum where group size g=1 corresponds to MQA and g=h (number of heads) corresponds to MHA. Models using GQA include Llama2, Llama3, Mistral. 4. Multi-Head Latent Attention (MHLA): Recent innovation seen in models like DeepSeek. MHLA seeks to reduce memory usage and accelerate inference while maintaining performance close to MHA. It compresses Key and Value vectors into smaller latent representations via low-rank projections (down-projection and up-projection matrices). This results in smaller caches and faster inference. During training, it behaves like MHA; during inference, it switches to an MQA-like paradigm. It also applies compression to Queries to reduce memory during training. The article concludes by noting further advancements such as sparse attention, efficient attention, and memory-augmented attention aiming at scalability, speed, and adaptability