Language Models Pack Billions of Concepts into 12,000 Dimensions

Beyond Orthogonality: How Language Models Pack Billions of Concepts into 12,000 Dimensions By Nicholas Yoder, 20 Feb 2025 --- Introduction A key question posed in a 3Blue1Brown video by Grant Sanderson asks: How can GPT-3’s embedding space of 12,288 dimensions accommodate millions of distinct real-world concepts? This exploration brings together ideas from high-dimensional geometry and the Johnson-Lindenstrauss (JL) lemma, revealing how language models efficiently encode semantic meaning in a moderate-sized vector space by relaxing orthogonality constraints. --- Key Insights Orthogonality vs Quasi-Orthogonality: A space of dimension N can fit only N perfectly orthogonal vectors. But allowing vectors to be "quasi-orthogonal" (angles between ~85°–95°) vastly increases capacity. Optimization Challenges: An initial loss function designed to enforce near-orthogonality: ran into issues: Gradient Trap: Vectors far from 90° angles get stuck because gradient vanishes near 0° or 180°. 99% Solution: Optimizer minimizes loss globally by making most vectors orthogonal but a few nearly parallel, clustering in a way very different from the desired spread. Improved Loss Function: Using an exponential penalty: helped avoid local minima and produced maximum pairwise angles ~76.5°, less than the anticipated ~89°. --- Johnson-Lindenstrauss Lemma: Geometric Guarantees Allows projection of N points from high dimension into k-dimensional space with bounded distortion ε: \[ (1 - \varepsilon) \|u - v\|^2 \leq \|f(u) - f(v)\|^2 \leq (1 + \varepsilon) \|u - v\|^2 \] Required dimension: \[ k \geq \frac{C}{\varepsilon^2} \log(N) \] where C is a constant related to the probability of success. Proven in 1984 originally for extending Lipschitz functions, now fundamental in computer science, geometry, and machine learning. --- Two Practical Domains of JL Lemma Application Dimensionality Reduction: E.g., compressing high-dimensional customer preference vectors in e-commerce to manageable sizes with minor information loss. Embedding Space Capacity: Understanding how many distinct concepts with nuanced semantic relations can fit into a fixed-dimensional space, relevant to language model embeddings. --- Embeddings and Real-World Semantic Relations Concepts in language are not strictly orthogonal; they overlap semantically. Examples of partial relationships: "Archery" relates to "precision" and "sport" "Fire" connects with "heat" and "passion" "Green" links color perception with environmental consciousness High-dimensional spaces can encode such nuanced relationships effectively. --- Empirical GPU-Accelerated Experiments Optimized projection of up to 30,000 vectors into spaces up to 10,000 dimensions over 50,000 iterations. Findings on constant C of JL lemma: Starts near 0.9 (always <1.0), peaks, then decreases. At high N/K ratios, C can drop below 0.2, indicating efficient sphere packing. These results suggest current JL bounds are conservative and high-dimensional spaces can pack vectors far more efficiently with optimization. --- Implications for Language Models Consider three values of C: C = 4: conservative random projection C = 1: optimized embeddings upper bound C = 0.2: experimental lower bound in large spaces Approximate formula for number of quasi-orthogonal vectors given embedding dimensionality (k) and degrees of freedom from orthogonality (F): \[ \text{Vectors} \approx 10^{\frac{k \cdot F^2}{1500}} \] Applying to GPT-3 (k=12,288): At 89° (F=1): ~10⁸ vectors At 88° (F=2): ~10³² vectors At 87° (F=3): ~10⁷³ vectors -