The Maths You Need to Start Understanding LLMs Posted on 2 September 2025 in AI, LLM from scratch This article is the second of three parts explaining Large Language Models (LLMs) for technically savvy readers without deep AI knowledge. It builds on part 19 of the LLM from Scratch series based on Sebastian Raschka's book Build a Large Language Model (from Scratch). The first post in the series helps set the context. --- Overview Serious AI research requires advanced maths, but understanding how LLMs work at inference time mainly uses high-school maths concepts like vectors and matrices. Training maths is more complex and covered later. Focus here is on inference: using an existing model. --- Vectors and High-Dimensional Spaces A vector of length n represents a point or direction in n-dimensional space. Example: In 2D, vector (2, -3) means 2 units right, 3 units down. In 3D, vector (5, 1, -7) extends to three axes. For LLMs, vectors represent things like logits, which are raw scores predicting likelihoods of next tokens. Although impossible to visualize beyond 3D, the mathematical idea remains consistent. --- Vocab Space Logits vectors have a dimension equal to the vocabulary size; for GPT-2 this is 50,257. Each element corresponds to the likelihood of a token being the next. Such a vector is a point in a 50,257-dimensional "vocab space". For example, token index 464 ("The")’s logit value is the predicted likelihood. Logits are unnormalized scores; applying the softmax function converts them into probabilities between 0 and 1 that sum to 1. Different logits vectors can represent the same probability distribution (redundant representations). After softmax, probability distributions occupy a "clean" normalized vocab space. The concept of a one-hot vector (all zeroes except one element = 1) denotes 100% probability on one token, important for later discussions. --- Embeddings An embedding space is a high-dimensional vector space representing meanings of concepts. Similar concepts cluster together, e.g. "domestic cat," "lion," "tiger" form cat cluster; "dog," "wolf," "coyote" form canine cluster. Different embedding spaces exist depending on the use case (e.g. zoological vs daily language). Often, the direction of embedding vectors matters more than their length—vectors scaled versions of each other can represent the same meaning. --- Projections by Matrix Multiplication Matrices are collections of vectors (rows and columns). Matrix multiplication can transform points from one space to another. Example: a 2×2 rotation matrix rotates points in 2D space by an angle θ. If points are in a 2×n matrix, multiplying by the rotation matrix produces their rotated versions. In machine learning, row-major format is used (points as rows), so multiplication order differs. Matrices allow projecting between spaces of different dimensions, e.g. 3D to 2D for graphics projection. Important note: projections can lose information when reducing dimensionality (lossy projections). Large LLM-related matrices can project between very high-dimensional spaces, e.g. 50,257 to 768 dimensions and back. --- Neural Networks A single neural network layer computes: \[ Z = \delta(X W^T + B) \] \(X\) = input batch (n samples × input dimension) \(W\) = weight matrix (output dimension × input dimension), transposed in calculation \(B\) = bias vector \(\delta\) = activation function (optional) Ignoring bias and activation simplifies it to a pure matrix multiplication projecting from input space to output space. This highlights that single layers are essentially linear projections between vector spaces. --- Conclusion Understanding