Maths for Machine Learning  For Coders!
Your coders guide to understand machine learning papers
By Nicholas Johnson
Document Version:
Last Updated:
Welcome to the Course!
Imaging for a minute, you had never written a line of code before, and someone came along and showed you this:
let y = 0;
let x = [1, 2, 3]
let w = [4, 5, 6]
for (const i = 0; i < x.length; i ++> ) {
y = y + weights[i]
}
y = y + b;
As a coder, in pretty much any language, you will probably immediately recognise some variable declarations, a for
loop plus a bit of Maths.
I could have written this a lot more efficiently, but this is a maths book, so I’m keeping the code vanilla for now.
Think back though, to when you first saw code like this. It’s not obvious what is happening here. It’s obvious now because you learned the syntax, but back then, this was deep magic.
Now say I showed you this:
You might, if you knew a bit of Maths, recognise that this formula is the same thing. You might also, if you’re not new to this, recognise the formula for a simple linear perceptron, or you might not.
In the code, everything is relevant, the square and curly braces, the semicolons, the ordering of the lines.
In the equation too, everything is relevant, the sigma, the italics, capitals, subscripts, superscipts. Sometimes we write a dot or a bar or a star or a hat. All of these things have significance. both the code and the equation express an idea in a way that is understandable.
This book is your coder’s guide to maths. It is a rosetta stone that will let you mentally translate the equations into algorithms.
Our goal in this book will be to get to a point where we can understand all the mathematics needed to read modern machine learning papers, papers like “Attention is All You Need”, the paper that laid the groundwork for ChatGPT.
This book is actually 2 books
This book very specifically focusses on Maths so you can read and understand papers. The companion to this book, Python for Machine Learning will trach you how to put these ideas into practice. You may wish to switch between these two as you go, or remain focussed on one or the other.
Exercises
Like learning code, learning maths is interactive. You’ll find exercises sprinkled through this course. Doing these exercise will cement the ideas in your head.
Syntax and Terminology
In this section, we’re going to run through most of the basic syntax and terminology that you’ll need to interpret mathematical equations, specifically for machine learning. This will give you a basis to understand the rest of the book.
Don’t worry if not everything makes sense yet. You’ll probably wish to refer back to this chapter as you move on.
Variables
Variables in Maths are not like variables in programming. In software I might write
a = a + 1;
I can do this because a is a pointer to a location in memory.
This makes no sense in mathematics. Variables in Maths have values or values that can be discovered. You can’t just set them to whatever you want. If I write:
$a$ is now 12 for the duration of the equation.
Capitals, Bold and Italic in Variable Names
 $x$  lower case italic letters are used for variables representing scalar values, eg $5$
 $\mathbf{x}$  bold lower case letters are used for vectors.
 $\mathbf{X}$  bold capitals are used to represent matrices and tensors, eg $[1,2,3]$
This isn’t necessarily true for all Maths, but has become a convention in machine learning.
Scalars, Vectors, Matrices and Tensors
In Maths we have a lot of words for arrays. This is for historical reasons, various folks writing at different times using different words for related concepts.
Scalars
A scalar is a single number. You can happily conceptualise this as a zero dimensional array or tensor if it makes you happy, the math will still work out.
Here’s a scalar in action:
Vectors
Vectors are one dimensional arrays, just like regular arrays. They also share the same square bracket and comma syntax, which is nice:
Matrices
Matrices are two dimensional arrays. We can optionally surround them with square braces. A matrix is an array of arrays:
Tensors
Tensors are N dimensional arrays. A one dimensional tensor is also a vector. A two dimensional tensor is also a matrix.
Tensors are hard to represent on paper, but we could have a go at showing a 3d tensor using a vector of matrices:
If we want to specify that a variable X contains a $5 \times 5\times 5\times 5$ tensor without drawing the tensor, we can do so by saying that Y is in the set of real numbers $5 \times 5 \times 5 \times 5$. More on sets later.
When writing code, we use tensors most of the time regardless of how many dimensionw we need. You may have heard of a package called TensorFlow? PyTorch also deals with tensors, as does Jax.
Single or Double Vertical Bars around vectors denote the length of the vector 
Single or double bars around a vector denote the length of a vector. These two notations are used interchangeably in different contexts.
To find the length (or magnitude) of a vector $\mathbf{w}$ given by $\mathbf{w} = [1,2]$, you can use the Pythagorean formula for the magnitude of a 2D vector:
Where:
 $w_1$ is the first component of the vector (in this case, 1).
 $w_2$ is the second component of the vector (in this case, 2).
Plugging in the values from vector $\mathbf{w}$:
$\mathbf{w} = \sqrt{1^2 + 2^2}$ $\mathbf{w} = \sqrt{1 + 4}$ $\mathbf{w} = \sqrt{5}$
So, the magnitude (or length) of vector $\mathbf{w}$ is $\sqrt{5}$.
Single Vertical Bars around scalars are absolute values 
Single bars can also mean the absolute value of a scalar
We can also write this as a function:
We can define this like so:
Occasioanlly you will also see this:
Vertical bars in sets mean “such that”
A single bar separating two side of an equation means “such that”
TODO: Find real world example of this.
Hats
A hat means a prediction.
TODO: Create formula
We often use yHat as a variable name, to mean our current prediction. It’s not necessarily the final value, just what we have right now in this current training epoch.
Stars
A star means an ideal value, it’s the value we want to get to
Perceptrons
Perceptrons are the fundamental building blocks of neural networks. A perceptron
The Perceptron  A Percieving and Recognizing Automaton  Rosenblat 1957
Matrix Maths (Fundamentals of Linear Algebra)
The great thing about matrices is that they contain a bunch of elements that are predictably all the same as each other. This makes them ideal for parallel processing. We use GPUs (sometimes called TPUs or Tensor Processing Units) to load a load of data all at once, then CHUNK, we process all that data, all at the same time through multiple parallel CUDA cores.
This is great for graphics, where you have millions of 3d points that all need transforming very quickly. It’s also great for machine learning, where you have huge vectors of weights that all need transforming using the same algorithm.
Dot Products and Cross Products
for most operations, tensors, vectors, scalars and matrices work in the same way.
The very notable exception to this rule is the dot product. Say I have two scalars, a and b, and I want to multiply them togehter, I can write this in three ways.
a x b is the same as ab, sometimes written a.b
a x b = ab = a.b
a x b is the cross product. a.b or ab is the dot product. With scalars these give the same result, so we use them interchangeably. Not so with tensors.
A = \begin{bmatrix} 1 & 2 \ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \ 7 & 8 \end{bmatrix} ]

Calculating (C_{11}):
$C_{11} = A_{11} \cdot B_{11} + A_{12} \cdot B_{21} = 1 \cdot 5 + 2 \cdot 7 = 5 + 14 = 19$ 
Calculating (C_{12}):
$C_{12} = A_{11} \cdot B_{12} + A_{12} \cdot B_{22} = 1 \cdot 6 + 2 \cdot 8 = 6 + 16 = 22$ 
Calculating (C_{21}):
$C_{21} = A_{21} \cdot B_{11} + A_{22} \cdot B_{21} = 3 \cdot 5 + 4 \cdot 7 = 15 + 28 = 43$ 
Calculating (C_{22}):
$C_{22} = A_{21} \cdot B_{12} + A_{22} \cdot B_{22} = 3 \cdot 6 + 4 \cdot 8 = 18 + 32 = 50$
So, the resulting matrix (C) from the dot product (matrix multiplication) of matrices (A) and (B) would be:
x \in \mathbb{R}