Hi, I'm Nicholas Johnson!

software engineer / trainer / AI enthusiast

Maths for Machine Learning - For Coders!

Your coders guide to understand machine learning papers

By Nicholas Johnson

Document Version:

Last Updated:

Workbook Cover Image

Welcome to the Course!

Imaging for a minute, you had never written a line of code before, and someone came along and showed you this:

let y = 0;
let x = [1, 2, 3]
let w = [4, 5, 6]

for (const i = 0; i < x.length; i ++> ) {
    y = y + weights[i]

y = y + b;

As a coder, in pretty much any language, you will probably immediately recognise some variable declarations, a for loop plus a bit of Maths.

I could have written this a lot more efficiently, but this is a maths book, so I’m keeping the code vanilla for now.

Think back though, to when you first saw code like this. It’s not obvious what is happening here. It’s obvious now because you learned the syntax, but back then, this was deep magic.

Now say I showed you this:

y=i=1nwixi+by = \sum_{i=1}^{n} w_i x_i + b

You might, if you knew a bit of Maths, recognise that this formula is the same thing. You might also, if you’re not new to this, recognise the formula for a simple linear perceptron, or you might not.

In the code, everything is relevant, the square and curly braces, the semicolons, the ordering of the lines.

In the equation too, everything is relevant, the sigma, the italics, capitals, subscripts, superscipts. Sometimes we write a dot or a bar or a star or a hat. All of these things have significance. both the code and the equation express an idea in a way that is understandable.

This book is your coder’s guide to maths. It is a rosetta stone that will let you mentally translate the equations into algorithms.

Our goal in this book will be to get to a point where we can understand all the mathematics needed to read modern machine learning papers, papers like “Attention is All You Need”, the paper that laid the groundwork for ChatGPT.

This book is actually 2 books

This book very specifically focusses on Maths so you can read and understand papers. The companion to this book, Python for Machine Learning will trach you how to put these ideas into practice. You may wish to switch between these two as you go, or remain focussed on one or the other.


Like learning code, learning maths is interactive. You’ll find exercises sprinkled through this course. Doing these exercise will cement the ideas in your head.

Syntax and Terminology

In this section, we’re going to run through most of the basic syntax and terminology that you’ll need to interpret mathematical equations, specifically for machine learning. This will give you a basis to understand the rest of the book.

Don’t worry if not everything makes sense yet. You’ll probably wish to refer back to this chapter as you move on.


Variables in Maths are not like variables in programming. In software I might write

a = a + 1;

I can do this because a is a pointer to a location in memory.

This makes no sense in mathematics. Variables in Maths have values or values that can be discovered. You can’t just set them to whatever you want. If I write:

a=12a = 12

aa is now 12 for the duration of the equation.

Capitals, Bold and Italic in Variable Names

  • xx - lower case italic letters are used for variables representing scalar values, eg 55
  • x\mathbf{x} - bold lower case letters are used for vectors.
  • X\mathbf{X} - bold capitals are used to represent matrices and tensors, eg [1,2,3][1,2,3]

This isn’t necessarily true for all Maths, but has become a convention in machine learning.

Scalars, Vectors, Matrices and Tensors

In Maths we have a lot of words for arrays. This is for historical reasons, various folks writing at different times using different words for related concepts.


A scalar is a single number. You can happily conceptualise this as a zero dimensional array or tensor if it makes you happy, the math will still work out.

Here’s a scalar in action:

x=5x = 5


Vectors are one dimensional arrays, just like regular arrays. They also share the same square bracket and comma syntax, which is nice:

x=[1,2,3]\mathbf{x} = [1, 2, 3]


Matrices are two dimensional arrays. We can optionally surround them with square braces. A matrix is an array of arrays:

X=[a11a12a13a14a15a21a22a23a24a25a31a32a33a34a35a41a42a43a44a45a51a52a53a54a55]X = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} & a_{15} \\ a_{21} & a_{22} & a_{23} & a_{24} & a_{25} \\ a_{31} & a_{32} & a_{33} & a_{34} & a_{35} \\ a_{41} & a_{42} & a_{43} & a_{44} & a_{45} \\ a_{51} & a_{52} & a_{53} & a_{54} & a_{55} \\ \end{bmatrix}


Tensors are N dimensional arrays. A one dimensional tensor is also a vector. A two dimensional tensor is also a matrix.

Tensors are hard to represent on paper, but we could have a go at showing a 3d tensor using a vector of matrices:

X=[[x111x112x113x114x115x121x122x123x124x125x131x132x133x134x135x141x142x143x144x145x151x152x153x154x155][x211x212x213x214x215x221x222x223x224x225x231x232x233x234x235x241x242x243x244x245x251x252x253x254x255][x511x512x513x514x515x521x522x523x524x525x531x532x533x534x535x541x542x543x544x545x551x552x553x554x555]]X = \left[ \begin{array}{c} \begin{bmatrix} x_{111} & x_{112} & x_{113} & x_{114} & x_{115} \\ x_{121} & x_{122} & x_{123} & x_{124} & x_{125} \\ x_{131} & x_{132} & x_{133} & x_{134} & x_{135} \\ x_{141} & x_{142} & x_{143} & x_{144} & x_{145} \\ x_{151} & x_{152} & x_{153} & x_{154} & x_{155} \\ \end{bmatrix} \\ \begin{bmatrix} x_{211} & x_{212} & x_{213} & x_{214} & x_{215} \\ x_{221} & x_{222} & x_{223} & x_{224} & x_{225} \\ x_{231} & x_{232} & x_{233} & x_{234} & x_{235} \\ x_{241} & x_{242} & x_{243} & x_{244} & x_{245} \\ x_{251} & x_{252} & x_{253} & x_{254} & x_{255} \\ \end{bmatrix} \\ \vdots \\ \begin{bmatrix} x_{511} & x_{512} & x_{513} & x_{514} & x_{515} \\ x_{521} & x_{522} & x_{523} & x_{524} & x_{525} \\ x_{531} & x_{532} & x_{533} & x_{534} & x_{535} \\ x_{541} & x_{542} & x_{543} & x_{544} & x_{545} \\ x_{551} & x_{552} & x_{553} & x_{554} & x_{555} \\ \end{bmatrix} \end{array} \right]

If we want to specify that a variable X contains a 5×5×5×55 \times 5\times 5\times 5 tensor without drawing the tensor, we can do so by saying that Y is in the set of real numbers 5×5×5×55 \times 5 \times 5 \times 5. More on sets later.

YR5×5×5×5×5Y \in \mathbb{ℝ}^{5 \times 5 \times 5 \times 5 \times 5}

When writing code, we use tensors most of the time regardless of how many dimensionw we need. You may have heard of a package called TensorFlow? PyTorch also deals with tensors, as does Jax.

Single or Double Vertical Bars around vectors denote the length of the vector ||

Single or double bars around a vector denote the length of a vector. These two notations are used interchangeably in different contexts.

To find the length (or magnitude) of a vector w\mathbf{w} given by w=[1,2]\mathbf{w} = [1,2], you can use the Pythagorean formula for the magnitude of a 2D vector:

w=w=w12+w22|\mathbf{w}| = ||\mathbf{w}|| = \sqrt{w_1^2 + w_2^2}


  • w1w_1 is the first component of the vector (in this case, 1).
  • w2w_2 is the second component of the vector (in this case, 2).

Plugging in the values from vector w\mathbf{w}:

w=12+22|\mathbf{w}| = \sqrt{1^2 + 2^2} w=1+4|\mathbf{w}| = \sqrt{1 + 4} w=5|\mathbf{w}| = \sqrt{5}

So, the magnitude (or length) of vector w\mathbf{w} is 5\sqrt{5}.

Single Vertical Bars around scalars are absolute values |

Single bars can also mean the absolute value of a scalar

a=5a=5a = -5 |a| = 5

We can also write this as a function:

abs(a)=5\text{abs}(a) = 5

We can define this like so:

a={aif a0aif a<0|a| = \begin{cases} a & \text{if } a \geq 0 \\ -a & \text{if } a < 0 \end{cases}

Occasioanlly you will also see this:

a=a2|a| = \sqrt{a^2}

Vertical bars in sets mean “such that”

A single bar separating two side of an equation means “such that”

TODO: Find real world example of this.


A hat means a prediction.

TODO: Create formula

We often use yHat as a variable name, to mean our current prediction. It’s not necessarily the final value, just what we have right now in this current training epoch.


A star means an ideal value, it’s the value we want to get to


Perceptrons are the fundamental building blocks of neural networks. A perceptron

The Perceptron - A Percieving and Recognizing Automaton - Rosenblat 1957

Matrix Maths (Fundamentals of Linear Algebra)

The great thing about matrices is that they contain a bunch of elements that are predictably all the same as each other. This makes them ideal for parallel processing. We use GPUs (sometimes called TPUs or Tensor Processing Units) to load a load of data all at once, then CHUNK, we process all that data, all at the same time through multiple parallel CUDA cores.

This is great for graphics, where you have millions of 3d points that all need transforming very quickly. It’s also great for machine learning, where you have huge vectors of weights that all need transforming using the same algorithm.

Dot Products and Cross Products

for most operations, tensors, vectors, scalars and matrices work in the same way.

The very notable exception to this rule is the dot product. Say I have two scalars, a and b, and I want to multiply them togehter, I can write this in three ways.

a x b is the same as ab, sometimes written a.b

a x b = ab = a.b

a x b is the cross product. a.b or ab is the dot product. With scalars these give the same result, so we use them interchangeably. Not so with tensors.

Certainly! To calculate the dot product of the rows of matrix \(A\) with the columns of matrix \(B\), we can use the following matrices:

A = \begin{bmatrix} 1 & 2 \ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \ 7 & 8 \end{bmatrix} ]

  1. Calculating (C_{11}):

    C11=A11B11+A12B21=15+27=5+14=19C_{11} = A_{11} \cdot B_{11} + A_{12} \cdot B_{21} = 1 \cdot 5 + 2 \cdot 7 = 5 + 14 = 19
  2. Calculating (C_{12}):

    C12=A11B12+A12B22=16+28=6+16=22C_{12} = A_{11} \cdot B_{12} + A_{12} \cdot B_{22} = 1 \cdot 6 + 2 \cdot 8 = 6 + 16 = 22
  3. Calculating (C_{21}):

    C21=A21B11+A22B21=35+47=15+28=43C_{21} = A_{21} \cdot B_{11} + A_{22} \cdot B_{21} = 3 \cdot 5 + 4 \cdot 7 = 15 + 28 = 43
  4. Calculating (C_{22}):

    C22=A21B12+A22B22=36+48=18+32=50C_{22} = A_{21} \cdot B_{12} + A_{22} \cdot B_{22} = 3 \cdot 6 + 4 \cdot 8 = 18 + 32 = 50

So, the resulting matrix (C) from the dot product (matrix multiplication) of matrices (A) and (B) would be:

C=[19224350] C = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}
Here: - \(X\) is the variable holding the matrix. - The matrix is populated row-wise with the numbers 1 through 4.

x \in \mathbb{R}