Mathematics is the basis of any Engineering. Much more so with abstract sciences like Machine Learning. And Linear Algebra is at the core. It is very important to understand linear algebra if we want to proceed with understanding Machine Learning.

Linear Algebra was developed to simplify linear equations. It provides a simple way of representing and formalizing the solution of linear equations.

Consider the two equations below

```
2x + y = 10
2x - y = 2
```

One can trivially solve these two equations of two variables. That is high school algebra. We don't need any code to solve it. But what would we do if we were given 100,000 equations of 100,000 variables? Linear Algebra helps us here. Let us see how

Of course, we do not have enough space here to write down the 100,000 equations. But we can use the above two equations to understand the concept. These two equations can be written in matrix form as

Essentially, we have represented the set of equations in the form

`Ax = b # Where A is a matrix; x and b are vectors.`

This is the short hand way of representing a set of linear equations - in form of matrices.

Reducing the representational size is not the only advantage of this. We will see below how it helps in computation. Before that, let us look into the notifications.

- A ∈ R
^{m x n}=> A is a matrix of m rows and n columns, and all its elements are real numbers. - x ∈ R
^{n}=> x is a vector with n entries and all its elements are real numbers. - A
_{ij}=> Denotes the element of A at the i^{th}row and j^{th}column - x
_{i}=> Denotes the element of i^{th}element of the vector x - A
_{i,:}=> i^{th}row in the matrix A - A:,j => j
^{th}column in the matrix A

Intuitively, we can think of a vector as a point in n-dimensional space. And a matrix A as an operation that can map a vector V1 in n dimensional space into another vector V2 in m dimensional space.

Linear Algebra is a vast domain. In order to use it in Machine Learning, it is necessary to understand some basic concepts.

The concept of matrix multiplication is not so intuitive. Suppose we have two matrices A ∈ R^{m x n} and B ∈ R^{n x p}, the product of the two matrices is C ∈ R^{m x p}where

` C`_{ij} = ∑ A_{ik} B_{kj}

Note the sizes of of the three matrices. For the product A x B to exist, the number of columns in the matrix A must be the same as the number of rows in matrix B. In this case, the product will have rows as many as A and columns as many as B

There could be 4 possibilities of matrix product. Matrix - Matrix product, Matrix - Vector product, Vector - Matrix product and Vector - Vector product. Let us look at two special cases here:

Given two vectors x, y ∈ R^{n}, x^{T}y - the dot product of the two vectors - is a real number.

Given a matrix A ∈ R^{m x n} and a vector x ∈ R^{n}. Their product y = Ax is a vector ∈ R^{m}. In other words, y is a linear combination of the columns of A, where the coefficients of the linear combination are given by the entries of x.

It is interesting to note here that matrix multiplication

- Is not commutative. A x B may not be equal to B x A. In fact B x A may not even exist.
- But it is distributive. A x (B + C) = A x B + A x B
- And it is also associative. A x (B x C) = (A x B) x C

By definition, multiplicative identity is a matrix I such that A x I = A. Here, I happens to be a matrix where all the diagonal elements are 1 and others are 0. In more formal terms, for a matrix A ∈ R^{m x n}, the identity matrix is the matrix I ∈ R^{n x n} such that I_{ij} = 1 if i = j and 0 otherwise.

In general the size of an identity matrix is not important. Just that it should be a square matrix (with equal number of rows and columns) such that I_{ij} = 1 if i = j and 0 otherwise.

The transpose of a matrix obtained by flipping it around the diagonal. In formal terms, for a matrix A ∈ R^{m x n}, the transpose A^{T} is matrix B ∈ R^{n x m} such that B_{ij} = A_{ji}

Interesting properties of transpose operation

- (A
^{T})^{T}= A - (A x B)
^{T}= B^{T}x A^{T} - (A + B)
^{T}= A^{T}+ B^{T}

A symmetric matrix that is symmetric along the diagonal. In formal terms, matrix A ∈ R^{n x n} is symmetric if A^{T} = A. And matrix A ∈ R^{n x n} is anti-symmetric if A^{T} = -A

Note that the matrix has to be a square matrix in order to be a symmetric or anti-symmetric.

Observe that for any matrix A, A + A^{T} is symmetric and A - A^{T} is anti-symmetric. Thus, any matrix A = ((A + A^{T}) + (A - A^{T}))/2. That means, any matrix A can be expressed as a sum of a symmetric matrix and an anti-symmetric matrix.

This property is important for subsequent derivations. Symmetric matrices have very nice properties that are useful in real life calculations. It is common to denote a symmetric matrix using S. Thus, A ∈ S^{n} means that A is an n x n dimensional symmetric matrix.

Trace of a square matrix A &isa; R^{n x n} is denoted by tr(A). It is the sum of all the diagonal elements of the matrix.

` tr(A) = ∑ A`_{ii}

The matrix trace has some interesting properties:

- tr(A) = tr(A
^{T}) - tr(A + B) = tr(A) + tr(B)
- tr(n x A) = n x tr(A)
- If A x B is a square matrix (implying that B x A is also a square matrix), tr(AB) = tr(BA)
- Extending this, A, B, C, such that ABC is a square matrix, tr(ABC) = tr(BCA) = tr(CAB).
- This is not limited to 2 or 3, it applies to any number of matrices

We can think of the Norm of a vector as the length of the segment from the origin to the point denoted by the vector - that is the Euclidean norm (or the L_{s}Norm. It is denoted as

` ||x||`_{2} = sqrt(∑ x_{i}^{2})

Note that ||x||_{2} is same as x_{T}x - or the dot product of x with itself.

The L_{1} norm is defined as

` ||x||`_{1} = ∑ |x|_{i}

Similarly, we can define L_{p} norm as

` ||x||`_{p} = (∑ x_{i}^{p})^{1/p}

If we consider p = infinity, the L_{ ∞} simplifies to

` L`_{ ∞} = max(|x_{i}|)

In fact, the Norm is not limited to these L_{n} norms. We can define our own function. Any function f(x) can be used as the Norm if and only if

- f(x) >= 0 for all x [non-negativity]
- f(x) = 0 if and only if x = 0 [definiteness]
- For all x ∈ R
^{n}and t ∈ R, f(tx) = |t|f(x) [homogeneity] - For all x ∈ R
^{n}and y ∈ R^{n}, f(x + y) <= f(x) + f(y) [triangle inequality]

Norms are also defined for Matrices. The Frobenius Norm for matrices is defined as

` ||A||`_{F} = (∑ ∑ A_{ij}^{2})^{1/2}

Interesting to note that this is also equal to tr(A^{T}A)^{1/2}

Mathematicians have defined many other kinds of Norms. But these are the important ones we will need for machine learning.

A set of n vectors {x_{1}, x_{2}, . . . x_{n}} ∈ R^{n} is called linearly independent if no vector can be expressed as a combination of the other vectors. Conversely, a set of n vectors is called linearly dependent if at least one of them can be expressed as

` x`_{n} = ∑ α_{i}x_{i}

for some scalar values {αx_{1}αx_{2}. . . αx_{n-1}}

The column rank of a matrix is the largest subset of column vectors that are linearly independent. Similarly, the row rank is the largest subset of row vectors that are linearly independent.

The rank can be roughly considered as the information contained in the matrix. It can be proved that for any matrix A, the column rank is always equal to the row rank. Hence it is generally referred as the rank of the matrix. Some interesting properties of matrix ranks:

- For a matrix A ∈ R
^{m x n}, the rank(A) = <= min(m,n). If rank(A) = min(m,n), it is called a full rank matrix. - For any matrix A, rank(A) = rank(A
^{T}) - For any matrix A ∈ R
^{m x n}and B ∈ R^{n x p}, the rank(AB) <= min(rank(A), rank(B)) - For any matrix A,B ∈ R
^{m x n}, rank(A + B) <= rank(A) + rank(B)

Concept of matrix inverse is very important for most linear algebra problems. Several mathematical libraries are dedicated to just calculating the inverse in an efficient way.

The inverse of a matrix A ∈ R^{n x n} is another matrix A^{-1} ∈ R^{n x n} such that

` A`^{-1}A = AA^{-1} = I^{n}

The matrix inverse is unique. A matrix cannot have multiple inverses. Also, the inverse may not be defined for every matrix. It is certainly not defined for non-square matrices. Even with square matrices, the inverse may not be defined. Such a matrix is called Singular Matrix.

A matrix is called non-singular or invertible if and only if A^{-1} exists. Else it is singular or non-invertible. A non-singular or invertible matrix is always a full rank matrix.

For matrices A, B ∈ R^{n x n}, we have

- (A
^{-1})^{-1}= A - (AB)
^{-1}= B^{-1}A^{-1} - (A
^{-1})^{T}= (A^{T})^{-1}. Hence it is also referred as A^{-T}

For the linear equations we saw above,

```
Ax = b
=> x = A
```^{-1}b

That makes the job pretty simple. Of course, for this we require that A is a full rank square matrix. That is, we have as many equations as the number of variables and none of the equations is redundant. These are the necessary and sufficient conditions for a set of linear equations to be solvable.

For two vectors x, y ∈ R^{n}, x^{T}y can be considered to be the shadow of x on y (or vice-versa. The dot product is positive if x has some components in the direction of y (or y in the direction of x) and it is negative if x has some components in direction of y (or y in the direction of x). In a sense, it is an indicator of the angle between them. The two vectors x, y ∈ R^{n} are orthogonal if this dot product is 0.

And a vector x ∈ R^{n} is called normalized if its L_{2} norm is 1. Two vectors are called orthonormal if both are normalized and orthogonal.

When we talk about matrices, the definition of orthogonality is slightly different - although intuitively, it means the same. A square matrix A ∈ R^{n x n} is orthogonal if all its columns are normalized and orthogonal to each other. From this definition, it follows that

` UU`^{T} = I = U^{T}U

In other words, the transpose of an orthogonal matrix is also its inverse.

Note that we need a square matrix. If A is not a square matrix with orthonormal columns, AA^{T} will be I. But since we have more rows than the columns - more rows than the dimensions, they can never by orthonormal. (eg: three vectors in 2D space can never be orthogonal to each other). Thus the A^{T}A is not I.

Thus, we need a square matrix for it to be orthogonal.

The orthogonal matrices have some interesting properties. For example, when we multiple a vector with an orthogonal matrix, its Euclidean (L_{2}) norm remains unchanged. Thus, for an orthogonal matrix U ∈ R^{n x n}

` ||Ux||`_{2} = ||x||_{2}

For any x ∈ R^{n}

The span of something refers to the extent of its capacity. The span of a set of vectors {x_{1}, . . . x_{n}} is the set of all the vectors that can be referred by a linear combination of these. For example, the vectors (0, 1) and (1, 0) can span the entire two dimensional space. But the vectors (1, 1) and (2, 2) can span only one line in the two dimensional space. Thus the span of the set {(1, 1) and (2, 2)} is just a line while the span of the set {(0, 1), (1, 0)} is a plane.

In formal terms,

` span({x`_{1}, . . . x_{n}}) = {v : v = ∑ α_{i}x_{i}, α_{i} ∈ R^{n}}

If we have n vectors are linearly independent, their span is R^{n}.

The projection of a vector y ∈ R^{m}} onto the span of n linearly independent vectors {x_{1}, . . . x_{n}} ∈ R^{m}} (here, m > n), is the vector v ∈ span{x_{1}, . . . x_{n}} such that y is as close to v as possible.

Formally:

` proj(y; {x`_{1}, . . . x_{n}}) = argmin_{ v ∈ span({x1, . . . xn})}(||y - v||_{2})

The Range R(A) also called the columnspace of a matrix is essentially the span of its columns. For a matrix A ∈ R^{m x n}, the range is defined as

` R(A) = {v ∈ R`^{m} : v = Ax , x ∈ R^{n}}

This gives us some interesting properties.

If A is full rank and m > n, the projection works out to be

` proj(y, A) = A(A`^{T}A)^{-1}A^{T}y

Finally, the null space of a vector A ∈ R^{m x n}, denoted by N(A), is the set of all the vectors that equal 0 when multiplied by A. That is:

` N(A) = {x ∈ R`^{n}: Ax = 0}

Interestingly, R(A^{T}) and N(A) are disjoint sets and span the entire n dimensional space. Thus, any point in R^{n} is either in R(A^{T}) or N(A); and never in both.

Hence they are called orthogonal complements of each other.

The determinant of a matrix |A| or det(A) can be considered as the "volume" of space covered by its rows - the volume of the hyper cube defined by points that are a linear combination of the row vectors (a_{i}), such that the linear coefficients are less than or equal to 1. That is, the volume covered by the set S such that

` S = {v ∈ R`^{n} : v = ∑ α_{i}a_{i} where 0 <= α_{i} <= 1}

The absolute value of the determinant is the volume covered by this set S.

Algebraically; for A ∈ R^{n x n} consider A_{\i\j} ∈ R^{(n-1) x (n-1)} be the matrix generated by discarding the i^{th} row and j^{th} column. Then, the general recursive formula for the determinant is:

` |A| = ∑`_{i} (-1)^{(i+j)} a_{ij}|A_{\i\j}| for any j in 1 .. n
= ∑_{j} (-1)^{(i+j)} a_{ij}|A_{\i\j}| for any i in 1 .. n

The determinant is ugly to calculate. But people have developed efficient libraries that help us with this. It is important to understand the concept, and leave the algebra to the machines.

Intuitively, we can think of the determinant as the difference between the two sets of extended diagonals products. For a 3x3 matrix, the determinant can be calculated as

` |A| = a`_{11}a_{22}a_{33} + a_{12}a_{23}a_{31} + a_{13}a_{21}a_{32} - a_{11}a_{23}a_{32} - a_{12}a_{21}a_{33} - a_{13}a_{23}a_{31}

The same is extended for higher order matrices.

Given a square matrix A ∈ R^{n x n} and a vector x ∈ R^{n}, the scalar value obtained by x^{T}Ax is defined as the quadratic form. Note that:

` x`^{T}Ax = ∑ x_{i}(Ax)_{i} = ∑ x_{i} ∑ A_{ij}x_{j} = ∑_{i} ∑_{j} A_{ij}x_{i}x_{j}

Also, since the value is a scalar,

` x`^{T}Ax = (x^{T}Ax)^{T} = x^{T}A^{T}x = x^{T}(A + A^{T})x/2

From this, we can conclude that only the symmetric part of A contributes to the quadratic form. Hence, we implicitly assume that matrices appearing in the quadratic form are symmetric.

Based on this, for a non zero vector x ∈ R^{n}, a symmetric matrix A ∈ S^{n} is

- Positive Definite (PD) - If x
^{T}Ax > 0 - Denoted by A > 0 - Positive Semi-Definite (PSD) - If x
^{T}Ax >= 0 - Denoted by A >= 0 - Negative Definite (ND) - If x
^{T}Ax < 0 - Denoted by A < 0 - Negative Semi-Definite (NSD) - If x
^{T}Ax <= 0 - Denoted by A > 0 - Indefinite if none of the above - There exist x
_{1}and x_{2}such that x_{1}^{T}Ax_{1}> 0 and x_{2}^{T}Ax_{2}< 0.

Obviously, if A is positive definite, -A is negative definite; and so on. An important property of positive definite and negative definite matrices is that they are always full rank and invertible.

Gram Matrix presents an interesting case. It is defined as a matrix that can be expressed as G = A^{T}A - for any A ∈ R^{m x n}. Note that A could be any matrix, need not be square. Any Gram Matrix G is always Positive Semi-Definite. Further, if m >= n and A is full rank, G = A^{T}A is Positive Definite.

Given a square matrix A ∈ R^{n x n}, we say that λ ∈ C is the eigenvalue and x ∈ R^{n} is the corresponding eigenvector if Ax = λx for x ≠ 0. Note that λ and x need not be real. They could be complex.

Intuitively, if we multiply a matrix by the eigenvector, the result points in the same direction as x, but scaled by a factor λ.

Note that for any scalar c, if x is an eigenvector, cx is also an eigenvector. Thus, when we refer to the eigenvector for the eigenvalue & lambda, we imply the one that is normalized to 1. This too involves the confusion between x and -x. But that is not a major problem.

We can rewrite the above equation as

` (λI - A)x = 0`

Interestingly, this means that (λI - A) has a non empty null space - implying that it is a singular matrix - implying that its determinant is 0. Thus, we have

` |(λI - A)| = 0.`

We can use the definition of determinant to solve this n^{th} order polynomial equation and get n complex values of λ and then we can solve further to find the individual eigenvectors.

Of course solving an n^{th} order polynomial is not a joke - for high values of n. There are better ways of solving this problem. But this was the simplest way that can sufficiently elaborate the concept. Some interesting properties for A ∈ R^{n x n}

- The trace of A is equal to the sum of its eigenvalues
- The determinant of A is equal to the product of its eigenvalues.
- The rank of A is equal to the number of non-zero eigenvalues.
- If A is invertible, (1/λ
_{i}) is the eigenvalue of A^{-1} - The eigenvalues of a diagonal matrix are the individual diagonal elements.

We can write the entire eigenvector equation as

` AX = XΛ`

Where columns of X are the eigenvectors and Λ is the diagonal matrix with the diagonal elements are the corresponding eigenvalues. If the eigenvectors of A are linearly independent then X is invertible. In that case, we can rewrite the above equation as

` A = XΛX`^{-1}

A matrix that can be written in this form is called diagonalizable.