Skip to main content

Deep Learning (Goodfellow et al) - Chapter 2 review


     This chapter is the first chapter of the part - 1. This part prepares you for Deep learning and in this chapter Linear Algebra is covered. The author advice you to skip the chapter if you are already familiar with the concepts. Also it is not that you can become Gilbert Strang after reading this chapter! The chapter is so concise that what may be relevant to our future acquaintance with the book are only present here. Only at the end of the chapter some concepts author speaks on the application part. The focus is on mathematical concepts involving Linear Algebra.
 
    Though Science and Engineering use Linear Algebra efficiently, Computer scientist have lesser experience with it. This is the motivation of the chapter and if you have experience it then author provides you with a reference (The Matrix Cookbook - Petersen and Pedersen, 2006) and waves you good bye to meet in the next chapter.

     Definitions of scalar, vector, matrix and tensor are listed one by one followed by there representations and examples. Basic matrix operations like transpose, matrix addition, matrix multiplication are discussed. Properties of these operations were discussed. Matrix multiplication is associative, distributive over addition but not commutative. Matrix can be used in solving a system of linear equations, inversion of matrix will solve this equations. Inverse is introduced after introducing identity matrix. A set of equations can have one solutions or infinitely many solutions and not anything in between, because linear combination of a solution, will also be a solution to the problem. The set of vectors that are obtainable by linear combination of the original vectors is the span. The dimensions of column space of matrix in the equation was discussed for various cases of solutions.

    If the columns of the matrix are linearly independent, then we will have one solution. This is also a condition for the invertibility of matrix. So for a matrix to be invertible, it has to be square ( for the column space to contain all dimensions of bias term). A square matrix with linearly dependent column is known as singular. For Non-singular square matrix the solution can be arrived at by taking an inverse.

     Norms is a metrics to measure the size of a vector. There are various norms of which Lp norms like L1 and L2(Euclidean Norms) are widely used. In order to satisfy as a norm, a function has to satisfy three conditions. Null vector must be reduced to zero, triangular inequality must be satisfied and the function is homogeneous. The squared L2 norm is used in machine learning than L2 itself. L1,Linf are other norms used. Forbenius norm is a matrix norm similar to L2 norm.

  Diagonal matrix, symmetric matrix, unit vector, orthogonal vectors, orthonormal vectors and orthogonal matrix. The definition of orthogonal matrix is at its best. This is the simplest form i have seen. Rows are mutually orthonormal. Usually many textbooks expresses this as Inverse is equal to transpose or come up with sets of conditions for an orthogonal matrix. The inverse of an orthogonal matrix are easy to compute. The matrices are usually decomposed to orthogonal matrices for easy computation. This is the case with eigen value decomposition and sigular value decomposition.

     In Eigen decomposition, a matrix is decomposed into set of eigenvectors and eigenvalues. A matrix A has an eigenvector v corresponding to eigenvalue 𝝀 if

Av = 𝝀v

 If we concatenate eigenvectors v into V and diag(𝝀) as 𝝀, then eigenvalue decomposition can be written as

A = V diag(𝝀) V-1

   The matrix has to be square and the eigen vectors must be linearly independent. For real valued eigenvectors and eigenvalues,

A = Q𝛥QT

Q is an orhtogonal matrix composed of eigen vector of A.
𝛥 is diagonal matrix containing eigenvalue ordered as the eigenvectors in Q.

    The order in which the eigen vector and eigen values are present is by descending order of eigen values by convention. Eigendecomposition will be unique only if all the eigenvalues are unique. If two eigen values are equal then any set of eigen vectors lieing in their span will also be eigen vectors. This is actually called eigen space, but the author does not give the term as this may not be relevant further in the context of the book.

x T A x gives the eigen value of A if ||x|| =1 is the eigen vector of A.
And this term for all eigen vector will be greater than zero for a positive definite matrix (All eigen values are greater than 0).

    Singular value Decomposition (SVD) decomposes a matrix into its singular values and vectors. Unlike Eigendecomposition even non-square real matrix has a singular value decomposition. SVD can be expressed as a product of three matrices

A = UDVT


    Where D is the matrix containing singular values of A (square roots of eigen values of ATA (which is also equal to AAT),
U is collection of left singular vectors (eigen vectors of AAT),
V is collection of right singular vectors (eigen vectors of ATA).


   They also discuss a concept called Moore-Penrose pseudo-inverse. This enables as to find matrix inversion. Pseudo invers of A is obtained as

A+ = VD+UT

    where D+ is the transpose of the matrix containing the reciprocal of non-zero elements of D.

    It gives one of the many solutions for wider A and approximate solution taller A.

   Trace is the sum of diagonal element. It is also Frobenius norm of AAT.

   Some properties of Trace was discussed. The Determinant of a square matrix is introduced at the end of the chapter. Trace is equal to sum of eigen values and determinant is equal to product of eigen values.

    At last the chapter ends with Principal component Analysis (PCA). This is the first Machine learning algorithm spotted in the book. It is a lossy compression applicable to a rectangular matrix containing data. By neglecting few columns in our encoded matrix, we can use the matrix to store lesser memory. The encoded matrix 'c' can be decoded by means of another matrix D,which will have its columns orthogonal. An algorithm is devised to minimize the input point and the decoded point. The algorithm is written to return the largest principal component and similar to it other eigenvectors can be obtained.

    From this chapter one of the key to understand Deep learning is introduced.

Comments

Popular posts from this blog

CodeSignal - Almost Strictly Increasing Sequence

I came across an interesting question in Code Signal.   "Given a sequence of integers as an array, determine whether it is possible to obtain a strictly increasing sequence by removing no more than one element from the array."                                                                                                         It is strictly increasing if every element in the array is greater than its successor.     For a strictly increasing sequence we can check for each element whether it is greater than the next. In that case we can come to a conclusion that this sequence is not strictly increasing. If every element is greater than the successor we get to know it is a strictly increasing.    For worst case(The sequence is strictly increasing), the algorithmic complexity is O(n).     If we use similar approach for the question, we have to remove each element and pass the sequence to the above function.      So for n elements, we use a fun