Sunday, September 23, 2018

Backpropagation


Backpropagation

UAT

The Universal Approximation Theorem (UAT) states that the simplest version of the structure of weights & non-linear functions used in deep learning. The UAT states roughly that this equation can be solved, to an approximation:
logit(numbers[] * input_weights[][]) * hidden_weights[][]) ~= other_numbers[]

Backpropagation

The UAT does not tell how to do this, only that it is possible. Later, the backpropagation (BP) algorithm was created to realize the possibility of the UAT. BP, as created, is only intended to solve one equation of N inputs and M unknowns. Backprogagation uses "hill-climing", which assumes that the solution of an equation is at the top of a hill. Wherever it starts, if it climbs uphill, it will eventually hit the top.

In a typical Deep Learning application, we take thousands of input samples which are similar, and turn them into equations of the above form. For a simple image classifier, this would be typical. The right-hand size corresponds to whether the image is a picture of a cat, a building or a car.
logit(image_pixels[] * input_weight_matrix[]) * hidden_weight_matrix[] ~= [1,0,0]
Wait a minute! In the math of equation-solving, each of these equations is a separate subspace, to be solved independently.

How did we get from approximating an equation (UAT) to approximating thousands of equations? The key idea of Deep Learning is to pretend that all of these equations are in one subspace, and that there is a mythical equation that is the centroid of all of these equations.
Deep Learning takes many (thousand) sample equations which are mathematically different subspaces, and essentially averages the result of the application of BP to each equation. To put it another way, each equation tugs the BP algorithm toward itself.

This seems like a misuse of BP- a misuse that has proven very very fruitful.

References

See the wikipedia pages for more info on the UAT, Backpropagation, and Gradient Descent:

https://en.wikipedia.org/wiki/Universal_approximation_theorem
https://en.wikipedia.org/wiki/Backpropagation
https://en.wikipedia.org/wiki/Gradient_descent








Neural Networks Series #1: What's an SVD?

Singular Value Decomposition (SVD) is a linear algebra algorithm that decomposes almost any matrix of numbers into 3 matrices, that when multiplied together will recreate the original matrix. There are several of these matrix decomposition techniques, useful for different purposes.
Before we delve into the uses of SVD, we will take a short detour. The rank of a matrix is the number of linearly separable rows (or columns) in the matrix, which means that:
        if (row[i] + x) * y + z = row[j] 
            then rows i and j are not linearly separable
Computing the rank is done by mutating the matrix into row echelon form, which substitutes a row of zeros for all of the rows which are considered "duplicates" under linear separability. Here is a tutorial on row echelon form:
https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
This matrix has a rank of 1 because it is possible to generate any 3 of the rows from the fourth row with the above formula, and to generate the remaining columns:
[1,1,1,1]
[2,2,2,2]
[3,3,3,3]
[4,4,4,4]
This is called linear separability. Rank gives a primitive measurement for the amount of correlation between rows or columns. SVD is more nuanced. It implements linear separability as a continuous measurement whereas rank is only a discrete measurement. Separability is scored by Euclidean distance, and SVD creates a relative ranking of how close each row is to all other rows, and how close each column is to every other column. This ranking gives a couple of useful measurements about the matrix data:

  1. It gives a more nuanced measurement of separability than just the rank. It gives a vector of numbers which give the relative amount of separability across multiple columns.
  2. By ranking the distances among rows & columns, it gives a scoring of centrality, or "normal vs outlier".
This matrix has a rank of 2 because while it is not possible to generate matching rows or columns, they are close:
[1,2,3,4]
[2,3,4,5]
[3,4,5,6]
[4,5,6,7]
SVD generates a list of numbers called the "singular values" of a matrix. Here are the singular values for the above 2 matrices:
[1.00000000e+00 8.59975057e-17 2.52825014e-47 3.94185361e-79]
[9.98565020e-01 5.35527830e-02 1.96745978e-17 9.98144122e-18]
Let's chart these singular value vectors (linear scale):
These charts tell us that the first matrix has a rank of 1. The second matrix has a rank above one, but does not have two full rows of uniqueness. This chart is a measurement of the correlation between rows and correlation between columns. The singular values of a matrix are a good first look at the amount of "information" or "signal" or "entropy". "Quantity of signal" is a fairly slippery concept with mathematical definitions that border on the metaphysical, but the singular values are a useful (though simplistic) measurement of the amount of information contained in a matrix.
Based on the term "entropy" in the previous paragraph, let's try filling a larger matrix with random numbers. This is the measurement of a 20x50 random matrix, with a chart in log scaled on the Y axis:
Rank = 20
Singular values: 
[0.35032972 0.322733   0.30419963 0.29269556 0.28137789 0.26201005
 0.25742231 0.23075557 0.21945011 0.21763414 0.20245292 0.1976021
 0.175207   0.16692658 0.14884194 0.14208821 0.13474842 0.12291441
 0.10004981 0.08849001]
The graph should be a straight line but is not, due mainly to the fact "random numbers" are very hard to come by. This matrix has rank of 20- you cannot multiply & add any row and get another row. In fact, all rows are equidistant under Euclidean distance. Let's do something sneaky: we're going multiply it by itself to get a 50x50 matrix of random numbers. Here's are the measurements (again, a log-Y chart):
Rank = 20
[1.14271040e+02 9.69770480e+01 8.61587896e+01 7.97653909e+01
 7.37160653e+01 6.39172571e+01 6.16984957e+01 4.95777248e+01
 4.48387835e+01 4.40997633e+01 3.81619300e+01 3.63550994e+01
 2.85815085e+01 2.59437792e+01 2.06268491e+01 1.87974231e+01
 1.69055603e+01 1.40665565e+01 9.31997487e+00 7.29072452e+00
 1.50274491e-14 1.29473245e-14 1.24897761e-14 1.11334989e-14
 1.03937520e-14 9.90176181e-15 9.17746567e-15 8.94074462e-15
 8.58805508e-15 7.94287894e-15 6.90962668e-15 6.72135898e-15
 6.02639348e-15 5.66356533e-15 5.16429606e-15 4.84762468e-15
 4.47711092e-15 4.37266408e-15 4.27780206e-15 3.43884568e-15
 3.21440817e-15 2.83815151e-15 2.61710432e-15 2.34685343e-15
 2.24348424e-15 1.32915906e-15 9.12392110e-16 5.65320744e-16
 2.64346262e-16 1.11941235e-16]
Huh? Why is the rank still 20? And what's going on with that chart? When we multiplied the matrix by itself, we did not add any information/signal/entropy to the matrix! Therefore it still has the same amount of separability. It's as if we weighed a balloon, filled it with weightless air, then weighed it again.
In this post, we discussed SVD and the measurement of information in a matrix; this trick of multiplying a random matrix by itself to blow it up was what helped me understand SVD ten years ago. (Thanks to Ted Dunning on the Mahout mailing list.)
I've fast-forwarded like mad through days of linear algebra lecture, in order to give intuitive understanding about matrix analysis. We'll be using SVD in a few ways in this series on my explorations in Neural Networks. See this page for a good explanation of SVD and a cool example of how it can be used to clean up photographs. :
Source code for the above charts & values available at: