Thursday, January 05, 2017

Learning About Learning

Yesterday Scott Aaronson released his sweeping new P v NP survey. Babai gave an update on graph isomorphism, in short while he still has the first subexponential time algorithm for GI, he no longer claims quasipolynomial-time. We'll have more on graph isomorphism next week.

Having seen the power of neural networks for learning, how do they actually work? Over the break I decided to catch myself up. I worked through Michael Nielsen's online book Neural Networks and Deep Learning chosen mainly because he co-authored one of the great books on quantum computing followed by the beginning of the TensorFlow tutorial. To try out code I downloaded Python and TensorFlow and used PyCharm as my IDE (free for academics). I should really learn GitHub since that holds all the code samples. Where were all these cool tools when I was a kid?

So what did I learn? Here is my high level understanding, but read the above materials to get a fuller picture. A neural network is just a weighted threshold circuit, though instead of precise threshold functions you use more continuous and differentiable functions like sigmoid or a rectifier. If you fix the weights and biases (the negation of the threshold value), the network computes a function from input to output, for example from images of digits to the digits themselves. Typically you have a sample of labelled data and you can create an error function to see how well your network computes the solution. Fix the labelled data and now you have a new function from the weights and biases to the error.

You want to choose weights and biases to minimize the error. The general approach is gradient descent, improve a given solution by moving a small distance in the opposite direction in the gradient and repeat. I had naively thought you estimated the gradient numerically but in fact the gradient is computed symbolically using a dynamic programming-like algorithm called backpropagation based on the chain rule for partial derivatives.

Convolution nets has a special first layer that captures features of pieces of the image. Recurrent neural networks allow feedback loops.

Nielsen shows how from scratch to build and train a neural net. In TensorFlow you just design the network and then it automatically computes the gradient and has a highly optimized algorithm that can be run on GPUs or specialized hardware to minimize the error.

What caught my eye is how much of an art machine learning still is. How many nodes should you have in your network? How many levels? Too many may take too long to train and could cause overfitting. Too few and you don't have enough parameters to create the function you need.

What threshold function do you use and how to you aggregate results? Lots of various tricks to avoid overfitting and improve speed of optimization. There's no fixed procedure to choose the parameters, though you can test how well you do. So lots of trial and error to learning and experience (which I don't have) certainly helps.

So we have amazingly powerful learning tools but for now we still need humans to learn. For now.


  1. Small correction: recurrent nets represent time dependencies (or more generally, dependencies along any DAG), not feedback loops. Each computation of a recurrent net can be unfolded into a feed-forward net of depth O(input size).

  2. "Convolution nets has a special first layer that captures features of pieces of the image."

    This is not correct. Convolutional neural networks have many convolutional layers, anywhere, not necessarily at the first layer. These layers exploit locality and translation invariance, two important properties of image-like data. Here is a stylized example showing how convolutional neurons can recognize higher-and-higher level abstractions, from edges through noses to faces:


  4. The idea behind convolution net is as follows: think of images and a box. Let's define a feature over the pixels in the box like the existence of a vertical line and have a neural network for it. Now the location of this box doesn't matter for the feature, so if you are looking for vertical lines in an image you can just use the same network for all of them, you can share the weights between networks for the feature. This saves a lot of weights and makes training and inference practical.

    Many important papers in machine learning are about intelligent ways for saving computation time. You really don't want the number of computation steps to grow superlinearly with respect to network depth, input size, ... that would make training and inference infeasible in practice. CNNs are the reason deep learning worked in practice and beat all previous algorithms in image recognition by a large margin. Machine learning requires a good deal of engineering to have practical algorithms that you can actually run and test, even constant factors matter.

  5. ------
    Lance asks  "How many nodes should you have in your network? How many levels? Too many may take too long to train and could cause overfitting. Too few and you don't have enough parameters to create the function you need."
    Algorithmic answers to these questions center upon the notion of "rank-jumping" (as at least some portions of the literature call it).

    Specifically in regard to the rank-jumping literature, a notably student-friendly multi-reference multi-example survey is Vin de Silva and Lek-Heng Lim's "Tensor rank and the ill-posedness of the best low-rank approximation problem" (SIAM Journal on Matrix Analysis and Applications, 2008).

    The de Silva/Lim survey has been concretely helpful (to me) in upgrading quantum simulation codes that, dynamically and adaptively, raise-and-lower the ranks of tensor representations. Algorithms that once were ad hoc, evolve to be more nearly universal and natural (and stable too).

    Sweet! Hoorah for "Team Yellow Book"! :)

    Further suggestions in regard to this "Yellow Book" literature — whether in the language of "rank jumps" or "topological closure" or any other GAGA-esque terminology — would be welcome to me and many. It's been plenty challenging (for me at least) to reduce this literature's beautiful insights to concrete algorithmic practice.

  6. As a follow-up, the above concrete computational considerations in regard to rank-jumping in tensor network representations are surveyed abstractly in "Yellow Book" comment #91 on Scott Aaronson's Shtetl Optimized essay "My 116-page survey article on P vs. NP" (of Jan 03 2017).

    As a followup, I will attempt to compose these two perspectives — abstract-with-concrete and Yellow Book-with-pragmatic — in a MathOverflow and/or TCS StackExchange question regarding "Yellow Book" descriptions of rank-jumping in practical computational simulations. Such attempted compositions — from me or anyone — can rightly be appreciated as tributes to a small-yet-vital community, namely the proprietors of mathematical weblogs.

    Math weblogs require of their proprietors a sustained personal commitment that (as it seems to me and many) crucially nourishes the vitality of the 21st century's diverse STEAM enterprises. In particular, math weblogs crucially nourish the hopeful enthusiasm of Yellow Book Era STEAM-students — hundreds of millions of 21st century STEAM-students, YIKES! :) — who will inherit and, as we can hope and even reasonably foresee, apply fresh Yellow Book understandings in marvelously extending our 21st century's great STEAM-enterprises.

    This New Year's appreciation of math weblogs, and heartfelt gratefulness for the sustained efforts of their oft-underappreciated proprietors, is therefore extended.

  7. "Where were all these cool tools when I was a kid?" - When you were a kid, the disgruntled gray hairs were envious of your tools like a keyboard and screen.

    When 'kids these days' are old they'll be marveling at the youngsters that seem to just twitch their eyelids to pilot their interstellar spacecraft.

    I'm not saying you have gray hair.

  8. Re "how much of an art machine learning still is" - a Deep Neural Network has a lot of engineering choices to make, not just gradient descent methods and threshold functions but the large scale architecture of the connections among layers, where there's a CNN, an RNN, an LSTM, etc. In comparison, SVMs, Random Forests, and regression have just a handful of hyper-parameters to tweak.