My notes from “Self-Supervised Learning” talk by Yann LeCun at PAISS 2019.
Supervised learning works but requires many labelled samples.
Reinforcement learning: model-free RL works great for games.
They all use convNets and a few other architectural concepts
real-time learning isn’t practical
horrible sample efficiency - can’t really use in other than simulation yet.
ConvNets
ConvNets work where the property is the strong local correlations in the features and some sort of transition invariance of the statistics of the signal - images, audio signals, any “array” where locality has a meaning.
Deeper convnets make the system fault tolerent.
Lots of progress in computer vision - Mask R-CNN: instance segmentation, one-shot systems: FPNs, RetinaNet, Panoptic FPNs - Segment and recognizes object instances and regions, Deconvolutions: UNet - used a lot in medical imaging: Works better when using 3D UNet rather than 2D - artifacts are reduced.
Cosmological structure formation with ConvNets: 3D UNet, PDE solver on small 3D domains, Use it for a simulation on a large volume. Various other applications.
Self-driving cars: Perception systems.
Is this gonna take use to “intelligent” systems? No. We need significant effort to make progress to build machines with common sense.
The notion of AGI doesn’t exist. Just because there is no concept of general intelligence.
Self-supervised learning
Humans learns quickly. Largely by observation, with remarkably little interaction. At 9 months, concepts of inertia, gravity…
This is not supervised, or reinforcement.
Self-supervised learning: Prediction and Reconstruction:
Predict any part of the input from any other part.
Predict the future from the past.
Predict the future from the recent past.
Predict the past from the present.
Predict the top from the bottom.
Predict the occluded from the visible.
Predict there is a part of the input you don’t know and predict that.
RL: weak feedback, SL: medium feedback, S-SL: A lot of feedback. The machine predicts any part of it’s input for any observed part.
SL works really good on ImageNet is not because of lots of training samples, it’s because of 1000+ categories. Categories help to construct representations. Learning theory: sample complexity, 1 scalar value once in while doesn’t make much sense.
How to deal with uncertainity? We have to have systems that represent the uncertainity in the prediction - Energy based learning.
Energy based learning
Learning an energy function (or contrast function) \(F(Y)\), that takes
Low values on the data manifold
Higher values everywhere else
How do we turn a prediction with multiple possible outputs to an architecute?
Learn the energy function (think of it like a \(-log\) of probability, but not normalized. More general than the probablilistic approaches).
Such system doesn’t have an output, it consituits the inputs. And if I constraint the values of some of the inputs, I get the values of the other inputs that minimizes the energy.
How do you train such systems?
How do you make sure the energy is higher outside? - Probablisitic models do this automatically.
Transform the energy based model into a probablilistic one -> Gibbs distribution.
Parameterized energy function \(F(Y, W)\):
make the energy low on the samples
make the energy higher everywhere else
making the energy low on the samples is easy - NN training
What about energy everywhere else?
Build the machine so that the volume of low energy stuff is constant.
PCA energy function (not a good representation of the data manifold)
\[F(Y) = {|| W^TWY - Y^2 ||}^2\]
K-means energy function, Z- constrained to 1-to-k code (not a direct representation, reconstruction is the product of a prototype matrix x latent vector (one-hot: closed prototype) - Doesn’t scale very well in high dims)
\[F(Y) = min_{x} \sum_i {||Y - W_i Z_i ||^2 }\]
Maximize P(Y
W) on training samples.
Minimize \(-log(P(Y, W))\) on training samples. Computer the gradient of the neg log-likelihood loss for one sample \(Y\).
Latent Variable Models
Sparse Coding and Regularized Auto-Encoders.
Make you pay for choosing a Z outside. Make the machine want to make many components of Z, zero (with L1 norm). Make the region of space smaller or larger as we want. Note: K-means does this implicitly by making z a discrete variable. PCA also implicity does this by limiting the dimension of Z.
On MNIST, all digits would be reconstructed as a linear combination of a small number of columns.
You have to constrain the norm to be bounded, else it blows up. “Generation through Latent Optimization” paper.
In sparse coding auto-encoders, train a NN to predict the optimal code is, this value is regularized to be sparse. There is a convolutional version of this as well.
Variational Auto-Encoder
In the context of energy-based models.
You fit an image patch through an encoder, run it through code, you add noise to that code, and you run though a decoder. You minimize the reconstruction error by training it through the system.
In the Z space, every training sample will be a point. If you add noise to each of those point, you turn them into a fuzzy sphere. Some fuzzy balls overlap, causing bad reconstructions.
Training -> Fuzzy balls move away from each other to minimize reconstruction error -> But that does nothing for us.
To prevent this from happenning, you attach every sphere with a spring, a penality for moving too far, so-called KL term.
AKA:
Attach the balls to the center with a spring, so they don’t fly away.
Minimize the square distances of the balls to the origin.
Center the balls around the origin.
Make the sizes of the balls close to 1 in each dimension.
Through a so-called KL term.
Thus, Limiting the information content of the code.
Denosing: Corrupt the input, train to recover original input. Flaw: creates a valley. Works great in NLP: BERT. Easy to model uncertainity in discrete space.