# Self-supervised learning

- 5 mins

## My notes from “Self-Supervised Learning” talk by Yann LeCun at PAISS 2019.

• Supervised learning works but requires many labelled samples.
• Reinforcement learning: model-free RL works great for games.
• They all use convNets and a few other architectural concepts
• real-time learning isn’t practical
• horrible sample efficiency - can’t really use in other than simulation yet.

### ConvNets

• ConvNets work where the property is the strong local correlations in the features and some sort of transition invariance of the statistics of the signal - images, audio signals, any “array” where locality has a meaning.
• Deeper convnets make the system fault tolerent.
• Lots of progress in computer vision - Mask R-CNN: instance segmentation, one-shot systems: FPNs, RetinaNet, Panoptic FPNs - Segment and recognizes object instances and regions, Deconvolutions: UNet - used a lot in medical imaging: Works better when using 3D UNet rather than 2D - artifacts are reduced.
• Cosmological structure formation with ConvNets: 3D UNet, PDE solver on small 3D domains, Use it for a simulation on a large volume. Various other applications.
• Self-driving cars: Perception systems.
• Is this gonna take use to “intelligent” systems? No. We need significant effort to make progress to build machines with common sense.
• The notion of AGI doesn’t exist. Just because there is no concept of general intelligence.

### Self-supervised learning

• Humans learns quickly. Largely by observation, with remarkably little interaction. At 9 months, concepts of inertia, gravity…
• This is not supervised, or reinforcement.
• Self-supervised learning: Prediction and Reconstruction:
• Predict any part of the input from any other part.
• Predict the future from the past.
• Predict the future from the recent past.
• Predict the past from the present.
• Predict the top from the bottom.
• Predict the occluded from the visible.
• Predict there is a part of the input you don’t know and predict that.
• RL: weak feedback, SL: medium feedback, S-SL: A lot of feedback. The machine predicts any part of it’s input for any observed part.
• SL works really good on ImageNet is not because of lots of training samples, it’s because of 1000+ categories. Categories help to construct representations. Learning theory: sample complexity, 1 scalar value once in while doesn’t make much sense.
• How to deal with uncertainity? We have to have systems that represent the uncertainity in the prediction - Energy based learning.

#### Energy based learning

• Learning an energy function (or contrast function) $F(Y)$, that takes
• Low values on the data manifold
• Higher values everywhere else
• How do we turn a prediction with multiple possible outputs to an architecute?
• Learn the energy function (think of it like a $-log$ of probability, but not normalized. More general than the probablilistic approaches).
• Such system doesn’t have an output, it consituits the inputs. And if I constraint the values of some of the inputs, I get the values of the other inputs that minimizes the energy.
• How do you train such systems?
• How do you make sure the energy is higher outside? - Probablisitic models do this automatically.
• Transform the energy based model into a probablilistic one -> Gibbs distribution.
• Parameterized energy function $F(Y, W)$:
• make the energy low on the samples
• make the energy higher everywhere else
• making the energy low on the samples is easy - NN training
• What about energy everywhere else?

1. Build the machine so that the volume of low energy stuff is constant.
• PCA energy function (not a good representation of the data manifold)
• K-means energy function, Z- constrained to 1-to-k code (not a direct representation, reconstruction is the product of a prototype matrix x latent vector (one-hot: closed prototype) - Doesn’t scale very well in high dims)
•  Maximize P(Y W) on training samples.
• Minimize $-log(P(Y, W))$ on training samples. Computer the gradient of the neg log-likelihood loss for one sample $Y$.

#### Latent Variable Models

• Sparse Coding and Regularized Auto-Encoders.
• Make you pay for choosing a Z outside. Make the machine want to make many components of Z, zero (with L1 norm). Make the region of space smaller or larger as we want. Note: K-means does this implicitly by making z a discrete variable. PCA also implicity does this by limiting the dimension of Z.
• On MNIST, all digits would be reconstructed as a linear combination of a small number of columns.
• You have to constrain the norm to be bounded, else it blows up. “Generation through Latent Optimization” paper.

• In sparse coding auto-encoders, train a NN to predict the optimal code is, this value is regularized to be sparse. There is a convolutional version of this as well.

#### Variational Auto-Encoder

• In the context of energy-based models.
• You fit an image patch through an encoder, run it through code, you add noise to that code, and you run though a decoder. You minimize the reconstruction error by training it through the system.
• In the Z space, every training sample will be a point. If you add noise to each of those point, you turn them into a fuzzy sphere. Some fuzzy balls overlap, causing bad reconstructions.
• Training -> Fuzzy balls move away from each other to minimize reconstruction error -> But that does nothing for us.
• To prevent this from happenning, you attach every sphere with a spring, a penality for moving too far, so-called KL term.
• AKA:
• Attach the balls to the center with a spring, so they don’t fly away.
• Minimize the square distances of the balls to the origin.
• Center the balls around the origin.
• Make the sizes of the balls close to 1 in each dimension.
• Through a so-called KL term.
• Thus, Limiting the information content of the code.
• Denosing: Corrupt the input, train to recover original input. Flaw: creates a valley. Works great in NLP: BERT. Easy to model uncertainity in discrete space.