A short introduction to machine learning

[-]Kerrigan3yΩ240

How was Dall-E based on self-supervised learning? The datasets of images weren't labeled by humans? If not, how does it get form text to image?

[-]Gabriel Adriano de Melo2yΩ020

The text-to-image from Dall-E was based on another model called CLIP, which had learned to caption images (generate image-to-text). This captioning could be thought as supervised learning, but the caveat is that they weren't labeled by humans (in the ML sense) but extracted from web data. This is just a part of the Dall-E model, another one is the diffusion process that is based on recovering an image from noise, which is un-supervised as we can just add noise to images and ask it to recover the original image.

[-]gwern3y20

The 'labels' aren't labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text 'labels' will have nothing whatsoever to do with the image - they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don't need text 'label' inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren't 'labels' in any traditional sense. They're just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.

[-]InvidFlower2y10

Not sure on DALL-E, but I think many image generators use an image classifier as part of their process. The classifier uses labels for its training, but the image AI doesn’t have direct intervention.

I think you take the classifier like CLIP and run it on an image to tell you it is likely “car” and “ red”. Then add noise to the image. Then provide the noisy image and classifications to the image AI. So it will try to find “red” and “car” and add more of it to the details. Then the resulting image is run through CLIP and the classifications compared to the original classifications to define the loss function.

[-]Millon Madhur Das3y-10

Just like language models are trained using masked language modelling and next token prediction, Dall-E was trained for image inpainting(predicting cropped-out parts of an image). This doesn't require explicit labels; hence it's self-supervised learning. Note this is only a part of the training procedure, which is self-supervised and not the whole training process.

[-]Niklas Todenhöfer1yΩ010

instead deep learning tends to generalise incredibly well to examples it hasn’t seen already. How and why it does so is, however, still poorly-understood.

In my opinion generalisation is a very interesting point!

Are there any new insights into deep learning generalisation, similar to the ideas of:

1) implicit regularisation through optimisation methods like stochastic gradient descent,
2) the double descent risk curve where more parameters can reduce error again,
or
3) margin-based measures to predict generalisation gaps?

Or more generally asked:
How do we maybe ensure regular update(s) of this or similar article(s)?

[-]Vamsi Sistla2yΩ010

Considering this article is 3 years old, wanted to add some updates on Symbolic AI since then.

I agree with the historic view of Symbolic AI but with Neuro Symbolic AI (NeSy), some of those concerns have been addressed. NeSy is leveraging the NN with Symbolic programming (example research - https://arxiv.org/abs/2402.01889)

This research (https://arxiv.org/abs/2401.01040) published in Jan 2024 is a good survey of this approach (and its challenges).