Posts

Sorted by New

30Putting multimodal LLMs to the Tetris test

3mo

112Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Wiki Contributions

Comments

Putting multimodal LLMs to the Tetris test

Lovre2mo10

Benchmarking new models

In this comment, we'll collect the results of tests we run with models released after the above post was written. We'll only test new visual LLMs for which there seems to be a significant chance of outperforming GPT-4V; we will also, to save on API costs (e.g. for Claude 3 Opus each 0-shot game costs about 0.5 USD), run just the basic prompt with 0-shot, see whether that produces significantly better results than GPT-4V and on that basis decide whether to comprehensively test the new model.

Initial testing

Claude 3 Opus: Scored 17.5 with the basic prompt (0-shot), averaged over 10 games, compared to 19.6 scored by GPT-4V. It misclassifies tetrominoes a lot. We decided not to test it more comprehensively.

Comprehensive testing

So far, no models have seemed worth to comprehensively test. If you believe there is a model that could plausibly outperform GPT-4V, that we have missed, please send me an email.

Putting multimodal LLMs to the Tetris test

Lovre3mo10

Thanks for a lot of great ideas!

We tried cutting out the fluff of many colors and having all tetrominoes be one color, but that's didn't seem to help much (but we didn't try for the falling tetromino to be a different color than the filled spaces). We also tried simplifying it by making it 10x10 grid rather than 10x20, but that didn't seem to help much either.

We also thought of adding coordinates, but we ran out of time we allotted for this project and thus postponed that indefinitely. As it stands, it is not very likely we do further variations on Tetris because we're busy with other things, but we'd certainly appreciate any pull requests, should they come.

Decent plan prize winner & highlights

Lovre3mo20

Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they've seen everything, but if trained with synthetic data it should be much more doable.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre2yΩ470

Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation.

But I got curious now, so I performed some experiments.

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper

Feature vectors of the Iris flower data set:
Input = 83, 40, 58, 20, output = 1
Input = 96, 45, 84, 35, output = 2
Input = 83, 55, 24, 9, output = 0
Input = 73, 54, 28, 9, output = 0
Input = 94, 45, 77, 27, output = 2
Input = 75, 49, 27, 9, output = 0
Input = 75, 48, 26, 9, output = 0

So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also... in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.)

I also tried

Input = 83, 55, 24, 9, output = 0
Input = 73, 54, 28, 9, output = 0
[... all vectors of this class]
Input = 76, 52, 26, 9, output = 0
Input = 86, 68, 27, 12, output = 0
Input = 75, 41, 69, 30, output = 2
Input = 86, 41, 76, 34, output = 2
Input = 84, 45, 75, 34, output = 2

Where I cherrypicked the "class 2" so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than "remembering" something.

This last experiment might seem a little contrived, but bit of a problem with this kind of testing is that if you supply enough stuff in-context, the model (plausibly) meta-learns the distribution and then can generate pretty similar-looking vectors. So, yeah, to really get to the bottom of this, to become 'certain' as it were, I think one would have to go in deeper than just looking at what the model generates.

(Or maybe there are some ways around that problem which I did not think of. Suggestions appreciated!)

To recheck things again -- since I'm as worried about leakage as anyone -- I retested Iris, this time transforming each coordinate with its own randomly-chosen affine transformation:

And the results are basically identical to those with just one affine transformation for all coordinates.

I'm glad that you asked about InstructGPT since I was pretty curious about that too, was waiting for an excuse to test it. So here are the synthetic binary results for (Davinci-)InstructGPT, compared with the original Davinci from the post:

$\begin{matrix} Model & Scen. 1 & Scen. 2 & Scen. 3 & Scen. 4 & Scen. 5 & Scen. 6 & Scen. 7 & Scen. 8 & Scen. 9 Vanilla GPT-3 & 67.78 % & 76.67% & 77.78% & 82.22 % & 95.56% & 77.78% & 70.0 % & 72.22 % & 63.33 % Instruct GPT-3 & 75.56% & 72.22 % & 71.11 % & 86.67% & 95.56% & 77.78% & 84.44% & 73.33% & 73.33% \end{matrix}$

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre2yΩ020

That seems like a great idea, and induction heads do seem highly relevant!

What you describe is actually one of the key reasons why I'm so excited about this whole approach. I've seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm -- but...they just don't have (m)any nontrivial "degrees of freedom" in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input -- and even more so, you can vary it not just discretely, but also ~continuously.

That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre2y40

I forgot to explicitly note it in the post, but yeah, if you have any ideas for variations on these experiments which you'd like to see run, which you feel might make your model of what is going on clearer, feel free to comment them here. Conditional on them being compute-light/simple enough to implement, I'll try to get back to you ASAP with the results – do feel encouraged to share ideas which might be vaguer or might require more compute as well, though in those cases I might not get back to you immediately.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre2y30

Is there any difference in formatting you omitted mentioning?

There shouldn't be any difference – neither between Iris and the synthetic binary tasks, nor between different synthetic binary tasks themselves – except if some snuck in that evaded my notice.

The only thing I experimented with, alternative-formatting-wise, was that the first time I experimented with Iris, I did it with a line before all the input vectors which said something like "This is a sequences of inputs and outputs of an integer function.", but then I redid the experiment without that line, without any penalty to the accuracy (the results shown are without that preamble) – so when I later did all the synthetic binary experiments, I omitted any preamble.

In regression experiments, I also originally added the line: "This is a sequence of inputs and outputs of a function which takes an integer as an argument and returns an integer." I didn't really do any test whether regression performed better with that line or not, but in some examples it didn't seem like it made a difference.

(Technical note: for all the synthetic binary and regression tasks shown in this post, their "input text" (i.e. the way their train feature vectors were formatted) can be found in the linked repository, in experiments_log.json. Top-level of the json is the experiment name, and each experiment name has the key "input_text" where this is stored. Input text for Iris is not stored though, but there is some metadata in iris_results/. A run of iris_test.py with the parts which send the input via API commented out does confirm that the format is much the same, though.)