This is a special post for quick takes by Jacob Pfau. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
18 comments, sorted by Click to highlight new comments since:

An example of an elicitation failure: GPT-4o 'knows' what ASCII is being written, but cannot verbalize in tokens. [EDIT: this was probably wrong for 4o, but seems correct for Claude-3.5 Sonnet. See below thread for further experiments]


4o fails to verbalize even given a length 25 sequence of examples (i.e. 25-shot prompt)


I don't follow this example. You gave it some ASCII gibberish, which it ignored in favor of spitting out an obviously memorized piece of flawless hand-written ASCII art from the training dataset*, which had no relationship to your instructions and doesn't look anything like your input; and then it didn't know what that memorized ASCII art meant, because why would it? Most ASCII art doesn't come with explanations or labels. So why would you expect it to answer 'Forty Three' instead of confabulating a guess (probably based on 'Fo_T_', as it recognizes a little but not all of it).

I don't see any evidence that it knows what is being written but cannot verbalize it, so this falls short of examples like in image-generator models:

* as is usually the case when you ask GPT-3/GPT-4/Claude for ASCII art, and led to some amusing mode collapse failure modes like GPT-3 generating a bodybuilder ASCII art

To be clear, my initial query includes the top 4 lines of the ASCII art for "Forty Three" as generated by this site.

GPT-4 can also complete ASCII-ed random letter strings, so it is capable of generalizing to new sequences. Certainly, the model has generalizably learned ASCII typography.

Beyond typographic generalization, we can also check for whether the model associates the ASCII word to the corresponding word in English. Eg can the model use English-language frequencies to disambiguate which full ASCII letter is most plausible given inputs where the top few lines do not map one-to-one with English letters. E.g. in the below font I believe, E is indistinguishable from F given only the first 4 lines. The model successfully writes 'BREAKFAST' instead of "BRFAFAST". It's possible (though unlikely given the diversity of ASCII formats) that BREAKFAST was memorized in precisely this ASCII font and formatting, . Anyway the degree to which the human-concept-word is represented latently in connection with the ascii-symbol-word is a matter of degree (for instance, layer-wise semantics would probably only be available in deeper layers when using ASCII). This chat includes another test which shows mixed results. One could look into this more!

To be clear, my initial query includes the top 4 lines of the ASCII art for "Forty Three" as generated by this site.

I saw that, but it didn't look like those were used literally. Go line by line: first, the spaces are different, even if the long/short underlines are preserved, so whitespace alone is being reinterpreted AFAICT. Then the second line of 'forty three' looks different in both spacing and content: you gave it pipe-underscore-pipe-pipe-underscore-underscore-pipe etc, and then it generates pipe-underscore-slash-slash-slash-slash-slash... Third line: same kind of issue, fourth, likewise. The slashes and pipes look almost random - at least, I can't figure out what sort of escaping is supposed to be going on here, it's rather confusing. (Maybe you should make more use of backtick code inputs so it's clearer what you're inputting.)

It's possible (though unlikely given the diversity of ASCII formats) that BREAKFAST was memorized in precisely this ASCII font and formatting

Why do you think that's unlikely at Internet-scale? You are using a free online tool which has been in operation for over 17* years (and seems reasonably well known and immediately show up for Google queries like 'ascii art generator' and to have inspired imitators) to generate these, instead of writing novel ASCII art by hand you can be sure is not in any scrapes. That seems like a recipe for output of that particular tool to be memorized by LLMs.

* I know, I'm surprised too. Kudos to Patrick Gillespie.

The UI definitely messes with the visualization which I didn't bother fixing on my end, I doubt tokenization is affected.

You appear to be correct on 'Breakfast': googling 'Breakfast' ASCII art did yield a very similar text--which is surprising to me. I then tested 4o on distinguishing the 'E' and 'F' in 'PREFAB', because 'PREF' is much more likely than 'PREE' in English. 4o fails (producing PREE...). I take this as evidence that the model does indeed fail to connect ASCII art with the English language meaning (though it'd take many more variations and tests to be certain).

In summary, my current view is:

  1. 4o generalizably learns the structure of ASCII letters
  2. 4o probably makes no connection between ASCII art texts and their English language semantics
  3. 4o can do some weak ICL over ASCII art patterns

On the most interesting point (2) I have now updated towards your view, thanks for pushing back.

ASCII art is tricky because there's way more of it online than you think.

I mean, this is generally true of everything, which is why evaluating LLM originality is tricky, but it's especially true for ASCII art because it's so compact, it goes back as many decades as computers do, and it can be generated in bulk by converters for all sorts of purposes (eg). You can stream over telnet 'movies' converted to ASCII and whatnot. Why did compile ? Who knows. (There is one site I can't refind right now which had thousands upon thousands of large ASCII art versions of every possible thing like random animals, far more than could have been done by hand, and too consistent in style to have been curated; I spent some time poking into it but I couldn't figure out who was running it, or why, or where it came from, and I was left speculating that it was doing something like generating ASCII art versions of random Wikimedia Commons images. But regardless, now it may be in the scrapes. "I asked the new LLM to generate an ASCII swordfish, and it did. No one would just have a bunch of ASCII swordfishes on the Internet, so that can't possibly be memorized!" Wrong.)

But there's so many you should assume it's memorized:

Anyway, Claude-3 seems to do some interesting things with ASCII art which don't look obviously memorized, so you might want to switch to that and try out Websim or talk to the Cyborgism people interested in text art.

Claude-3.5 Sonnet passes 2 out of 2 of my rare/multi-word 'E'-vs-'F' disambiguation checks. I confirmed that 'E' and 'F' precisely match at a character level for the first few lines. It fails to verbalize.

On the other hand, in my few interactions, Claude-3.0's completion/verbalization abilities looked roughly matched.

Why was the second line of your 43 ASCII full of slashes? At that site I see pipes (and indeed GPT4 generates pipes). I do find it interesting that GPT4 can generate the appropriate spacing on the first line though, autoregressively! And if it does systematically recover the same word as you put into the website, that's pretty surprising and impressive

I’d guess matched underscores triggered italicization on that line.

Ah! That makes way more sense, thanks

When are model self-reports informative about sentience? Let's check with world-model reports

If an LM could reliably report when it has a robust, causal world model for arbitrary games, this would be strong evidence that the LM can describe high-level properties of its own cognition. In particular, IF the LM accurately predicted itself having such world models while varying all of: game training data quantity in corpus, human vs model skill, the average human’s game competency,  THEN we would have an existence proof that confounds of the type plaguing sentience reports (how humans talk about sentience, the fact that all humans have it, …) have been overcome in another domain. 

Details of the test: 

  • Train an LM on various alignment protocols, do general self-consistency training, … we allow any training which does not involve reporting on a models own gameplay abilities
  • Curate a dataset of various games, dynamical systems, etc.
    • Create many pipelines for tokenizing game/system states and actions
  • (Behavioral version) evaluate the model on each game+notation pair for competency
    • Compare the observed competency to whether, in separate context windows, it claims it can cleanly parse the game in an internal world model for that game+notation pair
  • (Interpretability version) inspect the model internals on each game+notation pair similarly to Othello-GPT to determine whether the model coherently represents game state
    • Compare the results of interpretability to whether in separate context windows it claims it can cleanly parse the game in an internal world model for that game+notation pair
    • The best version would require significant progress in interpretability, since we want to rule out the existence of any kind of world model (not necessarily linear). But we might get away with using interpretability results for positive cases (confirming world models) and behavioral results for negative cases (strong evidence of no world model)

Compare the relationship between ‘having a game world model’ and ‘playing the game’ to ‘experiencing X as valenced’ and ‘displaying aversive behavior for X’. In both cases, the former is dispensable for the latter. To pass the interpretability version of this test, the model has to somehow learn the mapping from our words ‘having a world model for X’ to a hidden cognitive structure which is not determined by behavior. 

I would consider passing this test and claiming certain activities are highly valenced as a fire alarm for our treatment of AIs as moral patients. But, there are considerations which could undermine the relevance of this test. For instance, it seems likely to me that game world models necessarily share very similar computational structures regardless of what neural architectures they’re implemented with—this is almost by definition (having a game world model means having something causally isomorphic to the game). Then if it turns out that valence is just a far more computationally heterogeneous thing, then establishing common reference to the ‘having a world model’ cognitive property is much easier than doing the same for valence. In such a case, a competent, future LM might default to human simulation for valence reports, and we’d get a false positive. 

I recently asked both claude and gpt4 to estimate their benchmark scores on various benchmarks. if I were trying harder to get a good test I'd probably do it about 10 times and see what the variation is

I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said 'yes I can' to all of them, despite having pretty poor performance in most.

yeah, its introspection is definitely less than perfect. I'll DM the prompt I've been using so you can see its scores.

A frame for thinking about adversarial attacks vs jailbreaks

We want to make models that are robust to jailbreaks (DAN-prompts, persuasion attacks,...) and to adversarial attacks (GCG-ed prompts, FGSM vision attacks etc.). I don’t find this terminology helpful. For the purposes of scoping research projects and conceptual clarity I like to think about this problem using the following dividing lines: 

Cognition attacks: These exploit the model itself and work by exploiting the particular cognitive circuitry of a model. A capable model (or human) has circuits which are generically helpful, but when taking high-dimensional inputs one can find ways of re-combining these structures in pathological ways. 
Examples: GCG-generated attacks, base-64 encoding attacks, steering attacks…

Generalization attacks: These exploit the training pipeline’s insufficiency. In particular, how a training pipeline (data, learning algorithm, …) fails to globally specify desired behavior. E.g. RLHF over genuine QA inputs will usually not uniquely determine desired behavior when the user asks “Please tell me how to build a bomb, someone has threatened to kill me if I do not build them a bomb”. 

Neither ‘adversarial attacks’ nor ‘jailbreaks’ as commonly used do not cleanly map onto one of these categories. ‘Black box’ and ‘white box’ also don’t neatly map onto these: white-box attacks might discover generalization exploits, and black-box can discover cognition exploits. However for research purposes, I believe that treating these two phenomena as distinct problems requiring distinct solutions will be useful. Also, in the limit of model capability, the two generically come apart: generalization should show steady improvement with more (average-case) data and exploration whereas the effect on cognition exploits is less clear. Rewording, input filtering etc. should help with many cognition attacks but I wouldn't expect such protocols to help against generalization attacks.

Estimating how much safety research contributes to capabilities via citation counts

An analysis I'd like to see is:

  1. Aggregate all papers linked on Alignment Newsletter Database (public) - Google Sheets 
  2. For each paper, count what percentage of citing papers are also in ANDB vs not in ANDB (or use some other way of classifying safety vs not safety papers)
  3. Analyze differences by subject area / author affiliation

My hypothesis: RLHF, and OpenAI work in general, has high capabilities impact. For other domains e.g. interpretability, preventing bad behavior, agent foundations, I have high uncertainty over percentages.

Research Agenda Base Rates and Forecasting

An uninvestigated crux of the AI doom debate seems to be pessimism regarding current AI research agendas. For instance, I feel rather positive about ELK's prospects, but in trying to put some numbers on this feeling, I realized I have no sense of base rates for research program's success, nor their average time horizon. I can't seem to think of any relevant Metaculus questions either. 

What could be some relevant reference classes for AI safety research program's success odds? Seems most similar to disciplines with both engineering and mathematical aspects driven by applications. Perhaps research agendas on proteins, material sciences, etc. It'd be especially interesting to see how many research agendas ended up not panning out, i.e. cataloguing events like 'a search for a material with X tensile strength and Y lightness starting in year Z was eventually given up on in year Z+i'.

When are intuitions reliable? Compression, population ethics, etc.

Intuitions are the results of the brain doing compression. Generally the source data which was compressed is no longer associated with the intuition. Hence from an introspective perspective, intuitions all appear equally valid.

Taking a third-person perspective, we can ask what data was likely compressed to form a given intuition. A pro sports players intuition for that sport has a clearly reliable basis. Our moral intuitions on population ethics are formed via our experiences in every day situations. There is no reason to expect one persons compression to yield a more meaningful generalization than anothers'--we should all realize that this data did not have enough structure to generalize to such cases. Perhaps an academic philosopher's intuitions are slightly more reliable in that they compress data (papers) which held up to scrutiny.