Erich_Grunewald's Shortform

Erich_Grunewald

On the advice of @adamShimi, I recently read Hasok Chang's Inventing Temperature. The book is terrific and full of deep ideas, many of which relate in interesting ways to AI safety. What follows are some thoughts on that relationship, from someone who is not an AI safety researcher and only somewhat follows developments there, and who probably got one or two things wrong.

(Definitions: By "operationalizing", I mean "giving a concept meaning by describing it in terms of measurable or closer-to-measurable operations", whereas "abstracting" means "removing properties in the description of an object".)

There has been discussion on LessWrong about the relative value of abstract work on AI safety (e.g., agent foundations) versus concrete work on AI safety (e.g., mechanistic interpretability, prosaic alignment). Proponents of abstract work argue roughly that general mathematical models of AI systems are useful or essential for understanding risks, especially coming from not-yet-existing systems like superintelligences. Proponents of concrete work argue roughly that safety work is more relevant when empirically grounded and subjected to rapid feedback loops. (Note: The abstract-concrete distinction is similar to, but different from, the distinction between applied and basic safety research.)

As someone who has done neither, I think we need both. We need abstract work because we need to build safety mechanisms using generalizable concepts, so that we can be confident that the mechanisms apply to new AI systems and new situations. We need concrete work because we must operationalize the abstract concepts in order to measure them and apply them to actually existing systems. And finally we need work that connects the abstract concepts to the concrete concepts, to see that they are coherent and for each to justify the other.

Chang writes:

The dichotomy between the abstract and the concrete has been enormously helpful in clarifying my thinking at the earlier stages, but I can now afford to be more sophisticated. What we really have is a continuum, or at least a stepwise sequence, between the most abstract and the most concrete. This means that the operationalization of a very abstract concept can proceed step by step, and so can the building-up of a concept from concrete operations. And it may be beneficial to move only a little bit at a time up and down the ladder of abstraction.

Take for example the concept of (capacity for) corrigibility, i.e., the degree to which an AI system can be corrected or shut down. The recent alignment faking paper showed that, in experiments, Claude would sometimes "pretend" to change its behavior when it was ostensibly being trained with new alignment criteria, while not actually changing its behavior. That's an interesting and important result. But (channeling Bridgman) we can only be confident that it applies to the concrete concept of corrigibility measured by the operations used in the experiments -- we have no guarantees that it holds for some abstract corrigibility, or when corrigibility is measured using another set of operations or under other circumstances.

An interesting case study discussed in the book is the development of the abstract concept of temperature by Lord Kelvin (the artist formerly known as William Thomson) in collaboration with James Prescott Joule (of conservation of energy fame). Thomson defined his abstract temperature in terms of work and pressure (which were themselves abstract and needed to be operationalized). He based his definition on the Carnot cycle, an idealized process performed by the theoretical Carnot heat engine. The Carnot heat engine was inspired by actual heat engines, but was fully theoretical -- there was no physical Carnot heat engine that could be used in experiments. In other words, the operationalization of temperature that Thomson invented using the Carnot cycle was an intermediate step that required further operationalization before Thomson's abstract temperature could be connected with experimental data. Chang suggests that, while the Carnot engine was never necessary for developing an abstract concept of temperature, it did help Thomson achieve that feat.

Ok, back to AI safety. So above I said that, for the whole AI thing to go well, we probably need progress on both abstract and concrete AI safety concepts, as well as work to bridge the two. But where should research effort be spent on the margin?

You may think abstract work is useless because it has no error-correcting mechanism when it is not trying to, or is not close to being able to, operationalize its abstract concepts. If it is not grounded in any measurable quantities, it can't be empirically validated. On the other hand, many abstract concepts (such as corrigibility) still make sense today and are currently being studied in the concrete (though they have not yet been connected to fully abstract concepts) despite being formulated before AI systems looked much like they do today.

You may think concrete work is useless because AI changes so quickly that the operations used to measure things today will soon be irrelevant, or more pertinently perhaps, because the superintelligent systems we truly need to align are presumably vastly different from today's AI systems, in their behavior if not in their architecture. In that way, AI is quite different from temperature. The physical nature of temperature is constant in space and time -- if you measure temperature with a specific set of operations (measurement tools and procedures), you would expect the same outcomes regardless of which century or country you do it in -- whereas the properties of AI change rapidly over time and across architectures. On the other hand, timelines seem short, such that AGI may share many similarities with today's AI systems, and it is possible to build abstractions gradually on top of concrete operations.

There is in fact an example from the history of thermometry of extending concrete concepts to new environments without recourse to abstract concepts. In the 18th century, scientists realized that the mercury and air thermometers used then behaved very differently, or could not be used at all due to freezing and melting, for very low and very high temperatures. While they had an intuitive notion that some abstract temperature ought to apply across all degrees of heat or cold, their operationalized temperatures clearly only applied to a limited range of heat and cold. To solve this, they eventually developed different sets of operations for the measurement of temperatures in extreme ranges. For example, Josiah Wedgwood measured very high temperatures in ovens by baking standardized clay cylinders and measuring how much they'd shrunk. These different operations, which yielded measurements of temperature on different scales, were then connected by measuring temperature for both scales (using different operations) in an overlapping range and lining those up. All this was done without an abstract theory of temperature, and while the resulting scale was not on very solid theoretical ground, it was good enough to provide practical value.

Of course, the issue with superintelligence is that, because of e.g., deceptive alignment and gradient hacking, we want trustworthy safety mechanisms and alignment techniques in place well before the system has finished training. That's why we want to tie those techniques to abstract concepts which we are confident will generalize well. But I have no idea what the appropriate resource allocation is across these different levels of abstraction.^[1] Maybe what I want to suggest is that abstract and concrete work is complementary and should strive towards one another. But maybe that's what people have been doing all along?

The most upvoted dialogue topic on an October 2023 post by Ben Pace was "Prosaic Alignment is currently more important to work on than Agent Foundations work", which received 40 agree and 32 disagree votes, suggesting that the general opinion on LessWrong at that time was that the current balance was about right, or that prosaic alignment should get some more resources on the margin. ↩︎

[-]Erich_Grunewald2y80

Here's what I usually try when I want to get the full text of an academic paper:

Search Sci-Hub. Give it the DOI (e.g. https://doi.org/...) and then, if that doesn't work, give it a link to the paper's page at an academic journal (e.g. https://www.sciencedirect.com/science...).
Search Google Scholar. I can often just search the paper's name, and if I find it, there may be a link to the full paper (HTML or PDF) on the right of the search result. The linked paper is sometimes not the exact version of the paper I am after -- for example, it may be a manuscript version instead of the accepted journal version -- but in my experience this is usually fine.
Search the web for "name of paper in quotes" filetype:pdf. If that fails, search for "name of paper in quotes" and look at a few of the results if they seem promising. (Again, I may find a different version of the paper than the one I was looking for, which is usually but not always fine.)
Check the paper's authors' personal websites for the paper. Many researchers keep an up-to-date list of their papers with links to full versions.
Email an author to politely ask for a copy. Researchers spend a lot of time on their research and are usually happy to learn that somebody is eager to read it.

[-]Lao Mein2y10

I would add Semantic Scholar to the list. It gives consistantly better search results than Google Scholar and has a better interface. I've also found a really difficult-to-find paper on pre-print websites once or twice.

[-]Erich_Grunewald2y10

Thanks for the suggestion! I'll be trying it out and adding it to the list if I find it useful.

[-]Erich_Grunewald2y34

I'm really confused by this passage from The Six Mistakes Executives Make in Risk Management (Taleb, Goldstein, Spitznagel):

We asked participants in an experiment: “You are on vacation in a foreign country and are considering flying a local airline to see a special island. Safety statistics show that, on average, there has been one crash every 1,000 years on this airline. It is unlikely you’ll visit this part of the world again. Would you take the flight?” All the respondents said they would.
We then changed the second sentence so it read: “Safety statistics show that, on average, one in 1,000 flights on this airline has crashed.” Only 70% of the sample said they would take the flight. In both cases, the chance of a crash is 1 in 1,000; the latter formulation simply sounds more risky.

One crash every 1,000 years is only the same as one crash in 1,000 flights if there's exactly one flight per year on average. I guess they must have stipulated that in the experiment (of which there's no citation), because otherwise it's perfectly rational to suppose the first option is safer (since generally an airline serves >1 flight per year)?

[-]Erich_Grunewald2mo10

[-]Erich_Grunewald3y10

A few months ago I wrote a post about Game B. The summary:

I describe Game B, a worldview and community that aims to forge a new and better kind of society. It calls the status quo Game A and what comes after Game B. Game A is the activity we’ve been engaged in at least since the dawn of civilisation, a Molochian competition over resources. Game B is a new equilibrium, a new kind of society that’s not plagued by collective action problems.

While I agree that collective action problems (broadly construed) are crucial in any model of catastrophic risk, I think that

civilisations like our current one are not inherently self-terminating (75% confidence);

there are already many resources allocated to solving collective action problems (85% confidence); and

Game B is unnecessarily vague (90% confidence) and suffers from a lack of tangible feedback loops (85% confidence).

I think it can be of interest to some LW users, though it didn't feel on-topic enough to post in full here.