Post summary (feel free to suggest edits!):The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case: For outer alignment, say researchers want a chatbot that gives helpful, honest answers - but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.(This will appear in this week's forum summary. If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
Post summary (feel free to suggest edits!):‘Setting the Zero Point’ is a “Dark Art” ie. something which causes someone else’s map to unmatch the territory in a way that’s advantageous to you. It involves speaking in a way that takes for granted that the line between ‘good’ and ‘bad is at a particular point, without explicitly arguing for that. This makes changes between points below and above that line feel more significant.
As an example, many people draw a zero point between helping and not helping a child drowning in front of them. One is good, one is bad. The Drowning Child argument argues this point is wrongly set, and should be between helping and not helping any dying child.
The author describes 14 examples, and suggests that it’s useful to be aware of this dynamic and explicitly name zero points when you notice them.
(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
Currently it's all manually, but the ChatGPT summaries are pretty decent, I'm looking into which types of posts it does well.
Post summary (feel free to suggest edits!):The author argues that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). This is because there is good economic reason to have AIs ‘aim’ at certain outcomes - eg. We might want an AI that can accomplish goals such as ‘get me a TV for a great price’. Current methods train AIs to do this via trial and error, but because we ourselves are often misinformed, we can sometimes negatively reinforce truthful behavior and positively reinforce deception that makes it look like things are going well. This can mean AIs learn an unintended aim, which if ambitious enough, is very dangerous. There are also intermediate goals like ‘don’t get turned off’ and ‘control the world’ that are useful for almost any ambitious aim.
Warning signs for this scenario are hard to observe, because of the deception involved. There will likely still be some warning signs, but in a situation with incentives to roll out powerful AI as fast as possible, responses are likely to be inadequate.
Post summary (feel free to suggest edits!):Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.
Non-exhaustive list of how it could tell it’s in reality:
If you can understand the contents of the AI’s mind well enough to falsify every possible check it could do to determine the difference between simulation and reality, then you could use that knowledge to build a friendly AI that doesn’t need to be fooled in the first place.
Post summary (feel free to suggest edits!):The author gives examples where their internal mental model suggested one conclusion, but a low-information heuristic like expert or market consensus differed, so they deferred. This included:
Another common case of this principle is assuming something won’t work in a particular case, because the stats for the general case are bad. (eg. ‘90% of startups fail - why would this one succeed?’), or assuming something will happen similarly to past situations.
Because the largest impact comes from outlier situations, outperforming these heuristics is important. The author suggests that for important decisions people should build a gears-level model of the decision, put substantial time into building an inside view, and use heuristics to stress test those views. They also suggest being ambitious, particularly when it’s high upside and low downside.
Cheers, edited :)
Post summary (feel free to suggest edits!):The Diplomacy AI got a handle on the basics of the game, but didn’t ‘solve it’. It mainly does well due to avoiding common mistakes like eg. failing to communicate with victims (thus signaling intention), or forgetting the game ends after the year 1908. It also benefits from anonymity, one-shot games, short round limits etc.
Some things were easier than expected eg. defining the problem space, communications generic and simple and quick enough to easily imitate and even surpass humans, no reputational or decision theoretic considerations, you can respond to existing metagame without it responding to you. Others were harder eg. tactical and strategic engines being lousy (relative to what the author would have expected).Overall the author did not on net update much on the Diplomacy AI news, in any direction, as nothing was too shocking and the surprises often canceled out.
Post summary (feel free to suggest edits!):In chatting with ChatGPT, the author found it contradicted itself and its previous answers. For instance, it said that orange juice would be a good non-alcoholic substitute for tequila because both were sweet, but when asked if tequila was sweet it said it was not. When further quizzed, it apologized for being unclear and said “When I said that tequila has a "relatively high sugar content," I was not suggesting that tequila contains sugar.”
This behavior is worrying because the system has the capacity to produce convincing, difficult to verify, completely false information. Even if this exact pattern is patched, others will likely emerge. The author guesses it produced the false information because it was trained to give outputs the user would like - in this case a non-alcoholic sub for tequila in a drink, with a nice-sounding reason.
Post summary (feel free to suggest edits!):Last year, the author wrote up an plan they gave a “better than 50/50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset.
They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them. They think we should then focus on those.In their personal work, they’ve found theory work faster than expected, and crossing the theory-practice gap mildly slower. In 2022 most of their time went into theory work like the Basic Foundations sequence, workshops and conferences, training others, and writing up intro-level arguments on alignment strategies.(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)