To understand reality, especially on confusing topics, it's important to understand the mental processes involved in forming concepts and using words to speak about them.
Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
This is the first great public writeup on model evals for averting existential catastrophe. I think it's likely that if AI doesn't kill everyone, developing great model evals and...
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode’s guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.
Topics we discuss:
...Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?
Palantir published marketing material for their offering of AI for defense purposes. There's a video of how a military commander could order a military strike on an enemy tank with the help of LLMs.
One of the features that Palantir advertises is:
Agents
Define LLM agents to pursue specific, scoped goals.
Given military secrecy we are hearing less about Palantir's technology than we hear about OpenAI, Google, Microsoft and Facebook but Palantir is one player and likely an important one.
I doubt they or the government (or almost anyone) has the talent the more popular AI labs have. It doesn’t really matter if they throw billions of dollars at training these if no one there knows how to train them.
When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?
In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.
This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this...
This review was originally written for the Astral Codex Ten Book Review Contest. Unfortunately it didn’t make it as one of the finalists. but since I made use of the LessWrong proofreading/feedback service, I am reposting it here. It can also be found on my gender blog.
If I ask ChatGPT to explain transgender people to me, then it often retreats into vague discussions of gender identity. It is very hard to get it to explain what these things mean, in terms of actual experiences people might have. And that might not be a coincidence - the concepts used to understand transness seem to be the result of a complicated political negotiation, at least as much as they are optimized to communicate people’s experiences.
Some people claim to do better,...
A while back, I had a conversation with ChatGPT to try to understand the conservative perspective on trans people and it finally managed to stump me when it justified its claims on the basis of religious morality.
I don't think ChatGPT is good at conservatism. 😅 AI ethics STRONK.
...I imagine this is a similar situation -- I don't quite understand how trans women transitioning in part because of autogynephilia is actually relevant for how we should structure society or how one ought to interact with a trans person. After all, cis/het people can make big life d
“Words, like tranquil waters behind a dam, can become reckless and uncontrollable torrents of destruction when released without caution and wisdom.”
— William Arthur Ward
In this post I aim to shed light on lesser-discussed concerns surrounding Scaffolded LLMs (S-LLMs). The core of this post consists of three self-contained discussions, each focusing on a different class of concern. I also review some recent examples of S-LLMs and attempt to clarify terminology.
Discussion I deals with issues stemming from how these systems may be developed.
Discussion II argues that optimism surrounding the internal natural language usage of S-LLMs may be premature.
Discussion III examines the modular nature of S-LLMs and how it facilitates self-modification.
The time pressed reader is encouraged to skim the introduction and skip to whichever discussion interests them.
The development of S-LLMs is...
Excellent post. Big upvote, and I'm still digesting all of the points you've made. I'll respond more substantively later. For now, a note on possible terminology. I wrote a followup to my brief "agentized LLMs", Capabilities and alignment of LLM cognitive architectures where I went into more depth on capabilities and alignment; I made many but not all of the points you raised. I proposed the term language model cognitive architectures (LMCAs) there, but I'm now favoring "language model agents" as a more intuitive and general term.
The tag someone just appli...
(Crossposted to the EA forum)
The linked paper is our submission to the Open Philanthropy AI Worldviews Contest. In it, we estimate the likelihood of transformative artificial general intelligence (AGI) by 2043 and find it to be <1%.
Specifically, we argue:
Sure. Fwiw I read "delay" and "pause" as stop until it's safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.