All of Stephen Fowler's Comments + Replies

You are given a string s corresponding to the Instructions for the construction of an AGI which has been correctly aligned with the goal of converting as much of the universe into diamonds as possible. 

What is the conditional Kolmogorov Complexity of the string s' which produces an AGI aligned with "human values" or any other suitable alignment target.

To convert an abstract string to a physical object, the "Instructions" are read by a Finite State Automata, with the state of the FSA at each step dictating the behavior of a robotic arm (with appropriate mobility and precision) with access to a large collection of physical materials. 

4the gears to ascension13d
that depends a lot on what exactly the specific instructions are. there are a variety of approaches which would result in a variety of retargetabilities. it also depends on what you're handwaving by "correctly aligned". is it perfectly robust? what percentage of universes will fail to be completely converted? how far would it get? what kinds of failures happen in the failure universes? how compressed is it? anyway, something something hypothetical version 3 of QACI (which has not hit a v1)


Is part of the motivation behind this question to think about the level of control that a super-intelligence could have on a complex system if it was only able to only influence a small part of that system?

I was not precise enough in my language and agree with you highlighting that what "alignment" means for LLM is a bit vague. While people felt Sydney Bing was cool, if it was not possible to reign it in it would have made it very difficult for Microsoft to gain any market share. An LLM that doesn't do what it's asked or regularly expresses toxic opinions is ultimately bad for business.

In the above paragraph understand "aligned" to mean in the concrete sense of "behaves in a way that is aligned with it's parent companies profit motive", rather than "acting i... (read more)

A concerning amount of alignment research is focused on fixing misalignment in contemporary models, with limited justification for why we should expect these techniques to extend to more powerful future systems.

By improving the performance of today's models, this research makes investing in AI capabilities more attractive, increasing existential risk.

Imagine an alternative history in which GPT-3 had been wildly unaligned. It would not have posed an existential risk to humanity but it would have made putting money into AI companies substantially less attractive to investors.

Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.
2Joseph Van Name2mo
I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else's opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.
See also thoughts on the impact of RLHF research.

Nice post.

"Membranes are one way that embedded agents can try to de-embed themselves from their environment."

I would like to hear more elaboration on "de-embedding". For agents who which are embedded in and interact directly with the physical world, I'm not sure that a process of de-embedding is well defined.

There are fundamental thermodynamic properties of agents that are relevant here. Discussion of agent membranes could also include an analysis of how the environment and agent do work on each other via the mebrane, and how the agent dissipates waste heat and excess entropy to the environment. 

De-embedding, as in, making your membrane as strong as possible. More technically: minimizing infiltration as much as possible in response to potential threats across your Markov blanket 

"Day by day, however, the machines are gaining ground upon us; day by day we are becoming more subservient to them; more men are daily bound down as slaves to tend them, more men are daily devoting the energies of their whole lives to the development of mechanical life. The upshot is simply a question of time, but that the time will come when the machines will hold the real supremacy over the world and its inhabitants is what no person of a truly philosophic mind can for a moment question."

— Samuel Butler, DARWIN AMONG THE MACHINES, 1863

An additional distinction between contemporary and future alignment challenges is that the latter concerns the control of physically deployed, self aware system.

Alex Altair has previously highlighted that they will (microscopically) obey time reversal symmetry[1] unlike the information processing of a classical computer program. This recent paper published in Entropy[2] touches on the idea that a physical learning machine (the "brain" of a causal agent) is an "open irreversible dynamical system" (pg 12-13).

  1. ^
... (read more)
2the gears to ascension3mo
The purpose for reversible automata is simply to model the fact that our universe is reversible, is it not? I don't see how that weighs on the question at hand here.

Feedback wanted!

What are your thoughts on the following research question:

"What nontrivial physical laws or principles exist governing the behavior of agentic systems."

(Very open to feedback along the lines of "hey that's not really a research question")


2Alexander Gietelink Oldenziel3mo
Sounds good but very broad. Research at the cutting edge is about going from these 'gods eye view questions' that somebody might entertain on an idle Sunday afternoon to a very specific refined technical set of questions. What's your inside track?
Physical laws operate on individual particles or large numbers of them. This limits agents by allowing to give bounds on what is physically possible, e.g., growth no more than lightspeed and being subject to thermodynamics - in the limit. It doesn't tell what happens dynamically in medium scales. And because agentic systems operate mostly in very dynamic medium scale regimes, I think asking for physics is not really helping.  I like to think that there is a systematic theory of all possible inventions. A theory that explores ways in which entropy is "directed", such as in a Stirling machine or when energy is "stored". Agents can steer local increase of entropy.  

Yes, perhaps there could be a way having dialogues edited for readability.

I strongly downvoted Homework Answer: Glicko Ratings for War. The reason is because it's appears to be a pure data dump that isn't intended to be actually read by a human. As it is a follow up to a previous post it might have been better as a comment or edit on the original post linking to your github with the data instead. 

Looking at your post history, I will propose that you could improve the quality of your posts by spending more time on them. There are only a few users who manage to post multiple times a week and consistently get many upvotes. 

When you say you were practising Downwell for the course of a month, how many hours was this in total?

Rough guess, ~45 hours.

Is this what you'd cynically expect from an org regularizing itself or was this a disappointing surprise for you?

2Ben Pace4mo
Mm, I was just trying to answer "what do I think would actually work".  Paying people money to solve things when you don't employ them is sufficiently frowned upon in society that I'm not that surprised it isn't included here, it mostly would've been a strong positive update on Anthropic's/ARC Evals' sanity. (Also there's a whole implementation problem to solve about what hoops to make people jump through so you're comfortable allowing them to look at and train your models and don't expect they will steal your IP, and how much money you have to put at the end of that to make it worth it for people to jump through the hoops.) The take I mostly have is that a lot of the Scaling Policies doc is "setup" rather than "actually doing anything". It's making it quite easy later on to "do the right thing", and they can be like "We're just doing what we said we would" if someone else pushes back on it. It also helps bully other companies into doing the right thing. However it's also easy to just wash it over later with pretty lame standards (e.g. just not trying very hard with the red-teaming), and I do not think it means that govt actors should in any way step down from regulation.  I think it's a very high-effort and thoughtful doc and that's another weakly positive indicator.

I strongly believe that, barring extremely strict legislation, one of the initial tasks given to the first human level artificial intelligence will be to work to develop more advanced machine learning techniques. During this period we will see unprecedented technological developments and any many alignment paradigms rooted in the empirical behavior of the previous generation of systems may no longer be relevant.

I predict most humans choose to reside in virtual worlds and possibly have their brain altered to forget that it's not real. 

"AI safety, as in, the subfield of computer science concerned with protecting the brand safety of AI companies"

Made me chuckle.

I enjoyed the read but I wish this was much shorter, because there's a lot of very on the nose commentary diluted by meandering dialogue.

I remain skeptical that by 2027 end-users will need to navigate self-awareness or negotiate with LLM-powered devices for basic tasks (70% certainty it will not be a problem). This is coming from a belief that end user devices won't be running the latest and most powerful models, and that argumenta... (read more)

I agree. Satire, and near-future satire especially, works best on a less-is-more basis. Eliezer has some writing on the topic of politics & art... The Twitter long-form feature is partially responsible here, I think: written as short tweets, this would have encouraged Eliezer to tamp down on his stylistic tics, like writing/explaining too much. (It's no accident that Twitter was most associated with great snark, satire, verbal abuse, & epigrams, but not great literature in general.) The Twitter long-form feature is a misfeature which shows that Musk either never understood what Twitter was good for, or can't care as he tries to hail-mary his way into a turnaround into an 'everything app' walled-garden 'X', making Twitter into a crummy blogging app just so no one clicks through to any other website.
To be fair, the world is already filled with software that makes it intentionally difficult to execute basic tasks. As a simple example, my Windows laptop has multiple places that call themselves Time and Date settings but I can only change the time zone in the harder-to-find ones. A minor inconvenience, but someone intentionally put the setting in the easy-to-find place and then locked it from the user. As another, my car won't let me put the backup camera on the screen while driving forward for more than a few seconds (which would be really useful sometimes when towing a trailer!) and won't let me navigate away from it when driving backwards (which would be great when it starts randomly connecting to nearby bluetooth devices and playing audio from random playlists. As a third, I use a mobile data plan from a reseller for an internet hotspot, and somewhere on the back end T mobile decided to activate parental controls on me (which I found out when I went to the website for Cabela's, which requires age verification because they also sell guns), but because I don't have a T mobile account, literally no one has the ability and authority to fix it. And I think you're underestimating how valuable an agentic compiler or toaster could be, done right. A compiler that finds and fixes your errors because it codes better than you (hinted at in the story). A toaster that knows exactly how you want your food heated and overrides your settings to make it happen. I find it hard to imagine companies not going that route once they have the ability.

Is there reason to think the "double descent" seen in observation 1 relates to the traditional "double descent" phenomena?

My initial guess is no.

4Nina Rimsky4mo
No connection with this

That's a good suggestion. I wasn't sure if I could make the question qualitative enough for a prediction market. I'm thinking something along the lines of "If Rishi Sunak is removed from office (in the next 3 years) is funding to the Frontier Taskforce reduced by 50% or more within 6 months".

5Garrett Baker4mo
Sounds reasonable!

Without governance you're stuck trusting that the lead researcher (or whoever is in control) turns down near infinite power and instead act selflessly. That seems like quite the gamble.

6Seth Herd4mo
I don't think it's such a stark choice. I think odds are the lead researcher takes the infinite power, and it turns out okay to great. Corrigibility seems like the safest outer alignment plan, and it's got to be corrigible to some set of people in particular. I think giving one random person near infinite power will work out way better than intuition suggests. I think it's not power that corrupts, but rather the pursuit of power. I think unlimited power will lead to an ordinary, non-sociopathic person to progressively focus more on their empathy for others. I think they'll ultimately use that power to let others do whatever they want that doesn't take away others' freedom to do what they want. And that's the best outer alignment result, in my opinioin.

What I find incredible is how contributing to the development of existentially dangerous systems is viewed as a morally acceptable course of action within communities that on paper accept that AGI is a threat.

Both OpenAI and Anthropic are incredibly influential among AI safety researchers, despite both organisations being key players in bringing the advent of TAI ever closer.

Both organisations benefit from lexical confusion over the word "safety".

The average person concerned with existential risk from AGI might assume "safety" means working to reduce the l... (read more)

3Roman Leventov4mo
If by "techniques that work on contemporary AIs" you mean RLHF/RLAIF, then I don't know anyone claiming that the robustness and safety of these techniques will "extend to AGI". I think that AGI labs will soon move in the direction of releasing an agent architecture rather that a bare LLM, and will apply reasoning verification techniques. From OpenAI's side, see "Let's verify step by step" paper. From DeepMind's side, see this interview with Shane Legg.  I think this passage (and the whole comment) is unfair because it presents what AGI labs are pursuing (i.e., plans like "superalignment") as obviously consequentially bad plans. But this is actually very far from obvious. I personally tend to conclude that these are consequentially good plans, conditioned on the absence of coordination on "pause and united, CERN-like effort about AGI and alignment" (and the presence of open-source maximalist and risk-dismissive players like Meta AI). What I think is bad in labs' behaviour (if true, which we don't know, because such coordination efforts might be underway but we don't know about them) is that the labs are not trying to coordinate (among themselves and with the support of governments for legal basis, monitoring, and enforcement) on "pause and united, CERN-like effort about AGI and alignment". Instead, we only see the labs coordinating and advocating for RSP-like policies. Another thing that I think is bad in labs' behaviour is inadequately little funding to safety efforts. Thus, I agree with the call in "Managing AI Risks in the Era of Rapid Progress" for the labs to allocate at least a third of their budgets to safety efforts. These efforts, by the way, shouldn't be narrowly about AI models. Indeed, this is a major point of Roko's OP. Investments and progress in computer and system security, political, economic, and societal structures is inadequate. This couldn't be the responsibility of AGI labs alone, obviously, but I think they have to own at a part of it. They
-5Anders Lindström4mo

This feels like you're engaging with the weakest argument against Israel's recent aggression to make your point. You are not going to find many people who disagree with "violence against civilians is bad" on LessWrong.

It also strikes me as bizarre that this post mentions only the civilian casualties on one side and not the far greater (and rapidly growing) number of Palestinians who have been killed.

Hamas plan was exactly this: 1.Kill many civilians in Israel in awful ways to ensure quick retaliation 2.Ensure that many Palestianian are perceived to be killed during retaliation (by locating own military infrastructure near civilian cites, preventing evacuation, overestimating number of killed) 3. Manipulate those who are using utilitarian calculation in the West into supporting them 4. Manipulate Muslim countries into declaring war with Israel 5.Destroy Israel

That it is so difficult for Anthropic to reassure people stems from the contrast between Anthropic's responsibility focused mission statements and the hard reality of them receiving billions in dollars of profit motivated investment.

It is rational to draw conclusions by weighting a companies actions more heavily than their PR.

It is rational to draw conclusions by weighting a companies actions more heavily than their PR.

Yeah—I'm very on board with this. I think people tend to put way too much weight and pay way too much attention to nice-sounding PR rather than just focusing on concrete evidence, past actions, hard commitments, etc. If you focus on nice-sounding PR, then GenericEvilCo can very cheaply gain your favor by manufacturing that for you, but actually making concrete commitments is much more expensive.

So yes, I think your opinion of Anthropic should mostly be priors ... (read more)

"Let us return for a moment to Lady Lovelace’s objection, which stated that the machine can only do what we tell it to do.

One could say that a man can ‘inject’ an idea into the machine, and that it will respond to a certain extent and then drop into quiescence, like a piano string struck by a hammer. Another simile would be an atomic pile of less than critical size: an injected idea is to correspond to a neutron entering the pile from without. Each such neutron will cause a certain disturbance which eventually dies away. If, however, the size of the pile i... (read more)

I believe you should err on the side of not releasing it.

I am 85% confident that this won't work. The issue isn't that the prompt hasn't made it clear enough that illegal moves are off the table, the issue is that chatGPT isn't able to keep track of the board state well enough to avoid making illegal moves.

I've tried a game with GPT4 where it was fed the above prompt plus the FEN of the game and also had it "draw" the board. It seems to really struggle with it's geometric understanding of the game, as you'd expect. For example, it struggled with identifying which squares were under attack from a knight. I think this reflects a limitations of the current model and I don't think this is something a clever prompt will fix.

Then tackle this problem directly. Find a representation of board state so that you can specify a middlegame position on the first prompt, and it still makes legal moves. 
I also tried it with drawing the board + adding explanation to moves, and there is some errors in drawings. But may be there is a way to make the drawing more coherent?

Two points.

Firstly, humans are unable to self modify to the degree that an AGI will be able to. It is not clear to me that a human given the chance to self modify wouldn't immediately wirehead. An AGI may require a higher degree of alignment than what individual humans demonstrate.

Second, it is surely worth noting that humans aren't particularly aligned to their own happiness or avoiding suffering when the consequences of their action are obscured by time and place.

In the developed world humans make dietary decisions that lead to horrific treatment of anim... (read more)

Great post, strongly upvoted. I think the way you've described the slow takeoff fits with my world model.

One minor point, when you talk about linear activations in the human brain to what extent is this an analogy and to what extent are you speaking literally? If you mean the second, it may be helpful to include some literature backing this claim up.

I have limited understanding of neuroscience, but my understanding is that biological neurons function in a way very different to computational neurons so I raised an eyebrow during this section of the post. 

Could you explain more what you mean by this?

My (completely amateur) understanding is that the "extra" semantic and syntactic structure of written and spoken language does two things. 

One, it adds redundancy and reduces error. Simple example, gendered pronouns mean that when you hear "Have you seen Laurence? She didn't get much sleep last night." you have a chance to ask the speaker for clarification and catch if they had actually said "Laura" and you misheard.

Two, it can be used as a signal. The correct use of jargon is used by listeners or readers as a proxy for competence. Or many typos in your text will indicate to readers that you haven't put much effort into what you're saying.

Train Tracks

When Gromit laid down his own tracks in this train chase ...

The above gif comes from the brilliant childrens claymation film, "Wallace and Gromit The Wrong Trousers". In this scene, Gromit the dog rapidly lays down track to prevent a toy train from crashing. I will argue that this is an apt analogy for the alignment situation we will find ourselves in the future and that prosaic alignment is focused only on the first track.

The last few years have seen a move from "big brain" alignment research directions to prosaic approaches. In other words asking how to align near-contemporary models instead of asking... (read more)

My experience has been that we are clearly funding constrained, particularly in a post FTX world. This makes it hard to recruit top talent and is likely driving people out of alignment research.

(All opinions expressed are my own. Details of examples have been changed. None of these stories directly concern experiences with my own funding. I currently have active grant applications which has incentivised me to make this comment less harsh than it probably should be.)

I'm aware of researchers who have turned down extremely prestigious careers to pursue alignm... (read more)

(Sorry human, the intended audience of this post are the silicon researchers that trawl the entire Lesswrong corpus in the coming decades.)

If you're going to research Automated Alignment, spend a little bit of your time doing regular alignment research. Just in case you're actually one of the automated researchers.

Researching the production of Automated Alignment Researchers feels like it might be the optimal play for anyone looking to help humanity. 

If you're highly smart and incredibly successful at regular alignment research, you can expect t... (read more)

Thank you for the feedback. This will be addressed.

Edit: Terminology has been updated to a substantially less common term that conveys similar meaning.

Real Numbers Representing The History of a Turing Machine.

Epistemics: Recreational. This idea may relate to alignment, but mostly it is just cool. I thought of this myself, but I'm positive this is an old and well known.

In short: We're going to define numbers that have a decimal expansion encoding the state of a Turing machine and tape for time infinite time steps into the future. If the machine halts or goes into a cycle, the expansion is repeating. 

Take some finite state Turing machine T on an infinite tape A. We will have the tape be 0 everywhere.

L... (read more)

I am also surprised at how little attention these systems have been receiving. 

I was reading about CoT reasoning plus early S-LLMs around September of last year at the same time I was encountered Yann LeCun's "A Path Toward Autonomous Machine Intelligence". While LeCun's paper barely discusses language models, it does provide a plausible framework for building a cognitive architecture.

The above planted the seed, so that when I saw the BabyAGI architecture diagram I immediately thought "This does plausibly seem like a paradigm that could lead to very p... (read more)

2[comment deleted]8mo
2Seth Herd8mo
I'll show you that draft when it's ready; thanks for the offer! A couple of thoughts: At this point I'm torn between optimism based on the better interpretability and pessimism based on the multipolar scenario. The timeline doesn't bother me that much, since I don't think more general alignment work would help much in aligning those specific systems if they make it to AGI.and of course I'd like a longer timeline for me and others to keep enjoying life. My optimism is relative, and I still have something like a vague 50% chance of failure. Shorter timelines have an interesting advantage of avoiding compute and algorithm overhangs that create fast, discontinuous progress. This new post makes the case in detail. I'm not at all sure this advantage outweighs the loss of time to work on alignment, since that's certainly helpful. So I'm entirely unsure whether I wish no one had thought of this. But in retrospect it seems like too obvious an idea to miss. The fact that almost everyone in the alignment community (including me) was blindsided by it seems like a warning sign that we need to work harder to predict new technologies and not fight the last war. One interesting factor is that many of us who saw this or had vague thoughts in this direction never mentioned it publicly, to avoid helping progress; but the hope that no one would think of such an obvious idea pretty quickly was in retrospect totally unreasonable.

Thank you for the feedback. I'm definitely not sold on any particular terminology and was just aiming to keep things as compatible as possible with existing work. 

I wasn't that familiar with Conjecture's work on CoEm, although I had read that outline. It was not immediately obvious to me that their work involved LLMs. 

More details on CoEm currently seem to be scattered across various podcasts with Connor Leahy, though a writeup might eventually materialize. I like this snippet (4 minutes, starting at 49:21).

Hello and thank you for the good questions.

1. I do think that it is at least plausible (5-25%?) that we could obtain general intelligence via improved scaffolding, or at least obtain a self improving seed model that would eventually lead to AGI. Current systems like Voyager do not have that many "moving parts". I suspect that there is a rich design space for capabilities researchers to explore if they keep pushing in this direction.

Keep in mind that the current "cutting edge" for scaffold design consists of relatively rudimentary ideas like "don't use the ... (read more)

3Filip Sondej8mo
1. I agree that scaffolding can take us a long way towards AGI, but I'd be very surprised if GPT4 as core model was enough. 2. Yup, that wasn't a critique, I just wanted to note something. By "seed of deception" I mean that the model may learn to use this ambiguity more and more, if that's useful for passing some evals, while helping it do some computation unwanted by humans. 3. I see, so maybe in ways which are weird to humans to think about.
4Ape in the coat8mo
  It's clear to me that we can easily prevent this type of behaviour. First of all, the system must not have read access to logs. But in general, the decision to read the memory and write logs should be explicit and trasparent parts of the scaffolding and the system shouldn't be able to "introspect" on its own. But if something can be easily prevented it doesn't mean that it will be, unless we actually do the effort. We need to think about more of such cases and develop safety protocols for LLM-based agents. 

Are humans aligned? 

Bear with me! 

Of course, I do not expect there is a single person browsing Short Forms who doesn't already have a well thought out answer to that question. 

The straight forward (boring) interpretation of this question is "Are humans acting in a way that is moral or otherwise behaving like they obey a useful utility function." I don't think this question is particularly relevant to alignment. (But I do enjoy whipping out my best Rust Cohle impression

Sure, humans do bad stuff but almost every human manages to stumble... (read more)

I'm probably not "aligned" in a way that generalizes to having dangerous superpowers, uncertain personhood and rights, purposefully limited perspective, and somewhere between thousands to billions of agents trying to manipulate and exploit me for their own purposes. I expect even a self-modified Best Extrapolated Version of me would struggle gravely with doing well by other beings in this situation. Cultish attractor basins are hazards for even the most benign set of values for humans, and a highly-controlled situation with a lot of dangerous influence like that might exacerbate that particular risk. But I do believe that hypothetical self-modifying has at least the potential to help me Do Better, because doing better is often a skills issue, learning skills is a currently accessible form of self-modification with good results, and self-modifying might help with learning skills.

Disclaimer: Low effort comment.

The word "optimization" seems to have a few different related meanings so perhaps it would be useful to lead with a definition. You may enjoy reading this post by Demski if you haven't seen it.

* Mathematical definition: Optimization is the process of finding the best possible solution to a problem, given a set of constraints. * Practical definition: Optimization is the process of improving the performance of a system, such as by minimizing costs, maximizing profits, or improving efficiency. In my comment I focused on the second interpretation (by focussing on iteration). The first definition does not require a perfect model of the world.  In the real world we always have limited information and compute and so the best possible solution is always an approximation. The person with the most compute and information will probably optimize faster and win.   I agree that this is a very good post and it helps me sharpen my views. 

Partially Embedded Agents

More flexibility to self-modify may be one of the key properties that distinguishes the behavior of artificial agents from contemporary humans (perhaps not including cyborgs). To my knowledge, the alignment implications of self modification have not been experimentally explored.

Self-modification requires a level of embedding. An agent cannot meaningfully self-modify if it doesn't have a way of viewing and interacting with its own internals. 

Two hurdles then emerge. One, a world for the agent to interact with that also co... (read more)

For anyone who wasn't aware both Ng and LeCun have strongly indicated that they don't believe people existential risks from AI are a priority. Summary here

You can also check out Yann's twitter. 

Ng believes the problem is "50 years" down the track, and Yann believes that many concerns AI Safety researchers have are not legitimate. Both of them view talk about existential risks as distracting and believe we should address problems that can be seen to harm people in today's world. 

This was an interesting read.

There are a lot of claims here that are presented very strongly. There are only a few papers on language agents, and no papers (to my knowledge) that prove all language agents always adhere to certain propeties.

There might be a need for clearer differentiation between the observed properties of language agents, the proven properties, and the properties that being claimed.

One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."

I think this is an extremely strong cla... (read more)

Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified. You write:  Let me clarify that we are not claiming that the architecture of every language agent fixes the functional role of the text it stores in the same way. Rather, our claim is that if you consider any particular language agent, its architecture will fix the functional role of the text it stores in a way which makes it possible to interpret its folk psychology relatively easily.  We do not want to deny that in order to interpret the text stored by a language agent, one must know about its architecture.  In the case you imagine, the architecture of the agent fixes the functional role of the text stored so that any natural language task-description T represents an instruction to perform its negation, ~T.  Thus the task "Write a poem about existential risk" is stored in the agent as the sentence "Do not write a poem about existential risk," and the architecture of the agent later reverses the negation. Given these facts, a stored instance of "Do not write a poem about existential risk" corresponds to the agent having a plan to not not write a poem about existential risk, which is the same as having a plan to write a poem about existential risk.  What is important to us is not that the natural language representations stored inside a language agent have exactly their natural language meanings, but rather that for any language agent, there is a translation function recoverable from its architecture which allows us to determine the meanings of the natural language representations it stores.  This suffices for interpretability, and it also allows us to directly encode goals into the agent in a way that helps to resolve problems with reward misspecification and goal misgeneralization.

Evolution and Optimization

When discussing inner/outer alignment and optimization generally, evolution is often thrown out as an example. Off the top of my head, the Sharp Left Turn post discusses evolution as if it is an "outer optimizer".

But evolution seems special and distinct from every other optimizer we encounter. It doesn't have a physical location and it doesn't have preferences that can be changed. It's selecting for things that are capable of sticking around and making more copies of itself.

It's selection is the default one.

Do you know of authors who have written about this?

Effective Boxing Threats = Monkey Brain Manipulation 

There are a handful of threats that a powerless boxed AI could make that could conceivably convince an otherwise sane human guards to release it from captivity. All of the ones I'm aware of are more precise variants of the general idea here.

The approach I have seen to dealing with these threats is to provide a convincing argument that a rational (or super-rational) individual shouldn't give in to the threat. 

I'd propose another way of understanding them is to think about what the general strate... (read more)

"Training" Story for an Agentised-LLM turned AGI:

The following is a subsection of a draft. Keen for feedback.

I'm currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.

Hubinger's "Training Stories" provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock "proposal".

Spoilers: A-LLMs are highly competitive and but unlikely to be aligned

Stretching ... (read more)

Really impressive work and I found the colab very educational.

I may be missing something obvious, but it is probably worth including "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts. 

Notably, they did not use SGD to find these directions, but rather had "NLP experts" (grad students) manual look over the top 30 words associated with each value vector. 

I have to dispute the idea that "less neurons" = "more human-readable". If the fewer neurons are performing a more complex task it won't necessarily be easier to interpret.  

1Shayne O'Neill8mo
Definately. The lower the neuron vs 'concepts' ratio is, the more superposition required to represent everything. That said with the continuous function nature of LNNs these seem to be the wrong abstraction for language. Image models? Maybe.  Audio models? Definately. Tokens and/or semantic data?  That doesnt seeem practical.  

Soon there will be an army of intelligent but uncreative drones ready to do all the alignment research grunt work. Should this lead to a major shift in priorities?

This isn't far off, and it gives human alignment researchers an opportunity to shift focus. We should shift focus to the of the kind of high level, creative research ideas that models aren't capable of producing anytime soon*. 

Here's the practical takeaway: there's value in delaying certain tasks for a few years. As AI evolves, it will effectively handle these tasks. Meaning you can be subst... (read more)

I like the way you think.

While an indepth daily journal would help simulating a person, I suspect you could achieve a reasonably high fidelity simulation without it.

I personally don't keep a regular journal, but I do send plenty of data over messenger, whatsapp etc describing my actions and thoughts.

You've convinced me that it's either too difficult to tell or (more likely) just completely incorrect. Thanks for the links and the comments. 

Initially it was intended just to put the earlier estimate in perspective and check it wasn't too crazy, but I see I "overextended" in making the claims about search. 

It was an oversight to not include inference costs, but I need to highlight that this is a fermi estimate and from what I can see it isn't enough of a difference to actually challenge the conclusion.

Do you happen to know what the inference costs are? I've only been able to find figures for revenue (page 29)

Do you think that number is high enough to undermine the general conclusion that there is billions of dollars of profit to be made from training the next SOTA model? 

I also am not sure it is enough to change the conclusion, but I am pretty sure "put ChatGPT to Bing" doesn't work as a business strategy due to inference cost. You seem to think otherwise, so I am interested in a discussion. Inference cost is secret. The primary sources are OpenAI pricing table (ChatGPT 3.5 is 0.2 cents per 1000 tokens, GPT-4 is 30x more expensive, GPT-4 with long context is 60x more expensive), Twitter conversation between Elon Musk and Sam Altman on cost ("single-digits cents per chat" as of December 2022), and OpenAI's claim of 90% cost reduction since December. From this I conclude OpenAI is selling API calls at cost or at loss, almost certainly not at profit. Dylan Patel's SemiAnalysis is a well respected publication on business analysis of semiconductor industry. In The Inference Cost Of Search Disruption, he estimates the cost per query at 0.36 cents. He also wrote a sequel on cost structure of search business, which I recommend. Dylan also points out simply serving ChatGPT for every query at Google would require $100B in capital investment, which clearly dominates other expenditures. I think Dylan is broadly right, and if you think he is wrong, I am interested in your opinions where.
Load More