All of Victor Levoso's Comments + Replies

This post doesn't make me actually optimistic about conjeture actually pulling this off, because for that I would have to see details but it does at least look like you understand why this is hard and why the easy versions like just telling gpt5 to imitate a nice human won't work. And I like that this actually looks like a plan. Now maybe it will turn out to not be a good plan but at least is better than openAI's plan of
"well figure out from trial and error how to make the Magic safe somehow".

I think that DG is making a more nickpicky point and just claiming that that specific definition is not feasible rather than using this as a claim that foom is not feasible, at least in this post. He also claims that elsewhere but has a diferent argument about humans being able to make narrow AI for things like strategy(wich I think are also wrong) At least that's what I've understood from our previous discussions.

yeah, totally, i'm also just using that post as a jump-off point for a more in-depth long-form discussion about dragon god's beliefs.

So it seems that a lot of people who applied to Understanding Search in Transformers project to do mechanistic interpretability research and probably a lot of them won't get in.
I think there's a lot of similar projects and potential low-hanging fruit people could work on and we probably could organize to make more teams working on similar things.
I’m willing to organize at least one such project myself(specifically working on trying to figure out how algorithm distillation works) and will talk with Linda about it in 2 we... (read more)

About the sampling thing. I think a better way to do it that will work for other kind models would be trainining a few diferent models that do better or worse on the task and use different policies, and then you just make a dataset of samples of trajectories from multiple of them. Wich should be cleaner in terms of you knowing what is going on on the training set than getting the data as the model trains (wich on the other hand is actually better for doing AD)

That also has the benefit of letting you study how wich agents you use to generate the training da... (read more)

2Joseph Bloom8mo
Agreed about the better way to sample. I think in the long run this is the way to go. Might make sense to deal with this at the same time we deal with environment learning schedules (learning on a range of environments where the difficulty gets harder over time).  I hadn't thought about learning different policies at the same RTG. This could be interesting. Uniform action preference at RTG = 0 make sense to me, since while getting a high RTG is usually low entropy (there's one or a few ways to do it), there are many ways not to get reward. In practice, I might expect the actions to just match the base frequencies of the training data which might be uniform. 

Oh nice, I was interested on doing mechanistic interpretability on decision transformers myself and had gotten started during SERI MATS but now was more interested in looking into algorithm distillation and the decision transformers stuff fell to the wayside(plus I haven't been very productive during the last few weeks unfortunately). It's too late to read the post in detail today but will probably read it in detail and look at the repo tomorrow. I'm interested in helping with this and I'm likely going to be working on some related research in the near fut... (read more)

1Joseph Bloom8mo
Hi Victor,  Glad you are keen on this area, I'd be very happy to collaborate. I'll respond to your comments here but am happy to talk more after you've read the post.  The linked paper (Can Wikipedia help Offline Reinforcement Learning) is very interesting in a few ways, however, I'd be interested in targeted reasons to investigate this specifically. I think working with larger models is often justified but it might make sense to squeeze more juice out of the small models before moving to larger models. Happy to hear the arguments though.  I'm glad you asked about this. In terms of extrapolation, an earlier model I trained seemed to behave like this.  Analysis techniques like the RTG Scan functionality in the app present ways to explore the mechanisms behind this which I decided not to explore in this post (and possibly in general) for a few reasons: It's not clear to me that this is more than a coincidence. I think it could be that in the space of functions that map RTG to behaviour, for certain games, it is possible to learn coincidentally extrapolating functions. If the model were to develop qualitatively different behaviour (under some definition) in out-of-distribution RTG ranges for any task, then my interest will be renewed.  I suspect doing something like integrated gradients for CNN layers is pretty doable (maybe that's what the MATS shard team have done, see one of Alex Turner's comments) but yeah, they are probably harder to work with.  Thank you for sharing this! I'd be very excited to see attempts to understand these models. I've started with toy models for reasons like simplicity and complete control but I can see many arguments in favour of jumping to larger models. The main challenge I see would be loading the weights into a TransformerLens model so we can get the cache enabling easy analysis. This is likely quite doable.  Interesting!

Another posible update is towards shorter timelines if you think that humans might not be trained whith the optimal amount of data(since we can't just for example read the entire internet) and so it might be posible to get better peformance whith less parameters, if you asume brain has similar scaling laws.

Not a response to your actual point but I think that hypothetical example probably doesn't make sense (as in making the ai not "care" doesn't prevent it from including mindhacks in its plan) If you have a plan that is "superingently optimized" for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn't in some sense "care" about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the ef... (read more)

I think I agree, but also, people say things like "the AI should if possible be prevented from not modeling humans", which if possible would imply that the hypothetical example makes more sense.

So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside  I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar.... 
Or maybe you get some article about Eliezer's book,  some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write... etc.

Since in general gpt6 would be optimized to predict (in ... (read more)

I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count. GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don't involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn't have to be perfect to be ridiculoulsy useful. This general approach of "run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team" seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn't strike me as "obviously the world ends if someone tries this".

Well if Mary does learn something new( how it feels "from the inside" to see red or whatever ) she would notice, and her brainstate would reflect that plus whatever information she learned. Otherwise it doesn't make sense to say she learned anything.

And just the fact she learned something and might have thought something like "neat, so that's what red looks like" would be relevant to predictions of her behavior even ignoring possible information content of qualia.

So it seems distinguishable to me.

Not sure what you mean.

If some action is a risk to the world but Harry doesn't know vow doesn't prevent him from doing it.

If afer taking some action Harry realizes it risked the world nothing happens except maybe him not being unable to repeat the decision if it comes up again.

If not taking some action (Example defeating someone about to obliviate him) would cause him to forget about a risk to the world vow doesn't actually force him to do it.

And if Harry is forced to decide between ignorance and a risk to the world he will choose whichever he thinks is least likely to destroy the world.

The thing about ignorance seems to also aply to abandoning intelligence buffs.

Perhaps: Harry will destroy the world eventually. And wearing the Diadem "normally" finds a more certain/swift result to the same outcome?  If so, then he should take it off... but taking it off would perhaps leave him too dumb to work out why he should have taken it off ("it made sense at the time, maybe I should put it back on"), and also unable to see how his default plans eventually do so ("it seemed like a bad idea, but now the logic is fuzzy, let's keep going but more slowly") so taking it off would ALSO destroy the world. However, if he convulses forever back and forth between removing and not removing the diadem he kind of fulfills both impulses, and won't thus destroy the world either way, so that's the result? Kind of like a Garrabrant Inductor thinking about one of its own Goedel Statements?