Victor Levoso

Message

Victor Levoso has not written any posts yet.

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

This post is a summary of our paper A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task (ACL 2024). While we wrote and released the paper a couple of months ago, we have done a bad job promoting it so far. As a result, we’re writing...

May 28, 202453

Replying toWhy we should expect ruthless sociopath ASI

Victor Levoso2h

Why we should expect ruthless sociopath ASI

Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.

In the case of very long context LLM you are even... (read 393 more words →)

Replying toWhy we should expect ruthless sociopath ASI

Victor Levoso1d*

Why we should expect ruthless sociopath ASI

LLM in practice these days do include increasingly bigger % of RL wich seems like it should at least make you less certain about capabilities mostly coming from pretraining and papers from before that continuing to be relevant for very long and you do mention it on the other post but still wrote that capabilities come mostly from pretraining on the footnote?.

I expect an optimist or someone from the comparatively-less-pessimistic group would argue that LLM or LLM +RL might lead to consequentialists that have human-like goals due to being built from a base of human imitation even as they move towards ASI.

And an important disagrement with those people there is that you... (read more)

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian

abhayesian, Jannik Brinkmann, Victor Levoso

This post is a summary of our paper A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task (ACL 2024). While we wrote and released the paper a couple of months ago, we have done a bad job promoting it so far. As a result, we’re writing up a summary of our results here to reinvigorate interest in our work and hopefully find some collaborators for follow-up projects. If you’re interested in the results we describe in this post, please see the paper for more details.

TL;DR - We train transformer models to find the path from the root of a tree to a given leaf (given an edge list of the tree).... (read 2579 more words →)

Replying toCognitive Emulation: A Naive AI Safety Proposal

Victor Levoso3y

Cognitive Emulation: A Naive AI Safety Proposal

This post doesn't make me actually optimistic about conjeture actually pulling this off, because for that I would have to see details but it does at least look like you understand why this is hard and why the easy versions like just telling gpt5 to imitate a nice human won't work. And I like that this actually looks like a plan. Now maybe it will turn out to not be a good plan but at least is better than openAI's plan of
"well figure out from trial and error how to make the Magic safe somehow".

Victor Levoso3y

I think that DG is making a more nickpicky point and just claiming that that specific definition is not feasible rather than using this as a claim that foom is not feasible, at least in this post. He also claims that elsewhere but has a diferent argument about humans being able to make narrow AI for things like strategy(wich I think are also wrong) At least that's what I've understood from our previous discussions.

Replying toAI Safety Camp, Virtual Edition 2023

Victor Levoso3y

AI Safety Camp, Virtual Edition 2023

So it seems that a lot of people who applied to Understanding Search in Transformers project to do mechanistic interpretability research and probably a lot of them won't get in.
I think there's a lot of similar projects and potential low-hanging fruit people could work on and we probably could organize to make more teams working on similar things.
I’m willing to organize at least one such project myself(specifically working on trying to figure out how algorithm distillation https://arxiv.org/pdf/2210.14215.pdf works) and will talk with Linda about it in 2 weeks and write a longer post with more details but I thought it would be better to write a comment here to see if how many people are interested in that kind of thing beforehand.

Replying toDecision Transformer Interpretability

Victor Levoso3y

Decision Transformer Interpretability

About the sampling thing. I think a better way to do it that will work for other kind models would be trainining a few diferent models that do better or worse on the task and use different policies, and then you just make a dataset of samples of trajectories from multiple of them. Wich should be cleaner in terms of you knowing what is going on on the training set than getting the data as the model trains (wich on the other hand is actually better for doing AD)

That also has the benefit of letting you study how wich agents you use to generate the training data affects the model. Like if you have two... (read more)

Replying toDecision Transformer Interpretability

Victor Levoso3y

Decision Transformer Interpretability

Oh nice, I was interested on doing mechanistic interpretability on decision transformers myself and had gotten started during SERI MATS but now was more interested in looking into algorithm distillation and the decision transformers stuff fell to the wayside(plus I haven't been very productive during the last few weeks unfortunately). It's too late to read the post in detail today but will probably read it in detail and look at the repo tomorrow. I'm interested in helping with this and I'm likely going to be working on some related research in the near future anyway. Also btw I think that once someone gets to the point that we understand what's going on the setup from... (read more)

Replying toTwo-year update on my personal AI timelines

Victor Levoso4y

Two-year update on my personal AI timelines

Another posible update is towards shorter timelines if you think that humans might not be trained whith the optimal amount of data(since we can't just for example read the entire internet) and so it might be posible to get better peformance whith less parameters, if you asume brain has similar scaling laws.

Replying toAGI Ruin: A List of Lethalities

Victor Levoso4y

AGI Ruin: A List of Lethalities

Not a response to your actual point but I think that hypothetical example probably doesn't make sense (as in making the ai not "care" doesn't prevent it from including mindhacks in its plan) If you have a plan that is "superingently optimized" for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn't in some sense "care" about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)

The plan that produces the... (read more)

Replying toNgo and Yudkowsky on alignment difficulty

Victor Levoso4y

Ngo and Yudkowsky on alignment difficulty

So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar....
Or maybe you get some article about Eliezer's book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write... etc.

Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same... (read more)

LESSWRONG
LW

LESSWRONG
LW

Victor Levoso

Victor Levoso

Victor Levoso

Victor Levoso

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Finding Backward Chaining Circuits in Transformers Trained on Tree Search