Jeremy Gillen

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Wiki Contributions

Comments

Sorted by

There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.

"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).

"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).

Or as Eliezer said:

I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

In different words again: the tasks GPTs are being incentivised to solve aren't all solvable at a human level of capability.

 

You almost had it when you said:

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.

It's more accurate if I edit it to:

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina [text] well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'.

You say it's not particularly informative. Eliezer responds by explaining the argument it responds to, which provides the context in which this is an informative statement about the training incentives of a GPT.

The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.

Your central point is: 

Where GPT and humans differ is not some general mathematical fact about the task,  but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.

You are misinterpreting the OP by thinking it's about comparing the mathematical properties of two tasks, when it was just pointing at the loss gradient of the text prediction task (at the location of a ~human capability profile). The OP works through text prediction sub-tasks where it's obvious that the gradient points toward higher-than-human inference capabilities.

You seem to focus too hard on the minima of the loss function:

notice that “what would the loss function like the system to do”  in principle tells you very little about what the system will do

You're correct to point out that the minima of a loss function doesn't tell you much about the actual loss that could be achieved by a particular system. Like you say, the particular boundedness and cognitive architecture are more relevant to this question. But this is irrelevant to the argument being made, which is about whether the text prediction objective stops incentivising improvements above human capability.

 

The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning

I think a better lesson to learn is that communication is hard, and therefore we should try not to be too salty toward each other. 

I sometimes think of alignment as having two barriers: 

  • Obtaining levers that can be used to design and shape an AGI in development.
  • Developing theory that predicts the effect of your design choices.

My current understanding of your agenda, in my own words:

You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)

As new tactics are developed, you're hoping that expertise and robust theories develop around building systems this way. (3)

This by itself doesn't scale to hard problems, so you're trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)

Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)

Have I understood this correctly?

I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point? 

Maybe it'd be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:

  • Unexpected edge cases in value specification
  • Goals stability across ontology shifts
  • Reflective stability of goals
  • Optimization daemons or simpler self-reinforcing biases
  • Maintaining interruptibility against instrumental convergence

Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it's risky to do the hard part last.


but because the same solutions that will make AI systems beneficial will also make them safer

This is often not true, and I don't think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.

In practice, sadly, developing a true ELM is currently too expensive for us to pursue

Expensive why? Seems like the bottleneck here is theoretical understanding.

Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.

Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.

Jacob Cannells' brain efficiency post

I thought the consensus on that post was that it was mostly bullshit?

These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.

(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).

would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.

Good point, I'm convinced by this. 

build on past agent foundations research

I don't really agree with this. Why do you say this?

That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, other people often learn faster than me.

(Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)

I was thinking "should grantmakers let the money flow to unknown young people who want a chance to prove themselves."

I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.

The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually move away if they see other pathways after having understood the constraints. Now I see my work as mostly about trying to run into constraints for the purpose of better understand them.

Maybe that wouldn't help though, it's really hard to make people see the constraints.

The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.

Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.

How are they falling short?

(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I were grantmaking I'd probably give many more of some kinds of safety research grant. But "If a man has an idea just give him money and don't ask questions" isn't the right kind of change imo).

I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.

Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.

Also there are more grifters now that the word is out, so the EV is also declining that way.

(I feel bad saying this as someone who benefited a lot from early-career financial support).

Load More