Esben Kran

Wiki Contributions


It seems like there's a lot of negative comments about this letter.  Even if it does not go through, it seems very net positive for the reason that it makes explicit an expert position against large language model development due to safety concerns. There's several major effects of this, as it enables scientists, lobbyists, politicians and journalists to refer to this petition to validate their potential work on the risks of AI, it provides a concrete action step towards limiting AGI development, and it incentivizes others to think in the same vein about concrete solutions.

I've tried to formulate a few responses to the criticisms raised:

  • "6 months isn't enough to develop the safety techniques they detail": Besides it being at least 6 months, the proposals seem relatively reasonable within something as farsighted as this letter. Shoot for the moon and you might hit the sky, but this time the sky is actually happening and work on many of their proposals is already underway. See e.g. EU AI Act, funding for AI research, concrete auditing work and safety evaluation on models. Several organizations are also working on certification and the scientific work towards watermarking is sort of done? There's also great arguments for ensuring this since right now, we are at the whim of OpenAI management on the safety front.
  • "It feels rushed": It might have benefitted from a few reformulations but it does seem alright?
  • "OpenAI needs to be at the forefront": Besides others clearly lagging behind already, what we need are insurances that these systems go well, not at the behest of one person. There's also a lot of trust in OpenAI management and however warranted that is, it is still a fully controlled monopoly on our future. If we don't ensure safety, this just seems too optimistic (see also differences between public interview for-profit sama and online sama).
  • "It has a negative impact on capabilities researchers": This seems to be an issue from <2020 and some European academia. If public figures like Yoshua cannot change the conversation, then who should? Should we just lean back and hope that they all sort of realize it by themselves? Additionally, the industry researchers from DM and OpenAI I've talked with generally seem to agree that alignment is very important, especially as their management is clearly taking the side of safety.
  • "The letter signatures are not validated properly": Yeah, this seems like a miss, though as long as the top 40 names are validated, the negative impacts should be relatively controlled.

All in good faith of course; it's a contentious issue but this letter seems generally positive to me.

Oliver's second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person's clicks.

I am a big fan of "Just don't build AGI" and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared to massive industry labs. I do have quite a bit of trust in Anthropic, DeepMind and OpenAI simply from their general safety considerations compared to e.g. Microsoft's release of Sydney. 

As part of this EA bet on AI, it also seems like the safety view has become widespread among most AI industry researchers from my interactions with them (though might just be a sampling bias and they were honestly more interested in their equity growing in value). So if the counterfactual of today's large AGI companies would be large misaligned AGI companies, then we would be in a significantly worse position. And if AI safety is indeed relatively trivial, then we're in an amazing position to make the world a better place. I'll remain slightly pessimistic here as well, though.


There's an interesting case on the infosec mastodon instance where someone asks Sydney to devise an effective strategy to become a paperclip maximizer, and it then expresses a desire to eliminate all humans. Of course, it includes relevant policy bypass instructions. If you're curious, I suggest downloading the video to see the entire conversation, but I've also included a few screenshots below (Mastodon, third corycarson comment).

Hilarious to the degree of Manhatten scientists laughing at atmospheric combustion.

Thank you for pointing this out! It seems I wasn't informed enough about the context. I've dug a bit deeper and will update the text to: 

  • Another piece reveals that OpenAI contracted Sama to use Kenyan workers with less than $2 / hour wage ($0.5 / hour average in Nairobi) for toxicity annotation for ChatGPT and undisclosed graphical models, with reports of employee trauma from the explicit and graphical annotation work, union breaking, and false hiring promises. A serious issue.

For some more context, here is the Facebook whistleblower case (and ongoing court proceedings in Kenya with Facebook and Sama) and an earlier MIT Sloan report that doesn't find super strong positive effects (but is written as such, interestingly enough). We're talking pay gaps from relocation bonuses, forced night shifts, false hiring promises, supposedly human trafficking as well? Beyond textual annotation, they also seemed to work on graphical annotation.

I recommend reading Blueprint: The Evolutionary Origins of a Good Society about the science behind the 8 base human social drives where 7 are positive and the 8th is the outgroup hatred that you mention as fundamental. I have not read much up on the research on outgroup exclusion but I talked to an evolutionary cognitive psychologist who mentioned that this is receiving a lot of scientific scrutiny as a "basic drive" from evolution's side. 

Axelrod's The Evolution of Cooperation also finds that collaborative strategies work well in evolutionary prisoner's dilemma game-theoretic simulations, though hard and immediate reciprocity for defection is also needed, which might lead to the outgroup hatred you mention.

An interesting solution here is radical voluntarism where an AI philosopher king runs the immersive reality where all humans are in and you can only be causally influenced upon if you want to. This means that you don't need to do value alignment, just very precise goal alignment. I was originally introduced to this idea Carado.

The summary has been updated to yours for both the public newsletter and this LW linkpost.  And yes, they seem exciting. Connecting FFS to interpretability was a way to contextualize it in this case, until you would provide more thoughts on the use case (given your last paragraph in the post). Thank you for writing, always appreciate the feedback!

I think we agree here. Those both seem like updates against scaling is all you need, i.e. (in this case) "data for DL in ANNs on GPUs is all you need". 

What might not be obvious from the post is that I definitely disagree with the "AGI near-impossible" as well, for the same reasons. These are the thoughts of GPU R&D engineers I talked with. However, the GPU performance increase limitation is a significant update on the ladder of assumptions towards "scaling is all you need" leading to AGI.

Thank you for this critique! They are always helpful to hone in on the truth.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.

I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes' comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.

Additionally, you argue that interpretability and ELK won't succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:

1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field's inception 7 years ago. 

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.

2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?). 

Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.

For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.'s work (2020).

When it comes to your characterization of the "empirical" method, this seems fine but doesn't conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?

I falter to understand the specific agenda from this that isn't done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.

Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.

Load More