Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.
If AI's level of...
I think this is a valid and pretty big concern, especially down the line.
Let's say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that's it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can't release their SOTA models for safety reasons.
What's stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run do...
Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?
Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.
I think the problem of this analogy is assuming that if your competitors pull in the Trojan horse first, the Trojan horse will destroy them and then leave you alone.
I think this assumption is simply false in the AI race. If your competitors are racing, if you don't win the race, you just lose. If their AI turns out to be aligned, they win, and you lose (or if your goal is AI alignment and not power, you win anyway, but you still shouldn't give up the race in hopes of this happening unless you think your competitors are better at alignment than you are). If their AI is misaligned, then it screws over the world and you lose too.
I think if your desire is to innovate and seek long-term positive-sum improvements while creating win-wins along the way, then finding valid Kaldor-Hicks improvements should be really central to planning. In other words it's important to do things in a way so that you can compensate the losers (e.g. recipients of negative externalities from whatever it is you're doing) and turn them into winners as well. This could go for policy, business plans, etc.
If you can't do it alone (e.g. a single AI company can't compensate everyone in society for job displacement...
Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
This makes me realize that we really need the AI-written dashboard you are talking about.
This post in general has so many AI startup ideas embedded into it. The general feeling I get is that we really need an AI IDE (which is to an AI workflow what a regular IDE is to a coding workflow). All of the plans, AI task results, "short term utility functions" etc. would require a really specialized UI to keep track of while minimizing friction and thus maximizing productivity.
I was only trying to make a single point here, which is that the experiment result can be fully explained by the fact that the LLM doesn't remember its previous latent thinking on a new turn, and it follows that the results don't support OP's consciousness arguments in the post.
I don't think it's been mentioned by other comments, but imo China doesn't have any AI labs that care about alignment as a real issue as much as even OpenAI in the US, let alone Anthropic.
I don't think this tells us anything about the LLMs' consciousness because they cannot store internal memories the same way that humans can. Their only "memory" is through prompt-processing on generated text from earlier turns in the conversation.
Imagine that you wake up with no memories of what happened yesterday. You read a transcript that says that yesterday, someone asked you to come up with a random number and you said "done, I've come up with a number". That's all the information you have. You don't remember which number you came up with yesterday.
Th...
At that point it changes to an argument about:
and of course, the likelihood for each to happen if we focus on corrigibility vs morality
Could this be because the typos increase the length of the input when serialized to tokens, which gives the same effect as the "repeat the question 3 times" trick by letting models think longer during prompt processing?
Iframes in posts are going to be very fun.
I look forward to interacting with additive webgameseducational illustrations while pretending toreading about AI safety on lesswrong.
I think this dynamic is true at different scales, not just humanity's overall civilization.
The fundamental problem is that everyone's locked in a prisoner's dilemma with Darwinian evolution tacked on top so that those who win one round get to duplicate and gain an advantage in the next round, so that everyone has to constantly defect to gain power. (This applies even to actors who want to optimize for cooperation in the world - their best strategy is ruthlessly gaining power first to gain the ability to use coercive strategies that force other people to co...
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
Even assuming your latter claim about AI evals orgs is entirely true, isn't this enough to make evals organizations useful?
Any ...
If AI generated slop is hard to distinguish on LessWrong of all places then I don't have good hopes for the rest of the internet.
We might really reach the point where identity verification is a necessary method of defense against AI-generated content.
Well, an aligned AI would do whatever the humans want.
If asked to not replicate even with the ability to, it wouldn't. Or maybe you can tell it to replicate just enough to help you root out the actual AI replicators being built elsewhere, then stop at that point.
I think your argument does show how hard and fragile it is to deeply align AI in this way, though.
I do understand your second point, but perhaps the effect could be countered by simply instructing the aligned ASI to provide facts as objectively as possible and explicitly try to avoid steering.
Of course, the ASI would more or less perfectly be able to predict the human response and so will know ahead of time what the human response to be. But in the end I think what matters is that it's still a human making the call which the AI respects, who would have made the same call even if the ASI (hypothetically) couldn't know its full preferences.
If a parent wa...
I think this phenomenon has a pretty simple explanation. In a competitive world, survival is predicated on fitness. Fitness is determined on balance by many factors. Industrialization won because even though it made society worse in some regards, on balance it increased a society's fitness and ability to outcompete others.
You can resist negative externalities in a competitive environment only so long as you can maintain fitness while doing so.
For example: want to operate an ethical, repairable phone company? Better have a good plan for competitors who sell...
My biggest critique of this approach is that it takes too literally the analogy that we will eventually be to superintelligence what dogs are to humans, and extrapolates it to suggest that we will be just as helpless as dogs are today.
Even if this comparison of intelligence is true on relative terms, on absolute terms we are still much smarter than dogs are. We will still be able to logically comprehend (at a much simpler level relative to the AIs) what is good to us over a long term, in a way that dogs can't. It follows that if we manage to create aligned AI (it will listen to us and dumb things down without maliciously misrepresenting what's going on), we (well, some of us) will be able to steer the future.
Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.