I’m a vegetarian and I consider my policy of not frequently recalculating the cost/benefit of eating meat to be an application of a rule in two-level utilitarianism, not a deontological rule. (I do pressure test the calculation periodically.)
Also I will note you are making some pretty strong generalizations here. I know vegans who cheat, vegans who are flexible, vegans who are strict.
The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes
You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.
I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some ...
...Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions
Confabulation is a dealbreaker for some use-cases (e.g. customer support), and potentially tolerable for others (e.g. generating code when tests / ground-truth is available). I think it's essentially down to whether you care about best-case performance (discarding bad responses) or worst-case performance.
But agreed, a lot of value is dependent on solving that problem.
While of course this is easy to rationalize post hoc, I don’t think falling user count of ChatGPT is a particularly useful signal. There is a possible world where it is useful; something like “all of the value from LLMs will come from people entering text into ChatGPT”. In that world, users giving up shows that there isn’t much value.
In this world, I believe most of the value is (currently) gated behind non-trivial amounts of software scaffolding, which will take man-years of development time to build. Things like UI paradigms for coding assistants, experi...
Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/
The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.
Regarding China or other regimes using LLMs for censorship, I'm actually concerned that it might rapidly go the opposite direction as speculated here:
...It has widely been reported that the PRC may b
There doesn't need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
...We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the ...
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent...
I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)
If our AGIs, after delivering utopia, tell us we need to start restricting childbea...
Seconding the Airmega, but here’s a DIY option too if availability becomes an issue: https://dynomight.net/better-DIY-air-purifier.html
The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand
Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style
I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.
In the paper OpenAI have a “neutral” r...
Is “adversarial-example-wanters” referring to an existing topic, or something you can expand on here?
This is a great experiment! This illustrates exactly the tendency I observed when I dug into this question with an earlier mode, LaMDA, except this example is even clearer.
As an AI language model, I have access to a variety of monitoring tools and system resources that allow me to gather information about my current state
Based on my knowledge of how these systems are wired together (software engineer, not an ML practitioner), I’m confident this is bullshit. ChatGPT does not have access to operational metrics about the computational fabric it is running...
I buy this. I think a solid sense of self might be the key missing ingredient (though it’s potentially a path away from Oracles toward Agents).
A strong sense of self would require life experience, which implies memory. Probably also the ability to ruminate and generate counterfactuals.
And of course, as you say, the memories and “growing up” would need to be about experiences of the real world, or at least recordings of such experiences, or of a “real-world-like simulation”. I picture an agent growing in complexity and compute over time, while retaining a memory of its earlier stages.
Perhaps this is a different learning paradigm from gradient descent, relegating it to science fiction for now.
I think they quite clearly have no (or barely any) memory, as they can be prompt-hijacked to drop one persona and adopt another. Also, mechanistically, the prompt is the only thing you could call memory and that starts basically empty and the window is small. They also have a fuzzy-at-best self-symbol. No “Markov blanket”, if you want to use the Friston terminology. No rumination on counterfactual futures and pasts.
I do agree there is some element of a self-symbol—at least a theory of mind—in LaMDA, for example I found it’s explanation for why it lied to b...
why they thought the system was at all ready for release
My best guess is it’s fully explained by Nadella’s quote “I hope that, with our innovation, [Google] will definitely want to come out and show that they can dance. And I want people to know that we made them dance.”
https://finance.yahoo.com/news/microsoft-ceo-satya-nadella-says-172753549.html
Seems kind of vapid but this appears to be the level that many execs operate at.
Is there any evidence at all that markets are good at predicting paradigm shifts? Not my field but I would not be surprised by the “no” answer.
Markets as often-efficient in-sample predictors, and poor out-of-sample predictors, would be my base intuition.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public's level of fear of catastrophe. Now it seems to me that there's a more plausible path to widespread appreciation of (and therefore work on) AI risk. To ...
I think an interesting lens here is to focus on the meta/structural level. Essentially we need good governance. As you touch on here:
It is not immediately obvious to me that the required level of power is way out of proportion to the level of wisdom that will be available, if we can find some way of coupling wisdom and intelligence as priorities for models. I don’t really know of anyone doing “wisdom evals” though. Maybe RLHF is in some very weak sense testing for wisdom.
I guess my main poin... (read more)