Paul Tiplady
Paul Tiplady has not written any posts yet.

Paul Tiplady has not written any posts yet.

I’m a vegetarian and I consider my policy of not frequently recalculating the cost/benefit of eating meat to be an application of a rule in two-level utilitarianism, not a deontological rule. (I do pressure test the calculation periodically.)
Also I will note you are making some pretty strong generalizations here. I know vegans who cheat, vegans who are flexible, vegans who are strict.
The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes
You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.
I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some strong, and the bigger the reservoir of memes the more likely a new highly-infectious... (read more)
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
I think your Simon Strawman is putting forth an overly-weak... (read more)
Confabulation is a dealbreaker for some use-cases (e.g. customer support), and potentially tolerable for others (e.g. generating code when tests / ground-truth is available). I think it's essentially down to whether you care about best-case performance (discarding bad responses) or worst-case performance.
But agreed, a lot of value is dependent on solving that problem.
While of course this is easy to rationalize post hoc, I don’t think falling user count of ChatGPT is a particularly useful signal. There is a possible world where it is useful; something like “all of the value from LLMs will come from people entering text into ChatGPT”. In that world, users giving up shows that there isn’t much value.
In this world, I believe most of the value is (currently) gated behind non-trivial amounts of software scaffolding, which will take man-years of development time to build. Things like UI paradigms for coding assistants, experimental frameworks and research for medical or legal AI, and integrations with existing systems.
There are supposedly north of 100... (read more)
Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/
The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.
Regarding China or other regimes using LLMs for censorship, I'm actually concerned that it might rapidly go the opposite direction as speculated here:
... (read more)It has widely been reported that the PRC may be hesitant to deploy public-facing LLMs due to concerns that the models themselves can’t be adequately censored - it might be very difficult to make a version of ChatGPT that cannot
There doesn't need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I don't think that is impossible, but it is probably very hard, much harder than finding a "deception" neuron.
Right, I was... (read 1336 more words →)
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had... (read more)
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
I think an interesting lens here is to focus on the meta/structural level. Essentially we need good governance. As you touch on here:
It is not immediately obvious to me that the required level of power is way out of proportion to the level of wisdom that will be available, if we can find some way of coupling wisdom and intelligence as priorities for models. I don’t really know of anyone doing “wisdom evals” though. Maybe RLHF is in some very weak sense testing for wisdom.
I guess my main point here is that “extreme amount of wisdom” as a prerequisite might not... (read more)