[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.
Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.
This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.
"Mathematical descriptions" is a little ambiguous. Equations and models are terse. The mapping of such equations to human-level system expectations (anticipated conditional experiences) can require quite a bit of verbosity.
I think that's what you're saying with the "algorithms and data structures" part, but I'm unsure if you're claiming that the property specification of the math is sufficient as a description, and comparable in fidelity to the algorithmic implementation.
The Model-View-Controller architecture is very powerful. It allows us to separate concerns.
For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.
We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.
If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the...
Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.
In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.
This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] w...
I would certainly agree with part of what you are saying. Especially the point that many important lessons are taught by pain [correct me if this is misinterpreting your comment]. Indeed, as a parent for example, if your goal is for your child to gain the capacity for self sufficiency, a certain amount of painful lessons that reflect the inherent properties of the world are necessary to achieve such a goal.
On the other hand, I do not agree with your framing of pain as being the main motivator [again, correct me if required]. In fact, a wide variety of syst...
A very rough draft of a plan to test prophylactics for airborne illnesses.
Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.
Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in t...
All of the problems you list seem harder with repeated within-person trials.
I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better...
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.
when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).
By this definition, humans have shards because they can want food at the same time as wantin...
I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:
...Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all,
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable).
I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more...
I'm not a decel, but the way this stuff often is resolved is that there are crazy people that aren't taken seriously by the managerial class but that are very loud and make obnoxious asks. Think the evangelicals against abortion or the Columbia protestors.
Then there is some elite, part of the managerial class, that makes reasonable policy claims. For Abortion, this is Mitch McConnel, being disciplined over a long period of time in choosing the correct judges. For Palestine, this is Blinken and his State Department bureaucracy.
The problem with d...
The FDC just fined US phone carriers for sharing the location data of US customers to anyone willing to buy them. The fines don't seem to be high enough to deter this kind of behavior.
That likely includes either directly or indirectly the Chinese government.
What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.
If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you...
This is a narrow objection to the IMO hyperbolic focus on government assault risks.
Whether or not you face government assault risks depends on what you do. Most people don't face government assault risks. Some people engage in work or activism that results in them having government assault risks.
The Chinese government has strategic goals and most people are unimportant to those. Some people however work on topics like AI policy in which the Chinese government has an interest.
From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):
Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.
A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem.
So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs...
Looking for blog platform/framework recommendations
I had a Wordpress blog, but I don't like wordpress and I want to move away from it.
Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew).
I would like something that I can use for free with my own domain (so not Wix).
The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough.
Other requirements: Da...
some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:
my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.
it's possible that this is true for some people, but in general i expect that to be a mistaken anal...
It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!
But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.
In my fantasies, if I ever were to get that god-like glimpse at how everything actually is, with all that is currently hidden unveiled, it would be something like the feeling you have when you get a joke, or see a "magic eye" illustration, or understand an illusionist's trick, or learn to juggle: what was formerly perplexing and incoherent becomes in a snap simple and integrated, and there's a relieving feeling of "ah, but of course."
But it lately occurs to me that the things I have wrong about the world are probably things I've grasped at exactly because ...
And then today I read this: “We yearn for the transcendent, for God, for something divine and good and pure, but in picturing the transcendent we transform it into idols which we then realize to be contingent particulars, just things among others here below. If we destroy these idols in order to reach something untainted and pure, what we really need, the thing itself, we render the Divine ineffable, and as such in peril of being judged non-existent. Then the sense of the Divine vanishes in the attempt to preserve it.” (Iris Murdoch, Metaphysics as a Guide to Morals)
There have been multiple occasions where I've copy and pasted email threads into an LLM and asked it things like:
I really want an email plugin that basically brute forces rationality INTO email conversations.
That seems fair enough!
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI? (I will indeed assume a "no comment" or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board)