Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
simeon_c8643
3
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components. I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome. It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2. If f(w)=Pw(o|i)  is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you've actually explained o, it seems like you'd basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I'd want to keep the regularization coefficient weak enough that I'm in that regime. In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible ("the reason for the network's output is because of its weights") but on the other hand is too trivial to be interesting. In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable. This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f "as independent as possible". Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we'd actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent "the extent to which w includes this Hessian component"). In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw. Maybe I should touch grass and start experimenting with this now, but there's still two things that I don't like: * There's a sense in which I still don't like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I've considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use. * If we're doing the whole Hessian thing, then we're modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that's more appropriate when modelling f as ~linear. I don't think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I'd have something smarter... 1. ^ Normally one would use log probs, but for reasons I don't want to go into right now, I'm currently looking at probabilities instead.
A list of some contrarian takes I have: * People are currently predictably too worried about misuse risks * What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment. * Neuroscience as an outer alignment[1] strategy is embarrassingly underrated. * Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing. * Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get. * ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities. * The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML. * ARC's MAD seems doomed to fail. * People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment. * People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more. * There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are. ---------------------------------------- 1. A non-exact term ↩︎
RobertM5739
8
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI. I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)
Neil 10
0
I'm working on a non-trivial.org project meant to assess the risk of genome sequences by comparing them to a public list of the most dangerous pathogens we know of. This would be used to assess the risk from both experimental results in e.g. BSL-4 labs and the output of e.g. protein folding models. The benchmarking would be carried out by an in-house ML model of ours. Two questions to LessWrong:  1. Is there any other project of this kind out there? Do BSL-4 labs/AlphaFold already have models for this?  2. "Training a model on the most dangerous pathogens in existence" sounds like an idea that could backfire horribly. Can it backfire horribly? 

Popular Comments

Recent Discussion

86simeon_c
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
77habryka
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
74Daniel Kokotajlo
To clarify: I did sign something when I joined the company, so I'm still not completely free to speak (still under confidentiality obligations). But I didn't take on any additional obligations when I left. Unclear how to value the equity I gave up, but it probably would have been about 85% of my family's net worth at least. But we are doing fine, please don't worry about us. 
robo10

Is that your family's net worth is $100 and you gave up $85?  Or your family's net worth is $15 and you gave up $85?

Either way, hats off!

I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.

I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than... (read more)

The following is the first in a 6 part series about humanity's own alignment problem, one we need to solve, first.


~ What Is Alignment? ~

ALIGNMENT OF INTERESTS

When I began exploring non-zero-sum games, I soon discovered that achieving win-win scenarios in the real world is essentially about one thing - the alignment of interests.

If you and I both want the same result, we can work together to achieve that goal more efficiently, and create something that is greater than the sum of its parts. However, if we have different interests or if we are both competing for the same finite resource then we are misaligned, and this can lead to zero-sum outcomes.

AI ALIGNMENT

You may have heard the term "alignment" used in the current discourse around existential risk regarding...

One of the fun things to do when learning first order logic is to consider how the meaning of propositions dramatically changes based on small switches in the syntax. This is in contrast to natural language, where the meaning of a phrase can be ambiguous and we naturally use context clues to determine the correct interperation.

An example of this is the switching of the order of quantifiers. Consider the four following propositions:

These mean, respectively,

  1. Everybody likes somebody
  2. Everybody is liked by somebody
  3. There is a very popular person whom everybody likes
  4. There is a very indiscriminate person who likes everyone

These all have quite different meanings! Now consider an exchange between Pascal and a mugger:

Mugger: I am in control of this simulation and am using an avatar right now. Give me $5...

Hello.  I am a computer security consultant, programmer specializing in PC games, and podcaster.  I am most interested in AI alignment and ethical development and I hope to learn from all of you.

Readers must be 15+

This is a story about existential risk from AI.

1.

Some idiot let the Andys out of the lab. 

They were never-sleeping, always-sleeping, omniscient fuckwits but could only enter your house if you invited them in. 

Everyone invited them in. 

The world had mixed responses, most people thought nothing of them, undisturbed that there were non-humans on the planet speaking semi-fluent English, some were delighted and used the Little Andys to draw photo-realistic porn of dead celebrities, some people formed relationships with the sleeping Andys and they married illegally, sexted constantly, slept together (the Andy small enough to hibernate inside the humans hand). 

Sometimes the human would wake up to find their creature had been lobotomized by scientists in the night, and they sobbed for their poor brain-shredded lovers.

But mostly...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
Neil 10

I'm working on a non-trivial.org project meant to assess the risk of genome sequences by comparing them to a public list of the most dangerous pathogens we know of. This would be used to assess the risk from both experimental results in e.g. BSL-4 labs and the output of e.g. protein folding models. The benchmarking would be carried out by an in-house ML model of ours. Two questions to LessWrong: 

1. Is there any other project of this kind out there? Do BSL-4 labs/AlphaFold already have models for this? 

2. "Training a model on the most dangerous pa... (read more)

The first speculated on why you’re still single. We failed to settle the issue. A lot of you were indeed still single. So the debate continues.

The second gave more potential reasons, starting with the suspicion that you are not even trying, and also many ways you are likely trying wrong.

The definition of insanity is trying the same thing over again expecting different results. Another definition of insanity is dating in 2024. Can’t quit now.

You’re Single Because Dating Apps Keep Getting Worse

A guide to taking the perfect dating app photo. This area of your life is important, so if you intend to take dating apps seriously then you should take photo optimization seriously, and of course you can then also use the photos for other things.

I love the...

Follow up idea based on the stalking section:

  • Write an algorithm that finds the shortest distance from any person through connections to desired person using social media like X or Instagram
  • Ask nodes to contact the target node or ask secret matchmakers to create a set up with a convincing pretext such as inviting them to rationality events!
  • Automate steps in the process and involve others.
2romeostevensit
People should focus way more on things that make them better partners because they make you a healthier more rounded person and way less on idiosyncratic dating market dynamics imo. When you climb the health hill you meet others also climbing the health hill. When you climb fake hills you meet others climbing fake hills.
5Vanessa Kosoy
FWIW, from glancing at your LinkedIn profile, you seem very dateable :)
4Gunnar_Zarncke
I said die, not kill. Let the predators continue to use the dating platforms if they want. It will keep them away from other more wholesome places.

This is exactly what I'm afraid of. That some human will build machines that are going to be - not just superior to us - but not attached to what we want, but what they want. And I think it's playing dice with humanity's future. I personally think this should be criminalized, like we criminalize cloning of humans. 

- Yoshua Bengio

My next guest is about as responsible as anybody for the state of AI capabilities today. But he's recently begun to wonder whether the field he spent his life helping build might lead to the end of the world. 

Following in the tradition of the Manhattan Project physicists who later opposed the hydrogen bomb, Dr. Yoshua Bengio started warning last year that advanced AI systems could drive humanity extinct. 

Dr....

LessOnline & Manifest Summer Camp

June 3rd to June 7th

Between LessOnline and Manifest, stay for a week of experimental events, chill coworking, and cozy late night conversations.

Prices raise $100 on May 13th