Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
simeon_c9749
7
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
A list of some contrarian takes I have: * People are currently predictably too worried about misuse risks * What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment. * Neuroscience as an outer alignment[1] strategy is embarrassingly underrated. * Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing. * Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get. * ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities. * The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML. * ARC's MAD seems doomed to fail. * People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment. * People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more. * There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are. ---------------------------------------- 1. A non-exact term ↩︎
I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components. I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome. It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2. If f(w)=Pw(o|i)  is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you've actually explained o, it seems like you'd basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I'd want to keep the regularization coefficient weak enough that I'm in that regime. In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible ("the reason for the network's output is because of its weights") but on the other hand is too trivial to be interesting. In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable. This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f "as independent as possible". Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we'd actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent "the extent to which w includes this Hessian component"). In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw. Maybe I should touch grass and start experimenting with this now, but there's still two things that I don't like: * There's a sense in which I still don't like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I've considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use. * If we're doing the whole Hessian thing, then we're modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that's more appropriate when modelling f as ~linear. I don't think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I'd have something smarter... 1. ^ Normally one would use log probs, but for reasons I don't want to go into right now, I'm currently looking at probabilities instead.
Causality is rare! The usual statement that "correlation does not imply causation" puts them, I think, on deceptively equal footing. It's really more like correlation is almost always not causation absent something strong like an RCT or a robust study set-up. Over the past few years I'd gradually become increasingly skeptical of claims of causality just by updating on empirical observations, but it just struck me that there's a good first principles reason for this. For each true cause of some outcome we care to influence, there are many other "measurables" that correlate to the true cause but, by default, have no impact on our outcome of interest. Many of these measures will (weakly) correlate to the outcome though, via their correlation to the true cause. So there's a one-to-many relationship between the true cause and the non-causal correlates. Therefore, if all you know is that something correlates with a particular outcome, you should have a strong prior against that correlation being causal. My thinking previously was along the lines of p-hacking: if there are many things you can test, some of them will cross a given significance threshold by chance alone. But I'm claiming something more specific than that: any true cause is bound to be correlated to a bunch of stuff, which will therefore probably correlate with our outcome of interest (though more weakly, and not guaranteed since correlation is not necessarily transitive). The obvious idea of requiring a plausible hypothesis for the causation helps somewhat here, since it rules out some of the non-causal correlates. But it may still leave many of them untouched, especially the more creative our hypothesis formation process is! Another (sensible and obvious, that maybe doesn't even require agreement with the above) heuristic is to distrust small (magnitude) effects, since the true cause is likely to be more strongly correlated with the outcome of interest than any particular correlate of the true cause.
RobertM5739
8
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI. I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)

Popular Comments

Recent Discussion

3Decaeneus
Causality is rare! The usual statement that "correlation does not imply causation" puts them, I think, on deceptively equal footing. It's really more like correlation is almost always not causation absent something strong like an RCT or a robust study set-up. Over the past few years I'd gradually become increasingly skeptical of claims of causality just by updating on empirical observations, but it just struck me that there's a good first principles reason for this. For each true cause of some outcome we care to influence, there are many other "measurables" that correlate to the true cause but, by default, have no impact on our outcome of interest. Many of these measures will (weakly) correlate to the outcome though, via their correlation to the true cause. So there's a one-to-many relationship between the true cause and the non-causal correlates. Therefore, if all you know is that something correlates with a particular outcome, you should have a strong prior against that correlation being causal. My thinking previously was along the lines of p-hacking: if there are many things you can test, some of them will cross a given significance threshold by chance alone. But I'm claiming something more specific than that: any true cause is bound to be correlated to a bunch of stuff, which will therefore probably correlate with our outcome of interest (though more weakly, and not guaranteed since correlation is not necessarily transitive). The obvious idea of requiring a plausible hypothesis for the causation helps somewhat here, since it rules out some of the non-causal correlates. But it may still leave many of them untouched, especially the more creative our hypothesis formation process is! Another (sensible and obvious, that maybe doesn't even require agreement with the above) heuristic is to distrust small (magnitude) effects, since the true cause is likely to be more strongly correlated with the outcome of interest than any particular correlate of the true cause.
2Garrett Baker
This seems pretty different from Gwern's paper selection trying to answer this topic in How Often Does Correlation=Causality?, where he concludes Also see his Why Correlation Usually ≠ Causation.
gwern20

Those are not randomly selected pairs, however. There are 3 major causal patterns: A->B, A<-B, and A<-C->B. Daecaneus is pointing out that for a random pair of correlations of some variables, we do not assign a uniform prior of 33% to each of these. While it may sound crazy to try to argue for some specific prior like 'we should assign 1% to the direct causal patterns of A->B and A<-B, and 99% to the confounding pattern of A<-C->B', this is a lot closer to the truth than thinking that a third of the time, A causes B, a third of the ... (read more)

46Daniel Kokotajlo
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I'm not sure what I'd want to say yet though & I'm a bit scared of media attention. 
3WilliamKiely
I'd be interested in hearing peoples' thoughts on whether the sacrifice was worth it, from the perspective of assuming that counterfactual Daniel would have used the extra net worth altruistically. Is Daniel's ability to speak more freely worth more than the altruistic value that could have been achieved with the extra net worth?
1WilliamKiely
(Note: Regardless of whether it was worth it in this case, simeon_c's reward/incentivization idea may be worthwhile as long as there are expected to be some cases in the future where it's worth it, since the people in those future cases may not be as willing as Daniel to make the altruistic personal sacrifice, and so we'd want them to be able to retain their freedom to speak without it costing them as much personally.)
habryka30

I think having signed an NDA (and especially a non-disparagement agreement) from a major capabilities company should probably rule you out of any kind of leadership position in AI Safety, and especially any kind of policy position. Given that I think Daniel has a pretty decent chance of doing either or both of these things, and that work is very valuable and constrained on the kind of person that Daniel is, I would be very surprised if this wasn't worth it on altruistic grounds.  

This article is last in a series of 10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance, such as incident reportingsafety evals, model registries, and more. We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.

This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.

What are CBRN hazards?

...

I was fully expecting having to write yet another comment about how human-level AI will not be very useful for a nuclear weapon program. I concede that the dangers mentioned instead (someone putting an AI in charge of a reactor or nuke) seem much more realistic. 

Of course, the utility of avoiding sub-extinction negative outcomes with AI in the near future is highly dependent on p(doom). For example, if there is no x-risk, then the first order effects of avoiding locally bad outcomes related to CBRN hazards are clearly beneficial. 

On the other han... (read more)

Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.

Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights...

I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.

I think the first sentence is true (especially for alignment strategy), but the second sentence seems sort of... broad-life-advice-ish, instead of a specific tip? It's a pretty indirect help to most kinds of alignment.

Otherwise, this comment's points really do seem like empirical thing... (read more)

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

1Darklight
Thanks for the reply! So, the main issue I'm finding with putting them all into one proposal is that there's a 1000 character limit on the main summary section where you describe the project, and I cannot figure out how to cram multiple ideas into that 1000 characters without seriously compromising the quality of my explanations for each. I'm not sure if exceeding that character limit will get my proposal thrown out without being looked at though, so I hesitate to try that. Any thoughts?

Oh, hmm, I sure wasn't tracking a 1000 character limit. If you can submit it, I wouldn't be worried about it (and feel free to put that into your references section). I certainly have never paid attention to whether anyone stayed within the character limit.

3Lorxus
The reason is not secret anymore! I have finished and published a two-post sequence on maximal lottery lotteries.
1Lorxus
Excellent, thanks!

Previously: On the Proposed California SB 1047.

Text of the bill is here. It focuses on safety requirements for highly capable AI models.

This is written as an FAQ, tackling all questions or points I saw raised.

Safe & Secure AI Innovation Act also has a description page.

Why Are We Here Again?

There have been many highly vocal and forceful objections to SB 1047 this week, in reaction to a (disputed and seemingly incorrect) claim that the bill has been ‘fast tracked.’ 

The bill continues to have substantial chance of becoming law according to Manifold, where the market has not moved on recent events. The bill has been referred to two policy committees one of which put out this 38 page analysis

The purpose of this post is to gather and analyze all...

Sure, but you weren’t providing reasons to not believe the argument, or reasons why your interpretation is at least as implausible

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
// ODDS = YEP:NOPE
YEP, NOPE = MAKE UP SOME INITIAL ODDS WHO CARES
FOR EACH E IN EVIDENCE
	YEP *= CHANCE OF E IF YEP
	NOPE *= CHANCE OF E IF NOPE

The thing to remember is that yeps and nopes never cross. The colon is a thick & rubbery barrier. Yep with yep and nope with nope.

bear : notbear =
 1:100 odds to encounter a bear on a camping trip around here in general
* 20% a bear would scratch my tent : 50% a notbear would
* 10% a bear would flip my tent over : 1% a notbear would
* 95% a bear would look exactly like a fucking bear inside my tent : 1% a notbear would
* 0.01% chance a bear would eat me alive : 0.001% chance a notbear would

As you die you conclude 1*20*10*95*.01 : 100*50*1*1*.001 = 190 : 5 odds that a bear is eating you.

4Zane
"20% a bear would scratch my tent : 50% a notbear would" I think the chance that your tent gets scratched should be strictly higher if there's a bear around?
1lukehmiles
A possom or whatever will scratch mine like half the time
keltan10

In my head I was thinking a tree branch moving in the wind.

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let’s call the habit of not saying things you know to be false ‘shallow honesty’[1].

Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).

Either way, if you...

X O10

Maybe I will not be able to submit this because my negative karma is so bad even though I am not being an antagonist with people except bringing up rational arguments with what I think and that it would be helpful.

What worries me about? An article like this is, firstly, civilisation seems to be regurgitating the same stuff over and over which only shows that nothing is changing. Further, if we have to talk about deep honesty like a commodity when 2 generations ago it was just something that people did, but it was the 60s and 70s when we tried to bring love... (read more)

4habryka
Promoted to curated: I sure tend to have a lot of conversations about honesty and integrity, and this specific post was useful in 2-3 conversations I've had since it came out. I like having a concept handle for "trying to actively act with an intent to inform", I like the list of concrete examples of the above, and I like how the post situates this as something with benefits and drawbacks (while also not shying away too much from making concrete recommendations on what would be better on the margin).
This is a linkpost for https://markxu.com/if-you-weren't

My friend Buck once told me that he often had interactions with me that felt like I was saying “If you weren’t such a fucking idiot, you would obviously do…” Here’s a list of such advice in that spirit.

Note that if you do/don’t do these things, I’m technically calling you an idiot, but I do/don’t do a bunch of them too. We can be idiots together.

If you weren’t such a fucking idiot…

  • You would have multiple copies of any object that would make you sad if you didn’t have it
    • Examples: ear plugs, melatonin, eye masks, hats, sun glasses, various foods, possibly computers, etc.
  • You would spend money on goods and services.
    • Examples of goods: faster computer, monitor, keyboard, various tasty foods, higher quality clothing, standing desk, decorations for your room,
...

One way to do this is to email people that you want to be your mentor with the subject “Request for Mentorship”.

I'm curious if anyone sending emails like these have gotten mentors. The success rate might be higher if you form a connection and then ask for recurring meetings.

LessOnline & Manifest Summer Camp

June 3rd to June 7th

Between LessOnline and Manifest, stay for a week of experimental events, chill coworking, and cozy late night conversations.

Prices raise $100 on May 13th