Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
simeon_c157
1
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
A list of some contrarian takes I have: * People are currently predictably too worried about misuse risks * What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment. * Neuroscience as an outer alignment[1] strategy is embarrassingly underrated. * Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing. * Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get. * ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities. * The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML. * ARC's MAD seems doomed to fail. * People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment. * People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more. * There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are. ---------------------------------------- 1. A non-exact term ↩︎
Quote from Cal Newport's Slow Productivity book: "Progress in theoretical computer science research is often a game of mental chicken, where the person who is able to hold out longer through the mental discomfort of working through a proof element in their mind will end up with the sharper result."
RobertM5639
8
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI. I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)
Thomas KwaΩ472
0
I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won't publish. Now that this post is out, I feel like I should at least make this known. I don't regret attempting the dialogue, I just wish we had something more interesting to disagree about.

Popular Comments

Recent Discussion

The curious tale of how I mistook my dyslexia for stupidity - and talked, sang, and drew my way out of it. 

Sometimes I tell people I’m dyslexic and they don’t believe me. I love to read, I can mostly write without error, and I’m fluent in more than one language.

Also, I don’t actually technically know if I’m dyslectic cause I was never diagnosed. Instead I thought I was pretty dumb but if I worked really hard no one would notice. Later I felt inordinately angry about why anyone could possibly care about the exact order of letters when the gist is perfectly clear even if if if I right liike tis.

I mean, clear to me anyway.

I was 25 before it dawned on me that all the tricks...

keltan10

Is it not normal to sub vocalise?

Could people react to this comment with a Tick if they do, and a cross if they don’t?

1keltan
I was diagnosed as a kid. I went through a. lot. of. therapy. Lots of special classes and making two thumbs up then pushing your knuckles together to make a bed that spells bed. That all helped a lot. But three things helped to the point where I hardly think about it these days. 1. Minecraft PVP servers. You need to be able to effectively communicate with your team and taunt the enemy. And you need to be able to do it while someone is running at you with a sword. 2. Fighting with Antivax people as a teenager on Facebook. The biggest slip up someone could make in a Facebook argument was mixing up “you’re” and “your” 3. Talking to girls I liked who could actually spell things correctly. I got very good at rapidly googling how to spell words as I was typing a response.
1philip_b
I think that's pretty easy :)
4AnthonyC
This was really interesting! You probably already know this, but reading out loud was the norm, and silent reading unusual, for most of history: https://en.wikipedia.org/wiki/Silent_reading That didn't really start to change until well after the invention of the printing press. For most of my life, even now once in a while, I would subvocalize my own inner monologue. Definitely had to learn to suppress that in social situations.
This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

  • Give feedback on how this project is helpful or how it could be different to be much more helpful
  • Tell me what's wrong/missing; point me to sources
...
9ryan_greenblatt
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug. The deployment criteria starts with: This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that: 1. Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC). 2. Don't pass cost benefit for current models which pose low risk. (And it seems the criteria is "do you have them implemented right now?) If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the "deployment" criteria IMO. Generally, the deployment criteria should be gated behind "has a plan to do this when models are actually powerful and their implementation of the plan is credible". I get the sense that this criteria doesn't quite handle the necessarily edge cases to handle reasonable choices orgs might make. (This is partially my fault as I didn't notice this when providing feedback on this project.) (IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn't leak architecture secrets.) (A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I'm afraid.)
2Zach Stein-Perlman
[edited] Thanks. I agree you're pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted. I didn't put much effort into clarifying this kind of thing because it's currently moot—I don't think it would change any lab's score—but I agree.[1] I think e.g. a criterion "use KYC" should technically be replaced with "use KYC OR say/demonstrate that you're prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn't too high]." Yeah. The criteria can be like "implement them or demonstrate that you could implement them and have a good plan to do so," but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don't work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.) I'm interested in suggestions :shrug: 1. ^ And I think my site says some things that contradict this principle, like 'these criteria require keeping weights private.' Oops.

Hmm, yeah it does seem thorny if you can get the points by just saying you'll do something.

Like I absolutely think this shouldn't count for security. I think you should have to demonstrate actual security of model weights and I can't think of any demonstration of "we have the capacity to do security" which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)

2Akash
Could consider “frontier AI watch”, “frontier AI company watch”, or “AGI watch.” Most people in the world (including policymakers) have a much broader conception of AI. AI means machine learning, AI is the thing that 1000s of companies are using and 1000s of academics are developing, etc etc.
This is a linkpost for https://arxiv.org/abs/2405.05673

Linked is my MSc thesis, where I do regret analysis for an infra-Bayesian[1] generalization of stochastic linear bandits.

The main significance that I see in this work is:

  • Expanding our understanding of infra-Bayesian regret bounds, and solidifying our confidence that infra-Bayesianism is a viable approach. Previously, the most interesting IB regret analysis we had was Tian et al which deals (essentially) with episodic infra-MDPs. My work here doesn't supersede Tian et al because it only talks about bandits (i.e. stateless infra-Bayesian laws), but it complements it because it deals with a parameteric hypothesis space (i.e. fits into the general theme in learning-theory that generalization bounds should scale with the dimension of the hypothesis class).
  • Discovering some surprising features of infra-Bayesian learning that have no analogues in classical theory. In particular, it
...

Predicting the future is hard, so it’s no surprise that we occasionally miss important developments.

However, several times recently, in the contexts of Covid forecasting and AI progress, I noticed that I missed some crucial feature of a development I was interested in getting right, and it felt to me like I could’ve seen it coming if only I had tried a little harder. (Some others probably did better, but I could imagine that I wasn't the only one who got things wrong.)

Maybe this is hindsight bias, but if there’s something to it, I want to distill the nature of the mistake.

First, here are the examples that prompted me to take notice:

Predicting the course of the Covid pandemic:

  • I didn’t foresee the contribution from sociological factors (e.g., “people not wanting
...
4Jsevillamol
Here is a "predictable surprise" I don't discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.
3habryka
I have a lot of uncertainty about the difficulty of robotics, and the difficulty of e.g. designing superviruses or other ways to kill a lot of people. I do agree that in most worlds robotics will be solved to a human level before AI will be capable of killing everyone, but I am generally really averse to unnecessarily constraining my hypothesis space when thinking about this kind of stuff. >90% seems quite doable with a well-engineered virus (especially one with a long infectious incubation period). I think 99%+ is much harder and probably out of reach until after robotics is thoroughly solved, but like, my current guess is a motivated team of humans could design a virus that kills 90% - 95% of humanity.
1O O
Can a motivated team of humans design a virus that spreads rapidly but stays dormant for a while until it kills most humans with a difficult to stop mechanism before we can stop it? And it has to happen before we develop AIs that can detect these sorts of latent threats anyways. You have to realize if covid was like this we would mass trial mrna vaccines as soon as they were available and a lot of Hail Mary procedures since the alternative is extinction. These slightly smarter than human AIs will be monitored by other such AIs, and probably will be rewarded if they defect. (The AIs they defect on get wiped out and they possibly get to replicate more for example) I think such a takeover could be quite difficult to pull off in practice. The world with lots of slightly smarter than human AIs will be more robust to takeover, there’s a limited time window to even attempt it, failure would be death, and humanity would be far more disciplined against this than covid.

Despite my general interest in open inquiry, I will avoid talking about my detailed hypothesis of how to construct such a virus. I am not confident this is worth the tradeoff, but the costs of speculating about the details here in public do seem non-trivial.

15simeon_c
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
habryka198

@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.

Epistemic Status: At midnight three days ago, I saw some of the GPT-4 Byproduct Recursively Optimizing AIs below on twitter which freaked me out a little and lit a fire underneath me to write up this post, my first on LessWrong. Here, my main goal is to start a dialogue on this topic which from my (perhaps secluded) vantage point nobody seems to be talking about. I don’t expect to currently have the optimal diagnosis of the issue and prescription of end solutions.

Acknowledgements: Thanks to my fellow Wisconsin AI Safety Initiative (WAISI) group organizers Austin Witte and Akhil Polamarasetty for giving feedback on this post. Organizing the WAISI community has been incredibly fruitful in being able to spar ideas with others and see which strongest ones survive....

Empathizing with AGI will not align it nor will it prevent any existential risk. Ending discrimination would obviously be a positive for the world, but it will not align AGI.

It may not align it, but I do think it would prevent certain unlikely existential risks.

If AI/AGI/ASI is truly intelligent, and not just knowledgeable, we should definitely empathize and be compassionate with it. If it ends up being non-sentient, so be it, guess we made a perfect tool. If it ends up being sentient and we've been abusing a being that is super-intelligent, then good luck... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

The first speculated on why you’re still single. We failed to settle the issue. A lot of you were indeed still single. So the debate continues.

The second gave more potential reasons, starting with the suspicion that you are not even trying, and also many ways you are likely trying wrong.

The definition of insanity is trying the same thing over again expecting different results. Another definition of insanity is dating in 2024. Can’t quit now.

You’re Single Because Dating Apps Keep Getting Worse

A guide to taking the perfect dating app photo. This area of your life is important, so if you intend to take dating apps seriously then you should take photo optimization seriously, and of course you can then also use the photos for other things.

I love the...

I think this list will successfully convince many to stay off the dating market indefinitely (I feel inclined that way myself after reading all this). Who in the world has time to work on all of this? At best, this is just a massive set of to-dos; at worst, it's an enormous list of all the ways the dating world sucks and reasons why you'll fail.

1exanova
If it helps, I am willing to match people in the rationality-adjacent circles in the Bay Area and give you personal feedback. You can find my contact information in my profile.
1rotatingpaguro
Thinking about it, I suspect I was not getting what "authenticity and openness" means. Like, it's not "being yourself and letting go", and more "being honest", I guess? Could you give me >= 2 examples of a person being "authentic and open"?
1RamblinDash
So I guess I'm not sure what you mean by that. I think it might be easier to support what I'm saying in the negative. Some example of inauthenticity or un-openness might be: * Consciously faking your personality (in a way that you wouldn't want to maintain as an essentially permanent change) * Lying about what you want out of the relationship * Pretending to like/dislike hobbies or interests that you actually strongly dislike/like The problem with doing these things is that, to the extent that doing them was necessary to gain the relationship, you are now stuck with a relationship that is built on a papered-over incompatibility. If your plan is that you will fake a completely different personality/goals/interests, then you will now be in a relationship where you have to permanently keep faking that stuff while constantly being wary that your new partner might find out you were faking plus you have to spend a lot of time and energy doing stuff and/or interacting with someone you don't actually like, or else ending the relationship and being back at square 1, except that you've invested time/energy that you won't get back. There can be toned-down good versions of this bad strategy tho, I think, which are more like "putting your best foot forward" than like "being inauthentic."   Truth: Looking for a life partner, getting desperate Good strategy [probably depends on age, for this one]: Open to various possibilities, see how it goes. Bad strategy: Your date says they are really only looking for short term fun, and you agree that's all you are looking for too.   Truth: A talkative person who loves debating ideas Good strategy: Tone it down a little, try to listen as much as you talk and try to "yes, and" or "that's interesting, tell me more about what led you to that" your date's points rather than "no but" (you can often make similar points either way) Bad strategy: Just agree with everything your date says; even if you actually have a strong opposing view  

1. If you find that you’re reluctant to permanently give up on to-do list items, “deprioritize” them instead

hate the idea of deciding that something on my to-do list isn’t that important, and then deleting it off my to-do list without actually doing it. Because once it’s off my to-do list, then quite possibly I’ll never think about it again. And what if it’s actually worth doing? Or what if my priorities will change such that it will be worth doing at some point in the future? Gahh!

On the other hand, if I never delete anything off my to-do list, it will grow to infinity.

The solution I’ve settled on is a priority-categorized to-do list, using a kanban-style online tool (e.g. Trello). The left couple columns (“lists”) are very active—i.e., to-do list...

2MondSemmel
I've found that there's value in having short to-do lists, because short lists fit much better into working memory and are thus easier to think about. If items are deprioritized rather than getting properly deleted from the system, this increases the total number of to-dos one could think about. On the other hand, maybe moving tasks to offscreen columns is sufficient to get them off one's mind? It seems to me like a both easier and more comprehensive approach would be to use a text editor with proper version control and diff features, and then to name particular versions before making major changes.
2Steven Byrnes
IMO the main point of a to-do list is to not have the to-do list in working memory. The only thing that should be in working memory is the one thing you're actually supposed to be focusing on and doing, right now. Right? Or if you're instead in the mode of deciding what to do next, or making a schedule for your day, etc., then that's different, but working memory is still kinda irrelevant because presumably you have your to-do list open on your computer, right in front of your eyes, while you do that, right? Is that what you do? It's not a good fit to my typical workflow. But I'm definitely in favor of trying different things and seeing what works best for you. :)

Or if you're instead in the mode of deciding what to do next, or making a schedule for your day, etc., then that's different, but working memory is still kinda irrelevant because presumably you have your to-do list open on your computer, right in front of your eyes, while you do that, right?

Whenever I look at a to-do list, I've personally found it noticeably harder to decide which of e.g. 15 tasks to do, than which of <10 tasks to do. And this applies to lists of all kinds. A related difficulty spike appears once a list no longer fits on a single screen and requires scrolling.

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.

The Volkswagen emissions scandal

There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia). Consequently, Volkswagen...

I am not sure I fully understand your point, but the problem with detecting sandbagging is that you do not know the actual capability of a model. And I guess that you mean "an anomalous decrease in capability" and not increase?

Regardless, could you spell out more how exactly you'd detect sandbagging?

LessOnline & Manifest Summer Camp

June 3rd to June 7th

Between LessOnline and Manifest, stay for a week of experimental events, chill coworking, and cozy late night conversations.

Prices raise $100 on May 13th