Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
simeon_c12368
14
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
A list of some contrarian takes I have: * People are currently predictably too worried about misuse risks * What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment. * Neuroscience as an outer alignment[1] strategy is embarrassingly underrated. * Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing. * Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get. * ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities. * The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML. * ARC's MAD seems doomed to fail. * People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment. * People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more. * There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are. ---------------------------------------- 1. A non-exact term ↩︎
I’m confused: if the dating apps keep getting worse, how come nobody has come up with a good one, or at least a clone of OkCupid? Like, as far as I can understand not even “a good matching system is somehow less profitable than making people swipe all the time (surely it’d still be profitable on the absolute scale)” or “it requires a decently big initial investment” can explain a complete lack of good products in a very demanded area. Has anyone digged into it / tried to start a good dating app as a summer project?
RobertM5739
8
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI. I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)
keltan33
0
I currently am completing psychological studies for credit in my university psych course. The entire time, all I can think is “I wonder if that detail is the one they’re using to trick me with?” I wonder how this impacts results. I can’t imagine being in a heightened state of looking out for deception has no impact.

Popular Comments

Recent Discussion

It is easier to ask than to answer. 

That’s my whole point.

It is much cheaper to ask questions than answer them so beware of situations where it is implied that asking and answering are equal. 

Here are some examples:

Let's say there is a maths game. I get a minute to ask questions. You get a minute to answer them. If you answer them all correctly, you win, if not, I do. Who will win?

Preregister your answer.

Okay, let's try. These questions took me roughly a minute to come up with. 

What's 56,789 * 45,387?

What's the integral from -6 to 5π of sin(x cos^2(x))/tan(x^9) dx?

What's the prime factorisation of 91435293173907507525437560876902107167279548147799415693153?

Good luck. If I understand correctly, that last one's gonna take you at least an hour1 (or however long it takes to threaten...

Dagon20

I agree with your assertion that pure factual questions are cheaper and easier than (correct) answers.  I fully disagree with the premise that they're currently "too cheap".

I see many situations where questions and answers are treated as symmetric.

I see almost none.  I see MANY situations where both are cheap, but even then answers are more useful and valued.  I see others where finding the right questions is valued, but answering is even more so.  And plenty where the answer isn't available, but the thinking about how to get closer to ... (read more)

I've been on the lookout for new jobs recently and one thing I have noticed is that the market seems flooded with ads for AI-related jobs. What I mean is not work on building models (or aligning them, alas), but rather, work on building applications using generative AI or other advances to make new software products. My impression of this is that first, there's probably something of a bubble, because I doubt many of these ideas can deliver on their promises, especially as they rely so heavily on still pretty unreliable LLMs and such. And second, that while the jobs are well paid and sound fun, I'm not sure how I feel about them. These jobs all essentially aim at automating away other jobs, one way or another....

Answer by Dagon20

It's definitely overhyped. I hesitate to call it a bubble - it's more like the normal software business model with a new cover.  Tons of projects and startups with pretty tenuous business models and improbable grand visions, most of which will peter out after a few years.  But that has been going on for decades, and will likely continue until true AI makes it all irrelevant.

Most of these jobs are less interesting, and less impactful than they claim.  Which makes the ethical considerations far less important.  My advice is to focus on th... (read more)

2dr_s
I suppose I'm mostly also looking for aspects of this I might have overlooked, or inside perspective about any details from someone who has relevant experience. I think I tend to err a bit on caution on things but ultimately I believe that "staying pure" is rarely a road to doing good (at most it's a road to not doing bad, but that's relatively easy if you just do nothing at all). Some of the problems with automation would have applied to many of the previous rounds of it, and those ultimately came out mostly good, I think, but also it somehow feels This Time It's Different (but then again, I do tend to skew towards pessimism and seeing all the possible ways things can go wrong...).
3Jay Bailey
I guess my way of thinking of it is - you can automate tasks, jobs, or people. Automating tasks seems probably good. You're able to remove busywork from people, but their job is comprised of many more things than that task, so people aren't at risk of losing their jobs. (Unless you only need 10 units of productivity, and each person is now producing 1.25 units so you end up with 8 people instead of 10 - but a lot of teams could also quite use 12.5 units of productivity well) Automating jobs is...contentious. It's basically the tradeoff I talked about above. Automating people is bad right now. Not only are you eliminating someone's job, you're eliminating most other things this person could do at all. This person has had society pass them by, and I think we should either not do that or make sure this person still has sufficient resources and social value to thrive in society despite being automated out of an economic position. (If I was confident society would do this, I might change my tune about automating people) So, I would ask myself - what type of automation am I doing? Am I removing busywork, replacing jobs entirely, or replacing entire skillsets? (Note: You are probably not doing the last one. Very few, if any, are. The tech does not seem there atm. But maybe the company is setting themselves up to do so as soon as it is, or something) And when you figure out what type you're doing, you can ask how you feel about that.
2dr_s
A fair point. I suppose part of my doubt though is exactly: are most of these applications going to automate jobs, or merely tasks? And to what extent does contributing to either advance the know how that might eventually help automating people?

I discovered the Netherlands actually has a good dating app that doesn't exist outside of it... I'm rather baffled. I have no idea how they started. I've messaged them asking if they will localize and expand and they thanked me for the compliment so... Dunno?

It's called Paiq and has a ton of features I've never seen before, like speed dating, picture hiding by default, quizzes you make for people that they can try to pass to get a match with you, photography contacts that involve taking pictures of stuff around and getting matched on that, and a few other things... It's just this grab bag of every way to match people that is not your picture or a blurb. It's really good!

2dr_s
Has anyone ever tried outlining a straight up first come first served system? Vet and pay a first batch of VIP users, then offer incentives to later joiners (eg vouchers for other products), then just free users, and finally introduce fees after reaching a certain user base, all committed to and outlined transparently from the beginning of course.
2Seth Herd
You need to have bunches of people use it for it to be any good, no matter how good the algorithm.
1Selfmaker662
 Right, I completely missed the network effects, 5 minutes of thinking through wasn’t enough. May be there even are good apps there, which didn’t make it through the development and marketing part. Thanks, Vanessa!

Co-Authors: @Rocket, @Ryan Kidd, @LauraVaughan, @McKennaFitzgerald, @Christian Smith, @Juan Gil, @Henry Sleight

The ML Alignment & Theory Scholars program (MATS) is an education and research mentorship program for researchers entering the field of AI safety. This winter, we held the fifth iteration of the MATS program, in which 63 scholars received mentorship from 20 research mentors. In this post, we motivate and explain the elements of the program, evaluate our impact, and identify areas for improving future programs.

Summary

Key details about the Winter Program:

  • The four main changes we made after our Summer program were:
  • Educational attainment of MATS scholars:
    • 48% of scholars
...

I'm noticing there are still many interp mentors for the current round of MATS -- was the "fewer mech interp mentors" change implemented for this cohort, or will that start in Winter or later?

2Sheikh Abdur Raheem Ali
I love this report! Shed a tear at not seeing Microsoft on the organization interest chart though 🥲. We could be a better Bing T_T.
1Ryan Kidd
Oh, I think we forgot to ask scholars if they wanted Microsoft at the career fair. Is Microsoft hiring AI safety researchers?
1Sheikh Abdur Raheem Ali
Yes, here’s an open position: Research Scientist - Responsible & OpenAI Research. Of course, responsible AI differs from interpretability, activation engineering, or formal methods (e.g., safeguarded AI, singular learning theory, agent foundations). I’ll admit we are doing less of that than I’d prefer, partially because OpenAI shares some of its ‘secret safety sauce’ with us, though not all, and not immediately. Note from our annual report that we are employing 1% fewer people than this time last year, so headcount is a very scarce resource. However, the news reported we invested ~£2.5b in setting up a new AI hub in London under Jordan Hoffman, with 600 new seats allocated to it (officially, I can neither confirm nor deny these numbers). I’m visiting there this June after EAG London. We’re the only member of the Frontier Model Forum without an alignment team. MATS scholars would be excellent hires for such a team, should one be established. Some time ago, a few colleagues helped me draft a white paper to internally gather momentum and suggest to leadership that starting one there might be beneficial. Unfortunately, I am not permitted to discuss the responses or any future plans regarding this matter.

I am trying to gather a list of answers/quotes from public figures to the following questions:

  • What are the chances that AI will cause human extinction?
  • Will AI automate most human labour?
  • Should advanced AI models be open source?
  • Do humans have a moral duty to build artificial superintelligence?
  • Should there be international regulation of advanced AI?
  • Will AI be used to make weapons of mass destruction (WMDs)?

I am writing them down here if you want to look/help: https://docs.google.com/spreadsheets/d/1HH1cpD48BqNUA1TYB2KYamJwxluwiAEG24wGM2yoLJw/edit?usp=sharing 

Thank you, this is the kind of thing I was hoping to find.

The first speculated on why you’re still single. We failed to settle the issue. A lot of you were indeed still single. So the debate continues.

The second gave more potential reasons, starting with the suspicion that you are not even trying, and also many ways you are likely trying wrong.

The definition of insanity is trying the same thing over again expecting different results. Another definition of insanity is dating in 2024. Can’t quit now.

You’re Single Because Dating Apps Keep Getting Worse

A guide to taking the perfect dating app photo. This area of your life is important, so if you intend to take dating apps seriously then you should take photo optimization seriously, and of course you can then also use the photos for other things.

I love the...

lc20

Manifold Love: pro-tip: if a woman measures her hand against yours, this is almost always flirtation.

Totally did not know this. Is this true? [10% react x 2]

A little taken aback by this response. It's not just flirting, it's outright romantic. Asking this is like asking if a woman resting their head on a mans chest and purring is "flirting". I didn't realize this was a common experience for guys not in a relationship with the particular woman.

1rotatingpaguro
Ok, then I agreed. I was interpreting the advice in a different way, but your interpretation looks more reasonable.
1Curt Tigges
Very kind of you to say. :) I think for me, though, the source of the emotion I felt when reading this series was something like: "Ah, so in addition to ensuring we are dateable ourselves, we must fix society, capitalism (at least the dating part of it), culture, etc. in order to have a Good Dating Universe." Which in retrospect was a bit overblown of me, so I think I no longer endorse the strong version of what I said in that comment.
1quiet_NaN
This made me laugh out loud. Otherwise, my idea for a dating system would be that given that the majority of texts written will invariably end up being LLM-generated, it would be better if every participant openly had an AI system as their agent. Then the AI systems of both participants could chat and figure out how their user would rate the other user based on their past ratings of suggestions. If the users end up being rated among each others five most viable candidates,  Of course, if the agents are under the full control of the users, the next step of escalation will be that users will tell their agents to lie on their behalf. ('I am into whatever she is into. If she is big on horses, make up a cute story about me having had a pony at some point. Just put the relevant points on the cheat sheet for the date'.) This might be solved by having the LLM start by sending out a fixed text document. If horses are mentioned as item 521, after entomology but before figure skating, the user is probably not very interested in them. Of course, nothing would prevent a user from at least generically optimizing their profile to their target audience. "A/B testing has shown that the people you want to date are mostly into manga, social justice and ponies, so this is what you should put on your profile." Adversarially generated boyfriend?
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

epistemic/ontological status: almost certainly all of the following - 

  • a careful research-grade writeup of a genuinely kinda shiny open(?) question in theoretical psephology that will likely never see actual serious non-cooked-up use;
  • dedicated to a very dear cat;
  • utterly dependent, for the entirety of the most interesting parts, on definitions I have come up with and results I have personally proven partially using them, which I have done with a professional mathematician's care; some friends and strangers have also checked them over;
  • my attempt to prove that something that can reasonably be called a maximal lottery-lottery exists;
  • my attempt to scavenge what others have left behind and craft a couple of missing pieces, and then to lay out a blueprint for how it could begin to work;
  • not a 30-minute read
  • the first half
...
Lorxus10

To avoid confusion: this post and my reply to it were also on a past version of this post; that version lacked any investigation of dominance criterion desiderata for lottery-lotteries.

The curious tale of how I mistook my dyslexia for stupidity - and talked, sang, and drew my way out of it. 

Sometimes I tell people I’m dyslexic and they don’t believe me. I love to read, I can mostly write without error, and I’m fluent in more than one language.

Also, I don’t actually technically know if I’m dyslectic cause I was never diagnosed. Instead I thought I was pretty dumb but if I worked really hard no one would notice. Later I felt inordinately angry about why anyone could possibly care about the exact order of letters when the gist is perfectly clear even if if if I right liike tis.

I mean, clear to me anyway.

I was 25 before it dawned on me that all the tricks...

1Shoshannah Tekofsky
Thanks! :D Attention is a big part of it for me as well, yes. I feel it's very easy to notice when I skip words when reading out loud, and getting the cadence of a sentence right only works if you have a sense of how it relates to the previous and next one.
2Aprillion (Peter Hozák)
Yeah, I myself subvocalize absolutely everything and I am still horrified when I sometimes try any "fast" reading techniques - those drain all of the enjoyment our of reading for me, as if instead of characters in a story I would imagine them as p-zombies. For non-fiction, visual-only reading cuts connections to my previous knowledge (as if the text was a wave function entangled to the rest of the universe and by observing every sentence in isolation, I would collapse it to just "one sentence" without further meaning). I never move my lips or tongue though, I just do the voices (obviously, not just my voice ... imagine reading Dennett without Dennett's delivery, isn't that half of the experience gone? how do other people enjoy reading with most of the beauty missing?). It's faster then physical speech for me too, usually the same speed as verbal thinking.
Lorxus20

Yeah, I myself subvocalize absolutely everything and I am still horrified when I sometimes try any "fast" reading techniques - those drain all of the enjoyment our of reading for me, as if instead of characters in a story I would imagine them as p-zombies.

I speed-read fiction, too. When I do, though, I'll stop for a bit whenever something or someone new is being described, to give myself a moment to picture it in a way that my mind can bring up again as set dressing.

1Shoshannah Tekofsky
That sounds great! I have to admit that I still get a far richer experience from reading out loud than subvocalizing, and my subvocalizing can't go faster than my speech. So it sounds like you have an upgraded form with more speed and richness, which is great!

TLDR; I demonstrate the use of refusal vector ablation on Llama 3 70B to create a bad agent that can attempt malicious tasks such as trying to persuade and pay me to assassinate another individual. I introduce some early work on a benchmark for Safe Agents which comprises two small datasets, one benign, one bad. In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.

Overview

In this post, I use insights from mechanistic interpretability to remove safety guardrails from the latest Llama 3 model. I then use a custom scaffolding for tool use and agentic planning to create a “bad” agent that can perform many unethical tasks. Examples include tasking the AI with persuading me to end the life of...

2the gears to ascension
You sure could have waited a day or two for someone else to get around to this. No reason to be the person who burns the last two days. (Of course, as usual, this would be better aimed upstream many steps. But it's the marginal difference that can be changed.)
3Simon Lermen
I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though.  Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this? How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions? I don't really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases. 
3Simon Lermen
Thanks for this comment, I take it very serious that things can inspire people and burn timeline. I think this is a good counterargument though: There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety. I don't want to give people ideas or appear cynical here, sorry if that is the impression.

No particular disagreement that your marginal contribution is low and that this has the potential to be useful for durable alignment. Like I said, I'm thinking in terms of not burning days with what one doesn't say.

LessOnline Festival

Ticket prices increase in 3 days

Join us May 31st to June 2nd, at Lighthaven, Berkeley CA