Why do some societies exhibit more antisocial punishment than others? Martin explores both some literature on the subject, and his own experience living in a country where "punishment of cooperators" was fairly common.

William_S1dΩ561217
20
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
habryka1d3916
3
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.  Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
Thomas Kwa14hΩ820-5
9
You should update by +-1% on AI doom surprisingly frequently This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and the fact you don't start at 50%. But I do believe that evidence is coming in every week such that ideal market prices should move by 1% on maybe half of weeks, and it is not crazy for your probabilities to shift by 1% during many weeks if you think about it often enough.
Dalcy1d356
1
Thoughtdump on why I'm interested in computational mechanics: * one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. 'discover' fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool * ... but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions. * re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don't know how good, but effort-wise far too less invested compared to theory work * would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc. * tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i'm thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it's gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing ... * the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream -> tree -> markov model -> stack automata -> ... ?) * this ... sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up * haha but alas, (almost) no development afaik since the original paper. seems cool * and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them. * eg crutchfield talks a lot about developing a right notion of information flow - obvious usefulness in eg formalizing boundaries? * many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.
Buck2dΩ31468
6
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.] I’m interested in the following subset of risk from AI: * Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman). * Scheming: Risk associated with loss of control to AIs that arises from AIs scheming * So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity). * Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes). * Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works. This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.) I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored. Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about: * It’s very expensive to refrain from using AIs for this application. * There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances. If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things: * It implies that work on mitigating these risks should focus on this very specific setting. * It implies that AI control is organizationally simpler, because most applications can be made trivially controlled. * It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.

Popular Comments

Recent Discussion

Yesterday, I had a coronectomy: the top halves of my bottom wisdom teeth were surgically removed. It was my first time being sedated, and I didn’t know what to expect. While I was unconscious during the surgery, the hour after surgery turned out to be a fascinating experience, because I was completely lucid but had almost zero short-term memory.

My girlfriend, who had kindly agreed to accompany me to the surgery, was with me during that hour. And so — apparently against the advice of the nurses — I spent that whole hour talking to her and asking her questions.

The biggest reason I find my experience fascinating is that it has mostly answered a question that I’ve had about myself for quite a long time: how deterministic am...

Viliam5m20

It could be an interesting experiment to build up this list iteratively. Like, every question you ask for the third time, the answer gets added at the bottom of the list. How long will the list get, and what will it contain?

2johnswentworth31m
My answer.
3Lucie Philippon2h
I consumed edible cannabis for the first time a few months ago, and it felt very similar to the experience you're describing. I felt regularly surprised at where I was, and had lots of trouble remembering more than the last 30 seconds of the conversation.  The most troubling experience was listening to someone telling me something, me replying, and while saying the reply, forgetting where I was, what I was replying to and what I already said. The weirdest part is that at this point I would finish the reply in a sort of disconnected state, not knowing where the words were coming from, and at the end I would have a feeling of "I said what I wanted to say", even though I could not remember a word of it.
1Johannes C. Mayer6h
Maybe you need to think the thought many times over in order to overwrite the original memory. In your place, I would try to prepare something similar to what Drake did. Some mental objects that you can retrieve have a predesigned hole to put information. To me, it seems like this should not be that hard to get. Then for ideally 30 minutes or so (though the streaming algorithm experiment seems also very interesting) after the surgery when you don't have short-term memory, you can repeatedly try to insert some specific object in the memory. Maybe it would make sense for the sake of the experiment to limit yourself to 3 possible objects that could be inserted. Your girlfriend can then choose one randomly after surgery, for you to drill into the memory, by repeatedly thinking about the scene completed with that specific object. Then after the 30 minutes, you do something completely different. Then 1 hour afterwards your girlfriend can ask you what the object was that she told you 1 hour ago. Well and probably many times during the first 30 minutes. Probably it would be best if your girlfriend (or whatever person is willing to do this) constantly reminds you during the first 30 minutes or so that you need to imagine the object. Probably at least every minute or so.

Previously: On the Proposed California SB 1047.

Text of the bill is here. It focuses on safety requirements for highly capable AI models.

This is written as an FAQ, tackling all questions or points I saw raised.

Safe & Secure AI Innovation Act also has a description page.

Why Are We Here Again?

There have been many highly vocal and forceful objections to SB 1047 this week, in reaction to a (disputed and seemingly incorrect) claim that the bill has been ‘fast tracked.’ 

The bill continues to have substantial chance of becoming law according to Manifold, where the market has not moved on recent events. The bill has been referred to two policy committees one of which put out this 38 page analysis

The purpose of this post is to gather and analyze all...

Jiro8m20

If your model is not projected to be at least 2024 state of the art and it is not over the 10^26 flops limit?

It's not going to be 2024 forever. In the future being 2024 state of the art won't be as hard as it is in actual 2024.

That developers risk going to jail for making a mistake on a form.

  1. This (almost) never happens.

Because prosecuting someone for making a mistake on a form happens when the government wants to go after an otherwise innocent person for unacceptable reasons, so they prosecute a crime that goes unprosecuted 99% of the time.

The

... (read more)

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnesic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnesic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnesic, it’s also apparently one...

Another class of applications which we discussed at the retreat: person 1 takes the amnesic, person 2 shares private information on them, and then person gives their reaction to the private information. Can be used e.g. for complex negotiations: maybe it is in our mutual best interest to make some deal, but in order for me to know that I'd need some information which you don't want to share with me, so I take the drug, you share the information, and I record some verified record of myself saying "dear future self, you should in fact take this deal".

... which is cool in theory but I would guess not of high immediate value in practice, which is why the post didn't focus on it.

7Algon23m
Important notice: benzodiazepines are serious business: benzo withdrawals are amongst the worst experiences a human can go through, and combinations of benzos with alcohol, barbiturates, opioids or tricyclic antidepressants are very dangerous: benzos played a role in 31% of the estimated 22,767 deaths from prescription drug overdose in the United States. If you're experimenting with benzos, please be very careful!
1metachirality26m
You can actually use this to do the sleeping beauty experiment IRL and thereby test SIA vs SSA. Unfortunately you can only get results if you're the one being put under.

I say this because I can hardly use a computer without constantly getting distracted. Even when I actively try to ignore how bad software is, the suggestions keep coming.

Seriously Obsidian? You could not come up with a system where links to headings can't break? This makes you wonder what is wrong with humanity. But then I remember that humanity is building a god without knowing what they will want.

So for those of you who need to hear this: I feel you. It could be so much better. But right now, can we really afford to make the ultimate <programming language/text editor/window manager/file system/virtual collaborative environment/interface to GPT/...>?

Can we really afford to do this while our god software looks like...

May this find you well.

4faul_sname3h
Haskell is a beautiful language, but in my admittedly limited experience it's been quite hard to reason about memory usage in deployed software (which is important because programs run on physical hardware. No matter how beautiful your abstract machine, you will run into issues where the assumptions that abstraction makes don't match reality). That's not to say more robust programming languages aren't possible. IMO rust is quite nice, and easily interoperable with a lot of existing code, which is probably a major factor in why it's seeing much higher adoption. But to echo and build off what @ustice said earlier: The hard part of programming isn't writing a program that transforms simple inputs with fully known properties into simple outputs that are meet some known requirement. The hard parts are finding or creating a mostly-non-leaky abstraction that maps well onto your inputs, and determining what precise machine-interpretable rules produce outputs that look like the ones you want. Most bugs I've seen come at the boundaries of the system, where it turns out that one of your assumptions about your inputs was wrong, or that one of your assumptions about how your outputs will be used was wrong. I almost never see bugs like this * My sort(list, comparison_fn) function fails to correctly sort the list" * My graph traversal algorithm skips nodes it should have hit * My pick_winning_poker_hand() function doesn't always recognize straights Instead, I usually see stuff like * My program assumes that when the server receives an order_received webhook, and then hits the server to fetch the order details from the vendor's API for the order identified in the webhook payload, the vendor's API will return the order details and not a 404 not found" * My server returns nothing at all when fetching the user's bill for this month, because while the logic is correct (determine the amount due for each order and sum), this particular user had 350,000 individual orders this
1Nevin Wetherill5h
I have been contemplating Connor Leahy's Cyborgism and what it would mean for us to improve human workflows enough that aligning AGI looks less like: Sisyphus attempting to roll a 20 tonne version of The One Ring To Rule Them All into the caldera of Mordor while blindfolded and occasionally having to bypass vertical slopes made out of impossibility proofs that have been discussed by only 3 total mathematicians ever in the history of our species - all before Sauron destroys the world after waking up from a restless nap of an unknown length. I think this is what you meant by "make the ultimate <programming language/text editor/window manager/file system/virtual collaborative environment/interface to GPT/...>" Intuitively, the level I'm picturing is: A suite of tools that can be booted up from a single icon on the home screen of a computer which then allows anyone who has decent taste in software to create essentially any program they can imagine up to a level of polish that people can't poke holes in even if you give a million reviewers 10 years of free time. Can something at this level be accomplished? Well, what does coding look like currently? It seems to look like a bunch of people with dark circles under their eyes reading long strings of characters in something basically the equivalent to an advanced text editor, with a bunch of additional little windows of libraries and graphics and tools. This is not as domain where human intelligence performs with as much ease as in other domains like spearfishing or bushcraft. If you want to build Cyborgs, I am pretty sure where you start is by focusing on building software that isn't god-machines, throwing out the old book of tacit knowledge, and starting over with something that makes each step as intuitive as possible. You probably also focus way more on quality over quantity/speed. So, plaintext instructions on what kind of software you want to build, or a code repository and a plaintext list of modifications? L
8Dagon5h
Agreed, but it's not just software.  It's every complex system, anything which requires detailed coordination of more than a few dozen humans and has efficiency pressure put upon it.  Software is the clearest example, because there's so much of it and it feels like it should be easy.
Viliam23m20

Consider the pressures and incentives. Adding new features can help you sell the software to more users. Fixing bugs... unless the application is practically falling apart, it does not make much of a difference. After all, the bugs will only get noticed by people who already use your application, i.e. they already paid for it.

For the artificial intelligence, I assume the "killer app" will be its integration with SharePoint.

7lc16h
I seriously doubt on priors that Boeing corporate is murdering employees.

This sort of begs the question of why we don't observe other companies assassinating whistleblowers.

21a3orn10h
I mean, sure, but I've been updating in that direction a weirdly large amount.

(ElevenLabs reading of this post:)

I'm excited to share a project I've been working on that I think many in the Lesswrong community will appreciate - converting some rational fiction into high-quality audiobooks using cutting-edge AI voice technology from ElevenLabs, under the name "Askwho Casts AI".

The keystone of this project is an audiobook version of Planecrash (AKA Project Lawful), the epic glowfic authored by Eliezer Yudkowsky and Lintamande. Given the scope and scale of this work, with its large cast of characters, I'm using ElevenLabs to give each character their own distinct voice. It's a labor of love to convert this audiobook version of this story, and I hope if anyone has bounced off it before, this...

habryka31m20

Added an embedded audio element for you.

3Neel Nanda41m
Thanks for making these! How expensive is it?
6spencerb1h
Is there a way to access this without a substack subscription?
1JanPro1h
Thank you! The Planecrash audiobook is great, and I would not have read it if it were not for the audio version.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
niplav34m20

Because[1] for a Bayesian reasoner, there is conversation of expected evidence.

Although I've seen it mentioned that technically the change in the belief on a Bayesian should follow a Martingale, and Brownian motion is a martingale.


  1. I'm not super technically strong on this particular part of the math. Intuitively it could be that in a bounded reasoner which can only evaluate programs in , any pattern in its beliefs that can be described by an algorithm in is detected and the predicted future belief from that pattern is incorporated into current belief

... (read more)
4Dagon6h
I think this leans a lot on "get evidence uniformly over the next 10 years" and "Brownian motion in 1% steps".  By conservation of expected evidence, I can't predict the mean direction of future evidence, but I can have some probabilities over distributions which add up to 0.   For long-term aggregate predictions of event-or-not (those which will be resolved at least a few years away, with many causal paths possible), the most likely updates are a steady reduction as the resolution date gets closer, AND random fairly large positive updates as we learn of things which make the event more likely.
30LawrenceC7h
The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps.  (Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier).  Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you'll need 99 steps in expectation.  ---------------------------------------- One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation does not mean you will take 99 steps with high probability -- in this case, there's a 50% chance you need only one update before you're certain (!), there's just a tail of very long sequences. In general, the expected value of variables need not look like I also think you're underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after --  then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them -- e.
2niplav37m
Thank you a lot for this. I think this or @Thomas Kwas comment would make an excellent original-sequences-style post—it doesn't need to be long, but just going through an example and talking about the assumptions would be really valuable for applied rationality. After all, it's about how much one should expect ones beliefs to vary, which is pretty important.

Crossposted from the AI Optimists blog.

AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world domination to ensure humans do not interfere with its plans, whatever they may be.

In this essay, we debunk the counting argument— a central reason to think AIs might become schemers, according to a recent report by AI safety researcher Joe Carlsmith.[1] It’s premised on the idea that schemers can have “a wide variety of goals,” while the motivations of a non-schemer must be benign by definition. Since there are “more” possible schemers than non-schemers, the argument goes, we should...

AviS37m10

I agree that, overall, counting arguments are weak.

But even if you expect SGD to be used for TAI, generalisation is not a good counterexample, because maybe most counting arguments about SGd do work except for generalisation (which would not be surprising, because we selected SGD precisely because it generalises well).

Many thanks to Spencer Greenberg, Lucius Caviola, Josh Lewis, John Bargh, Ben Pace, Diogo de Lucena, and Philip Gubbins for their valuable ideas and feedback at each stage of this project—as well as the ~375 EAs + alignment researchers who provided the data that made this project possible.

Background

Last month, AE Studio launched two surveys: one for alignment researchers, and another for the broader EA community. 

We got some surprisingly interesting results, and we're excited to share them here.

We set out to better explore and compare various population-level dynamics within and across both groups. We examined everything from demographics and personality traits to community views on specific EA/alignment-related topics. We took on this project because it seemed to be largely unexplored and rife with potentially-very-high-value insights. In this post, we’ll present what...

Thank you so much for conducting this survey! I want to share some information on behalf of MATS:

  • In comparison to the AIS survey gender ratio of 9 M:F, MATS Winter 2023-24 scholars and mentors were 4 M:F and 12 M:F, respectively. Our Winter 2023-24 applicants were 4.6 M:F, whereas our Summer 2024 applicants were 2.6 M:F, closer to the EA survey ratio of 2 M:F. This data seems to indicate a large recent change in gender ratios of people entering the AIS field. Did you find that your AIS survey respondents with more AIS experience were significantly more mal
... (read more)
3Josh Jacobson17h
Epistemic status: just speculation, from a not very concrete memory, written hastily on mobile after a quick skim of the post. ---------------------------------------- My guess is that these results should be taken with a large grain of salt, but if I'm wrong, I'd be interested in hearing more about why. Specifically, I think the "alignment researcher" population and "org leader" populations here are probably a far departure from what people envision when they hear these terms. I also expect other populations reported on to have a directionally similar skew to what I speculate below. An anecdote for why I expect that (some aspects may be off): * I started the survey, based off the description that it'd be decently short. I found it long, involved, and asking various questions (marked as required) that I really wasn't interested in answering (nor interested in the results of). IIRC it also had various ways in which the question phrasing was lacking. I accordingly abandoned it, while seeing there was still a long way to go to completion. One additional factor for my abandoning it was that I couldn't imagine it drawing a useful response population anyway; the sample mentioned above is a significant surprise to me (even with my skepticism around the makeup of that population). Beyond the reasons I already described, I felt that it being done by a for-profit org that is a newcomer and probably largely unknown would dissuade a lot of people from responding (and/or providing fully candid answers to some questions). All in all, I expect that the respondent population skews heavily toward those who place a lower value on their time and are less involved. I expect this to generally be a more junior group, often not fully employed in these roles, with eg the average age and funding level of the orgs that are being led particularly low (and some of the orgs being more informal). That's a very legitimate and useful population to survey; I just think it also isn't at all
3Cameron Berg7h
Here is the full list of the alignment orgs who had at least one researcher complete the survey (and who also elected to share what org they are working for): OpenAI, Meta, Anthropic, FHI, CMU, Redwood Research, Dalhousie University, AI Safety Camp, Astera Institute, Atlas Computing Institute, Model Evaluation and Threat Research (METR, formerly ARC Evals), Apart Research, Astra Fellowship, AI Standards Lab, Confirm Solutions Inc., PAISRI, MATS, FOCAL, EffiSciences, FAR AI, aintelope, Constellation, Causal Incentives Working Group, Formalizing Boundaries, AISC.  ~80% of the alignment sample is currently receiving funding of some form to pursue their work, and ~75% have been doing this work for >1 year. Seems to me like this is basically the population we were intending to sample.    Your expectation while taking the survey about whether we were going to be able to get a good sample does not say much about whether we did end up getting a good sample. Things that better tell us whether or not we got a good sample are, eg, the quality/distribution of the represented orgs and the quantity of actively-funded technical alignment researchers (both described above).   Note that the survey took people ~15 minutes to complete and resulted in a $40 donation being made to a high-impact organization, which puts our valuation of an hour of their time at ~$160 (roughly equivalent to the hourly rate of someone who makes ~$330k annually). Assuming this population would generally donate a portion of their income to high-impact charities/organizations by default, taking the survey actually seems to probably have been worth everyone's time in terms of EV.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA