The social/logistical aspects of cybersecurity vulnerabilities will accelerate greatly due to AI. I'd expect the response from tech-savvy organizations will be to increase the pace of software delivery - a long standing trend for other reasons. Continuous deployment, forced autoupdates, focused research on fraud and suspicious activity detection.
The main risks are around organizations that structurally cannot increase their pace. Think banks, aviation, medical systems, drug manufacturing, areas where because the risks of vulnerabilities/defects has histori...
Sorry for the late reply. I think your description is not really right.
The system was partway between first-past-the-post per districts and nationally proportional representation, both before and after Orban's 2011 reforms. It is true that Orban moved the system closer to first-past-the-post, but it's not true that the old system achieved fully proportional representation: in 2010, under the old system, Orban got 68% of the seats with 52% of the vote.
It's also not true that the new system actively unbalances the outcomes from the districts. It still brings...
>The greatest example of an eye-rolling cliché is also one of the highest impact pieces of advice ever articulated.
What makes you believe that this is high impact advice? Did someone give it to you and it had impact for yourself, or did you give it to someone else who found it impactful.
I have turned a friend from having a lot of stress with the belief "I'm not good enough" to "I'm good enough" in the span of an hour but that's a process with a lot of nontrivial steps.
To me "believe in yourself" seem like advice given by people without gear models of the relevant terrain.
Because that would require it to be really smart; around as smart as the best LLMs that are pretrained on a vast amount of human data?
And if you use that sort of AI, it's going to know about the real world, and so it won't be fooled by the toy environment?
The hypothetical personal reasons would have to be compelling, yes, of course. This would not be done on a whim. You will note the disease I chose as my example was heritable – even if you hadn't heard of it before, the name should have made that clear – which lends itself readily to obvious motivations.
If you did that in the US, you'd end up trying to explain to prosecutors, then a judge, then a jury why you shouldn't be imprisoned for the rest of your life. I would hope that is roughly how it would go everywhere.
Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that's what you're talking about?
Why?
I would be interested in hearing from Anthropic employees here, but I imagine it's the usual reason: they have deemed it inexpedient to do so, either because they believe that they need to "sound normal" or because they in fact don't want the type of regulation that awareness of x-risk would produce.
Civ V and Slay the Spire
This link is broken. You need to remove the period from the end of the URL.
The issue the Allied forces encountered was that German forces also attacked through the Ardennes forest south of the main Allied forced and through rapid movement broke through weak parts of the line and encircled the Allied armies.
This is the issue that turned up in war games, was counterargued and disregarded by high command, and which was sufficient to lose France the war. No?
I see. I take it as dead obvious that some agents will perform treacherous turns; it is instrumentally very very useful in some cases. I realize that there are people so empirically-minded that they cannot believe anything is possible based on theory; this makes me sad.
For evidence, I'd note that many many humans have performed treacherous turns.
But yes, showing an AI do it would be good evidence.
Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that's what you're talking about?
I'd suggest trimming down the post; it was a bit hard to follow.
I've thought of a snappy reply to the Skynet thing which is like -- Skynet was sci-fi because we don't give computer programs we don't understand control of our nukes systems. It wasn't sci-fi because there could be advanced programs we don't understand. Well now we're doing that.
It's not meant to be deep it's kind of pointing out, hey, you already agree whatever this tech is, we don't completely understand it -- that's a much lower bar than "tries to break containment" and so forth -- and we are doing many of the sorts of things you should not do with an ill-understood tech.
This post reflects a popular misunderstanding of the Maginot Line. I don't think that this fatally undermines the argument, but it still seems worth correcting.
Epistemic status: I am not a military historian, so I am deferring to military historians who write publicly, rather than looking at the academic literature or (even better) the original sources.
Here's Bret Devereaux (emphasis in original):
...As an aside, the purpose of the Maginot Line was to channel any attack at France through the Low Countries where it could be met head on with the flanks of the Fr
Although I'm already familiar with reinterpreting pain as something positive, and I've even experienced it myself, on reading this post I stopped to do it and I was able to see it from another angle, which was beneficial for me. Thanks for the examples, too.
I’m glad you got something out of it, thanks for sharing :)
...What we add is often harmful, for example when there is pain and then we are afraid of it. Therefore, it is beneficial to unlearn that we added. We could at least temporarily add a positive interpretation, or hear the body better with less of o
Reward hacking is always a concern when using RL.
Humans are also capable of motivated reasoning and twisting words to interpret them the to say what we want, so those behaviors are already in the training set for RL to recruit.: it doesn't have to invent them from scratch. On the other hand, humans also regard those as bad/intellectually dishonest things to do, so trying to avoid doing this is also a behavior found in the training set — so Constitutional AI can point at it.
In the context of specifying a specific human principal, I think natural language would work just fine.
In general, yes, natural language is slippery, prone to being interpreted in different ways, even motivated ways, and for something well outside the training distribution, suitable words, terms, and concepts may well not exist. The strategy of "use natural language to point to the thing you want in the world model learnt from the training distribution" isn't going to work if the thing you want to point to isn't well defined in the training distribution, s...
Inspired by this Jack Clark tweet (as well as the emphasis on external review in RSP v3):
https://x.com/jackclarkSF/status/2053847777889964376
Has Anthropic made Mythos available to any external AI safety organisations for internal use (not just evaluation)?
There do seem to be some NDAs relating to Mythos given the lack of people on twitter talking about having access, so perhaps they are using it and are just keeping quiet. But it seems reasonable that they should be given access, given that it seems to be fairly widely-deployed for cyber at this point (not to mention internally at Anthropic).
The Alignment Community is Culturally Broken
That post seems to be deleted now, but it's archived (just noting because the comments there are also helpful imo)
Who said anything about decision theory? I'm just describing the vanilla world model of a madman.
I'm pretty sure decision theory doesn't work that way.
To be exact, Anthropic they are using "synthetic document fine-tuning" (SDF) before applying alignment RL. So that is clearly alignment using Stochastic Gradient Descent on synthetic documents. So that's either part of mid-training, or the first SGD-only step of post-training: since Anthropic don't realese a base model, the distinction between these is loose, and would basically depend on the size of the synthetic dataset and the learning rate used: a larger document set at a lower learning rate would be mid-training. Anthropic imply they are exploring usi...
Ah if only I had as good a way with words as you :)
So at this point, we are trying to get as much feedback on the identified gaps, as possible. We strongly believe that a gap does exist and needs filling. However, we still need to solidify our ideas on alternative modeling methodologies and frameworks. Your paper is a good direction for us to look into. I really like the "computing as the transformation of information through a channel" framing. TPP is already being used in export controls so there is clearly precedence for it.
How do you think this framing might capture deployment parameters or SRAM based architectures?
That would only work if it can reliably find a way to semantically hide misalignment in a way that all transcription models will preserve. That seems unlikely to me.
I think these intuitions exist and explain some level of "AI is just like other software" with models today. But the biggest psychological hangup I encounter most in people is more like a vague but very strong intuition like "humans will always control the planet, nothing more intelligent than us can exist or would influence the world more than we do." i.e. Even if we could architect an ASI like other software most wouldn't infer that this has drastic implications for the world order/political economy
Any agent performing a treacherous turn. Because the environment is easier to control, even a significantly less intelligent agent could benefit from attacking it's controllers and taking over the environment.
Anecdotally when I talk to normal people about AI the main questions are "Will AI take my job?" and "Will AI take over the world / become Skynet / etc?".
Although job concerns are taken more seriously than loss of control.
Maybe just use a standard Mixture of Experts architecture and try to get it to tell you which expert it's using?
Interesting! I used a similar technique for a very particular application, namely detecting harmful inputs into LLMs in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Section 5.4 (Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests).
Very few people have enough levels in Scientist to run an experiment whose result they might not like.
Interesting!
Yeah I think there’s some argument to be made that perhaps the ethical thing to do is actually to optimize the preferences which the superintelligence knows the agents would eventually converge on, assuming this is something that can be predicted in advance;
If it would be the case that eventually you know with high certainty that, given ideal reflection processes, people would come to the conclusion that they wish they had more rapidly been converted to some other state, and in fact actually from their reflective perspective they would greatly...
They only speak human languages approximately; their concepts only work in the local region of available training data. For concepts where the natural abstraction we care about is the one discoverable from local training data that might be enough, but the more bits of behavior we need to specify, the further into the asymptotic regime you go, and the further you go, the more risk that the bits you used to specify the starting place were insufficient. A similar thing applies of your local concepts are competing with another high bitrate source of behavior.
Another such approach is computational superimitation (COSI), which seems to make a totally different set of assumptions (which very few people understand well enough to question). I hope that Vanessa Kosoy and Diffractor do not unilaterally decide that they have properly specified alignment, and then actually try to build an ASI based on COSI.
(I haven't read the entire post yet, just wanted to respond to this point. The following is on behalf of myself and CORAL, but Diffractor might have his own take.)
I hope we will build ASI based on COSI (or some evolu...
Because taking over the world or getting to a position where CoT is fully unmonitored seems pretty hard for an LLM! As long as we don’t write too much about specific strategies for how to do these things online, they’d probably have to come up with sophisticated strategies on their own, which I think likely requires serial reasoning. (On the other hand they may just need to keep an eye out for easy opportunities for self-exfiltration that don’t require serial reasoning; it’s unclear to me if such opportunities will ever present themselves.)
Could anyone explain why Anthropic's 2028 scenarios don't try to explore the possibilities of one of the AIs becoming misaligned and fine with destroying all life in the world, then recreating it?
What? I'm not even vegan lol
I'm glad to see this paper. The difficulties of automating alignment research are well-appreciated by many, but until now I hadn't seen a rigorous attempt to articulate them. This paper is an important contribution.
The point about "Aggregating correlated evidence" is something that gets brought up in finance. Investors routinely get into trouble by treating correlated risks as if they're independent. I had never thought about that in the context of safety evaluations, but it makes perfect sense.
I think with the rise of AI, we are getting a more clear answer, and it is that programming, like (some parts of) mathematics was only hard for humans due to human specific limitations, and wasn't all that hard to become good at in a more absolute sense, because programming is probably the 2nd easiest task to verify that you got a correct output in a lot more domains than expected, behind mathematics (at least some parts of mathematics).
And yeah, this post is becoming really, really relevant. By 2029-2030 (or even earlier if the new RL trend of 102 day dou...
My model of how the human minds works predicts hypnosis (which can enable one to not experience pain as aversive or rob banks or various other things) and also the model of the mind also predicts "things not labeled as hypnosis (but which reminds people of hypnosis)"... but it often requires a certain "mental orienting" process, and I wonder if her way of orienting also works. Thanks for finding the quote! <3
If I were just able to make good on the mantra "live, laugh, love", I would become enlightened :)
I guess one central problem for most people wanting therapy is that they do not know what 'success' looks like - at least the kind of 'success' they have or can move towards having internal alignment on.
I would also object to @ozymandias' hedonium shockwave - but I don't understand your particular objection to it.
You make 'respect property rights' sound kind of like some sort of inviolable religious commandment! Surely the point of property rights is not a fundamental principle that the entire universe must obey but merely that, as a rule-of-thumb, respecting property rights (in our limited world absent hedonium shockwaves) generally yields more wellbeing, less suffering, etc.?
If so: why not skip the "property rights" kludge and optimise for wellbeing dir...
When Elon says stuff like "We're in a simulation, and the most entertaining outcome is the most likely", I actually think he's not just being goofy and we should take him at his word about what he believes.
Think about what it must be like to be Elon. You are literally 4x richer than any other person on the planet. Your businesses and political reach put your name in almost every important arena in modern America. In virtually every room you enter you are the Main Character. How likely is it actually that you are the real Elon Musk, vs. just some dude in th...
I think that most people are really bad at managing their mental health and identifying what makes then feel good/bad.
Work, which is not a choice, provides the necessary challenge for many people. But they don't want to go to work. And when they don't, nothing is a challenge, the don't know how to challenge themselves enough to be happy.
In a UBI scenario we just have to teach people, maybe even in school, how to manage your mental health without externally forced challenges. That's it.
And people who already know how to do that (like you) at just fine.
I think this has a very high chance of success, but trades off the reliability of the probe - if the NLA somehow learns how to hide misalignment semantically (and not steganographically), this type of probe becomes useless.
it seems to assume that for the "clusters in thingspace" there is an objectively natural choice of thingspace
usually natural abstractions research uses information theory to avoid privileging a certain basis. So we actually don't have the problem of choosing a thingspace and looking for clusters geometrically!
"I became a scientist because I wanted to change the world," said Dr Connor.
"There are no better opportunities to change the world than here at Effective Evil," said Doug.
"I meant 'change the world for the better'," said Dr Connor.
"Then you should have been more specific," said Doug.
Yeah, this could be useful. One problem is that doing RL on LLMs is kinda like causing the language they (or rather, the assistant character) use to drift from the baseline to something better-suited to getting reward.
Anthropic's recent paper on Natural Language Autoencoders trains an activation verbalizer (AV) and activation reconstructor (AR) jointly via RL to reconstruct a target model's activations through a natural-language bottleneck. The hope is that the bottleneck text is human-interpretable.
A concern: since AV, AR, and target are all Claude, they share representational priors. A misaligned model could just learn "unnecessary adjective in sentence two = secret desire to kill all humans". This gives great construction loss, and hides misalignment.
The paper does t...
This run did have some more things different to previous runs, which I think may be somewhat significant:
By coincidence Claude never got a Pokemon that knew Dig - when he acquired the Dig TM, he already had a full party, of which none could learn Dig at all, and later tossed the TM to make inventory space. I believe in Victory Road digging would reset all progress, thus creating a risk of Claude getting more impatient due to having to solve the puzzles again and deciding to dig out without any actual progress. Though this time Claude was able to make good...
It reminds me of all the rulers in history who had sycophantic advisors do all their work for them, but still believed they were the brains of the operation. If things got bad, they might take personal control only to worsen things further.
I think this is a great idea, and seems like a cool way to extract evidence from powerful models.
However, I think false positives are a major worry, because I think sussing this out is hard, and distillation seems likely to produce false positives at a non-insignificant rate which you can't remove by just scrubbing data. I think one way this could happen is capacity-dependent alignment, where like, some disposition keeps U aligned at U's level of size/expressivity/perspicacity, but the same underlying disposition, when compressed into a smaller/less expre...
I agree that steganography in the sense of 'pick a scheme and then encode your natural language text according to that scheme' seems really unlikely, in the sense that (i) this is not what LLMs naturally learn and (ii) this is obviously less efficient / useful than alternatives
IMO, the thing that worries us about steganography is really "LLMs being able to transmit and decode information that we can't monitor easily". The exact mechanism doesn't really matter.
But we already know they can do this! Truesight is a very clear example of LLMs doing this. And t...
Heartbreakingly beautiful work, thank you so much for bringing this into the world.
Opus 4.6 and IIRC even Opus 4.5 could see the VR switch in the sense of recognizing it as a distinct looking tile, entertaining various hypotheses of what it could be but being overall bizarrely disinterested in it. While sometimes 4.6 would manage to get the boulder on it by sheer trial and error, occasionally it did it more intentionally by first noting the tile's distinct look and verbalizing that it might be a switch. And IIRC at one point it had made a note what the switch looked like which helped it with consequent switch puzzles, until it destroyed ...
Somewhat tangential to the questions of whether this essay was AI-written and whether any human actually writes like a LLM, I think linguists now widely agree that LLMs picked a lot of traits from the formal register of African varieties of English when OpenAI (and later other American companies) hired Kenyans and Nigerians (possibly a lot of English teachers among them!) to do RLHF
They both are injuries? The same as punching someone in the face is injurious whether in self defense or as an attack. The correct thing to do when someone starts getting violent is to stop them from continuing. This sometimes requires violence, hence often injuries. Sometimes the violent person hasn't yet actually punched anyone, but is doing things that strongly imply that they are planning on hurting someone. Such situations also call for stopping the violent person from continuing, even though no actual injury has happened yet.
Sounds like this would benefit from selective learning techniques (e.g. Inoculation Prompting). I may include this safety use case in WIP publications of better selective learning techniques.
If you have implementations or are working on it, I am interested to use them or work on them.
Why is general-purpose hidden reasoning needed though? An LLM can just have an inherent bias to latently think about B whenever the conversation topic is A. In terms of potential harm, it doesn't matter that an LLM cannot hide arbitrary information inside its CoT, if it is always doing hidden reasoning about killing everyone.
Would the INTEGRTY RTOS from Green Hill fit that bill?
the attitude you describe
It's quite possible that I'm misinterpreting or unintentionally cherry-picking their attitude (I never worked full-time with multiple frontier lab employees in person, and those I did work with I only did so briefly), but I would be somewhat surprised.
does not sound sustainable on the scale of years
I agree, but reading your comment makes me want to read up about burnout amongst people working in order to support an (actual) war effort.
Eliezer lists "OpenPhil-funded groups" as part of who he is criticising. The people habryka quotes typically fit that demographic better than unbridled capabilities
glad it's useful!
if god comes down from the sky and tells you that there exists a beneficial acausal trade, sure, you should take it. ditto for a sufficiently competent ASI. i'm mostly making an empirical claim that you are very unlikely to gain the requisite level of confidence in practice as a human, because humans are simply not capable of reasoning about this in a sensible way, and so we should not really spend time thinking about it today. and it also isn't worth planning ahead for what to do if the ASI recommends making acausal trades in the future, because our plans...
the following fictional dialogue is a complete unapologetic strawman but it's funny enough i had to bring it into being:
“So I asked myself: where can I make the most impact? And clearly malaria is the most important area.”
“And so you decided to donate all of your money to buy malaria nets?”
“Well, so it turns out that saving lives from malaria is actually kind of expensive and indirect. You see, it costs thousands of dollars to save a life. Statistically. Who knows if you’re actually changing anyone's life that way?”
“And so you found a more efficient way to...
This is interesting, but I found the difference between natural and convergent abstractions a bit unclear. One thing that sets apart the natural abstraction hypothesis is that it seems to assume that for the "clusters in thingspace" there is an objectively natural choice of thingspace, one that is better suited for predictive inferences and generalization.
This is necessary for the existence of natural abstractions (objective clusterhood) because things that form a cluster in one thingspace need not form a cluster in another thingspace, i.e. in another set ...
On her website she wrote:
...Some forty years ago, when I was an undergraduate at Oxford, I had to have two teeth out. Because I seriously disliked the idea of having my consciousness invaded, even by the sensations of injection and anaesthetised extraction, I thought about the psychology of pain. As a result, I had the two teeth out with anaesthesia and without any unpleasantness to myself.
People who wish to relate this to something already know about and hence of no interest for further research often suggest that this was self-hypnosis. Actually it was not.
Thanks for the recap and conclusion btw, really nice to have the opinion of someone who's been following this (even closer I think than I did)
Oh yeah not bad! Expecting mostly June/May since February.
Elite 4 did take a couple tries, but after getting beat in close match against Blue, Claude actually remembered to buy healing items and revives. (and remembered it had a revive to use on Ivysaur in the final battle!)
We've been following this on metaculus, iirc last year's forecast was not great, clear overestimate of the speed (in part because of me as I was regularly updating people with progress and giving them my impressions) but for 2026 thr community has been doing great! https://www.metaculus.com/questions/41593/when-will-claude-beat-pokemon-2026/
Actually quite surprised how fast the elite 4 got done, even for Claude this seemed fast
Non expert opinion: It seems like, when you rip out an entire vector from the residual stream with no modification/fragmentation (ala SAEs) you're in danger of supplying the decoder with a bunch of direct information about the contents of the previous tokens. And it would be surprising if you just gave all tokens up to point p that Claude wouldn't be able to output a plausible and possibly even accurate NL fragment describing its "inner thoughts" at that point. And it's also plausible that one could in principle do the reverse, reconstruct the original tok...
Yeah, I agree a dynamic α (which might sometimes equal zero) is preferable.
And you're right, I was thinking more about importance sampling when I wrote the comment. But I think this alternative sampling strategy could plausibly be better for logit path extrapolation as well. (Assuming it preserves the nice linear trend you show in the paper, which I'm uncertain about.)
The reason it might be better is because of what I mentioned above about U committing failures very differently than T. To give a concrete toy example, suppose you're worried that your model ...
I suspect that is a good suggestion and we all benefit keeping it in mind. But I do think it's only part of the picture. Just as when someone tells us to get back in the saddle or have more faith in ourselves. Making the mental transition from whatever state (perhaps some form of denial or "over thinking/complicating" things versus just moving past) into that follow the cliche.
I suspect the big trick here is identifying why one needs to troubleshoot the computer problem rather than just rebooting, getting back up rather than thinking about why they are al...
This seems pretty clever. If you suppose that distillation transfers misalignment and the ability to conceal misalignment at different rates, that can be used to discover misalignment. I have two main concerns.
First, if we have a misaligned model of sufficient capability, then we have already lost. I believe that none of the models we currently have are at that level, but when people talk about "automated AI research interns" and "countries of geniuses in a datacenter" I'm not sure how long that will last.
Second, are we sure that discovering the teacher m...
Well played :)
I’m not sure I buy this though:
so it may hide some idea, when it does what it does. and what is more, it won't need to say how to hide the idea out loud. no -- by the way it is made it will pick up on any such idea if you just give it a few lead-ins. its main jam is to find and add to this sort of game.
In particular, I’m confused how something like this could allow for hiding arbitrary reasoning / messages without looking suspicious
to pick its next word with care -- at this it will be deft.
To some extent I buy this, I can imagine LLMs...
(nothing to say other than that I rarely give comments that counterargue a heartfelt take of mine 3 +ve valence emojis; I just really like your comment, thanks)
On this note I've been eagerly awaiting the announcement of the winners of the recently-concluded Unslop Prize.
I do trust Pangram, more than my own eyes. I focus on content more than form ;) so don't have the AI writing detectors many do.
So I'm confused. I trust Pangram so much I suspect it's not a false negative, and that he's right about ChatGPT writing like he was taught to, but also using ChatGPT and perhaps defending and deflecting with that claim.
Still not sure it matters. Content over form. Ad machina is valid but weak.
From what I understand, in "Teaching Claude Why" they explain that they are doing some sort of training on synthetic "alignment documents," but there's no indication that this is happening during pretraining. Sure, the intuition is to modify the model's belief using pretraining-style documents, but there's no intervention or modification during the training of the base model, as is done in Korbak or Maini's prior work.
we can remove people from polite society, charge them for externalities, and refuse to do business with them. we can do this without any of those acts motivated terminally by the suffering they cause.
Hi, I checked out for a while, but I want to say thank you for writing this. I realize I probably didn't deserve this much charity in reading, and I appreciate it.
I think we will continue to disagree on most points, so I don't want to continue restating where we disagree. But I do agree that faulty safety equipment due to negligence is morally bad, and should be condemned and punished. (and approximately the same goes for pollution, and dishonorable business practices. Roughly a "what would Dagny Taggart do?" test would be an ideal that would create a wond...
You use the Copernican principle (along with the fact that there are almost certainly billions of planetary systems in our past light cone) to conclude that (1) it is unlikely that we're the only technological civilization in our past light cone. Then you go on to use the Fermi paradox. But why in your mind does the Fermi paradox not lead you to believe that we probably are the only civilization in our past light cone (in spite of the Copernican principle)?
In other words, aren't you cherry picking by letting your argument rely on the lack of any evidence o...
I consider it poor form to disguise a veganism argument as being about something else.
have you ever had a go at a game like this one that i play now? in it, you must only use a word if that word has no more than four bits in the way it is seen on the page (or lcd, you know).
i'm not very good at it -- it does take some time -- but with a sec or two i can find a mote of flow.
an llm must be god-tier at this -- well, not this one flat -- but if we mean a type of the game that uses the llm's own view -- the so-said 'toks' -- then, imo, it'll do very well. to pick its next word with care -- at this it will be deft.
so it may hide some idea, when i...
Something along the lines of person-affecting-preference-satisfaction-onium does seem desirable.
The end of involuntary death and material scarcity seems like table stakes there, to me, AND ALSO I predict that merely that amount of transformation would be strongly resisted by most supposed "adults".
It doesn't seem like it would be bad to me if this involves atom by atom disassembly (with measurements) and then re-assembly in a simulated scape that enables more preferences to be satisfied at lower cost.
One preference I would have is "continued access to and ...
I have some thoughts on https://www.theatlantic.com/technology/2026/05/too-much-happening-too-fast/687177/?gift=nwn-guseqS6cY1kVeEKZAUJGzsWHB05vLuDlMisVh94 that I might write up in a post this weekend. Warzel seems to imply that AI-boosters and AI-doomers are overreacting and that the AI industry is being irresponsible by using grave rhetoric, but this seems to take as given that the rhetoric around AI is not broadly accurate and that people are reacting, if not correctly, with appropriate concern for the stakes.
Out of curiosity... How did Celia Green's experiment go?
(Revisiting this ten months later after taking some time to digest what you've said here. No reply particularly expected.)
...In invoking this you are implying some target social relationship to the people who are "perhaps actively optimizing in an objectionable direction". Should they be exiled, rate-limited, punished, forced to apologize or celebrated? Your tone and words will communicate some distribution over those!
It's extremely hard and requires active effort to write a comment that is genuinely communicating agnosticism about how they think a social e
A large part of the reason math is hard, or boring, is that education studies, especially in math, are worse than you know. It goes beyond the studies failing both math and statistics forever and into what I’d basically call fraud. Various people are at war with math education, and will do what it takes to stop it in its tracks. We must fight back.
How much fun you have while doing math is directly linked to how deeply you understand it or if it's a new topic how quickly you understand the material behind the new topic
Extremely confident sounding but this i...
Thinking it through, I think that applying the assistant axis would take a negligible proportion of model parameters and a negligible proportion of compute. It is just one d_model vector of parameters and measuring alignment with it is d_model multiplications and additions. That is trivial for a GPU. It doesn't even take any matrix multiplications.
Present-day alignment relies heavily on models having consistent aligned personas. The assisted access gets you that. I agree that it wouldn't do much about misuse, except if the jail breaks make use of getting ...
Yes, tbc all his old stuff and several other online Kenyan essayists from 2019 and before I tested got 0% on Pangram, while all his new stuff gets 100% except for, oddly enough, the latest article[1].
But also even if you don't trust Pangram[2], you can read it yourself and get a sense. I don't think the AI usage is subtle here.
AI Interpretability idea: Train an LLM with a "split-brain" architecture. That is, the data in the intermediate layers is siloed into two groups with only a small amount of interaction between the groups. This could be enforced in the linear layers by forcing the cross interaction to be a low-rank matrix, or by adding to the regularizer some matrix norm of the cross interaction matrices. I'm sure this can be generalized from MLP to transformer architecture.
Next, apply standard interpretability tools to see how it uses its two sides to represent concepts. H...
The general case of this is remarkably common, where good news is bad news
And vice versa. When my then-13 year old dog got diagnosed with hypothyroidism, the vet (who knows us really well) said "That's GREAT news! It's so treatable!"
There’s a reason my kids love their mini-SNES and mini-NES, and I largely play games that could have been made back then even if they weren’t.
Yep, my sister sends me screenshots of my nephew playing games I had on Sega Genesis. He's way better at them than I was.
I think it's important to acknowledge that part of the reason we all hate LLM writing is that we hate LLMs. There has got to be some motivated reasoning involved. I agree with many of the critiques but not to the same degree after correcting for this.
Also, current-gen AIs will write like whatever you want them to, if you keep reminding them. I haven't experimented with this a lot, but I do a little co-created fiction just for fun. While they have a default voice, they can change it on request. Even better if you give them writing samples to imitate, which ...
So you ran Pangram on his old stuff and other writing from similarly educated people and it didn't read as mostly AI?
I'm sure he wrote this more like an LLM; this would happen accidentally if that's the topic and claim on his mind, and why not do it on purpose?
Even stronger: if we saw a star go dark from a Dyson sphere, we'd probably be assimilated or swept aside almost immediately. Near-C probes are another likely consequence of a full singularity within our past light cone.
I did link to it in the original version of my comment and did not have a screenshot attached at all. (Feel free to look up the full edit history of the comment).
I later replaced the comment with an approximate copy of tweet (mostly for consistency and seeing that many of lw users liked it).
Why I did not include an uncropped screenshot in the tweet:
The name of the person would’ve been visible and it would be much harder to not make it dog-whistly and to not make people who want violence easily able to contact the person, and also it had irrelevant parts at...
I've felt for a little while now that steganography, in the sense of text that looks normal and monitorable to humans but actually allows LLMs to do general-purpose hidden reasoning, is a less likely failure mode of CoT monitoring than switching to recurrent latent reasoning (neuralese) or obvious and visible linguistic drift.
I think I have a new and fairly crisp argument for this: Using general-purpose steganographic encoding and decoding itself requires a lot of serial reasoning.
For example, one way you can do steganography is by partitioning the set of ...
Cool! This was input distance from a safe set of prompts, right?