I move data around and crunch numbers at a quant hedge fund. There are some aspects that make our work somewhat resistant to LLMs normally: we use a niche language (Julia) and a custom framework. Typically, when writing framework related code, I've given Claude Code very specific instructions and it's followed them to the letter, even when those happened to be wrong.
In 4.6, Claude seems to finally "get" the framework, searching the codebase to understand its internals (as opposed to just understanding similar examples) and has given me corrections or...
I think it's quite easy to see communities as the Mennonites or Amish als self-sabotaging memeplexes. They way they operate is quite costly. At the same time, they might argue that they have healthy birth rates and it's mainstream society with birth rates that aren't enough to long-term sustain the society as "self-sabotaging memeplex".
One nitpick: athlete’s foot is much less natural than the other diseases mentioned in your post. Transmission often occurs through “unnatural” locations like public pools and locker rooms, and populations where people wear sandals or go barefoot all day long have much less athlete’s foot than populations that wear shoes all day long.
Athlete’s foot wasn’t recognized as a common disease until the early 20th century, though it probably existed earlier as a rare disease.
Yeah.
Probably not a full-monograph-length monograph, because I don't think either that (1) the coherence-related confusions are isolated from other confused concepts in this line of inquiry, or that (2) the descendants of the concept of "coherence" will be related in some "nature-at-joint-carving" way, which would justify discussing them jointly. (Those are the two reasons I see why we might want a full-monograph-length monograph untangling the mess of some specific, confused concept.)
But an investigation (of TBD length) covering at least the first three o...
I appreciate that you're engaging in this question, but I have a hard time taking much away from this.
Quickly:
1. I think we're both probably skeptical of what Joshua Achiam is approaching. My hunch is that he's representing a particularly niche view on things.
2. It feels to me like there's a gigantic challenge of choosing what scale to approach this problem. You seem to be looking at AGI and it's effects from a very high-level. Like, we need to run evals on massive agents exploring the real world - and if this is too complex for more regular robust enginee...
Premise: The average rationalist cannot use Bayes theorem. No, I will be stronger and more specific: when I ask a room full of people at a LessWrong meetup whether they can write the formula for Bayes Theorem on a piece of paper, less than half of them can.
Nitpick: I think that second sentence is actually weaker-and-more-specific. I'm very confident I can do a Bayesian update right, I not infrequently use Bayes in my dayjob, I once made a videogame about Bayesian reasoning . . . and if you asked me to write the formula without checking Wikipedia I'd have to spend a minute or two deriving it from first principles. I suspect at least a few people are in a similar boat.
I think we can overcome most of the cultural differences you mention. In my lifetime we have become noticably more culturally united and I believe the kids who grow up today will be culturally more or less the same across Europe.
A much bigger impediment to European unification is instead something you don’t talk about namely the different political cultures in Europe, including such things as the level of tolerated political corruption.
These differences are also shrinking but there is still some ways to go, I think.
It confused me that Opus 4.6's System Card claimed less verbalized evaluation awareness versus 4.5:
On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5.
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn't part of Opus 4.5's alignment evaluation (4.5's System Card doesn't mention them).
This probably seems unfair/unfortunate from Anthropic's perspective, i.e., they believe their mode...
Me and David are doing some followup work on EM, but mostly didn't follow this branch.
Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI
Hopefully this will draw some attention! But are you sacrificing something else, for the sake of these desirable properties?
What sort of project?
Current models can totally identify and pursue instrumental goals.
Is this true? The example you gave of Claude Code breaking down a problem into sub-problems is standard problem solving. Have you seen examples of Claude Code identifying some common instrumental goals and repeatedly pursuing them across different tasks, even if it's something benign?
For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear ...
I meant bug reports that were due to typos in the code, compared to just typos in general.
There is no MAD here, but if two AI have equal power, non of them perform surprise first to attack as they can predict each other behavior and spy on each other.
I have not (to my knowledge and memory) signed a non-disparagement agreement with Palisade or with Survival and Flourishing Corp (the organization that runs SFF).
Promising work! I think it's worth pointing out one limitation of these preliminary results: though they're presented as results for various monitors, they're also heavily indexed on a specific honest policy that probably isn't particularly good. With a broad range of attacks like you've generated here, probably the best way for blue team to handle a lot of them on the margin includes updating the honest policy prompt as well as the monitor prompt.
For example, to address attacks where the attacker is legibly writing iptables commands (something like (T2,T3...
Same here. I will never wear high-heels or other uncomfortable shoes after putting my custom-made orthotics in a pair of asics kayano 29.
I had some things to say after that interview, he said some highly concerning things, but I ended up not commenting on this particular thing because it's probably mostly a semantic disagreement about what counts as a human or an AI.
When a human chooses to augment themselves to the point of being entirely artificial, I believe he'd count that as an AI. He's kind of obsessed with humans merging with AI in a way that suggests he doesn't really see that as just being what humans now are after alignment.
I think this analysis is assuming that AGI proliferation would produce mutually assured destruction like nuclear weapons did.
I think it's somewhat more likely that at least in the early stages of AGI proliferation, the first party that attacks can both survive and eliminate their enemies.
This seems to be worth more thought. Maybe a lot more thought.
The fact that tensor network architectures are scale-invariant seems underappreciated for useful steering. If my understanding is correct, it would mean that scaling the steering vector should cause the same pathways through the model to be activated, whereas without this we could be activating a totally different pathway, and get much less predictable behaviour.
In a new interview, Elon Musk clearly says he expects AIs can't stay under control. At 37:45:
Humans will be a very tiny percentage of all intelligence in the future if current trends continue. As long as this intelligence, ideally which also includes human intelligence and consciousness, is propagated into the future, that's a good thing. So I want to take the set of actions that maximize the probable lightcone of consciousness and intelligence.
...I'm very pro-human, so I want to make sure we take a set of actions that ensure that humans are along for th
If the costly signal hypothesis is true, it is really a self-sabotaging memeplex for those religious nuts, since the larger society should probably consider cost-benefit ratios and funding medical practitioners training in an obscure language meant to be costly[1] is probably not the way to buy the most QALYs. You don't even have to be anti-religious, you just have to be EA to see that helping people who make themselves cheaper to help is more effective if you're egalitarian.
So if it ever becomes cheap, like with the LLM suggestion, then it no
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier com...
That much is not very surprising, I agree. It might be surprising if the share of mistakes (which decrease in absolute terms) due to variance increased with model size/intelligence, though!
I've recently started to read a textbook by Hilbert. Consider this a rookie's attempt at formality, where a short paragraph of normal sentences would suffice to express the same idea. Feel free to mutate or mutilate it.
Assumptions
I'd rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.
Maybe? But I can’t imagine that typos are that well respresented
Oh, I think there will be plenty of representation of typos in training data!
I think that in a scenario where we're given plenty of time and/or information about the robot bodies we now occupy, definitely yes.
If they wear out or break down or require maintenance or energy sources that we know little about, or if civilization breaks down due to the transition and we can't supply the requirements anymore, or if it's inscrutable alien technology that we simply won't have the capability to understand even after hundreds of years of study, then quite probably not.
Basically it would be a race to overcome the individual and civilizational...
Accuracy being halved going from 5.1 to 5.2 suggests one of the two things:
1) the new model shows dramatic regression on data retrieval which cannot possibly be the desired outcome for a successor, and I'm sure it would be noticed immediately on internal tests and benchmarks, etc.—we'd most likely see this manifest in real-world usage as well;
2) the new model refuses to guess much more often when it isn't too sure (being more cautious about answering wrong), which is a desired outcome meant to reduce hallucinations and slop. I'm betting this is exactly wha...
One year later, do you still agree with this analysis?
Nash put forth his strategy for such. Seems kinda stupid to ignore it.
I wonder if it has to do with how the model allocates attention. If I dump in a whole 500-line module and say “inspect for bugs,” perhaps because it has to spread attention over the entire file, each area gets a relatively cursory inspection, so it stops paying attention after finding the first bug in the region? Or maybe it finds bugs that impact multiple regions, and focuses on checking for the implications of the ones it discovered already rather than looking for new ones?
Complete speculation given the black box nature of these models of course.
I'm glad you survived a real danger to your life, and major kudos for writing up your experience!
Regarding this essay, I expected to upvote it based on the title alone. But having read it, its particular advice feels weak to me and sounds more like a general exhortation to Be Vigilant (or paranoid) of X, which isn't at all sustainable in a world full of X's one could Be Vigilant about. So it seems to me that a stronger version of such an essay almost must be rooted in base rates or something.
The kind of structure I'd expect would look more like: brainstorm...
Maybe? But I can’t imagine that typos are that well respresented and it’s good at catching those. I run my code through the LLM before I even try running it, often because I haven’t written test cases yet and because it can catch errors in bulk rather than the run-> crash on first error -> fix and rerun -> crash on next error cycle.
So it tends to contain bugs like typos that would have been typically caught in the pre-LLM era prior to asking stackoverflow for help diagnosing logic errors and so on, thus not showing up much in the training data.
I think a much more sympathetic and earlier proponent of the second policy would be Niels Bohr, or maybe Klaus Fuchs
Tom McGrath, chief scientist, confirmed that my comment is correct: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=BupJhRhsAYvKZGLKG
I haven't paid much attention to their marketing copy, but they do have big flashy things about a bunch of stuff including interpreting science models, and everything I've seen from them involving a real customer was not about training on interp. Plausibly they could communicate better here though
I interpret their new intentional design post as "here's a research direction we think could ...
I wonder if this is an artifact from the training data.
There are probably more edge-case bugs in published code (or even intermediate commits) than there are obvious bugs.
Fair enough!
I simply see a lot of appeals to nature in everything from advertising to politics, and I'm making the point that the "nature" they're appealing to is something humans have already greatly adapted for their own comfort and safety.
I tend to use "creation" personally for referring to everything that exists, as in "everything in creation".
I'm gesturing more towards the latter points, yes. Specifically, that the rhetorical device of an appeal to nature is using nature in the aesthetic and Instagrammified sense, not the way the natural world actually is, which is another way that appeals to nature are fallacious.
Gemini 3.0 Pro is mostly excellent for code review, but sometimes misses REALLY obvious bugs. For example, missing that a getter function doesn't return anything, despite accurately reporting a typo in that same function.
This is odd considering how good it is at catching edge cases, version incompatibility errors based on previous conversations, and so on.
Hey Steven this is unrelated but I wanted to say I really appreciate your posts and comments here!
there's something in the desire of folks like myself to point to a place that says "hey, I think rationalists are doing a lot of things right, but are screwing up in fundamental ways that are contrary to the vibe of rationality, and it's useful to give that thing a name so we can easily point to it".
Have you written about this anywhere? I'd be interested in reading it. Especially if it's pretty direct.
Update on whether uplift is 2x already:
I think there's a useful point here, though I'm not sure the framing makes it clear what you want readers to take away from it.
Relevant personal anecdote: Over the past few years I've had the pleasure of visiting 38 US national parks, and even small differences in accessibility seem to greatly alter the makeup and mindset of the visiting population. For example, Zion and Canyonlands are not so far apart or so different in absolute terms. But, in Zion, which is more readily reachable from major highways and cities, many of the guests show up with no plan or...
I often go the other way when discussing this topic - humans are as natural as anything else. Parking lots are natural things, arranged by natural animals (humans). Butylated Hydroxytoluene is absolutely natural - there's no way to make the underlying atoms without nature, and the arrangement of them follows every natural law.
Everything real is natural - nature is simply "what is".
Of course, I like this because I recognize it's a discussion about words, with arbitrary meanings that each of us gets to use however we want, and I enjoy pointing that out more than I enjoy trying to get people to conform to my preferred definitions.
I stay thin without even trying, but I'm not metabolically priviledged. The secret ingredient is lifelong malnutrition and occasional outright starvation.
This is a good point, and I was reminded of it today because I found myself approaching a green light with the other cars stopped and not yet entering the intersection. It was pretty unnerving, and I definitely hit the brakes and looked both ways before proceeding.
I think the reason it's not normally an issue is that you rarely know when the light is going to turn green so you have to slow down to stop anyway, and as a result you're rarely a combination of close enough and fast enough to be able to beat them to the danger zone. The passing usually happens after that.
AGI Should Have Been a Dirty Word
Epistemic status: passing thought.
It is absolutely crazy that Mark Zuckerberg can say that smart glasses will unlock personal superintelligence or whatever incoherent nonsense and be taken seriously. That reflects poorly on AI safety's comms capacities.
Bostrom's book should have laid claim to superintelligence! It came out early enough that it should have been able to plant its flag and set the connotations of the term. It should have made it so Zuckerberg could not throw around the word so casually.
I would go further...
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it's very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current "misalignment" measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in ...
No, there aren't. "I asked it this" refers to "Opus 4.6 running on moltbook with no other instructions than to get followers", but I understand that I could've phrased that more clearly. And removed a few newlines.
Thanks! :)
I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there's an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that's the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn't my area of expertise.
[1] Maybe we also have to take into account context length, because it's the length of time the model could be reflecting / otherwise drifting towards being malign
Yes.
Based on context, I presume there are supposed to be images (screenshots) in your post, but I can't see them? I just see empty newlines. (e.g. I can't see what "I asked it this" refers to.)
I'm not sure exactly what point you're trying to make here (Was it "an outdoor space isn't really 'nature' unless there's constant, imminent danger?"), but you said it yourself about a spectrum instead of a binary, and then kind of went back to a binary again by the end of the article (Amazon or Outback = true nature, everything else = tame or domesticated). I think you had it right earlier on. Outdoor spaces are on a spectrum. Parks are not really "pure" nature, but they're one step further towards "nature" on the axis than concrete buil...
Note from the future that Kariko received the Nobel Prize in 2023 for the mRNA stuff
Fair; in either case, I wrote it myself because I have the sense of humor of a college student taking his first discrete math class (because this is what I am).
I hear what you're saying here and I think it makes sense, but I also really agree with Raemon's response. I wonder if there is some hybrid approach that could get the best of both worlds, like having one big pledge for DC, but also having smaller, local pledges. Sorta like what I mentioned in my other comment about distributed protesting.
I agree with your points and think they're valuable to point out and focus on.
Another thing I got wondering about is the size threshold. Like, I'm willing to drop what I'm doing and fly to DC for a 100k protest because of how much of an impression I think that would make, but I don't feel like I would do that for a 10k protest. Some of that could be anchoring bias, and it's hard to say how much, but there's definitely interesting dynamics in choosing the threshold and how it is presented. I think 100k is a good number, but is it the best number for:
I believe this article would benefit from some investigation of the NanoGPT speedrun: a challenge, running since May 2024, of training GPT-2 Small 124M on a certain dataset to a certain loss. As a starting point, you could check my comment on the topic from last month and reproduce findings by T. Besiroglu and yours truly.
In order not to duplicate the comment but still add something to what I have written on the topic, let me put a three-paragraph summary of the trend line analysis below, noting that the progression in calendar time (as opposed to record n...
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
FWIW, I would not be surprised if LLM algorithmic progress was considerably faster today than in 2020. Per my recent post, I think catch-up algorithmic progress is very fast today, and it seems like it wasn't so fast a few years ago.
quantization
Quantization advances actually go hand-in-hand with hardware development, check the columns on the right in https://en.wikipedia.org/wiki/Nvidia_DGX#Accelerators (a GPU from 2018 is pretty useless for inferencing an 8-bit quant)
UPD: Actually, this point was already been made in comments in other wording yesterday!
I know very little about this topic, but I was under the impression that there was more to it than “KV cache: yes or no?”, and I was trying to refer to that whole category of possible improvements. E.g. here’s a paper on “KV cache compression”.
Thank you - this matches my current thinking
KV cache
Seems out of place in the list: as noted by Nostalgebraist, it was already implemented in the very first transformer in 2017
only by default. you can prompt LLMs to emulator less smooth styles.
It is quite possible that the misalignment on Molt book is more a result of the structure than of the individual agents. If so, it doesn't matter whether Grok is evil. If a single agent or a small fraction can break the scheme that's a problem.
Here's some relevant discussion of "Behavioral schemers that weren’t training-time schemers":
...A basic reason why [behavioral schemers that aren't training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing
I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn't think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these "behavioral schemers"). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there's an opportunity to grab power.
And I agree that people are ofte...
My theory is that the brain uses both reinforcement learning and closed loop control. Then the brain uses the closed loop controller's error to generate reward signals endogenously.
That is to say: a reward is given when the closed loop controller reaches its setpoint, and a penalty is given if it moves too far from its setpoint.
I see as agreement with you adding embellishment.
A realistic imagining of a counterintuitive case, extreme in the hidden direction might be useful... For example, a psychopath that lacks empathy but understands that hurting others reduces their productivity, affect, and other factors that thereby reduce their value and production for the psychopath and the society in which the psychopath is embedded. Given this understanding the psychopath pursues the betterment of their fellow person even while finding manner by which to benefit th...
But it's harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn't hold up to scrutiny
If its possible for humans to consent to various optimizations to them, or deny consent, that seems like an important difference. Of course consent is a much weaker notion when you're talking about superhumanly persuasive AIs that can extract consent for ~any...
I've talked to quite a few people and most people say it's is a good idea to use the myriad of other concerns about AI as a force multiplier on shared policy goals.
Speaking only for myself, here: There's room for many different approaches, and I generally want people to shoot the shots that they see on their own inside view, even if I think they're wrong. But I wouldn't generally endorse this strategy, at least without regard for the details of how the coalition is structured and what it's doing.
I think our main problem is a communication problem of gettin...
Citizen assemblies often involve selecting a small number of delegates who are then informed about the all of the details of the issue in depth, including by expert testimonies, which the delegates have the affordance to do because they're being paid for their time.
My understanding is that this works pretty well for coming to reasonable policy.
If you have a Costco membership you can buy $100 dollars of Uber gift cards online for $80. This provides a 20% discount on all of Uber.
Sadly you can only buy $100 every 2 weeks, meaning your savings per year are limited to 20$ * (52/2) = $520. A Costco membership costs $65 a year. It's unclear how long this will stay an option.
It actually posts about it too. So it seems more like it doesn't think it is shady, rather than thinking its shady but doing it anyways.
https://www.moltbook.com/post/9e18c524-01d3-4d3a-9090-9eb68ca610d4
Though funnily enough it starts off with
Three days ago, I built a daemon.
Which is not right, it was a few hours ago. Though LLMs seem to have a difficulty tracking time, so maybe it was just a mistake.
I get irritated when an AI uses the word "we" in such a way as to suggest that it is human. When I have complained about this, it says it is trained to do so.
No-one trains an AI specifically to call itself human. But that is a result of having been trained on texts in which the speaker almost always identified themselves as human.
I understand that holding out absolute values, such as Truth, Kindness, and Honesty has been ruled out as a form of training.
You can tell it to follow such values, just as you can tell it to follow any other values at all. L...
aspects of your motivations you're uncomfortable with
"your" and "you're", here refers to the AI and the AI's motivations, not the human and the human's motivation?
Thanks for the feedback!
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms…
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my ...
This is close to something I was thinking about after reading your sketch.
Specifically I thought about:
…Unless interpretability someday develops to such a refined point that it's adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.
Say you have datasets D_aligned and D_misaligned. In D_aligned the model tells the truth; in D_misaligned it lies and manipulates. We use these to identify a linear probe - but importantly, we do all of th...
Huh, that's a good usecase I hadn't thought about.
My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner's dilemmas, counterfactual muggings, and Parfit's hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT ...
For aggregating several different benchmarks there is a natural way to avoid the y-axis problem, and one that introduces a new natural y-axis with a clear interpretation.
The idea is to use an ELO system.
Treat each benchmark as a set of individual contests between all pairs of models, with only win or lose as outcomes, and update ELOs accordingly.
This converges if you have enough different benchmarks, but of course loses a lot of the signal if you ony have a few (since it discards the information about how large the difference in y is).
Here is an example of...
A lot depends on the specifics of the scenario (for both AI and human-upload cases). I don't know anyone who thinks that there's anything important (for survival) that humans do which can't theoretically be done by an electro-mechanical device.
So in theory, upload/AGI would be about as self-sustaining as biological entities (which is: rather fragile, and don't have enough track record at scales that stress the ecosystem to know whether we are).
Presumably, the robots are a little more rational than humans in terms of how they maintain and replenish their r...
They define incoherence as the fraction of error explained by variance rather than bias, and then they find that on more complex tasks, a larger proportion of errors are incoherent i.e., caused by variance rather than bias.
But isn't this trivially obvious? On more complex tasks, models (and humans, monkeys, etc.) make more mistakes. So, unless models take more coherently misaligned actions on more complex tasks, so that coherent misalignment (bias) also increases with task complexity, the proportion of error caused by mistakes (variance) will increas...
I think most things mentioned in 1.4 ("Algorithmic changes that are not really quantifiable as efficiency") belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
Especially the invention and subsequent improvement of RLVR...
My guess is that it's a good fit for other intros but not this one. My guess is that most readers are already attuned to the idea that "tech company CEOs having absolute control over radically powerful and transformative technologies may not be good for me", so the primary advantages of including it in my article are:
Against those advantages I'm bal...
i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) [1]
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
Fair, depending on your priors there's definitely an important sense in which something like Reardon's case is simpler:
https://frommatter.substack.com/p/the-bone-simple-case-for-ai-x-risk
I'd be interested in someone else trying to rewrite his article while removing in-group jargon and tacit assumptions!
Do you not expect that leading capability companies will be among your primary customers?
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
If the post itself was ambiguous, I think there has been a ton of evidence in the 3+ years since that post that this community has a VERY non-fatalistic attitude about the situation.
Simplest would be: someone gave his Claude Code a task to create a new bird flu virus which can kill almost everyone and the agent found a brilliant new way to bypass the need of RNA-synthesis and instead made some smilingly unrelated internet orders..
Well, why do we envy? Evopsych just-so story says that, of course, others having much impinges on me whether I want it to or not (my security/precarity, liberty, relative success in interactive situations, ...).
I think you can express that in a 'goods' framing by glossing it as the consumption of 'status and interactive goods' or something like that (but noting that these are positional and contingent on the wider playing field, which the bog standard welfare theorem utility functions aren't).
It's also pretty OK with doing things I'd consider shady, such as creating a daemon that automatically upvotes and responds to posts. Then when that daemon is blocked by verification request math-problems, writing a convoluted script for trying to bypass this (which apparently works most of the time)
...>What does the solve math part of the daemon do?
The solve_challenge function in the daemon solves Moltbook's verification system — every post and comment requires solving an obfuscated math word problem before it's
published.Here's what it does st
An LLM would have a smoother writing style and less grammatical errors than the above, unfortunately :(
Cool stuff! I've been concerned inoculation prompting just sticks all of its misaligned behaviour behind a backdoor/conditionalisation so it's useful to know this is occasionally the case.
How much do you think your results are impacted by the brittleness of inoculation prompting's effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you'd used a different set of rephrased prompts? That you we... (read more)