But staying on the frontier seems to be a really hard job. Lots of new research comes every day, and scientists struggle to follow it. New research has lots of value while it's hot, and loses it as the field progresses and finds itself a part of general theory (and learning it is a much more worthwhile use of time).
Which does introduce the question: if you are not currently at the cutting edge and actively advancing your field, why follow new research at all? After a bit of time, the field would condense the most important and useful research into neat textbooks and overview articles, and reading them when they appear would be a much more efficient use of time. While you are not at the cutting edge — read condensations of previous works until you get there.
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough).
Please, just please, don't start RSI on purpose. For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI, and even the mere existence of it poses a threat. We were afraid we would accidentally miss the point of no return, and now, so many people (not even in major AI companies, but in smaller labs too) are trying to bring that point closer purposefully.
Programs sometimes don't work as we expect them to, even when we are the ones designing them. How would making the hallucination machine do this job produce something so powerful with working guardrails?
I recently prepared an overview lecture about research directions in AI alignment for the Moscow AI Safety Hub. I had limited time, so I did the following: I reviewed all the sites on the AI safety map, examined the 'research' sections, and attempted to classify the problems they tackle and the research paths they pursue. I encountered difficulties in this process, partly because most sites lack a brief summary of their activities and objectives (Conjecture is one of the counterexamples). I believe that the field of AI safety would greatly benefit from improved communication, and providing a brief summary of a research direction seems like low-hanging fruit.
Just as you can unjustly privilege a low-likelihood hypothesis just by thinking about it, you can in the exact same way unjustly unprivilege a high-likelihood hypothesis just by thinking about it. Example: I believe that when I press a key on a keyboard, the letter on the key is going to appear on the screen. But I do not consciously believe that; most of the time I don't even think about it. And so, just by thinking about it, I am questioning it, separating it from all hypotheses which I believe and do not question.
Some breakthroughs were in the form of "Hey, maybe something which nobody ever thought of is true," but some very important breakthroughs were in the form "Hey, maybe this thing which everybody just assumes to be true is false."
When someone is doing physics (tries to find out what happens with a physical system knowing it initial conditions), they are performing the transformation from the time-consuming-but-easy-to-express form of connecting the initial conditions to the end result (physical laws), to a form of a single entry in the giant look-up table which matches initial conditions to the end result (which is not-time-consuming-but-harder-to-express form), essentially flattening out the time dimension. That creates a feeling that the process that they are analyzing is pre-determined, that this giant look-up table already exists. And when they apply it to themselves, this can create a feeling of no control over their own actions, like those observation-action pairs are drawn from that pre-existing table. But this table doesn't actually exist; they still need to perform the computation to get to the action; there is no way around it. Wherever the process is performed, that process is the person.
In other words, when people do physics on simple enough systems that they can fit in their head both the initial conditions and the end result and...
Why no one here is talking about Claude Mythos? I don't have any takes, but I want to hear yours.
Introducing three notable classes of model sizes (Sonnet, Opus, above-Opus) is possibly the consequence of Anthropic needing to feed datacenters with three different classes of servers during the Claude 5 lifecycle: the smaller Nvidia 8-chip servers (H100/H200/B200), rack-scale Trainium 2, and TPUv7, each being able to serve larger models than the previous one efficiently. Meanwhile, OpenAI until very recently was stuck with mostly the 8-chip Nvidia servers and so had to use smaller models (they couldn't serve their own Opus-class model efficiently), and only now they're getting enough GB200/GB300 Oberon racks to offer an Opus-class flagship model soon. Though the Blackwell Oberon racks are better than Trainium 2, so there's some advantage to what OpenAI will be able to serve compared to Opus 4, all else equal. And based on GPT-5.4 (which is likely in Sonnet's weight class), currently OpenAI might be better at RLVRing capabilities than Anthropic for models of the same size, so OpenAI's new Opus-class model might end up notably better than Opus 4. But by that time or a bit later Opus 5 will be released, so even if these considerations are on point, it's still unclear which of them w...
If it's the next level of pretraining compared to Opus 4 and Gemini 3 Pro, there's potential for novel observations about what that does to the texture of capabilities. It's the kind of thing that will predictably scale further soon without requiring algorithmic breakthroughs, and it's not even clear that RLVR can be expected to deliver more phase changes in capabilities in the near future due to pure scaling than pretraining (even if it's less than 1 phase change for either in expectation, until 2032 or so).
If I see a YouTube video pop up in my feed right after it’s published, I can often come up with a comment that gets a lot of likes and ends up near the top of the comment section.[1] It’s actually not that hard to do: the hardest part is being quick enough[2] to get into the first 10-30 comments (which I assume is the average number of comments viewers glance over), but the comment itself might be pretty generic and not that relevant to the video’s content.
Do you know a way I could use that? You can suggest advice for achieving convergent instrumental goals, usual human goals, and (most importantly) AI x-risk reduction. If you think I’m hyper-online or delusional about this, you can also point it out.
I wouldn’t be surprised if it’s actually not that hard and my success is just a consequence of being hyper-online.
I also suspect that the YouTube algorithm might have learned about this ability of mine and has now categorized me as a “top commenter,” so it shows me videos earlier than others and uses me to “boost engagement” or smth.
Wake up babe, new superintelligence company just dropped
And they show some impressive results.
The Math Inc. team is excited to introduce Gauss, a first-of-its-kind autoformalization agent for assisting human expert mathematicians at formal verification. Using Gauss, we have completed a challenge set by Fields Medallist Terence Tao and Alex Kontorovich in January 2024 to formalize the strong Prime Number Theorem (PNT) in Lean (GitHub).
Gauss took 3 weeks to do so, which seems way out of METR task length horizon prediction. Though I'm not sure if that's fair comparison, both because we do not have baseline human time for this task, and because formalizing is a domain where it is very hard to get off track, the criterion of success is very crisp.
I think alignment researchers have to learn to use it (or any other powerful math prover assistant) in order to exploit every leverage we can get.
(yes, this thought isn’t original either)
Reasons:
Money is a good approximation for what people value. Value can be destroyed. But what should I do to money to destroy the value it encompasses?
I might feel bad if somebody stole my wallet, but that money hasn't been destroyed; it is just now going to bring utility to another human, and if I (for some weird reason) value the quality of life of the robber just as much as my own, I wouldn't even think something bad has happened.
If I actually destroy money, like burn it to ashes, then there ...
We want world models to be:
But those properties are in tension with one another. If we aim for the first property, the most intuitive approach is to encode the concepts we understand. In that case, we end up with GOFAI, one of the main problems of which is that the mental world it lives in is very limited. If we aim for the second property, we end up with tangled messes like NNs (which are directly optimized for accuracy), and it's hard for humans to understand the concepts in the model.
We c...
Sending information is equivalent to storing information if you consider Galilean relativity (any experiment performed in a frame of reference moving at a constant speed is equivalent to the same experiment in a static frame of reference).
Sometimes, the amount of optimization power that was put into the words is less than you expect, or less than the gravity of the words would imply.
Some examples:
"You are not funny." (Did they evaluate your funniness across many domains and in diverse contexts in order to justify a claim like that?)
"Don't use this drug, it doesn't help." (Did they do the double-blind studies on a diverse enough population to justify a claim like that?)
"That's the best restaurant in town." (Did they really go to every restaurant in town? Did t...
If the future contains far more people than we have today, and if people are going to have their memory upgraded, and if the information about us on the internet is going to be preserved, then each person alive today is going to be kind of a celebrity.
It’s as if our civilization started with 10 people and they recorded every second of their lives: we would know almost everything about them. People would read their quotes, live by their wisdom, and create cults around them.
People often say, "Oh, look at this pathetic mistake AI made; it will never be able to do X, Y, or Z." But they would never say to a child who made a similar mistake that they will never amount to doing X, Y, or Z, even though the theoretical limits on humans are much lower than for AI.
In the comments to this quick take, I am planning to report on my intellectual journey: what I read, what I learned, what exercises I’ve done, and which projects or research problems I worked on. Thanks to @TristianTrim for suggesting the idea. Feel free to comment with anything you think might be helpful or relevant.
Is there a reason to hate Bill Gates? From a utilitarian perspective, he might be “the best person ever,” considering how much he gives to effective charities.
Do people just use the “billionaire = evil” heuristic, or are there other considerations?
Idea status: butterfly idea
In real life, there are too many variables to optimize each one. But if a variable is brought to your attention, it is probably important enough to consider optimizing it.
Negative example: you don’t see your eyelids; they are doing their job of protecting your eyes, so there’s no need to optimize them.
Positive example: you tie your shoelaces; they are the focus of your attention. Can this process be optimized? Can you learn to tie shoelaces faster, or learn a more reliable knot?
Humans already do something like this, but mostly co...
Solomonoff Induction is incredibly powerful. It's so powerful that it can't exist in our world. But because of its power, it needs to be handled with care. For it to actually produce accurate hypotheses, you have to expose it to as much evidence as possible, because even the tiniest coincidence in your data (which will happen if you don't collect the widest dataset possible) would be interpreted as a Deep Rule of the world.
or at least make them less obvious
My eyes are tired of AI-generated images. At this point, I would even prefer Corporate Memphis. It saddens me every time I see an obviously AI-generated image on the website of some good cause (like alignment agendas).
One counterexample is LessWrong’s featured articles, which sometimes use AI-generated backgrounds, but those are usually rather abstract, and their imperfections are less noticeable and actually fit the style.
Some folks on LessWrong already push back reall...
White House launches Manhattan project for AI (sorta).
In this pivotal moment, the challenges we face require a historic national effort, comparable in urgency and ambition to the Manhattan Project that was instrumental to our victory in World War II and was a critical basis for the foundation of the Department of Energy (DOE) and its national laboratories.
I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model's thoughts. But why? LLMs are still just as black-boxy as they were before; we still don't know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black...
Rules can generate examples. For instance: DALLE-3 is a rule according to which different examples (images) are generated.
From examples, rules can be inferred. For example: with a sufficient dataset of images and their names, a DALLE-3 model can be trained on it.
In computer science, there is a concept called Kolmogorov complexity of data. It is (roughly) defined as the length of the shortest program capable of producing that data.
Some data are simple and can be compressed easily; some are complex and harder to compress. In a sense, the task...