Off the top of my head, not very well-structured:
It seems the core thing we want our models to handle here is the concept of approximation errors, no? The "horse" symbol has mutual information with the approximation of a horse; the Santa Claus feature existing corresponds to something approximately like Santa Claus existing. The approach of "features are chiseled into the model/mind by an imperfect optimization process to fulfil specific functions" is then one way to start tackling this approximation problem. But it kind of just punts all the difficult parts onto "how the optimization landscape looks like".
Namely: the needed notion of approximation is pretty tricky to define. What are the labels of the dimensions of the space in which errors are made? What is the "topological picture" of these errors?
We'd usually formalize it as something like "this feature activates on all images within MSE distance of horse-containing images". And indeed, that seems to work well for the "horse vs cow-at-night" confusion.
But consider Santa Claus. That feature "denotes" a physical entity. Yet, what it {responds to}/{is formed because of} are not actual physical entities that are approximately similar to Santa Claus, or look like Santa Claus. Rather, it's a sociocultural phenomenon, which produces sociocultural messaging patterns that are pretty similar to sociocultural messaging patterns which would've been generated if Santa Claus existed[1].
If we consider a child fooled into believing into Santa Claus, what actually happened there is something like:
Going further, consider ghosts. Imagine ghost hunters equipped with a bunch of paranormal-investigation tools. They do some investigating and conclude that their readings are consistent with "there's a ghost". The issue isn't merely that there's such a small distance between "there's a ghost" and "there's no ghost" tool-output-vectors that the former fall within the approximation error of the latter. The issue is that the ghost hunters learned a completely incorrect model in which some tool-outputs which don't, in reality, correspond to ghosts existing, are mapped to ghosts existing.
Which, in turn, presumably happened because they'd previously confused the sociocultural messaging pattern of "tons of people are fooled into thinking these tools work" with "these tools work".
Which sheds some further light at the Santa Claus example too. Our sociocultural messaging about Santa Claus is not actually similar to the messaging in the counterfactual where Santa Claus really existed[2]. It's only similar in the deeply incomplete children's models of how those messaging patterns work...
Summing up, I think a merely correlational definition can still be made to work, as long as you:
... Or something like that.
Another idea I've been thinking about:
Consider the advantage prediction markets have over traditional news. If I want to keep track of some variable X, such as "the amount of investment going into Stargate", and all I have are traditional news, I have to constantly manually sift through all related news reports data in search of related information. With prediction markets, however, I can just bookmark this page and check it periodically.
An issue with prediction markets is that they're not well-organized. You have the tag system, but you don't know which outcomes feed into other events, you don't necessarily know what prompts specific market updates (unless someone mentions that in the comments), you don't have a high-level outline of the ontology of a given domain, etc. Traditional news reports offer some of that, at least: if competently written and truthful, they offer causal models and narratives behind the events.
It would be nice if we could fuse the two. An interface for engaging with the news that combines conciseness of prediction-market updates with an attempt at a model-based understanding offered by traditional news.
One obvious idea is to arrange it into the form of a Bayes net. People (perhaps the site's managers, perhaps anyone) could set up "causal models", in which specific variables are downstream of other variables. Other people (forecasters/experts hired by the project's managers, or anyone, like in prediction markets) could bet on which models are true[1], and within the models, on the values of specific variables[2]. (Relevant.)
Among other things, this would ensure built-in "consistency checks". If, within a given model, a variable X is downstream of outcomes A, B, C, such that X only happens if all of A, B, C happen, but the market-estimated P(X) isn't equal to P(ABC), this would suggest either that the prediction markets are screwing up, or that there's something wrong with the given model.
Furthermore, one way for this to gain notoriety/mainstream appeal is if specific high-status people or institutions set up their own "official" causal models. For example, an official AI 2027 causal model, or an official MIRI model of AI doom which avoids the multlple-stage fallacy and clearly shows how it's convergent.
Tons of ways this might not work out, but I think it's an interesting idea to try. (Though maybe it's something that should be lobbed off to Manifold Markets' leadership.)
"can we test out a version of this sort of thing powered by some humans-in-a-trenchcoat"
Response lag would be an issue here. As you'd pointed out, to be a proper part of the "exobrain", tools need to have very fast feedback loops. LLMs can plausibly do the needed inferences quickly enough (or perhaps not, that's a possible failure mode), but if there's a bunch of humans on the other end, I expect it'd make the tools too slow to be useful, providing little evidence regarding faster versions.
(I guess it'd work if we put von Neumann on the other end or something, someone able to effortlessly do mountainous computations in their head, but I don't think we have many of those available.)
or otherwise somehow test the ultimate hypothesis without having to build the thing
I think the minimal viable product here would be relatively easy to build. It'd probably just look like a LaTeX-supporting interface where you can define a bunch of expressions, type natural-language commands into it ("make this substitution and update all expressions", "try applying method #331 to solving this equation"), and in the background an LLM with tool access uses its heuristics plus something like SymPy to execute them, then updates the expressions.
The core contribution here would be removing LLM babble from the equation, abstracting the LLM into the background so that you can interact purely with the math. Claude's Artefact functionality and ChatGPT's Canvas + o3 can already be more or less hacked into this (though there are some issues, such as them screwing up LaTeX formatting).
"Automatic search for refactors of the setup which simplify it" should also be relatively easy. Just the above setup, a Loom-like generator of trees of thought, and a side window where the summaries of the successful branches are displayed.
Also: perhaps an unreliable demo of the full thing would still be illustrative? That is, hack together some interface that allows to flexibly edit and flip between math representations, maybe powered by some extant engine for that sort of thing (e. g., 3Blue1Brown's Manim? there are probably better fits). Don't bother with fine-tuning the LLMs, with wrapping them in proof-checkers, and with otherwise ensuring they don't make errors. Give the tool to some researchers to play with, see if they're excited about a reliable version.
"making it easier to change representations will enable useful thinking in hmath"
Approaching it from a different direction, how much evidence do we already have for this hypothesis?
Overall, I think that if something like this tool could be built and made to work reliably, the case for it being helpful is pretty solid. (Indeed, if I were more confident that AGI is 5+ years away, making object-level progress on alignment less of a priority, I'd try building it myself.) They key question here is whether it can actually be made to work flexibly/reliably enough on the back of the current LLMs.
On which point, as far as the implementation side goes, the core places where it might fail are:
Perhaps we could set it up so that, e. g., the first time you instantiate the connection between two representations, the task is handed off to a big LLM, which infers a bunch of rules and writes a bunch of code snippets regarding how to manage the connection, and the subsequent calls are forwarded to smaller, faster LLMs with a bunch of context provided by the big LLM to assist them. But again: would that work? Is there a way to frontload the work like this? Would smaller LLMs be up for the task?
while I think most people aren't quite tackling this with the particular taste I'd apply, it does sure seem like everyone is working on "do stuff with LLMs" and it's not where the underpicked fruit is
I disagree, I think pretty much nobody is attempting anything useful with LLM-based interfaces. Almost all projects I've seen in the wild are terrible and there are tons of unpicked low-hanging fruits.
I'd been thinking, on and off, about ways to speed up agent-foundations research using LLMs. An LLM-powered exploratory medium for mathematics is one possibility.
A big part of highly theoretical research is flipping between different representations of the problem: viewing it in terms of information theory, in terms of Bayesian probability, in terms of linear algebra; jumping from algebraic expressions to the visualizations of functions or to the nodes-and-edges graphs of the interactions between variables; et cetera.
The key reason behind it is that research heuristics bind to representations. E. g., suppose you're staring at some graph-theory problem. Certain problems of this type are isomorphic to linear-algebra problems, and they may be trivial in linear-algebra terms. But unless you actually project the problem into the linear-algebra ontology, you're not necessarily going to see the trivial solution when staring at the graph-theory representation. (Perhaps the obvious solution is to find the eigenvectors of the adjacency matrix of the graph – but when you're staring at a bunch of nodes connected by edges, that idea isn't obvious in that representation at all.)
This is a bit of a simplified example – the graph theory/linear algebra connection is well-known, so experienced mathematicians may be able to translate between those representations instinctively – but I hope it's illustrative.[1]
As a different concrete example, consider John Wentworth's Bayes Net Algebra. This is essentially an interface for working with factorizations of joint probability distributions. The nodes-and-edges representation is more intuitive and easy to tinker with than the "formulas" representation, which means that having concrete rules for tinkering with graph representations without committing errors would significantly speed up how quickly you can reason through related math problems. Imagine if the derivation of such frameworks was automated: if you could set up a joint PD in terms of formulas, automatically project the setup into graph terms, start tinkering with it by dragging nodes and edges around, and get errors if and only if back-projecting the changed "graph" representation into the "formulas" representations results in a setup that's non-isomorphic to the initial one.
(See also this video, and the article linked above.)
A related challenge are refactors. E. g., suppose you're staring at some complicated algebraic expression with an infinite sum. It may be the case that a certain no-loss-of-generality change of variables would easily collapse that expression into a Fourier series, or make some Obscure Theorem #418152/Weird Trick #3475 trivially applicable. But unless you happen to be looking at the problem through those lens, you're not going to be able to spot it. (Especially if you don't know the Obscure Theorem #418152/Weird Trick #3475.)
It's plausible that the above two tasks is what 90% of math research consists of (the "normal-science" part of it), in terms of time expenditure. Flipping between representations in search of a representation-chain where every step is trivial.
Those problems would be ameliorated by (1) reducing the friction costs of flipping between representations, and (2) being able to set up automated searches for simplifying refactors of the problem.
Can LLMs help with (1)? Maybe. They can write code and they can, more or less, reason mathematically, as long as you're not asking them for anything creative. One issue is that they're also really sloppy and deceptive when writing proofs... But that problem can potentially be ameliorated by fine-tuning e. g. r1 to justify all its conclusions using rigorous Lean code, which could be passed to automated proof-checkers before being shown to you.[2]
Can LLMs help with (2)? Maybe. I'm thinking something like the Pantheon interface, where you're working through the problem on your own, and in a side window LLMs offer random ideas regarding how to simplify the problem.
LLMs have bad research taste, which would extend to figuring out what refactorings they should try. But they also have a superhuman breadth of knowledge regarding theorems/math results. A depths-first search might thus be productive here. Most of LLM suggestions would be trash, but as long as complete nonsense is screened off by proof-checkers, and the ideas are represented in a quickly-checkable manner (e. g., equipped with one-sentence summaries), and we're giving LLMs an open-ended task, some results may be useful.
I expect I'd pay $200-$500/month for a working, competently executed tool of this form; even more the more flexible it is. I expect plenty of research mathematicians (not only agent-foundations folks) would, as well. There's a lucrative startup opportunity there.
@johnswentworth, any thoughts?
A more realistic example would concern ansatzes, i. e., various "weird tricks" for working through problems. They likewise bind to representations, such that the idea of using one would only occur to you if you're staring at a specific representation of the problem, and would fail to occur if you're staring at an isomorphic-but-shallowly-different representation.
Or using o3 with a system prompt where you yell at it a lot to produce rigorous Lean code, with a proof-checker that returns errors if it ever uses a placeholder always-passes "sorry" expression. But I don't know whether you can yell at it loudly enough using just the system prompt, and this latest generation of LLMs seems really into Goodharting, so it might straight-up try to exploit bugs in your proof-checker.
I'd guess this paper doesn't have the actual optimal methods.
Intuitively, this shouldn't matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods'. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they'd let elicit pass@800 capabilities instead of "just" pass@400, but it'd still be just pass@k elicitation for not-astronomical k.
Not strongly convinced of that, though.
Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.[1]
I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case.
Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
Since the US government is expected to treat other stakeholders in its previous block better than China treats members of it's block
At the risk of getting too into politics...
IMO, this was maybe-true for the previous administrations, but is completely false for the current one. All people making the argument based on something like this reasoning need to update.
Previous administrations were more or less dead inertial bureaucracies. Those actually might have carried on acting in democracy-ish ways even when facing outside-context events/situations, such as suddenly having access to overwhelming ASI power. Not necessarily because were particularly "nice", as such, but because they weren't agenty enough to do something too out-of-character compared to their previous democracy-LARP behavior.
I still wouldn't have bet on them acting in pro-humanity ways (I would've expected some more agenty/power-hungry governmental subsystem to grab the power, circumventing e. g. the inertial low-agency Presidential administration). But there was at least a reasonable story there.
The current administration seems much more agenty: much more willing to push the boundaries of what's allowed and deliberately erode the constraints on what it can do. I think it doesn't generalize to boring democracy-ish behavior out-of-distribution, I think it eagerly grabs and exploits the overwhelming power. It's already chomping at the bit to do so.
Mm, yeah, maybe. The key part here is, as usual, "who is implementing this plan"? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.
For things like superbabies (or brain-computer interfaces, or uploads), there's at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.
I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I don't see it.
The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI's existence. What people could those be?
Fundamentally, the problem is that there's currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it's currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren't aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).
Thus, I could only see three options:
A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI's goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.
Rather than keeping him in the dark or playing him, which reduces to Scenario 1.
Hm. Galaxy-brained idea for how to use this as a springboard to make prediction markets go mainstream:
(Note that it follows the standard advice for startup growth, where you start in a very niche market, gradually eat it all, then expand beyond this market, iterating until your reach is all-pervading.)