Thane Ruthenis

Wikitag Contributions

Comments

Sorted by

Hm. Galaxy-brained idea for how to use this as a springboard to make prediction markets go mainstream:

  • Convince friendly prominent alignment research institutions (e. g. MIRI, the AI Futures project) to submit their models to the platform.
  • Socially pressure AGI labs to submit their own official models there as well, e. g. starting from Anthropic. (This should be relatively low-cost for them; at least, inasmuch as they buy their own hype and safety assurances.)
  • Now you've got a bunch of high-profile organizations making implicit official endorsements of the platform.
  • Move beyond the domain of AI, similarly starting with friendly smaller organizations (EA orgs, etc.) then reaching out to bigger established ones.
  • Everyone in the world ends up prediction-market-pilled.
  • ???
  • Civilizational sanity waterline rises!

(Note that it follows the standard advice for startup growth, where you start in a very niche market, gradually eat it all, then expand beyond this market, iterating until your reach is all-pervading.)

Off the top of my head, not very well-structured:

It seems the core thing we want our models to handle here is the concept of approximation errors, no? The "horse" symbol has mutual information with the approximation of a horse; the Santa Claus feature existing corresponds to something approximately like Santa Claus existing. The approach of "features are chiseled into the model/mind by an imperfect optimization process to fulfil specific functions" is then one way to start tackling this approximation problem. But it kind of just punts all the difficult parts onto "how the optimization landscape looks like".

Namely: the needed notion of approximation is pretty tricky to define. What are the labels of the dimensions of the space in which errors are made? What is the "topological picture" of these errors?

We'd usually formalize it as something like "this feature activates on all images within  MSE distance of horse-containing images". And indeed, that seems to work well for the "horse vs cow-at-night" confusion.

But consider Santa Claus. That feature "denotes" a physical entity. Yet, what it {responds to}/{is formed because of} are not actual physical entities that are approximately similar to Santa Claus, or look like Santa Claus. Rather, it's a sociocultural phenomenon, which produces sociocultural messaging patterns that are pretty similar to sociocultural messaging patterns which would've been generated if Santa Claus existed[1].

If we consider a child fooled into believing into Santa Claus, what actually happened there is something like:

  • "What adults tell you about the physical world" is usually correlated with what actually exists in the physical world.
  • The child learns a model that maps adults' signaling patterns onto world-model states.
  • The world-states "Santa Claus exists" and "there's a grand adult conspiracy to fool you into believing that Santa Claus exists" correspond to fairly similar adult signaling patterns.
    • At this step, we potentially are again working with something neat like "vectors encoding adults' signaling patterns", and the notion of similarity is e. g. cosine similarity between those vectors.
  • If the child's approximation error is significant enough to fail to distinguish between those signaling patterns, they pick the feature to learn based on e. g. the simplicity prior, and get fooled.

Going further, consider ghosts. Imagine ghost hunters equipped with a bunch of paranormal-investigation tools. They do some investigating and conclude that their readings are consistent with "there's a ghost". The issue isn't merely that there's such a small distance between "there's a ghost" and "there's no ghost" tool-output-vectors that the former fall within the approximation error of the latter. The issue is that the ghost hunters learned a completely incorrect model in which some tool-outputs which don't, in reality, correspond to ghosts existing, are mapped to ghosts existing.

Which, in turn, presumably happened because they'd previously confused the sociocultural messaging pattern of "tons of people are fooled into thinking these tools work" with "these tools work".

Which sheds some further light at the Santa Claus example too. Our sociocultural messaging about Santa Claus is not actually similar to the messaging in the counterfactual where Santa Claus really existed[2]. It's only similar in the deeply incomplete children's models of how those messaging patterns work...

Summing up, I think a merely correlational definition can still be made to work, as long as you:

  • Assume that feature vectors activate in response to approximations of their referents.
  • Assume that the approximation errors can lie in learned abstract encodings ("ghost-hunting tools' outputs", "sociocultural messaging patterns"), not only in default encodings ("token embeddings of words"), with likewise-learned custom similarity metrics.
  • Assume that learned abstract encodings can themselves be incorrect, due to approximation errors in previously learned encodings, in a deeply compounding way...
    • ... such that some features don't end up "approximately corresponding" to any actually existing phenomenon. (Like, the ground-truth prediction error between similar-sounding models is unbounded in the general case: approximately similar-sounding models don't produce approximately similarly correct predictions. Or even predictions that live in the same sample space.)

... Or something like that.

  1. ^

    Well, not quite, but bear with me.

  2. ^

    E. g., there'd be more fear about the eldritch magical entity manipulating our children.

Another idea I've been thinking about:

Consider the advantage prediction markets have over traditional news. If I want to keep track of some variable X, such as "the amount of investment going into Stargate", and all I have are traditional news, I have to constantly manually sift through all related news reports data in search of related information. With prediction markets, however, I can just bookmark this page and check it periodically.

An issue with prediction markets is that they're not well-organized. You have the tag system, but you don't know which outcomes feed into other events, you don't necessarily know what prompts specific market updates (unless someone mentions that in the comments), you don't have a high-level outline of the ontology of a given domain, etc. Traditional news reports offer some of that, at least: if competently written and truthful, they offer causal models and narratives behind the events.

It would be nice if we could fuse the two. An interface for engaging with the news that combines conciseness of prediction-market updates with an attempt at a model-based understanding offered by traditional news.

One obvious idea is to arrange it into the form of a Bayes net. People (perhaps the site's managers, perhaps anyone) could set up "causal models", in which specific variables are downstream of other variables. Other people (forecasters/experts hired by the project's managers, or anyone, like in prediction markets) could bet on which models are true[1], and within the models, on the values of specific variables[2]. (Relevant.)

Among other things, this would ensure built-in "consistency checks". If, within a given model, a variable X is downstream of outcomes A, B, C, such that X only happens if all of A, B, C happen, but the market-estimated P(X) isn't equal to P(ABC), this would suggest either that the prediction markets are screwing up, or that there's something wrong with the given model.

Furthermore, one way for this to gain notoriety/mainstream appeal is if specific high-status people or institutions set up their own "official" causal models. For example, an official AI 2027 causal model, or an official MIRI model of AI doom which avoids the multlple-stage fallacy and clearly shows how it's convergent.

Tons of ways this might not work out, but I think it's an interesting idea to try. (Though maybe it's something that should be lobbed off to Manifold Markets' leadership.)

  1. ^

    Or, perhaps in an even more fine-grained manner, which links between different variables are true.

  2. ^

    Ideally, with many variables shared between different models.

"can we test out a version of this sort of thing powered by some humans-in-a-trenchcoat"

Response lag would be an issue here. As you'd pointed out, to be a proper part of the "exobrain", tools need to have very fast feedback loops. LLMs can plausibly do the needed inferences quickly enough (or perhaps not, that's a possible failure mode), but if there's a bunch of humans on the other end, I expect it'd make the tools too slow to be useful, providing little evidence regarding faster versions.

(I guess it'd work if we put von Neumann on the other end or something, someone able to effortlessly do mountainous computations in their head, but I don't think we have many of those available.)

or otherwise somehow test the ultimate hypothesis without having to build the thing

I think the minimal viable product here would be relatively easy to build. It'd probably just look like a LaTeX-supporting interface where you can define a bunch of expressions, type natural-language commands into it ("make this substitution and update all expressions", "try applying method #331 to solving this equation"), and in the background an LLM with tool access uses its heuristics plus something like SymPy to execute them, then updates the expressions.

The core contribution here would be removing LLM babble from the equation, abstracting the LLM into the background so that you can interact purely with the math. Claude's Artefact functionality and ChatGPT's Canvas + o3 can already be more or less hacked into this (though there are some issues, such as them screwing up LaTeX formatting).

"Automatic search for refactors of the setup which simplify it" should also be relatively easy. Just the above setup, a Loom-like generator of trees of thought, and a side window where the summaries of the successful branches are displayed.

Also: perhaps an unreliable demo of the full thing would still be illustrative? That is, hack together some interface that allows to flexibly edit and flip between math representations, maybe powered by some extant engine for that sort of thing (e. g., 3Blue1Brown's Manim? there are probably better fits). Don't bother with fine-tuning the LLMs, with wrapping them in proof-checkers, and with otherwise ensuring they don't make errors. Give the tool to some researchers to play with, see if they're excited about a reliable version.

"making it easier to change representations will enable useful thinking in hmath"

Approaching it from a different direction, how much evidence do we already have for this hypothesis?

  • Various visual proofs, interactive environments, and "intuitive" explanations of math (which mostly work by projecting the math into different representations) seem widely successful. See e. g. the popularity of 3Blue1Brown's videos.
  • ML/interpretability in particular seems to rely on visualizations heavily. See also Chris Olah's essays on the subject.
  • I think math and physics researchers frequently describe doing this sort of stuff in their head; I know I do. It seems common-sensical that externalizing this part of their reasoning would boost  their productivity, inasmuch as it would allow to scale it beyond the constraints of the human working memory.
  • We could directly poll mathematicians/physics/etc. with a description of the tool (or, as above, an unreliable toy demo), and ask if that sounds like something they'd use.

Overall, I think that if something like this tool could be built and made to work reliably, the case for it being helpful is pretty solid. (Indeed, if I were more confident that AGI is 5+ years away, making object-level progress on alignment less of a priority, I'd try building it myself.) They key question here is whether it can actually be made to work flexibly/reliably enough on the back of the current LLMs.

On which point, as far as the implementation side goes, the core places where it might fail are:

  • Is the current AI even up for the task? That is, is there a way to translate the needed tasks into a format LLMs plus proof-verifiers can reliably and non-deceitfully solve?
  • If AIs are up to the task, can even they do it fast enough? A one-minute delay between an interaction and a response is potentially okay-ish, although already significantly worse than a five-seconds' delay. But if it takes e. g. ten minutes for the response to be produced, because an unwieldy overcomplicated LLM scaffold in the background is busy arguing with itself, and 5% of the time it just falls apart, that'd make it non-viable too.[1]
  1. ^

    Perhaps we could set it up so that, e. g., the first time you instantiate the connection between two representations, the task is handed off to a big LLM, which infers a bunch of rules and writes a bunch of code snippets regarding how to manage the connection, and the subsequent calls are forwarded to smaller, faster LLMs with a bunch of context provided by the big LLM to assist them. But again: would that work? Is there a way to frontload the work like this? Would smaller LLMs be up for the task?

while I think most people aren't quite tackling this with the particular taste I'd apply, it does sure seem like everyone is working on "do stuff with LLMs" and it's not where the underpicked fruit is

I disagree, I think pretty much nobody is attempting anything useful with LLM-based interfaces. Almost all projects I've seen in the wild are terrible and there are tons of unpicked low-hanging fruits.

I'd been thinking, on and off, about ways to speed up agent-foundations research using LLMs. An LLM-powered exploratory medium for mathematics is one possibility.

A big part of highly theoretical research is flipping between different representations of the problem: viewing it in terms of information theory, in terms of Bayesian probability, in terms of linear algebra; jumping from algebraic expressions to the visualizations of functions or to the nodes-and-edges graphs of the interactions between variables; et cetera.

The key reason behind it is that research heuristics bind to representations. E. g., suppose you're staring at some graph-theory problem. Certain problems of this type are isomorphic to linear-algebra problems, and they may be trivial in linear-algebra terms. But unless you actually project the problem into the linear-algebra ontology, you're not necessarily going to see the trivial solution when staring at the graph-theory representation. (Perhaps the obvious solution is to find the eigenvectors of the adjacency matrix of the graph – but when you're staring at a bunch of nodes connected by edges, that idea isn't obvious in that representation at all.)

This is a bit of a simplified example – the graph theory/linear algebra connection is well-known, so experienced mathematicians may be able to translate between those representations instinctively – but I hope it's illustrative.[1]

As a different concrete example, consider John Wentworth's Bayes Net Algebra. This is essentially an interface for working with factorizations of joint probability distributions. The nodes-and-edges representation is more intuitive and easy to tinker with than the "formulas" representation, which means that having concrete rules for tinkering with graph representations without committing errors would significantly speed up how quickly you can reason through related math problems. Imagine if the derivation of such frameworks was automated: if you could set up a joint PD in terms of formulas, automatically project the setup into graph terms, start tinkering with it by dragging nodes and edges around, and get errors if and only if back-projecting the changed "graph" representation into the "formulas" representations results in a setup that's non-isomorphic to the initial one.

(See also this video, and the article linked above.)

A related challenge are refactors. E. g., suppose you're staring at some complicated algebraic expression with an infinite sum. It may be the case that a certain no-loss-of-generality change of variables would easily collapse that expression into a Fourier series, or make some Obscure Theorem #418152/Weird Trick #3475 trivially applicable. But unless you happen to be looking at the problem through those lens, you're not going to be able to spot it. (Especially if you don't know the Obscure Theorem #418152/Weird Trick #3475.)

It's plausible that the above two tasks is what 90% of math research consists of (the "normal-science" part of it), in terms of time expenditure. Flipping between representations in search of a representation-chain where every step is trivial.

Those problems would be ameliorated by (1) reducing the friction costs of flipping between representations, and (2) being able to set up automated searches for simplifying refactors of the problem.

Can LLMs help with (1)? Maybe. They can write code and they can, more or less, reason mathematically, as long as you're not asking them for anything creative. One issue is that they're also really sloppy and deceptive when writing proofs... But that problem can potentially be ameliorated by fine-tuning e. g. r1 to justify all its conclusions using rigorous Lean code, which could be passed to automated proof-checkers before being shown to you.[2]

Can LLMs help with (2)? Maybe. I'm thinking something like the Pantheon interface, where you're working through the problem on your own, and in a side window LLMs offer random ideas regarding how to simplify the problem.

LLMs have bad research taste, which would extend to figuring out what refactorings they should try. But they also have a superhuman breadth of knowledge regarding theorems/math results. A depths-first search might thus be productive here. Most of LLM suggestions would be trash, but as long as complete nonsense is screened off by proof-checkers, and the ideas are represented in a quickly-checkable manner (e. g., equipped with one-sentence summaries), and we're giving LLMs an open-ended task, some results may be useful.

I expect I'd pay $200-$500/month for a working, competently executed tool of this form; even more the more flexible it is. I expect plenty of research mathematicians (not only agent-foundations folks) would, as well. There's a lucrative startup opportunity there.

@johnswentworth, any thoughts?

  1. ^

    A more realistic example would concern ansatzes, i. e., various "weird tricks" for working through problems. They likewise bind to representations, such that the idea of using one would only occur to you if you're staring at a specific representation of the problem, and would fail to occur if you're staring at an isomorphic-but-shallowly-different representation.

  2. ^

    Or using o3 with a system prompt where you yell at it a lot to produce rigorous Lean code, with a proof-checker that returns errors if it ever uses a placeholder always-passes "sorry" expression. But I don't know whether you can yell at it loudly enough using just the system prompt, and this latest generation of LLMs seems really into Goodharting, so it might straight-up try to exploit bugs in your proof-checker.

I'd guess this paper doesn't have the actual optimal methods.

Intuitively, this shouldn't matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods'. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they'd let elicit pass@800 capabilities instead of "just" pass@400, but it'd still be just pass@k elicitation for not-astronomical k.

Not strongly convinced of that, though.

Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.[1]

I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case.

  1. ^

    Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.

Since the US government is expected to treat other stakeholders in its previous block better than China treats members of it's block

At the risk of getting too into politics...

IMO, this was maybe-true for the previous administrations, but is completely false for the current one. All people making the argument based on something like this reasoning need to update.

Previous administrations were more or less dead inertial bureaucracies. Those actually might have carried on acting in democracy-ish ways even when facing outside-context events/situations, such as suddenly having access to overwhelming ASI power. Not necessarily because were particularly "nice", as such, but because they weren't agenty enough to do something too out-of-character compared to their previous democracy-LARP behavior.

I still wouldn't have bet on them acting in pro-humanity ways (I would've expected some more agenty/power-hungry governmental subsystem to grab the power, circumventing e. g. the inertial low-agency Presidential administration). But there was at least a reasonable story there.

The current administration seems much more agenty: much more willing to push the boundaries of what's allowed and deliberately erode the constraints on what it can do. I think it doesn't generalize to boring democracy-ish behavior out-of-distribution, I think it eagerly grabs and exploits the overwhelming power. It's already chomping at the bit to do so.

Mm, yeah, maybe. The key part here is, as usual, "who is implementing this plan"? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.

For things like superbabies (or brain-computer interfaces, or uploads), there's at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.

I also think there is a genuine alternative in which power never concentrates to such an extreme degree.

I don't see it.

The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI's existence. What people could those be?

  1. By default, it's the ASI's developers, e. g., the leadership of the AGI labs. "They will be nice and put in goals/constraints that make the ASI loyal to humanity, not to them personally" is more or less isomorphic to "they will make the ASI loyal to them personally, but they're nice and loyal to humanity"; in both cases, they have all the power.[1]
  2. If the ASI's developers go inform the US's President about it in a faithful way[2], the overwhelming power will end up concentrated in the hands of the President/the extant powers that be. Either by way of ham-fisted nationalization (with something isomorphic to putting guns to the developers' (families') heads), or by subtler manipulation where e. g. everyone is forced to LARP believing in the US' extant democratic processes (which the President would be actively subverting, especially if that's still Trump), with this LARP being carried far enough to end up in the ASI's goal structure.
    • The stories in which the resultant power struggles shake out in a way that leads to the humanity-as-a-whole being given true meaningful input in the process (e. g., the slowdown ending in AI-2027) seem incredibly fantastical to me. (Again, especially given the current US administration.)
    • Yes, acting in ham-fisted ways would be precarious and have various costs. But I expect the USG to be able to play it well enough to avoid actual armed insurrection (especially given that the AGI concerns are currently not very legible to the public), and inasmuch as they actually "feel the AGI", they'd know that nothing less than that would ultimately matter.
  3. If the ASI's developers somehow go public with the whole thing, and attempt to unilaterally set up some actually-democratic process for negotiating on the ASI goal/constraint structures, then either (1) the US government notices it, realizes what's happening, takes control, and subverts the process, (2) they set up some very broken process – as broken as the US electoral procedures which end up with Biden and Trump as Top 2 choice of president – and those processes output some basically random, potentially actively harmful results (again, something as bad as Biden vs. Trump).

Fundamentally, the problem is that there's currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it's currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren't aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).

Thus, I could only see three options:

  • Power is concentrated in some small group's hands, with everyone then banking on that group acting in a prosocial way, perhaps by asking the ASI to develop a faithful scalable preference-agglomeration process. (I. e., we use a faithful but small-scale human-preference-agglomeration process.)
  • Power is handed off to some random, unstable process. (Either a preference agglomeration system as unfaithful as US' voting systems, or "open-source the AGI and let everyone in the world fight it out", or "sample a random goal system and let it probably tile the universe with paperclips".)
  • ASI development is stopped and some different avenue of intelligence enhancement (e. g., superbabies) is pursued; one that's more gradual and is inherently more decentralized.
  1. ^

    A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI's goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.

  2. ^

    Rather than keeping him in the dark or playing him, which reduces to Scenario 1.

Load More