Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a timely technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
How would more data and software live in reality? What would the Great Generalized Google Maps project include, in what order? What data structures would it use for more abstract things?
Per above, we'd need tighter feedback loops/quicker updates, appropriate markings of when content/procedures become outdated, some ability to compare various elements of constructed realities against the ground truth whenever the ground truth becomes known, etc. (Consider if Google Maps were updating very slowly, and also had some layer on top of its object-level observations whose representations relied on chains of inferences from the ground-true data but without quick ground-true feedback. That'd gradually migrate it to a fictional world as well.)
The general-purpose solution is probably some system that'd incentivize people to flag divergences from reality... Prediction markets?
Is the “fake vs real thinking” distinction about the same thing? Like, thinking which uses mental representations of the real world vs mental representations of other worlds?
I think it's more that modeling/thinking about the real world requires constant effort of paying attention/updating/inferring. Not doing that and operating off of a dream mashup is the default. So it's not that there are real vs. other-world representations, it's that grounding yourself in the real world requires sophisticated inference-time cognitive work, rather than just retrieving cached computations and cobbling them together.
(And then LLMs could only do the cobbling-together thing, they're always living in the dream mashup.)
Agreed. I think the most optimistic case is that peering at GPT-3/4's interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4's level, but maybe something in that rough direction.)
Insofar as you've missed reality's ontology, things will just look like a mess
Or your thing just won't work. There's a kind of trade-off there, I think?
DL works because it gives a lot of flexibility for defining internal ontologies, and for compute-efficiently traversing their space. However, it does so by giving up all guarantees that the result would be simple/neat/easy-to-understand in any given fixed external ontology (e. g., the human one).
To combat that, you can pick a feature that would provide some interpretability assistance, such as "sparsity" or "search over symbolic programs", and push in that direction. But how hard do you push? (How big is the penalty relative to other terms? Do you give your program-search process some freedom to learn neural-net modules for plugging into your symbolic programs?)
If you proceed with a light touch, you barely have any effect, and the result is essentially as messy as before.
If you turn the dial up very high, you strangle DL's flexibility, and so end up with crippled systems. (Useful levels of sparsity make training 100x-1000x less compute-efficient; forget symbolic program search.)
In theory, I do actually think you may be able to "play it by ear" well enough to hit upon some method where the system becomes usefully more interpretable without becoming utterly crippled. You can then study it, and perhaps learn something that would assist you in interpreting increasingly less-crippled systems. (This is why I'm still pretty interested in papers like these.)
But is there a proper way out? The catch is that your interventions only hurt performance if they hinder DL's attempts to find the true ontology. On the other hand, if you yourself discover and incentivize/hard-code (some feature of) the true ontology, that may actually serve as an algorithmic improvement.[1] It would constrain the search space in a helpful way, or steer the training in the right direction, or serve as a good initialization prior... Thus making the system both more interpretable and more capable.
Which is a boon in one way (will near-certainly be widely adopted; the "alignment tax" is negative), and a curse in another (beware the midpoint of that process, where you're boosting capabilities without getting quite enough insight into models to ensure safety).
(Alternatively, you can try to come up with some Clever Plan where you're setting up a search process that's as flexible as DL but which somehow has a guaranteed of converging to be simple in terms of your fixed external ontology. I personally think such ideas are brilliant and people should throw tons of funding at them.)
May. There are some caveats there.
I'm confused how "they do directly hill climb on high profile metrics" is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
I assume it means they use the benchmark as the test set, not the training set.
Ah, I see. Yeah, I think you're right with this correction. (I was slightly misunderstanding what you were getting at before.)
Raoian Sociopaths are distinguished by being only able to interact via "power talk" (i. e., they're at Simulacrum Level 4), and if I recall correctly, the same is true of Moral Mazes' middle managers. The latter aren't really buying into the signals, they don't actually have loyalty to the company or whatever (which defines the Clueless). They're just cynically exploiting all those mechanisms for self-advancement. Thus, they are Sociopaths.
Also, here's a devastatingly powerful argumentum ad verecundiam in my interpretation's favor.
Yeah, it's not a good use of your time to seek that out. But if you do happen to stumble upon them (e. g., if they intruded into your garden, or if Zvi's newsletter covered an incident involving them, with quotes), and a statement from them causes a twinge of "hm, there may be something to it...", investigating that twinge may be useful. You shouldn't necessarily crush it in a burst of cognitive dissonance/self-protectivenss.
I think there are some mismatches in the fundamental assumptions your model and METR's model make.
Your model assumes that any given AI's ability to solve tasks continues to grow at the same rate when you give it more inference-time compute.[1] METR's model, however, doesn't make time-budget assumptions for the AI.[2] It tracks a binary "can/can't solve this task" variable, assumes that if an AI can't solve the problem within some subjective time-frame, it can't solve it at all. And indeed: when comparing AIs and humans, we don't currently talk about "AIs are way too time-inefficient at this task compared to humans", we talk about "AIs can't do this one at all".
And I think it's indeed the correct model for the current AIs. Roughly speaking, if we're modeling time_horizon(compute), I think this should be a "three-phase" function:
From how you've discussed the topic previously, your background assumption is that Phase 3 will eventually disappear: that the reasoning-phase interval will eventually grow to infinity. I agree it's a salient possibility to track.
However, the current models do seem to have this phase, see e. g. various research on "overthinking". Thus, modeling them as merely biphasic is invalid.
Stepping back: I think your prediction of superexponential growth isn't really about "superexponential" growth, but instead of there being an outright discontinuity where the time-horizons go from a finite value to infinity. I guess this is "superexponential" in a certain loose sense, but not in the same sense as ex2 is superexponential.
I don't think this can be modeled via extrapolating straight lines on graphs / quantitative models of empirically observed external behavior / "on-paradigm" analyses.
You say "AIs have 'infinite horizon length' now" about the slope-point where the crossover point disappears, but note that your model assumes infinite horizons even before this point. You model it as performance(time_budget)=slope⋅time_budget+β, but of course for any above-zero slope, including subhuman slopes, there's a time-budget value at which the AI achieves arbitrarily high performance.
Aside: Indeed, that's a common misunderstanding of METR's models. I don't think you're making this misunderstanding there, but maybe something in that direction.