Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a timely technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
Is the “fake vs real thinking” distinction about the same thing? Like, thinking which uses mental representations of the real world vs mental representations of other worlds?
I think it's more that modeling/thinking about the real world requires constant effort of paying attention/updating/inferring. Not doing that and operating off of a dream mashup is the default. So it's not that there are real vs. other-world representations, it's that grounding yourself in the real world requires sophisticated inference-time cognitive work, rather than just retrieving cached computations and cobbling them together.
(And then LLMs could only do the cobbling-together thing, they're always living in the dream mashup.)
Agreed. I think the most optimistic case is that peering at GPT-3/4's interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4's level, but maybe something in that rough direction.)
Insofar as you've missed reality's ontology, things will just look like a mess
Or your thing just won't work. There's a kind of trade-off there, I think?
DL works because it gives a lot of flexibility for defining internal ontologies, and for compute-efficiently traversing their space. However, it does so by giving up all guarantees that the result would be simple/neat/easy-to-understand in any given fixed external ontology (e. g., the human one).
To combat that, you can pick a feature that would provide some interpretability assistance, such as "sparsity" or "search over symbolic programs", and push in that direction. But how hard do you push? (How big is the penalty relative to other terms? Do you give your program-search process some freedom to learn neural-net modules for plugging into your symbolic programs?)
If you proceed with a light touch, you barely have any effect, and the result is essentially as messy as before.
If you turn the dial up very high, you strangle DL's flexibility, and so end up with crippled systems. (Useful levels of sparsity make training 100x-1000x less compute-efficient; forget symbolic program search.)
In theory, I do actually think you may be able to "play it by ear" well enough to hit upon some method where the system becomes usefully more interpretable without becoming utterly crippled. You can then study it, and perhaps learn something that would assist you in interpreting increasingly less-crippled systems. (This is why I'm still pretty interested in papers like these.)
But is there a proper way out? The catch is that your interventions only hurt performance if they hinder DL's attempts to find the true ontology. On the other hand, if you yourself discover and incentivize/hard-code (some feature of) the true ontology, that may actually serve as an algorithmic improvement.[1] It would constrain the search space in a helpful way, or steer the training in the right direction, or serve as a good initialization prior... Thus making the system both more interpretable and more capable.
Which is a boon in one way (will near-certainly be widely adopted; the "alignment tax" is negative), and a curse in another (beware the midpoint of that process, where you're boosting capabilities without getting quite enough insight into models to ensure safety).
(Alternatively, you can try to come up with some Clever Plan where you're setting up a search process that's as flexible as DL but which somehow has a guaranteed of converging to be simple in terms of your fixed external ontology. I personally think such ideas are brilliant and people should throw tons of funding at them.)
May. There are some caveats there.
I'm confused how "they do directly hill climb on high profile metrics" is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
I assume it means they use the benchmark as the test set, not the training set.
Ah, I see. Yeah, I think you're right with this correction. (I was slightly misunderstanding what you were getting at before.)
Raoian Sociopaths are distinguished by being only able to interact via "power talk" (i. e., they're at Simulacrum Level 4), and if I recall correctly, the same is true of Moral Mazes' middle managers. The latter aren't really buying into the signals, they don't actually have loyalty to the company or whatever (which defines the Clueless). They're just cynically exploiting all those mechanisms for self-advancement. Thus, they are Sociopaths.
Also, here's a devastatingly powerful argumentum ad verecundiam in my interpretation's favor.
Yeah, it's not a good use of your time to seek that out. But if you do happen to stumble upon them (e. g., if they intruded into your garden, or if Zvi's newsletter covered an incident involving them, with quotes), and a statement from them causes a twinge of "hm, there may be something to it...", investigating that twinge may be useful. You shouldn't necessarily crush it in a burst of cognitive dissonance/self-protectivenss.
Mm, I don't think this maps on.
So, if anything, Sociopaths are the losers' bracket here, the Clueless are maybe the bottom tier (well, not really; I actually don't think they map on to anything here), and everyone in the winners' bracket here is a Roaian Loser (since they opted out of the game) (though not all Losers are winners).
Per above, we'd need tighter feedback loops/quicker updates, appropriate markings of when content/procedures become outdated, some ability to compare various elements of constructed realities against the ground truth whenever the ground truth becomes known, etc. (Consider if Google Maps were updating very slowly, and also had some layer on top of its object-level observations whose representations relied on chains of inferences from the ground-true data but without quick ground-true feedback. That'd gradually migrate it to a fictional world as well.)
The general-purpose solution is probably some system that'd incentivize people to flag divergences from reality... Prediction markets?