Hmm. A speculative, currently-intractable way to do this might be to summarize the ML model before feeding it to the goal-extractor.
tl;dr: As per natural abstractions, most of the details of the interactions between the individual neurons are probably irrelevant with regards to the model's high-level functioning/reasoning. So there should be, in principle, a way to automatically collapse e. g. a trillion-parameters model into a much lower-complexity high-level description that would still preserve such important information as the model's training objective.
But there aren't currently any fast-enough algorithms for generating such summaries.
Good question. There's a great amount of confusion over the exact definition, but in the context of this post specifically:
An optimizer is a very advanced meta-learning algorithm that can learn the rules of (effectively) any environment and perform well in it. It's general by definition. It's efficient because this generality allows it to use maximally efficient internal representations of its environment.
For example, consider a (generalist) ML model that's fed the full description of the Solar System at the level of individual atoms, and which is asked to roughly predict the movement of Earth over the next year. It can keep modeling things at the level of atoms; or, it can dump the overwhelming majority of that information, collapse sufficiently large objects into point masses, and use Cowell's method.
The second option is greatly more efficient, while decreasing the accuracy only marginally. However, to do that, the model needs to know how to translate between different internal representations, and how to model and achieve goals in arbitrary systems.
The same property that allows an optimizer to perform well in any environment allows it to efficiently model any environment. And vice versa, which is the bad part. The ability to efficiently model any environment allows an agent to perform well in any environment, so math!superintelligence would translate to real-world!superintelligence, and a math!goal would be mirrored by some real!goal. Next thing we know, everything is paperclips.
E. g., how to keep track of what "Earth" is, as it moves from being a bunch of atoms to a point mass.
Assuming it hasn't been trained for this task specifically, of course, in which case it can just learn how to translate its inputs into this specific high-level representation, how to work with this specific high-level representation, and nothing else. But we're assuming a generalist model here: there was nothing like this in its training dataset.
If it's only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical calculator, how does increasing this AI's compute/power lead to it becoming an optimizer over a wider domain than just arithmetic?
That was a poetic turn of phrase, yeah. I didn't mean a literal arithmetic calculator, I meant general-purpose theorem-provers/math engines. Given a sufficiently difficult task, such a model may need to invent and abstract over entire new fields of mathematics, to solve it in a compute-efficient manner. And that capability goes hand-in-hand with runtime optimization.
Do you think it might be valuable to find a theoretical limit that shows that the amount of compute needed for such epsilon-details to be usefully incorporated is greater than ever will be feasible (or not)?
I think something like this was on the list of John's plans for empirical tests of the NAH, yes. In the meantime, my understanding is that the NAH explicitly hinges on assuming this is true.
Which is to say: Yes, an AI may discover novel, lower-level abstractions, but then it'd use them in concert with the interpretable higher-level ones. It wouldn't replace high-level abstractions with low-level ones, because the high-level abstractions are already as efficient as they get for the tasks we use them for.
You could dip down to a lower level when optimizing some specific action — like fine-tuning the aim of your energy weapon to fry a given person's brain with maximum efficiency — but when you're selecting the highest-priority person to kill to cause most disarray, you'd be thinking about "humans" in the context of "social groups", explicitly. The alternative — modeling the individual atoms bouncing around — would be dramatically more expensive, while not improving your predictions much, if at all.
It's analogous to how we're still using Newton's laws in some cases, despite in principle having ample compute to model things at a lower level. There's just no point.
Edit: On a closer read, I take it you're looking only for tasks well-suited for language models? I'll leave this comment up for now, in case it'd still be of use.
Can't exactly fit that here, but the dataset seems relatively easy to assemble.
We can then play around with it:
Task: Automated Turing test.
Context: A new architecture improves on state-of-the-art AI performance. We want to check whether AI-generated content is still possible to distinguish from human-generated.
Input type: Long string of text, 100-10,000 words.
Output type: Probability that the text was generated by a human.
Etc., etc.; the data are easy to generate. Those are from here and here.
To be honest, I'm not sure it's exactly what you're asking, but it seems easy to implement (compute costs aside) and might serve as a (very weak and flawed) "fire alarm for AGI" + provide some insights. For example, we can then hook this Turing Tester up to an attribution tool and see what specific parts of the input text make it conclude it was/wasn't generated by a ML model. This could then provide some insights into the ML model in question (are there some specific patterns it repeats? abstract mistakes that are too subtle for us to notice, but are still statistically significant?).
Alternatively, in the slow-takeoff scenario where there's a brief moment before the ASI kills us all where we have to worry about people weaponizing AI for e. g. propaganda, something like this tool might be used to screen messages before reading them, if it works.