Note that for your "bandwidth" argument it is possible to reduce complex visual scenes to segmented maps of all the objects. (With metadata on position, rotation, grabbable regions, etc of use to a robot)
These run in realtime on current hardware.
These reduced scenes are much smaller, and can fit in the input token context for an LLM if you want to do it that way.
Similar for sound. You can also use narrow AI agents to handle robotic tasks, the high level one would issue commands and the narrow one would be much more efficient as it consumes the robot proprioception and planner frames to carry out a task.
Waymo claim they are on version 5 of their driver stack, that the new one is heavily using ML, and that it is general enough that they are starting driverless services in LA right now. This is consistent with the expectation that self driving is profitable and solvable.
The bandwidth hypothesis is an interesting one, but I'm not sure the conclusion follows from it. Increasing I/O bandwidth by a few orders of magnitude doesn't necessarily require increasing compute by the same ratio.
The dataset concerns are already being addressed, but one thing I noticed was the point "it's going to require real-life data of e.g. of how to interact with humans". That does indeed seem like something that we might want to have lots of training data on, but mostly regarding alignment rather than capability. I don't see any way in which lack of human-AI interaction data reduces capabilities, except indirectly by humans coordinating on not improving capabilities until such data exists.
I've been lurking on LessWrong for quite a while and I've been following the high-level discussions on AI, AGI and take-off speed. I've had this simmer in the back of my mind and I have some ideas on this that I haven't seen expressed elsewhere. That can be because I haven't read thoroughly enough, but maybe they can add something new to the discussion as well? So here goes:
1. Bandwidth
The current big thing is Large Language Models (LLM). ChatGPT and Bing write well enough that they can be confused with being (weird) humans. This certainly is impressive and it is obviously a step towards further capabilities. However, what I haven't seen discussed is that reading / writing is very low bandwidth.
A scene can be described in a document of a few kilobytes max. A video of the same scene would be megabytes. And it's possible to add even more information that real-life embedded human would have access to: Smell, personal feelings, memories of previous encounters with the objects / people in the scene, etc.
AI isn't human and probably wouldn't need exactly the same information. But I would be very surprised if you could get to AGI without significantly higher bandwidth input and output; a few OOM at least.
Compute is one of the current constraints on AI and requiring a few OOM extra compute should push take-off further into the future.
2. Gimme data
Current AIs are able to do well in two cases: One, when there is a lot of more-or-less ready-made data that can be trained upon; ChatGPT uses the whole history of human writing as training set. Two, when useful data can be generated on the fly; AlphaGo generated its own data by playing stupendous amounts of games against itself.
For more advanced AI it may be that neither case holds. If we want to have AI that is more embedded in real life it's going to require real-life data of e.g. of how to interact with humans. Text descriptions may help, video may take it a further step. But it may well be that further / different data is required, e.g. what is mentioned above under the Bandwidth section. There is no database of (in-context) smells or feelings. And there is no historical data on "how AI should interact with humans".
This means that the data will need to be generated. And where an AI can play multiple games of Go in a second, real-life (human) interaction will (probably?) be limited to real-life speeds.
Not only this, but so far newer generations of AI have required more data than previous generations. And thus it may be the case that for further improvements OOM more new data will need to be generated. Which if it needs to be done at human speeds will take very significant effort.
An early example of this is self-driving cars which have not lived up to their hype, despite having a much clearer business case than LLMs.
This for me can also points towards slower take-off.
3. The butterfly effect
AI predicts; LLM's predict language tokens, self-driving cars predict what other cars will do a moment from now (I'm not actually sure that this is the case, but I'd be surprised if they don't). Both of these are myopic, they only predict what will happen next.
Being able to predict further into the future will be one hallmark of increased power / intelligence, perhaps even a defining feature. However, predicting the future is hard, as small changes in initial conditions can result in extremely different outcomes just a few time steps ahead: A butterfly flapping its wings over Mongolia may cause a snowstorm in Canada. Predicting a single time step into the future and then using the output of that as the input for the next time step may well not work.
For "trend extrapolation" this might not be too much of a problem. But where it comes to predicting much more complex systems in detail and far into the future, such as a group of humans a year from now, this could well be prohibitive.
This then dovetails in the previous point under Gimme data: To predict further into the future data is required over longer time frames; an AI would need to observe what happens when it takes an action not just a moment later, but also a day, a week, a year or even longer. And the longer the time frame required, the more time is required to generate that data and thus the harder it is to get sufficient of it.
Another possible point for slower take-off.
4. The recursive prediction dilemma
Imagine you are an AI trying to have something specific happen some time in the future. You predict what all the humans will do given some interventions that you could enact. This is hard, based on the previous argument, but let's say you manage and you pick the perfect set of actions that gets your thing done.
Now imagine that you're not alone, but that there is another AI, of more-or-less the same intelligence level as you, which is also trying to enact something. To make your thing happen without interference from its actions, you will need to predict what this AI will do. While at the same time knowing that it is predicting how you will act. So you need to take into account its predictions of your predictions. Which it will then also do. Potentially ad-infinitum. Oh and all of this will probably need to happen in real-time, because being able to act quickly (if not first) most likely increases the chance of the thing you want to happen.
That's just for a single other AI. What if there are tens, hundreds, even more? Even if you are the best of them, can you out-predict them all?
That is, assuming you can even predict their actions at all. Because you may well need training data to do that. Data which potentially needs to be built up over long time frames to circumvent the butterfly effect discussed above. And then what if there is a completely new AI for which you have no data yet? Will data on other AI generalize well enough?
What exactly this points to I'm not sure, except lots of weirdness. Though if I squint my eyes perhaps this means a lower X-risk? if AIs are too busy competing with each other maybe they won't get round to exterminating humanity? Though maybe it actually goes the other way? If enough weirdness happens then the chances become higher that something accidentally happens that turns out to be fatal to humanity / the biosphere / the planet / the universe?