In terms of timelines, AGI is the threshold of capabilities where the system can start picking the low hanging fruit of lifting its easier-to-lift cognitive limitations (within constraints of compute hardware), getting to make a lot of progress at AI speeds on the kind of work that was previously only done by humans. Initially this might even be mere AI engineering in the sense of programming, with humans supplying the high level ideas for the AI to implement in code.
It's hard to pin down what specifically GPT-4 can't do at all, that's necessary to cross this threshold. It's bad at many steps that would be involved. Scaling predictably makes LLMs better, as long as data doesn't run out. A lot of the scaling will happen in a burst in the next 3-5 years before slowing down, absent regulation or AGI. It doesn't matter how the speed of improvement changes throughout the process, only whether the crucial capability threshold is crossed. And it's too unclear where the threshold lies and how much improvement is left to scale alone to tell with any certainty which one wins out.
Then there is data quality, which can get quite high in synthetic data in narrow domains such as Go or chess, allowing DL systems that are tiny by modern standards to play very good Go or chess. Something similar might get invented for data quality for LLMs that allows them to get very good at many STEM activities (such as theorem proving), but at a scale far beyond GPT-4.
There is not enough high quality text data to get through the current burst of scaling (forcing a pivot to less capability rich multimodal data), so serious work on this is inevitably ongoing, in addition to the distillation-motivated work for specialized smaller models. (Not generating specialized synthetic data particulary well might be one of the cognitive limitations that a nascent AGI resulting from doing this poorly might work on lifting.)