I know of two independently developed LLM in two languages where the conclusions of the developers is that "we run out of data in our language".  One of them is trying to scale by going multilingual.

Where to look next? There is lots of untapped data in speech (radio shows, youtube, etc): that amount could make a difference in my opinion.