It has been some years since I looked at the literature, but if I recall correctly, the problem is that g is defined on a population level rather than on the individual level. You can't directly measure someone's raw g because the raw g is meaningless by itself.
Suppose that you have an intelligence test composed of ten subtests, each of which may earn a person up to 10 points, for a total of 100 points. You give that test to a number of people, and then notice that on average, a person doing well on one of the subtests means that they are likely to do better on the other subtests as well. You hypothesize that this is explained by a "general intelligence factor". You find that if you assign each of your test-takers an "intelligence score", then the average subtest score of all the test-takers who share that level of intelligence is some subtest-dependent constant times their "intelligence score".
Let's say that I was one of the people taking your test. Your equations say that I have a g of 2. Subtest A has a factor loading of 1, so I should get 1 * 2 = 2 points on it. Subtest B has a factor loading of 4.5, so I should get 4.5 * 2 = 9 points on it. It turns out that I actually got 9 points on subtest A and 2 points on subtest B, exactly the opposite pattern than the one you predicted! Does this mean that you have made an error? No, because the factor loadings were only defined as indicating the score averaged over all the test-takers who shared my estimated g, rather than predicting anything definite about any particular person.
This means that I can get a very different score profile from my estimated g would predict, for as long as enough others with my estimated g are sufficiently close to the estimate. So my estimated g is not very informative on an individual level. Compare this to measuring someone's height, where if they are 170 cm on one test, they are going to be 170 cm regardless of how you measure it.
Now suppose that you are unhappy with the subtests that you have chosen, so you throw out some and replace them with new ones. It turns out that the new ones are substantially harder: on the old test, people got an average g of 4, but now they only get an average of 2. How do you compare people's results between the old and the new test? Especially since some people are going to be outliers and perform better on the new test - maybe I get lucky and increase my g to a 3. It's as if we were measuring people's heights in both meters and feet, but rather than one straightforwardly converting to another, two people with the same height in meters might have a different height in feet.
Worse, there's also no reason why the intelligence test would have to have 10 subtests that award 10 points each. Maybe I devise my own intelligence test: it has 6 subtests that give 0-6 points each, 3 subtests that give 2-8 points each, and 4 subtests that give 0-20 points each. The resulting raw score distributions and factor loadings are going to be completely different. How do you compare people's results on your old test, your new test, and my test?
Well, one way to compare them would be to just say that you are not even trying to measure raw g (which is not well-defined for individuals anyway), you are measuring IQ. It seems theoretically reasonable that whatever-it-is-that-intelligence-tests-measure would be normally distributed, because many biological and psychometric quantities are, so you just define IQ as following a normal distribution and fit all the scores to that. Now we can at least say that "Kaj got an IQ of 115 on his own test and an IQ of 85 on Bob's test", letting us know that I'm one standard deviation above the median on my test and one standard deviation below it on your test. That gives us at least some way of telling what the raw scores mean.
Suppose that you did stick to just one fixed test, and measured how the raw scores change over time. This is something that is done - it's how the Flynn effect was detected, as there were increasing raw scores. But there are also problems with that, as seen from all the debates over what the Flynn effect changes actually mean.
Let's say that an intelligence test contains a subtest measuring the size of your vocabulary. The theoretical reason for why vocabulary size is thought to correlate with intelligence is that people learn the meaning of new words by hearing them used in a context. With a higher intelligence, you need to hear a word used fewer times to figure out its meaning, so smarter people will on average have larger vocabularies. Now, words are culturally dependent and people will be exposed to different words just by random chance... but if all of the people who the test was normed on are from the same cultural group, then on average, getting a higher score on the vocabulary subtest is still going to correlate with intelligence.
But suppose that you use exactly the same test for 20 years. The meanings of words change: a word that was common two decades ago might be practically nonexistent today ("phone booth"). Or another might become more common. Or a subtest might measure some kind of abstract reasoning, and then people might start playing puzzle games that feature more abstract logic reasoning. Assuming that people's scores on an IQ test can be decomposed into something like (talent + practice), changes in culture can invalidate intertemporal comparisons by changing the amount of practice people get.
So if someone who was 20 took your test in 2000 and got a raw score of 58, and someone who is 20 takes your test in 2020 and also gets a raw score of 58, this might not indicate that they have the same intelligence either... even if they get exactly the same scores on all the subtests. Periodically re-norming the raw scores helps make them more comparable in this case as well; that way we can at least know that what their ranking relative to other 20-year-olds in the same year was.