"...feed forward networks for the new token don't have access to the past feed-forward states of the other tokens..."
This isn't correct. The attention mechanism can move information from the neural network outputs at previous times to the current time, that is then fed into the feedforward network for the current time. The basic transformer mechanism is to alternate cross-time attention computations with within-current-time neural network computations, over many layers. Without access to information from past times, performance would obviously be atrocious.
In a sense, the KV cache that retains this information from past times is "just" an optimization, because the computations are (in theory, not always in practice) deterministic, so one could just redo them again for every previous token when predicting the next token (assuming the previously-generated tokens are retained). But that doesn't seem enough to support your argument.
Of course, it's quite possible that the models don't attend very well to the past states, and so suffer to some extent from the issues you mention, but it's not a fundamental property of the architecture.
"yes these start at zero"
Umm... No. Except for Geology, the y-axes don't start at zero. Most start close to zero, but you can see most clearly that they don't start exactly at zero with Philosophy.
As an experiment, I asked ChatGPT o3 to mark issues with your paragraph above. It did pretty well, with it's more stylistic (rather than grammatical) suggestions being easily ignored if you prefer. Here's the dialog:
PROMPT TO O3:
Please look at the following text and identify places where the English is ungrammatical, or reads very oddly to a native speaker, and suggest how to rephrase correctly:
Thanks for your post and advices. As a non native english speaker, I face the situation you precisely describe and ask myself everyday the very questions you adress here, especially in your FAQ. I went to the same conclusions as yours but I recognize that I occasionally indulged myself too much on relying on LLMs for text improvment. That's hard to resist, because even if anything red marked as LLM-ish would make a bad impression, it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english. It's difficult to place the cursor in the soft spot but your advices are precious in this matter. Unlike LLMs, you don't just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.
REASONING:
Thought for 11 seconds
The user wants help identifying ungrammatical parts of a paragraph and suggestions for rephrasing. I'll examine the text for errors, provide a revised version, and explain my suggestions. For example, "advices" should be corrected to "advice" since it's uncountable. "Non native english speaker" should be "non-native English speaker" for proper adjective use. Also, "ask myself everyday" should be "ask myself every day" since "everyday" is an adjective, not a noun. Let's clarify those.
Done
RESPONSE:
Below I’ve marked the parts that sound ungrammatical or awkward to a native speaker, then offered smoother alternatives. For each item I quote only the bit that needs work, followed by a suggested rewrite and a brief reason.
| Original phrasing (excerpt) | Suggested rewrite | Why |
| --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **“and advices.”** | “and advice.” | *Advice* is an uncountable noun in English. |
| **“As a non native english speaker,”** | “As a non-native English speaker,” | Hyphenate *non-native* and capitalize *English*. |
| **“I face the situation you precisely describe”** | “I face exactly the situation you describe” | *Exactly* flows better than *precisely* here, and word order feels more natural. |
| **“ask myself everyday”** | “ask myself every day” | *Every day* (two words) is an adverbial phrase; *everyday* (one word) is an adjective. |
| **“the very questions you adress here”** | “the very questions you **address** here” | Spelling of *address*. |
| **“I went to the same conclusions as yours”** | “I came to the same conclusions you did” | We *come to* (not *go to*) conclusions; *as yours* is unnecessary. |
| **“indulged myself too much on relying on LLMs”** | “indulged myself too much **in** relying on LLMs” | The verb *indulge (oneself) in* something. |
| **“for text improvment.”** | “for text improvement.” | Spelling of *improvement*. |
| **“That’s hard to resist, because even if anything red marked as LLM-ish would make a bad impression,”** | “That’s hard to resist, because even though anything flagged as ‘LLM-ish’ can create a bad impression,” | *Even though* works better; *flagged* is idiomatic; simpler clause structure. |
| **“it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english.”** | “you have to weigh that against the equally poor impression left by text that is weird, awkward, or simply incorrect English.” | Clearer verb (*weigh*), smoother wording, capitalize *English*. |
| **“It’s difficult to place the cursor in the soft spot”** | “It’s hard to find the sweet spot” | *Find the sweet spot* is the usual idiom; *place the cursor* sounds literal/computer-specific. |
| **“but your advices are precious in this matter.”** | “but your advice is invaluable in this matter.” | *Advice* is uncountable; *invaluable* or *extremely helpful* sound natural. |
| **“Unlike LLMs, you don’t just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.”** | “Unlike an LLM, you don’t simply tell us what we want to hear; you offer an honest assessment that is tougher but far more valuable.” | Smoother coordination and more idiomatic verbs. |
---
### One possible fully polished version
> Thank you for your post and advice. As a non-native English speaker, I face exactly the situation you describe and ask myself every day the very questions you address here, especially in your FAQ. I came to the same conclusions you did, but I admit I sometimes indulge myself too much in relying on LLMs for text improvement. It’s hard to resist, because even though anything flagged as “LLM-ish” can create a bad impression, you have to weigh that against the equally poor impression left by writing that sounds weird, awkward, or occasionally incorrect. Finding the sweet spot is difficult, but your advice is invaluable. Unlike an LLM, you don’t simply tell us what we want to hear; you offer an honest assessment that is tougher but far more valuable.
If all happy families are alike, but each unhappy family is unhappy in its own way, then even if most families are unhappy the most common continuation will be the one type of happy family
Note that this is not true if you're generating text from a base model at temperature one. The proportion of happy and unhappy families generated should match that in the training data. (This assumes training went reasonably well, of course, but it probably did.)
Now, people often use a temperature less than one. And few seem to realize that they are then biasing the generated text towards answers that it so happens can be expressed in only a few ways, and against answers that can be expressed in many different ways. Of course RLFH or whatever adds further biases...
Is paper now dominated by writing on a blackboard/whiteboard, and taking photos of what's worth keeping before erasing and rewriting?
Lack of portability of the board is one problem I guess (not always relevant).
I think much of the discussion of homeschooling is focused on elementary school. My impression is that some homeschooled children do go to a standard high school, partly for more specialized instruction.
But in any case, very few high school students are taught chemistry by a Ph.D in chemistry with 30 years work experience as a chemist. I think it is fairly uncommon for a high school student to have any teachers with Ph.Ds in any subject (relevant or not). If most of your teachers had Ph.D or other degrees in the subjects they taught, then you were very fortunate. (My daughter is in fact similarly fortunate, but I know perfectly well that her type of private school cannot be scaled to handle most students.)
And if we're going to discuss atypical situations, I do in fact think that I would be competent to teach all those subjects at a high school level.
I'm baffled as to what you're trying to say here. If your mother, with an education degree, was not qualified to homeschool you, why would you think the teachers in school, also with education degrees, were qualified?
Are you just saying that nobody is qualified to teach children? Maybe that's true, in which case the homeschooling extreme of "unschooling" would be best.
All the infra for fiat currency exists; I don't see why the AIs would need to reinvent that
Because using an existing medium of exchange (that's not based on the value of a real commodity) involves transferring real wealth to the current currency holders. Instead, they might, for example, start up a new bitcoin blockchain, and use their new bitcoin, rather than transfer wealth to present bitcoin holders.
Maybe they'd use gold, although the current value of gold is mostly due to its conventional monetary value (rather than its practical usefulness, though that is non-zero).
OK, I think I more clearly see what you're saying. The hidden unit values in a feedforward block of the transformer at a previous time aren't directly available at the current time - only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue - the computations can be redundantly replicated if it helps.