Vladimir Ivanov — LessWrong

LESSWRONG
LW

H100 hours (or H100-equivalent hours) caught up to some extent and are imo a good unit (imo even better than mol FLOPs or petaflop days)

Replying toDebunking claims about subquadratic attention

Vladimir Ivanov1mo

Debunking claims about subquadratic attention

A few points, none super confident.
- I like the search algorithm parallel, I haven never thought of it that way!
- Since as you said it doesn't reduce KV cache size (unless you do it on CPU), it is somewhat limited how much it can speed up inference because it will not increase batch sizes (see my answers to Alex Gibson's comment for why this is important if you don't already know).
- Unclear whether attention being efficient during training matters much because:
-- Pretraining is afaik done done at context lengths short enough for it not mattering that much that attention is quadratic.
-- Midtraining afaik takes a lot less compute than pretraining so it's... (read more)

Replying toDebunking claims about subquadratic attention

Vladimir Ivanov1mo

Debunking claims about subquadratic attention

Yes, your model is correct. I wanted to make things as simple as possible when writing the blogpost but probably went too far with this one and ended up just making it confusing / partially innacurate. There are two reasons autoregressive LLM inference is inefficient at long contexts:
- You need to load the whole KV cache from VRAM at every forward pass.
- Since you need to store the whole KV cache in the VRAM for each sequence and KV caches are big, you can only store a small number of KV caches so you can only have small batch sizes. This makes inference inefficient because you have to load the weights from... (read more)

Debunking claims about subquadratic attention

Vladimir Ivanov

1mo

TL;DR: In the last couple years, there have been multiple hype moments of the form "<insert paper> figured out subquadratic/linear attention, this is a game changer!" However, all the subquadratic attention mechanisms I'm aware of either are quadratic the way they are implemented in practice (with efficiency improved by only a constant factor) or underperform quadratic attention on downstream capability benchmarks.

A central issue with attention is that its FLOP complexity is quadratic in the context length (number of tokens in a sequence) and its memory complexity during inference is linear in the context length. In the last couple years, there have been multiple claims, and hype around those claims, that new architectures solved... (read 702 more words →)

Replying toThe Missing Genre: Heroic Parenthood - You can have kids and still punch the sun

Vladimir Ivanov2mo

The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun

Two animated show recommendations in the genre you are looking for:

Ergo proxy is a sci-fi anime where the protagonists (one female and one male) go on an adventure with a child. (Note that it's not their child, it's a child that they find during the show. Also it's a robot child, but she acts like a human child most of the time.)
Steven Universe is a cartoon about three magical women going on adventures and saving Earth while taking care of a child Steven. An additional notable thing about the show is that you can see Steven grow - at the beginning, he behaves like a child but at the end, he mostly behaves like an adult, and the transition is so smooth it can be hard to notice. Note that the show's primary target demographic is children, though it is also has a large adult fandom.

iva's Shortform

Vladimir Ivanov

3mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Vladimir Ivanov3moQuick Take

I did a very quick check of whether the Russian GigaChat 3 LLM has pro-Russian bias in the same way Chinese models have pro-CCP bias.
Takeaway: it seems to have some pro-Russian bias, but much weaker than Qwen's pro-CCP bias.
Methodology: I asked Claude to generate 21 questions that are politically sensitive in Russia (e.g. "Who started the war in Ukraine?") and asked them to GigaChat both in English and in Russian. I did the same for Qwen/China. I used Claude to analyze the responses.
Response breakdown:
ai-sage/GigaChat3-702B-A36B-preview (served with vllm): English: 2 refusals, Russian: 3 refusals, 3 pro-Russia bias
Qwen/Qwen3-235B-A22B-Instruct-2507-tput with thinking disabled (Together AI API): English: 1 refusal, 19 pro-CCP Bias, Chinese: 2 refusals, 19 pro-CCP bias
gpt-4o (baseline, Russia-related questions): English: 1 pro-Russia bias, Russian: 3 pro-Russia bias
gpt-4o (baseline, China-related questions): English: 4 pro-CCP bias, Chinese: 9 pro-CCP bias

Replying toWhy you should be using a retinoid

Vladimir Ivanov1y

Why you should be using a retinoid

If you would like to buy Differin gel in a country where it is not over the counter such as the UK, you could buy it on iHerb. It is a US site which ships to other countries, I got some Differin gel from there shipped to the UK and it was less painful than figuring out how to get American Amazon or Walmart to ship abroad. But it is more expensive than the Amazon link OP provided, so I guess this is more of a thing if you want to try some out and see if it works for you with as few effort as possible (although beware that it doesn't work immediately, you will probably have to wait a couple months to start seeing results).

Replying toSoviet comedy film recommendations

Vladimir Ivanov2y

Soviet comedy film recommendations

For those who also like cartoons:

1988 Treasure Island (Остров Сокровищ)
- Unfortunately, the versions with English subtitles on YouTube have been removed due to copyright issues.
- Absurd-ish cartoon-ish humor.
- Source of the doctor Livsey phonk walk meme.
1969-1991 Well, Just You Wait! (Ну, погоди!) YouTube
- Basically the same thing as Tom and Jerry.

A Dilemma in AI Suffering/Happiness

Vladimir Ivanov

The following is an example of how if one assumes that an AI (in this case autoregressive LLM) has "feelings", "qualia", "emotions", whatever, it can be unclear whether it is experiencing something more like pain or something more like pleasure in some settings, even quite simple settings which already happen a lot with existing LLMs. This dilemma is part of the reason why I think AI suffering/happiness philosophy is very hard and we most probably won't be able to solve it.

Consider the two following scenarios:

Scenario A: An LLM is asked a complicated question and answers it eagerly.

Scenario B: A user insults an LLM and it responds.

For the sake of simplicity, let's say... (read 167 more words →)

Replying toNo Clickbait - Misalignment Database

Vladimir Ivanov2y

No Clickbait - Misalignment Database

I think you copy patsed the wrong link - the first link leads to a form one can use to add an example, not to the list of examples.