CloudMatrix was not, but Huawei Ascend has been there for a long time, and was used to train LLM even back in 2022. I didn't realize AI 2027 predated CloudMatrix but I still think ignoring China for Compute Production was unjustified.
This is a good argument and I think it is mostly true, but this absolutely should be in AI 2027 Compute Forecast page. Simply not saying a word about the topic makes it looks unserious and incompetent. In fact, that reaction happened repeatedly in my discussion with my friends in South Korea.
I know cyber eval results are underelicitation. Sonnet 4 can find zero day vulnerabilities, we are now in process of disclosing. If you can't get it to do that it's your skill issue.
Preordered ebook version on Amazon. I am also interested in doing Korean translation.
I disagree on DeekSeek and innovation. Yes R1 is obviously a reaction to o1, but its MoE model is pretty innovative, and it is Llama 4 that obviously copied DeepSeek. But yes I agree innovation is unpopular in China. But from interviews of DeepSeek founder Liang Wenfeng, we know DeepSeek was explicitly an attempt to overcome China's unwillingness to innovate.
Maybe we are talking about different problems, but we found instructing models to give up (literally "give up", I just checked the source) under certain conditions to be effective.
Our experience so far is while reasoning models don't improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.
I happen to work on the exact sample problem (application security pentesting) and I confirm I observe the same. Sonnet 3.5/3.6/3.7 were big releases, others didn't help, etc. As for OpenAI o-series models, we are debating whether it is model capability problem or model elicitation problem, because from interactive usage it seems clear it needs different prompting and we haven't yet seriously optimized prompting for o-series. Evaluation is scarce, but we built something along the line of CWE-Bench-Java discussed in this paper, this was a major effort and we are reasonably sure we can evaluate. As for grounding, fighting false positives, and avoiding models to report "potential" problems to sound good, we found grounding on code coverage to be effective. Run JaCoCo, tell models PoC || GTFO, where PoC is structured as vulnerability description with source code file and line and triggering input. Write the oracle verifier of this PoC: at the very least you can confirm execution reaches the line in a way models can't ever fake.
OpenAI wasted a whole year between GPT-3 and GPT-4. (Source: Greg Brockman said this in an OpenAI developer event.) So yes, I think OpenAI was 12+ months ahead at one time.
Anthropic does have a data program, although it is only for Claude Code, and it is opt in. See About the Development Partner Program. It gives you 30% discount in exchange.