Not directly related to the question, but Optimizers Qualitatively Alter Solutions And We Should Leverage This (2025) argues that the choice of optimizer (e.g. first-order methods like AdamW vs second-order methods like Shampoo) not only affects speed of convergence, but properties of the final solution.
Our position is that the choice of optimizer itself provides an effective mechanism to introduce
an explicit inductive bias in the process, and as a community we should attempt to understand
it and exploit it by developing optimizers aimed at converging to certain kinds of solutions. The
additional implication of this stance is that the optimizer can and does affect the effective expressivity
of the model class (i.e. what solutions we can learn). We argue that expressivity arguments that
solely focus on the architecture design and/or data do not provide a complete picture and could be
misleading for example if used to do model selection. The learning algorithm and choice of optimizer
are also critical in shaping the characteristics of what are reachable functions and implicitly the final
learned model.
An in-the-wild observation is how different the Kimi models are compared to Llamas and Claudes. Kimi (and I suppose now the recent Qwen models) are optimized with Muon+AdamW vs AdamW alone. I've seen anecdotes on how different Kimi responses are compared to other models. You can attribute some % of it to their data mix; MoonshotAI staff note they put a lot of effort into looking at and curating training data. But it's also possible some non-trivial % of the behavior can be attributed to the optimizers used.
Do you watch sport climbing at all? There's a lot of fun overlap in skills and techniques between ANW and IFSC-style climbing.
As sport climbing matures as a sport, there's more videos about top-level climbers discussing how they balance their training, conditioning, what they consider in and out of their style.  
A general tip when editing pieces is that if you ctrl+f for "I think" and you cut out 90-95% of them, it makes the piece better. The (respected) reader knows that everything you write is what you think, and a good piece should make the hedging explicit for a claim that needs deeper consideration -- not every claim.
What about readers you don't respect? Well, who cares what they think.
I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic's team is editorializing, but I don't think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).
I included the organization and person making the testimonial to try and highlight this, but I should've made it explicit in the post.
 
re: how this updated my model of AI progress:
We don't quite know what the updates were that improved this model's ability at long contexts, but it doesn't seem like it was a dramatic algorithm or architecture update. It also doesn't seem like it was OOMs more spend in training (either in FLOPs or dollars).
If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was "cheap" to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was -- it seems (relatively) cheap.
These improvements are being found faster than I expected (based on METR's timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would've needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would've predicted.
Anthropic's new model seems quite good for longer horizon tasks.
https://www.anthropic.com/news/claude-sonnet-4-5
a reminder of timelines
a claim
How are folks interpreting this?
what's the relationship between attachment and memory?
are people with good/great memories less sentimental?
if you're someone with less recall of events, especially ones that happened many years ago, do you feel more attached to holding onto momentos that act as triggers for special moments in your life? conversely, if you have great memory, do you feel less attached to momentos because you can recall an event on demand?
I think the main challenge to monitoring agents is one of volume/scale, re:
Agents are much cheaper and faster to run than humans, so the amount of information and interactions that humans need to oversee will drastically increase.
But 1, 2, 4 are issues humans already face with managing other humans.
This is a long winded way of saying: I'm optimistic we can address managing agents (with their current capabilities) by drawing analogies to how we already do management in effective organizations, and find avenues to scale these management strategies 100x (or however many OOMs one might think we'll need).
I interpreted loop as referring to an "OODA loop" where managers are observing those they are managing, delegating and action, and then waiting for feedback to then go back to the beginning of the loop.
e.g. nowadays, I delegate a decent chunk of implementation work to coding agents and the "loop" is me giving them a task, letting them riff on it for a few minutes, and then reviewing output before requesting changes or committing them in.
Have you experimented with ComfyUI? Image generation for "power users" -- many more knobs to turn and control output generation. There's actually a lot of room for steering during the diffusion/generation process, but it's hard to expose it in the same intuitive/general way as a prompt in language. Comfy feels more like learning Adobe or Figma.