t14n — LessWrong

t14n16d*31

Have you experimented with ComfyUI? Image generation for "power users" -- many more knobs to turn and control output generation. There's actually a lot of room for steering during the diffusion/generation process, but it's hard to expose it in the same intuitive/general way as a prompt in language. Comfy feels more like learning Adobe or Figma.

nielsrolf's Shortform

t14n16d*95

Not directly related to the question, but Optimizers Qualitatively Alter Solutions And We Should Leverage This (2025) argues that the choice of optimizer (e.g. first-order methods like AdamW vs second-order methods like Shampoo) not only affects speed of convergence, but properties of the final solution.

Our position is that the choice of optimizer itself provides an effective mechanism to introduce
an explicit inductive bias in the process, and as a community we should attempt to understand
it and exploit it by developing optimizers aimed at converging to certain kinds of solutions. The
additional implication of this stance is that the optimizer can and does affect the effective expressivity
of the model class (i.e. what solutions we can learn). We argue that expressivity arguments that
solely focus on the architecture design and/or data do not provide a complete picture and could be
misleading for example if used to do model selection. The learning algorithm and choice of optimizer
are also critical in shaping the characteristics of what are reachable functions and implicitly the final
learned model.

An in-the-wild observation is how different the Kimi models are compared to Llamas and Claudes. Kimi (and I suppose now the recent Qwen models) are optimized with Muon+AdamW vs AdamW alone. I've seen anecdotes on how different Kimi responses are compared to other models. You can attribute some % of it to their data mix; MoonshotAI staff note they put a lot of effort into looking at and curating training data. But it's also possible some non-trivial % of the behavior can be attributed to the optimizers used.

Elizabeth's Shortform

t14n17d20

Do you watch sport climbing at all? There's a lot of fun overlap in skills and techniques between ANW and IFSC-style climbing.

As sport climbing matures as a sport, there's more videos about top-level climbers discussing how they balance their training, conditioning, what they consider in and out of their style.

Buck's Shortform

t14n21d1-4

A general tip when editing pieces is that if you ctrl+f for "I think" and you cut out 90-95% of them, it makes the piece better. The (respected) reader knows that everything you write is what you think, and a good piece should make the hedging explicit for a claim that needs deeper consideration -- not every claim.

What about readers you don't respect? Well, who cares what they think.

t14n's Shortform

t14n1mo10

I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that's built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR's study: I'm saying there's acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek's drop today), I thought increasing the task-length that models can succeed on would've required more non-trivial architecture updates. Something on the scale of DeepSeek's RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.

t14n's Shortform

t14n1mo10

I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic's team is editorializing, but I don't think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).

I included the organization and person making the testimonial to try and highlight this, but I should've made it explicit in the post.

re: how this updated my model of AI progress:

We don't quite know what the updates were that improved this model's ability at long contexts, but it doesn't seem like it was a dramatic algorithm or architecture update. It also doesn't seem like it was OOMs more spend in training (either in FLOPs or dollars).

If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was "cheap" to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was -- it seems (relatively) cheap.

These improvements are being found faster than I expected (based on METR's timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would've needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would've predicted.

t14n's Shortform

t14n1mo10

Anthropic's new model seems quite good for longer horizon tasks.

https://www.anthropic.com/news/claude-sonnet-4-5

a reminder of timelines

Length of asks AIs can do is doubling every 7 months

a claim

How are folks interpreting this?

t14n's Shortform

t14n1mo40

what's the relationship between attachment and memory?

are people with good/great memories less sentimental?

if you're someone with less recall of events, especially ones that happened many years ago, do you feel more attached to holding onto momentos that act as triggers for special moments in your life? conversely, if you have great memory, do you feel less attached to momentos because you can recall an event on demand?

AI agents and painted facades

t14n2mo10

I think the main challenge to monitoring agents is one of volume/scale, re:

Agents are much cheaper and faster to run than humans, so the amount of information and interactions that humans need to oversee will drastically increase.

But 1, 2, 4 are issues humans already face with managing other humans.

A decent chunk of people are "RL'd" into doing the bare minimum and playing organizational politics to optimize their effort:reward ratio.
Many employees are treated as fungible and are given limited context into their entire org. Even in a fully transparent org, if it's large enough, then it's impossible for everything to be in an individual's context anyway.
People fail all the time in quite surprising and subtle ways! We're just really bad at documenting and noticing most of them since we usually only explore and share really catastrophic failures. If someone loses ~$100, it's whatever. When someone loses ~$100M dollars, it's a lawsuit.

This is a long winded way of saying: I'm optimistic we can address managing agents (with their current capabilities) by drawing analogies to how we already do management in effective organizations, and find avenues to scale these management strategies 100x (or however many OOMs one might think we'll need).

AI agents and painted facades

t14n2mo10

I interpreted loop as referring to an "OODA loop" where managers are observing those they are managing, delegating and action, and then waiting for feedback to then go back to the beginning of the loop.

e.g. nowadays, I delegate a decent chunk of implementation work to coding agents and the "loop" is me giving them a task, letting them riff on it for a few minutes, and then reviewing output before requesting changes or committing them in.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments