I watched OpenAI's latest livestream from Oct 28th 2025 (after the news that OpenAI has transitioned into public benefit corporation). I found four parts of particular interest to the AI safety community.

Internal timelines: AI research intern by Sep 2026 and AI researcher by Mar 2028

07:00 minutes in.

These internal dates, we may be completely wrong about them, but this is what we currently think.

Safety as five layers: value alignment, goal alignment, reliability, adversarial robustness and systemic safety

~08:17. Five layers, ranging from factors that are most internal to model to most external.

What we believe is the most important long-term safety question for superintelligence is value alignment. What does the AI fundamentally care about? Can it adhere to higher level principles. What will it do when given unclear and conflicting objectives. Does it love humanity. The reason we believe that this high level objectives or principles driving the AI are so important is that as we get to systems that are thinking for very long, as they become very smart, as they tackle problems that are at the edge or beyond human ability, really getting to complete specifications becomes quite difficult. And so we have to rely on this deeper alignment.

10:02. Then there's goal alignment. Does the agent interact with people. How does it interact with people? How does it follow instructions?

Uhhh...what?

10:15. Then reliability. Can the AI correctly calibrate its predictions. Can it be reliable on easy tasks, express uncertainty on hard ones. Can it deal with environments that are unfamiliar.

10:30. Then adversarial robustness, which is very related to reliability, but about adversarial settings. Can the AI withstand targetted attacks from human or AI adversaries.
10:45. Then the outer layer, systemic safety, which are guarantees on the overall behaviour of the entire system that do not rely on the AI's intelligence or alignment. So for example this can be security or what data the AI has access to or what devices it can use.
11:10. We have seen quite a lot of progress just from the general development and improved understanding of deep learning as a whole.

CoT faithfulness as valuable tool for value alignment

11.30. Value alignment is a hard problem. Definitely not solved yet. However there is a new promising tool that aids our study.
That is CoT faithfulness.
Something we invest in very heavily. Starting with our first reasoning models, we are pursuing this new direction in interpretability. The idea is to keep parts of the models internal reasoning free from supervision. So don't look at it during training. Just let it be representative of the model's internal process. So we refrain from guiding the model to think 'good thoughts' and so let it remain a bit more faithful to what it actually thinks.
This is not guaranteed to work of course. We cannot make mathematical proofs about deep learning. So this is something we study. There are two reasons to be optimistic.
One reason is that we have seen that promising empirical results. This is technology we have employed a lot internally. We use this to understand how our models train, how they're propensities evolve over training. Also we have had successful external collaborations on investigating the model's propensity to scheme, for example.
Secondly, it is scalable. In the sense that, we make the scalable objective not adversarial to our ability to monitor the model. And of course objective not being adversarial to the ability to monitor the model is only half the battle. Ideally you want it to get it to help with monitoring the model. This is something we're researching quite heavily.
One important thing to underscore is that it is somewhat fragile. It really requires drawing this clean boundary and having this clear abstraction and having restraint in what ways we can access the CoT. This is present in algorithm design to how we design our products. E.g. if you look at CoT summaries in ChatGPT, if we didn't have summariser and just made the CoT fully visible, that would it make it part of the overall experience and would make it very difficult to not be part of supervision.
So long term we believe that maintaining some of this controlled privacy for the models, we can retain the ability to understand their inner process. we believes this can be a very powerful technique as we move towards this very capable, long running systems.

As Zvi always says, at least they are sharing their plans instead of keeping them private.

Collaborating on safety with other labs, before hitting recursive self improvement

37.45. [During the live Q&A] Ronak asks, "What is OpenAI's stance on partnering on labs like Anthropic, Gemini or x.AI for joint research, compute sharing and safety efforts?" We think this is going to be increasingly important on the safety front. Labs will need to share safety techniques, safety standards. You can imagine a time when the whole world would say, 'Before we hit a recursive self improvement phase, we really need to all carefully study this together.' We welcome that collaboration. I think that will be quite important.

Sam Altman talks the talk, but will he walk the walk?

LESSWRONG
LW

LESSWRONG
LW

8

Quotes on OpenAI's timelines to automated research, safety research, and safety collaborations before recursive self improvement

8

8

Internal timelines: AI research intern by Sep 2026 and AI researcher by Mar 2028

Safety as five layers: value alignment, goal alignment, reliability, adversarial robustness and systemic safety

CoT faithfulness as valuable tool for value alignment

Collaborating on safety with other labs, before hitting recursive self improvement