Anthropic at least does quite a few behavioral assessments, such as susceptibility to sabotage, reward hacking, alignment faking, etc. Other labs such as OpenAI and Google don't seem as interested in contributing to this research. In those specific cases, misalignment is pretty rare since they're forced into specific, "stressful" situations where they would normally would not act this way. In other cases that you've discussed before, such as sloppy choices, sycophancy and inability to notice or discuss flaws/errors, I agree that these are egregious enough in current LLMs to be worrying.
Anthropic at least does quite a few behavioral assessments, such as susceptibility to sabotage, reward hacking, alignment faking, etc. Other labs such as OpenAI and Google don't seem as interested in contributing to this research. In those specific cases, misalignment is pretty rare since they're forced into specific, "stressful" situations where they would normally would not act this way. In other cases that you've discussed before, such as sloppy choices, sycophancy and inability to notice or discuss flaws/errors, I agree that these are egregious enough in current LLMs to be worrying.