Jakub Halmeš — LessWrong

LESSWRONG
is fundraising!
LW

A couple of weeks ago, I was surprised to find out that you can create artifacts that call the Claude API. Silly example: Chat app with Claude always responding with capitalized text.

People Are Less Happy Than They Seem

Jakub Halmeš5mo10

Yes, I agree that the quoted statement is too strong, and many feelings are unnoticed or forgotten.

Jakub Halmeš's Shortform

Jakub Halmeš11mo60

I wonder if you could take the R1-Zero training regime, penalize/restrict using existing words from all languages (maybe only in the scratchpad, not the final response), and obtain a model which can solve math problems by reasoning in a non-existent language.

Jesse Hoogland's Shortform

Jakub Halmeš11mo40

During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.

I also found this trade-off between human readability and performance noteworthy.

Jakub Halmeš's Shortform

Jakub Halmeš1y10

Yes, fair here means that their subjective EVs are equal. The post referenced in the sibling comment calls it "Even Odds", which is probably better.

Jakub Halmeš's Shortform

Jakub Halmeš1y10

I did not realize that. Thank you for the reference!

Jakub Halmeš's Shortform

Jakub Halmeš1y*70

If Alice thinks X happens with a probability of 20% while Bob thinks it's 40%, what would be a fair bet between them?

I created a Claude Artifact, which calculates a bet such that the expected value is the same for both.

In this case, Bob wins if X happens (he thinks it's more likely). If Alice bets $100, he should bet $42.86, and the EV of such bet for both players (according to their beliefs) is $14.29.

EDIT: I updated the calculator to handle the case when A's probability is higher than B's correctly.

The Inner Alignment Problem

Jakub Halmeš2y10

I wrote this mostly for personal purposes. I wanted to organize my thoughts about the problem while reading the paper, and publishing the notes, even if no one reads them, forces me to write more clearly and precisely.

I would like to get some feedback if there may be value in posts such as this one for other people. Please let me know! Thank you.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments