LESSWRONG
LW

Walter Laurito
111060
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Try training token-level probes
Walter Laurito3mo10

I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!

...
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes[1] would be awesome to assist LLM-based monitor

If I remember correctly, they are doing something like that in this paper here:

3.3 EXACT ANSWER TOKENS
Existing methods often overlook a critical nuance: the token selection for error detection, typically
focusing on the last generated token or taking a mean. However, since LLMs typically generate long-
form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use
the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to
LLMs’ unidirectional nature, failing to account for the generated response and missing cases where
different sampled answers from the same model vary in correctness. We investigate a previously
unexamined token location: the exact answer tokens, which represent the most meaningful parts
of the generated response. We define exact answer tokens as those whose modification alters the
answer’s correctness, disregarding subsequent generated content.

Reply1
There should be more AI safety orgs
Walter Laurito2y*107

and Kaarel’s work on DLK

@Kaarel is the research lead at Cadenza Labs (previously called NotodAI), our research group which started during the first part of SERI MATS 3.0 (There will be more information about Cadenza Labs hopefully soon!) 

Our team members broadly agree with the post! 

Currently, we are looking for further funding to continue to work on our research agenda. Interested funders (or potential collaborators) can reach out to us at info@cadenzalabs.org.

Reply
AI Safety Needs Great Engineers
Walter Laurito4y10

Should work again :)

Reply
AI Safety Needs Great Engineers
Walter Laurito4y*20

I've created a discord for the people interested in organizing / collaborating / self-study: https://discord.gg/Ckj4BKUChr People could start with the brief curriculum published in this document, until a full curriculum might be available :)

Reply
AI Safety Needs Great Engineers
Walter Laurito4y50

Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?

Reply
AI Safety Needs Great Engineers
Walter Laurito4y*30

Same here (Not sure yet if I get accepted to AISC though). But I would be happy with helping or co-organizing something like Richard_Ngo suggested. (Although I've never organized something like that before) Maybe a virtual version in (Continental?) Europe, if there are enough people

Reply
8Finding the estimate of the value of a state in RL agents
1y
4
51Searching for a model's concepts by their shape – a theoretical framework
2y
0
48[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision".
2y
6