Walter Laurito — LessWrong

I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!
...
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes^[1] would be awesome to assist LLM-based monitor

If I remember correctly, they are doing something like that in this paper here:

3.3 EXACT ANSWER TOKENS
Existing methods often overlook a critical nuance: the token selection for error detection, typically
focusing on the last generated token or taking a mean. However, since LLMs typically generate long-
form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use
the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to
LLMs’ unidirectional nature, failing to account for the generated response and missing cases where
different sampled answers from the same model vary in correctness. We investigate a previously
unexamined token location: the exact answer tokens, which represent the most meaningful parts
of the generated response. We define exact answer tokens as those whose modification alters the
answer’s correctness, disregarding subsequent generated content.

There should be more AI safety orgs

Walter Laurito2y*107

and Kaarel’s work on DLK

@Kaarel is the research lead at Cadenza Labs (previously called NotodAI), our research group which started during the first part of SERI MATS 3.0 (There will be more information about Cadenza Labs hopefully soon!)

Our team members broadly agree with the post!

Currently, we are looking for further funding to continue to work on our research agenda. Interested funders (or potential collaborators) can reach out to us at info@cadenzalabs.org.

AI Safety Needs Great Engineers

Walter Laurito4y10

Should work again :)

AI Safety Needs Great Engineers

Walter Laurito4y*20

I've created a discord for the people interested in organizing / collaborating / self-study: https://discord.gg/Ckj4BKUChr People could start with the brief curriculum published in this document, until a full curriculum might be available :)

AI Safety Needs Great Engineers

Walter Laurito4y50

Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?

AI Safety Needs Great Engineers

Walter Laurito4y*30

Same here (Not sure yet if I get accepted to AISC though). But I would be happy with helping or co-organizing something like Richard_Ngo suggested. (Although I've never organized something like that before) Maybe a virtual version in (Continental?) Europe, if there are enough people

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments