This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Eliciting Latent Knowledge
Settings
•
Applied to
Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?
by
Q Home
2d
ago
Dakara
v1.3.0
Jan 17th 2025 GMT
1
•
Applied to
[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
by
Leon Lang
3mo
ago
•
Applied to
Clarifying Alignment Fundamentals Through the Lens of Ontology
by
eternal/ephemera
4mo
ago
•
Applied to
Mechanistic Anomaly Detection Research Update
by
Nora Belrose
6mo
ago
•
Applied to
Covert Malicious Finetuning
by
Tony Wang
7mo
ago
•
Applied to
"What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents
by
IwanWilliams
8mo
ago
•
Applied to
Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
by
Josh Levy
8mo
ago
•
Applied to
CCS on compound sentences
by
Artyom Karpov
9mo
ago
•
Applied to
AXRP Episode 29 - Science of Deep Learning with Vikrant Varma
by
DanielFilan
9mo
ago
•
Applied to
Finding the estimate of the value of a state in RL agents
by
Clément Dumas
9mo
ago
•
Applied to
Auditing LMs with counterfactual search: a tool for control and ELK
by
Jacob Pfau
1y
ago
•
Applied to
Striking Implications for Learning Theory, Interpretability — and Safety?
by
RogerDearnaley
1y
ago
•
Applied to
Measurement tampering detection as a special case of weak-to-strong generalization
by
ryan_greenblatt
1y
ago
•
Applied to
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
by
Seb Farquhar
1y
ago
•
Applied to
Betting on what is un-falsifiable and un-verifiable
by
Abhimanyu Pallavi Sudhir
1y
ago
•
Applied to
Eliciting Latent Knowledge in Comprehensive AI Services Models
by
acabodi
1y
ago
•
Applied to
Robustness of Contrast-Consistent Search to Adversarial Prompting
by
Nandi
1y
ago
•
Applied to
Discovering Latent Knowledge in the Human Brain: Part 1 – Clarifying the concepts of belief and knowledge
by
Joseph Emerson
1y
ago