LESSWRONG
LW

Nandi
230Ω11330
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Splitting Debate up into Two Subsystems
Nandi5y10

I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.

I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.

Reply
Splitting Debate up into Two Subsystems
Nandi5y10
I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.

Cool! I'm not familiar with it

Reply
Splitting Debate up into Two Subsystems
Nandi5y10

In the case that the epistemic helper can explain us enough for us to come up with solutions ourselves, the info helper is as useful by itself.

However, sometimes even if we get educated about a domain or problem, we may not be creative enough to propose good solutions ourselves. In such cases we would need an agent to propose options to us. It would be good if an agent that gets trained to come up with solutions that we approve of is not the same agent that explains to us why we should or should not approve of a solution (because if it were, it would have an incentive to convince us).

Reply
No wikitag contributions to display.
71Intricacies of Feature Geometry in Large Language Models
7mo
0
61The Geometry of Feelings and Nonsense in Large Language Models
10mo
10
37[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
10mo
2
18Robustness of Contrast-Consistent Search to Adversarial Prompting
2y
1
33Machine Unlearning Evaluations as Interpretability Benchmarks
Ω
2y
Ω
2
13Splitting Debate up into Two Subsystems
5y
5
34Acknowledging Human Preference Types to Support Value Learning
Ω
7y
Ω
4