Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://compphil.github.io/truth/

We're excited to share the first volume of Elements of Computational Philosophy, an interdisciplinary and collaborative project series focused on operationalizing fundamental philosophical notions in ways that are natively compatible with the current paradigm in AI.

The first volume paints a broad-strokes picture of operationalizing truth and truth-seeking. Beyond this high-level focus, its 100+ pages can be framed in several different ways, which is why we placed multiple topic-based summaries at the beginning of the document. The note to the reader and the table of contents should further help scope and navigate the document.

Have a pleasant read, and feel free to use this linkpost to comment on the document as you go. Questions, criticism, and suggestions are all welcome.

PS: There will soon be a presentation about the overarching project series as part of the alignment speaker series hosted by EleutherAI. Expect more information soon on the #announcements channel of their Discord server. In general, keep an eye on this space.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 9:27 PM

All I want for christmas is a "version for engineers." Here's how we constructed the reward, here's how we did the training, here's what happened over the course of training.

My current impression is that the algorithm for deciding who wins an argument is clever, if computationally expensive, but you don't have a clever way to turn this into a supervisory signal, instead relying on brute force (which you don't have much of). I didn't see where you show that you managed to actually make the LLMs better arguers.

Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.

Thanks a lot for the feedback!

All I want for christmas is a "version for engineers." Here's how we constructed the reward, here's how we did the training, here's what happened over the course of training.

For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).

My current impression is that the algorithm for deciding who wins an argument is clever, if computationally expensive, but you don't have a clever way to turn this into a supervisory signal, instead relying on brute force (which you don't have much of).

You mean ArgRank (i.e. PageRank on the argument graph)? The idea was to simply use ArgRank to assign rewards to individual utterances, then use the resulting context-utterance-reward triples as experiences for RL. After collecting experiences, update the weights, and repeat. Now, though, I'd rather do PEFT on the top utterances as a kind of expert iteration, which would also make it feasible to store previous model versions for league training (e.g. by just storing LoRa weight diffs).

I didn't see where you show that you managed to actually make the LLMs better arguers.

Indeed, preliminary results are poor, and the bar was set pretty low at "somehow make these ideas run in this setup." For now, I'd drop ArgRank and instead use traditional methods from computational argumentation on an automatically encoded argument graph (see 5.2), then PEFT on the winning parties. But I'm also interested in extending CCS-like tools for bettering ArgRank (see 2.5). I'm applying to AISC9 for related follow-up work (among others), and I'd find it really valuable if you could send me some feedback on the proposal summary. Could I send you a DM with it? 

Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.

Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?

Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).

Sounds good. I enjoyed at least 50% of the time I spent reading the epistemology :P I just wanted a go-to resource for specific technical questions.

Could I send you a DM with it? 

Sure, but no promises on interesting feedback.

Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.

Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?

Deception's not quite the right concept. More like exploitation of biases and other weaknesses. This can look like deception, or it can look like incentivizing an AI to "honestly" be searching for arguments in a way that just so happens to be shaped by the argument-evaluation process' standards other than truth.

Hello. I noticed that your proposal for achieving truth in LLMs involves using debate as a method. My concern with this approach is that an AI consists of many small components that aggregate to produce text or outputs. These components simply operate based on what they've learned. Therefore, the idea of clarifying or "deconfusing" these components in the service of truth through debate seems not possible to me. But if have misunderstood the concept, let me know too thanks!

Thanks for the interest! I'm not really sure what you mean, though. By components, do you mean circuits or shards or...? I'm not sure what you mean by clarifying or deconfusing components, this sounds like interpretability, but there's not much interpretability going on in the linked project. Feel free to elaborate, though, and I'll try to respond again.

Hello there! What I meant as components in my comment are like the attention mechanism itself. For reference, here are the mean weights of two models I'm studying.