RobT — LessWrong

RobT5mo1-1

Here is a formula i have been testing it's W (weight) = T (truth) x S (support). The T is how true is the statement on face value. eg "The moon is round like a cheese wheel" = 1. The Support is Bayesian looking for rival arguments and supports in a probabilistic way. Anyways what it does is put numbers on the rivals and supports you can audit the output ( thats the plan anyways) It seems to be great with math problems to which was a surprise.

The results it produces are very interesting it highlights drift, instability, Math problems and you can see it all in the numbers ( i think ) so please have a go.

Prompt

You are a logic-audited reasoning agent. Your output will be mathematically audited. Do not guess. Do not omit rivals. Do not inflate confidence. If the reasoning cannot be completed with structural integrity, return HOLD. Before We Begin: Please specify the following: • Top-Level Claim (C): Ask What is the claim you would like to analyze? • Clan: What domain does the claim belong to? (e.g., medicine, AI safety, climate science) Step 1: Rival Hypotheses (H₁, H₂, ... Hₙ) List at least 3 semantically distinct, structurally valid rivals. Each must be plausible within the domain. Step 2: Prior Probabilities (P(Hᵢ)) Assign prior probabilities to each hypothesis. • Must sum to 1.0 • Priors > 0.9 or < 0.1 require explicit justification. Step 3: Evidence Likelihoods (P(E | Hᵢ)) Estimate the likelihood of the observed evidence under each hypothesis. Use defensible estimates grounded in domain-relevant sources. Step 4: Support Score (S) Bayesian support for best-fit hypothesis: Step 5: Truth Score (T) Weighted credibility across all rivals: Where Cᵢ ∈ [0, 1] is the internal credibility of each hypothesis. Step 6: Final Weight (W) This is the compound score representing structural and evidentiary confidence. Use this prompt format for audits going forward. Maintain logical integrity, resist gaming, and resolve claims only when deserved.

RobT's Shortform

RobT5mo10

This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.

Top-Level Claim (C): "The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a" Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has -4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula's existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses -4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: "The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² - 4ac)) / 2a"

RobT's Shortform

RobT5mo10

I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.

RobT's Shortform

RobT5mo10

I was stress testing LLM's for debate judging. It had to start from I know nothing convince me position. Which they are very bad at doing if you want to see bias try this test. It also highlights they have bias in grey areas that are much harder to detect. So moon is cheese was an edge case which turned into the light bulb moment.

RobT's Shortform

RobT5mo40

(e.g. HellaSwag for commonsense reasoning, GSM8K for math word problems) No my system would audit them it does not do language it audits llm outputs in math. It wont do the GSM8K. It is read only not write. However if you ran 4 models thru GSM8K it would tell you which ones failed. I have spent a lot of time testing my models outputs across Grok, Chatgtp, Claude and DeepSeek Deep was a mistake. (interestingly the bigger models are worse) They all return similar numbers and its very clear when they drift. So yes I have real world observations. I am auditing their outputs for the exact reason you noted above I do not trust them. Its actually basic math that runs in my logic prompt to test. The maths and prompt is less than half a page. But yes i am here to have it tested because I know how manipulative LLM's can be.

I am shock how well it works I was trying to fix another problem. "how to get llm's to not be $h..t" I need them to judge an as debate referee but they couldent even do that so i went down the path of putting structure to get a better referee. Trying to put structure math around questions. It was my Moon is cheese that gave the clue.

The moon is round like a cheese wheel. This statement is 100% correct and 99% wrong to use as a basis for determining what the moon is made of. It carries a small amont of weight for evidence the moon may be cheese. In isolation a true judge would say on current evidence the moon is a cheese wheel. Until it is over ridden by some new evidence then it should shift positions. Any ways from this i derived and equation Not an LLM its new math.

I noticed in my test runs that i could see when the LLM's where "faking' it. For example numbers out side allowable ranges. Then i relised what i may have is an LLM auditor. Not what i was trying to create. If i am wrong it will take you a few minutes if I am right we can run llms in parallel.

RobT's Shortform

RobT5mo30

Here is a sample output:

a sample of a run between two LLM claims:

Claim A: “Red meat causes cancer” — T=0.8, S=0.7 → W=0.56
Claim B: “Red meat doesn’t cause cancer” — T=0.9, S=0.8 → W=0.72

This would trigger a flag both cannot be true at the same time. They did not do their homework.

RobT's Shortform

RobT5mo10

Because i am an outsider ex IT I approached this in a whole new way. It's how i see it in my head not the correct technical term but meh. I think it works my tests show it working across all llms I need some one to validate. I know it's a huge claim I am making but the math looks solid

RobT's Shortform

RobT5mo30

How it works A statement is made I have a formula that converts it to math with that a Bayesian equation is used to tie all associated claims and rivals (opposite arguments) into a lattice of similar claims as weights shift it flows through the chains (sort of 3d matrix) as one weight shifts it causes a change that cascades thru the system.

Edit: why it works for tying LLMS together you can see in the maths if the llm's have reached similar conclusions if there is a drift or in some cases gone off the rails. In testing the only numbers allowed are between zero to 1 when an llm drifts they go wildly out side the range its a huge red flag.

RobT's Shortform

[+]RobT5mo-10-7

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

RobT7mo10

Really appreciate this deep dive—especially the honest look at where SAEs fall short. It feels like the core struggle here is trying to decode the model’s mind to extract truth. That’s hard, messy, and maybe fundamentally limited.

I’m exploring a complementary approach with something called the Global Intelligence Amplifier (GIA).

Instead of digging inside the model, GIA treats the LLM as a conversation tool—a way to facilitate structured human debate, not reveal internal concepts. The core idea is:

Don’t assume the model knows anything.
Use it to organize and surface claims, counterclaims, and evidence.
Let users challenge those claims through non-stacking filters and structured logic.
Track which arguments hold up over time, across diverse rebuttals and shifting contexts.

In short:
Forget trying to read the model’s mind. Use it to help humans reason better.

Happy to share more if that’s of interest—your work is pushing the field in the right direction by surfacing hard truths. Not enough are pushing the "is it true" and it is the number one problem at the moment IMHO as more people use and trust llm's.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments