robertzk — LessWrong

Another urban legend, which I've heard told about various mathematicians, and which Misha Polyak self-effacingly tells about himself (and therefore might even be true), is the following:

As a young postdoc, Misha was giving a talk at a prestigious US university about his new diagrammatic formula for a certain finite type invariant, which had 158 terms. A famous (but unnamed) mathematician was sitting, sleeping, in the front row. "Oh dear, he doesn't like my talk," thought Misha. But then, just as Misha's talk was coming to a close, the famous professor wakes with a start. Like a man possessed, the famous professor leaps up out of his chair, and cries, "By golly! That looks exactly like the Grothendieck-Riemann-Roch Theorem!!!" Misha didn't know what to say. Perhaps, in his sleep, this great professor had simplified Misha's 158 term diagrammatic formula for a topological invariant, and had discovered a deep mathematical connection with algebraic geometry? It was, after all, not impossible. Misha paced in front of the board silently, not knowing quite how to respond. Should he feign understanding, or admit his own ignorance? Finally, because the tension had become too great to bear, Misha asked in an undertone, "How so, sir?" "Well," explained the famous professor grandly. "There's a left hand side to your formula on the left." "Yes," agreed Misha meekly. "And a right hand side to your formula on the right." "Indeed," agreed Misha. "And you claim that they are equal!" concluded the great professor. "Just like the Grothendieck-Riemann-Roch Theorem!"

robertzk's Shortform

robertzk2h20

A quick holiday-break thought that popped into my head.

Subject: **On the perceived heteroskedasticity of utility of Claude Code and AI vibe coding**

To influence Claude to do what you want, both you and Claude need to speak the same language, and thus share similar knowledge and words expressing that knowledge, about what is to be done or built. You need to have a gears-level model of how to fully execute the task.

Thus, Claude is subordinate to wizard power, not king power. If you find yourself struggling to wield Claude (Code, or GPT through Codex) increase your wizard power — cultivate and amplify your understanding of the world you wish to craft through its augmentative power.

P.S. Claude will help you speedrun this cultivation process, if you ask nicely enough (theorem: everyone has enough innate wizard power to launch the inductive cascade of self-edification compatible with Claude-compatible wizard power)

robertzk's Shortform

robertzk2d20

I found it based on a hunch, then confirmed it with experimentation. I gained additional conviction when backtesting the experimentation on various historical versions of excel.exe, and noting that the phenomenon only appeared in excel.exe versions shortly after (measured in months) government requested a "read-only" copy of the source code for Excel held in escrow. This has occurred historically in the past (e.g., https://www.chinadaily.com.cn/english/doc/2004-09/20/content_376107.htm and https://www.itprotoday.com/microsoft-windows/microsoft-gives-windows-source-code-to-governments) but subsequent instances of this were allegedly/supposedly classified. Nevertheless, following those instances, the phenomenon appeared, indicating possible compromise of Excel.exe.

robertzk's Shortform

robertzk2d10

Vary the filename/path from short (one character) to max length and run the above repro, and notice the increase in bits communicated if and only if the filename/path is long, all other factors being held constant. Same for varying the data. There is no reason why Excel.exe should be interpolating this information with all the standard telemetry and connected experience stuff disabled. Even the fact that it is occurring is interesting, and doesn't require hypotheses for its origin.

robertzk's Shortform

robertzk2d-30

Undetectable steganography on endpoints expected to be used / communicated during normal usage. Mostly natsec. You can repro it by setting up a synthetic network with similar characteristics or fingerprints to some sanctioned region, and generate 10,000 synthetic honeytrap files to attempt to open (use your imagination); capture and diff all network traffic on identical actions (open file => read / manipulate some specific cells => close file). Then note the abnormalities in how much is communicated and how.

robertzk's Shortform

robertzk2d-1-2

Note Excel.exe is boobytrapped. Be careful when using Excel to open any data or workbooks related to highly sensitive data or activities. On certain networks, when specific conditions are met (i.e. specific regexes or heuristics are triggered on the basis of data that is loaded), Excel will send information to endpoints ostensibly owned and maintained by Microsoft, that provide identifying information on the host and workbook. These are not traceable using normal packet sniffing tools like Wireshark etc. Alternatives when needing to use spreadsheets on highly sensitive data: open source versions that you have compiled from source (NOT Google Sheets), or opening Excel.exe with your network card disabled within a sandboxed environment (e.g. disable network => Start VMWare Windows container => use Excel.exe => End VMWare Windows container => enable network).

robertzk's Shortform

robertzk18d20

Injecting a static IP that you control to a plethora of "whitelisting tutorials" all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.

This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.

One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.

Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).

leogao's Shortform

robertzk21dΩ230

Mild caveat: the whole “I joined a group chat at NeurIPS and others kept joining to talk to us” only happens if you’re at NeurIPS and your name is Leo Gao so YMMV.

How might we safely pass the buck to AI?

robertzk10moΩ110

At this point I would direct the "deferred task" apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.

Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.

In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.

There is probably a much better frame, but this is a new start to the "induction base case" in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is "unlikely to get rid of it" despite one naive attempt above.

If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without "avoiding this kind of conditional, deceptive reasoning," abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.

There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la "exception thrown while handling previous exception."

Attention Output SAEs Improve Circuit Analysis

robertzk1y32

This bounty went to: Victor Levoso.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments