robertzk's Shortform

robertzk

LESSWRONG
LW

robertzk's Shortform — LessWrong

robertzk's Shortform

by robertzk

8th Dec 2025

AI Alignment Forum

1 min read

5 Ω 2

This is a special post for quick takes by robertzk. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

25 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:40 AM

[-]robertzk2mo10616

I am just sharing a quick take on something that came to mind that occured earlier this year that I had forgotten about. I just received a domain renewal for a project that was dead in the water but should have been alive, and was unilaterally killed by the AWS Trust and Safety team in gaslighting. It is a bit too late, but I still think it is important for people to know.

Earlier this year (2025), a software engineering friend of mine had a great idea of creating a tool to pin on a map interface any ICE (the United States Immigration and Customs Enforcement) sightings from the public. I vibe coded it last February with Claude Sonnet 3.7, and put it up on AWS with a public facing domain. From a coding standpoint, it was a fun project because it was my first project that Terraformed a production app using Claude, autonomously. I had a website up that worked on desktop and mobile. From a public trust and safety standpoint, it helped by adding public accountability and mitigating overreach by a rogue, and potentially in some circumstances operating illegaly, agency. It was overall A Good Thing™.

However, what happened next was upsetting and shocking to me. A day after the website was put online, without me advertising it to ANYONE, the website became inaccessible to almost all browsers. It would simply show a giant red background saying this is associated with known criminals, etc., and has been blocked for safety. It is some Chrome / Google maintained blacklist mechanism that looks even scarier and more severe than the "https / http" / cert issue mismatch, and that cannot be overridden. To be clear, I had registered a certificate using AWS Certificate Manager, and there was no SSL issue. Instead, the domain had been unilaterally marked as "dangerous" by AWS (or Google, Chrome, whomever) one day after it was made publicly accessible, despite no advertising / attention.

I did all of this on my personal AWS account. The very next day, I received a scary email from AWS that claimed my account was violating their policies due to supporting criminal activity, and will be suspended immediately unless I remediate the account. I contacted AWS, who demurred and said they weren't sure what was going on, but that their Trust and Safety team had flagged dangerous activity on my account (they did NOT specify any resources related to the application I had put up; they were generic and vague with no specificity). The only causal correlate was me putting up this public tool to report ICE. I terraform destroyed the resources Claude generated, and then waited. Within a day, the case was closed and my account was restored to normal status.

I am not going to share the domain name, but needless to say, this pissed me off majorly. What the fuck, AWS? (and Google, and Chrome, and anyone else that had a hand in this?) I understand mistakes happen, but this does not smell like a mistake; this smells like some thing worse: a lack of sound judgment. And if there were any automated systems involved, that is no excuse either. Per AWS support's correspondence, they informed me a human in their Trust and Safety had reviewed the account and marked it as delinquent, not from a financial standpoint, but something worse--by equating its resources to criminality. If the concern was what I think it was (corporate cowardice), they should have been intellectually honest, and filed a case stating "We found an application on your account that is outside our policy. Here is the explanation for what we found and why we think it is outside our policy. You can dispute our decision at this link." etc. Instead, I was gaslighted and treated like a guilty until proven innocent subject of weaponized fear (because for a second, with the scary language of the website block and the support email, I was scared.)

That AWS Trust and Safety employee's judgment failed me, and their judgment failed themselves and the public as well as their own responsibility as an arbitrator of trust and safety; their decision, if there was one that can be attributed and not hidden behind the corporate veil of ambiguity, ultimately reduced public trust and safety.

[-]Shankar Sivarajan2mo712

Has what they (AWS, Google, and also Apple) did to Parler already been memory-holed, or are you just doing the classic Niemöller thing of being shocked when it happens to you?

their own responsibility as an arbitrator of trust and safety;

ultimately reduced public trust and safety.

Oh come on, surely pretending to actually believe the name the censors give themselves is laying it on too thick?

[-]β-redex2mo52

a project that was dead in the water but should have been alive

That statement sounds a bit too strong to me. Maybe this project wasn't important enough to invest further effort into, but you basically tried no workarounds. E.g. probably just moving to a European cloud would have solved all your issues? (If we model the situation as some possibly illegal US govt order, or just AWS being overzealous about censoring themselves.)

Heck, all the shadow libraries and sci-hub and torrent sites manage to stay up on the clearnet, and those are definitely illegal according to the law.

And in extreme cases you could just host your app as a TOR hidden service. (Though making users install a separate browser app might add enough friction to kill this particular project unfortunately.)

[-]waterlubber2mo4-4

Sounds like they got hit with a court order that prohibited disclosure of the order itself.

[-]jbash2mo2410

Yeah, no. Sounds like they either got hit with a (probably illegal) threat from the DHS/DOJ, or, actually more likely, they feared such threats because they'd seen the (also illegal) threats that ICEBlock drew, and they didn't want to deal with such.

It's also possible that freelance MAGA types inside of those companies decided that code was "obviously criminal" and needed to be suppressed. Possibly then using the past ICEBlock threats as ammunition in internal arguments.

Actual courts in the US are still not particularly willing to apply prior restraints to speech, and would feel especially hampered in doing so by the fact that there's nothing even slightly illegal about the project as described. Yes, if you asked Kristi Noem and Pam Bondi, they'd tell you it was illegal, but then they'd tell you many other untrue things as well. Obstruction of justice and interference with Federal officers, one or both of which are what they'd claim it was, do not work like that in reality.

I've actually never heard of a US court issuing a secret order like that. I'm not actually sure they have the power to do that. If they can do it at all, it'd be really unusual. You may be thinking of NSLs, which are secret, but are not court orders and also aren't statutorily authorized to be used to suppress anything.

[-]Shankar Sivarajan2mo61

I've actually never heard of a US court issuing a secret order like that.

Would you expect to?

[-]jbash2mo127

Yes. Nothing stays secret forever.

[-]Karl Krueger2mo11

Why do you predict that a court was involved?

[-]G Wood2mo2-3

People would use your app to do illegal things? Isn't that a success of the AWS Trust and Safety employee?

I think you may be confusing "Iegal" and "aligns with your politics".

[-]DaemonicSigil2mo41

No, I'm pretty sure publishing sightings of law enforcement is legal in the US. Some traffic radio stations report on where police are using radar guns for example, and this is fully legal. Indeed, considering that mapping ICE sightings could be of academic/intellectual interest (and that it is actually perfectly reasonable for law-abiding US citizens to want to limit their time spent in close proximity to ICE agents) this is far more centrally "helping people get away with doing illegal things" (speeding) than robertzk's project.

[-]G Wood1mo20

Legal ≠ consequence-free. Yes, reporting police locations is legal - Waze does it daily. But there's a relevant difference between "drivers avoiding speed traps" and "people with deportation orders evading enforcement."

The app is multi-use. Potential users:

1. Researchers wanting data on enforcement patterns

2. Legal residents avoiding hassle/intimidation

3. People with deportation orders evading enforcement

4. People actively helping category 3 evade enforcement

The developer can't control which use case dominates. But category 3 and 4 users have the strongest incentive to use and contribute to the app - they're the ones with real stakes. Selection effects mean they'll likely dominate the user base.

I'm not claiming the app is illegal. I'm saying "it's legal" doesn't fully address whether AWS made a reasonable judgment call about what they want to host. Those are different questions.

[-]JohnWittle1mo21

i think it's very likely that the latter is true, that AWS made a reasonable judgment call about what they want to host

but also, i think it's reasonable for someone in robertzk's position, based on the way the judgment call was actually communicated, to assume that it was the former. and i think that, perhaps deliberately, perhaps merely because of sort of selection effects, that's the intent. in a sort of "a system is what it does" sort of way, at least.

[-]robertzk2moΩ360

The “when no one’s watching” feature

Can we find a “when no one’s watching” feature, or its opposite, “when someone’s watching”? This could be a useful and plausible precursor to deception

With sufficiently powerful systems, this feature may be *very* hard to elicit, as by definition, finding the feature means someone’s watching and requires tricking / honeypotting the model to take it off its own modeling process of the input distribution.

This could be a precursor to deception. One way we could try to make deception harder for the model is to never develop this feature: modify the training data to make it clear that all gradients are recording, and someone’s always watching. This way, the need for the feature will be OOD by design, and it will be harder to develop and use such a feature during training. (We may still end up with effective proxies or correlates, such as “counterfactual no one’s watching,” e.g. when writing a story and assuming as the protagonist, or when feeding 1984 to the context window, etc. Even this proxy might still be a useful feature to locate and study.)

We could also perform gain of function finetuning on true *actual* “unrecorded training and evaluation” sessions to build a model organism with the feature to understand what it may look like, its subtlety, its elicitation and black box discovery potential, and how it hides OOD. Then leverage it for Microscope AI based alignment proposals.

I would offer $100 for anyone that can provide a concrete experiment or evidence to locate such a feature in a frontier open weight model.

[-]robertzk2mo*30

A quick holiday-break thought that popped into my head.

Subject: **On the perceived heteroskedasticity of utility of Claude Code and AI vibe coding**

To influence Claude to do what you want, both you and Claude need to speak the same language, and thus share similar knowledge and words expressing that knowledge, about what is to be done or built. You need to have a gears-level model of how to fully execute the task.

Thus, Claude is subordinate to wizard power, not king power. If you find yourself struggling to wield Claude (Code, or GPT through Codex) increase your wizard power — cultivate and amplify your understanding of the world you wish to craft through its augmentative power.

P.S. Claude will help you speedrun this cultivation process, if you ask nicely enough (theorem: everyone has enough innate wizard power to launch the inductive cascade of self-edification compatible with Claude-compatible wizard power)

[-]robertzk2mo20

Injecting a static IP that you control to a plethora of "whitelisting tutorials" all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.

This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.

One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.

Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).

[-]Karl Krueger2mo10

One specific practice that would prevent this:

Tutorials or other documentation that need example IPv4 addresses should choose them from the RFC 5737 blocks reserved for this purpose. These blocks are planned to never be assigned for actual usage (including internal usage) and are intended to be filtered everywhere at firewalls & routers.

192.0.2.0 - 192.0.2.255
198.51.100.0 - 198.51.100.255
203.0.113.0 - 203.0.113.255

[-]robertzk2mo-1-2

Note Excel.exe is boobytrapped. Be careful when using Excel to open any data or workbooks related to highly sensitive data or activities. On certain networks, when specific conditions are met (i.e. specific regexes or heuristics are triggered on the basis of data that is loaded), Excel will send information to endpoints ostensibly owned and maintained by Microsoft, that provide identifying information on the host and workbook. These are not traceable using normal packet sniffing tools like Wireshark etc. Alternatives when needing to use spreadsheets on highly sensitive data: open source versions that you have compiled from source (NOT Google Sheets), or opening Excel.exe with your network card disabled within a sandboxed environment (e.g. disable network => Start VMWare Windows container => use Excel.exe => End VMWare Windows container => enable network).

[-]Dagon2mo*40

I find this easy to believe, but a bit surprising that it's not mentioned or studied or even has crank/subversive pages with POC detections. The printer/scanner steganographic fingerprints became pretty well-known within a few years of becoming common.

I mean, anything that's aggressively online (m365 versions of excel, Windows itself, Google Sheets, etc.) should be assumed to be insecure against state-level threats. But if you've got evidence of specific backdoors or monitoring, that should be shared and common knowledge.

[-]waterlubber2mo30

By what means would it be untraceable? Routing through an undocumented Windows interface or something? Does it trigger on AI related data, or natsec?

[-]robertzk2mo-30

Undetectable steganography on endpoints expected to be used / communicated during normal usage. Mostly natsec. You can repro it by setting up a synthetic network with similar characteristics or fingerprints to some sanctioned region, and generate 10,000 synthetic honeytrap files to attempt to open (use your imagination); capture and diff all network traffic on identical actions (open file => read / manipulate some specific cells => close file). Then note the abnormalities in how much is communicated and how.

[-]waterlubber2mo10

Thanks, that's what I figured. Did you find this by accident? I'm curious what techniques work well to reveal this kind of stuff; I expect it to be pretty common.

[-]robertzk2mo20

I found it based on a hunch, then confirmed it with experimentation. I gained additional conviction when backtesting the experimentation on various historical versions of excel.exe, and noting that the phenomenon only appeared in excel.exe versions shortly after (measured in months) government requested a "read-only" copy of the source code for Excel held in escrow. This has occurred historically in the past (e.g., https://www.chinadaily.com.cn/english/doc/2004-09/20/content_376107.htm and https://www.itprotoday.com/microsoft-windows/microsoft-gives-windows-source-code-to-governments) but subsequent instances of this were allegedly/supposedly classified. Nevertheless, following those instances, the phenomenon appeared, indicating possible compromise of Excel.exe.

[-]waterlubber2mo30

Really interesting research. I would like to subscribe to your newsletter.

I have seen similar steganographic telemetry before (Dassaults Solidworks CAD software, and other enterprise licensed applications, go to incredible eztents to enforce licensing) but didn't expect data-level probing like this. I'd imagine similar scripts for e.g EURion detections in Photoshop.

I always dismissed "lessons on trusting trust" style attacks as mere hypotheticals, but backdoors operating on the level of Excel cells is now making me reconsider that notion.

[-]Karl Krueger2mo10

What other explanations for this network traffic have you investigated and on what basis did you reject those explanations?

[-]robertzk2mo10

Vary the filename/path from short (one character) to max length and run the above repro, and notice the increase in bits communicated if and only if the filename/path is long, all other factors being held constant. Same for varying the data. There is no reason why Excel.exe should be interpolating this information with all the standard telemetry and connected experience stuff disabled. Even the fact that it is occurring is interesting, and doesn't require hypotheses for its origin.

Moderation Log