This is an attempt to formulate a thought that’s been rolling around in my brain for a few years now. I want to formalize a hypothesis, but I’m not sure how to convey what the hypothesis even is. Hopefully by writing this out I’ll help make that happen (spoiler alert: I partially succeeded?), or someone reading this will be able to help me here.

Working towards the theme: The idea starts with the classic LessWrong meme of an infohazard (defined here as some form of external sensory “input”) that when received by its “target” causes that target harm. This could be something wholly internal, like increasing depression, but might have external effects as well. A victim might either do something they wouldn’t otherwise do of their own volition, or could even collapse completely, in the case of infohazards invoking seizure or death. The idea of an infohazard is not merely theoretical. We know from experience that visual and auditory infohazards exist for individual human beings. Examples range from sad stories which make people temporarily depressed, to people “brainwashed” by cults to do stuff they would otherwise never have done, and even to people who get seizures or death at certain inputs, like flashing lights. From that basis in fact, one can generalize to arrive at the hypothesis that for every individual human, such an infohazard exists (even if unconstructable in practice). An even stronger (but somewhat less plausible) hypothesis would be that there exists singular infohazards which would work for most or even all of humanity. On a more(?) theoretical level, it seems reasonable that a being with complete knowledge of a human brain and unlimited compute power could “trivially” construct an infohazard tailored to that brain. Could such a being control the brain’s output in some arbitrary manner by giving it the right inputs, or are limitations on how far even an optimal infohazard could go?

We are far from understanding the human brain deeply enough to even begin to answer these questions, so let’s try to abstract the situation a bit by considering the field of computing, where we have (arguably) greater scientific and mathematical understanding of the field. The best analogy to an infohazard in computer science would seem to be found in the practice of hacking.

“Hacking” as a term covers many different modes of action, but one popular method of hacking can be generalized as “inserting an input through whatever the normal input channel is, with the input designed to make the machine output something different than what it’s creators intended.” That last bit can be clarified into the formulation “different than what was intended by whoever built the machine,” or “different than how the literal code should have acted.” Writing this down, I’m finding it very hard to remove the sense of intentionality from the code here. Presumably the hacker is using normal physics, and the computer is treating the input exactly as it has been “designed” to treat it, albeit not how it’s developers expected. So in a sense the hacker isn’t breaking anything other than the constraints of the designer’s imagination. This seems like a dead end, since we aren’t talking about computers anymore. Before abandoning this line of thinking, it might be worth exploring a little bit more, however. Another way to consider hacking is by thinking about the internal state of the machine, rather than its output. A definition of hacking from that perspective might be “inserting an input through whatever the normal input channel is, with the input designed to make the machine’s internal state be different in some significant way from its creators’ intentions.” This usually involves having access to parts of the machine which you should not have access to. Ultimately the hacker (presumably) only cares about the ultimate output (which admittedly might simply be information about the computer’s internal state), so it might just be a different way of thinking about the same thing, and an equivalent definition. (Btw, I’m specifying “inserting an input through whatever the normal input channel is,” since some hacking techniques involve finding an input channel which shouldn’t have been accepting input, was never meant to be used as such, and/or didn’t even exist until the hacker broke the machine in some glorious hackery way.)

There are some hacks that only attack and harm individual computers, but quite often these days, hacks can be effective across any computer running a certain operating system. As long as the right input is given, the hacker has free reign to mess around with the output as they wish. One can easily imagine some sort of “super-hack” — an input which not only hacks one’s OS, but all computers (of some semi-arbitrarily “significant” complexity) available on the market. The input code could be pretty ugly and glued-together, and I’m not sure why anyone would bother doing all that in a single input, but I don’t think there’s any technical reason why that couldn’t work in theory (or maybe there is, idk). Of course that wouldn’t effect computers with no (or extremely limited) input channels, but that would be the human equivalent of being blind before The Image That Makes You Crazy appears. No input, no issue! Certainly machines that are similar enough to each other in certain ways can be hacked with identical techniques, if nothing else.

Generalizing even further, what about Turing machines? What would it mean to “hack” a Turing machine? All of the conceptions of hacking developed above seem to depend upon some amount of human intentionality, and that intentionality being subverted. Turing machines do not have intentionality, as they exist platonically, and run blindly. You cannot “break” a Turing machine in any conventional sense. How do we take human intentions or design out of the picture?

Looking back at both computers and humans, one relatively objective state a hack or an infohazard can lead to is a systems crash. In a computer, that would mean “bricking,” where no further input can alter the output, and in humans that would mean death or coma, where, again, no further input will change anything. Bricking or deathly infohazards are more limited in form than hacking and mind control, respectively, but may be easier to translate into a formalized notion.

Presumably, a formalized conception of “bricking” could lead to some interesting and potentially really important theorems, which may also be of help when discussing real-world hacking, and more ambitiously, human infohazards.

In the context of a Turing machine, I would define a Brick (capitalized to indicate this particular meaning) as being “input which causes all further input to make no difference/be irrelevant to output”. A Bricked Turing machine need not halt, nor run forever, as long as there is no possible input which would alter the status of the output post-Brick. If the Turing machine is indifferent to input no matter what initial input is given (such as one which only outputs 0, or immediately halts at the first step, or outputs an infinite sequence of 1s), that should not count as a Bricked machine, since there was never an initial Brick in the first place. Therefore, a Brickable machine must be sensitive to at least some initial inputs, in the sense that by changing the input, one will receive differing outputs. I am sure there are many trivial properties of Brickable Turing machines it should be possible to talk about, but I have yet to explore this in any more depth. Hopefully, there may also be non-trivial properties unique to Brickable machines, which might shed light on some of the areas I’ve covered above.

What I would like to know is if this concept already exists in computer science, and if so, to what depth has it been studied? What can we formally prove about Brickability of various machines? If this is a novel concept, do you think this is worthy of future study? Apologies if I just reinvented something that’s already well-known, or if there’s a fatal flaw in my formalism that renders it totally useless. I don’t have formal training in the area, so I’m sure I have a lot of blind spots.

New to LessWrong?

New Answer
New Comment

2 Answers sorted by

ChristianKl

Mar 11, 2022

40

to people “brainwashed” by cults to do stuff they would otherwise never have done

There's an old post by gwern according to which brainwashing doesn't really exist in the way commonly assumed.

I interpret that post as "brainwashing usually doesn't work, and when it does, it usually only works for a limited time". Which is consistent with the statement that brainwashing sometimes works, and when it works for a limited time, it is sometimes a long time, sometimes on the scale of years.

Yes, if your model was that cults have like 80% success rate, that is not true.

But a cult that would succeed to convert 1% of people they approach, with average retention of 5 years, could still get huge, assuming that they keep approaching lots of people. And during... (read more)

8 comments, sorted by Click to highlight new comments since: Today at 8:35 AM

Some very quick notes, don't have time for more:

  • It looks to me as if you fail to separate formal models from concrete implementations. Most exploitability results from mismatches between the two. As a concrete example, the PDF spec says (said?) that a PDF starts with a "this is a PDF" comment, e.g. %PDF-1.4. (Right at the start, with nothing before it.) In practice, most readers are happy if that signature sits anywhere in the first 4KiB or something like that. End result are benign things like polyglot files on the one end – look for Ange Albertini and/or the International Journal of PoC||GTFO ("Proof-of-Concept or Get The F*** Out") – or malicious results like your [guard mechanism – firewall/antivirus/…] parsing things one way and considering them safe, and [target program] parsing them in another way and [blowing up – crashing / running embedded code / …]. As long as you only look at specs / formal models, you cannot talk about the actual effects. You have to actually look at the code / program, have to get down to the base level, to see these problems.

  • For more stuff on that look at LANGSEC (very short summary, but the papers quite are readable too) or the stuff that Halvar Flake does. Briefly and slightly over-simplified, the model here is that you have a universal machine (your computer) but want it to do a very specific limited thing, so you write an emulator (your program) for a finite state machine or similar. There are certain state transitions implemented by your code, and the program starts out in a "sane" state. Certain events (weird inputs, memory manipulation, …) might move the program from a sane state into a "weird state", and after that you can still trigger the transitions intended for sane states, but instead they are run on weird states – you now have a "weird machine" that does unintended things. Another key insight from the LANGSEC stuff is that you write the program, so you probably think you decide what's going to happen, but actually the program is effectively an interpreter for the input that behaves like a program running on your "interpreter", so whoever provides the input (files, network packets, …) actually decides what happens. (Within the bounds set by your code, but those are often uncomfortably loose.) Not only is code data (as taught by Lisp), but data is also code.

  • Formal models can't be exploitable in this sense, as they are the specification. What can happen is that there's logic bugs in the spec, but that is a very different class of problems. So an abstract Turing machine can't really be hackable as it will behave exactly according to specification (as it is the spec), the program implemented by the TM might be faulty and exploitable, and a concrete implementation of the TM might be too. (Of course, logic bugs can also be very very bad.)

Thanks for the insightful comment! I’ll look through the links you provided, but I think we’re in agreement here (though you put it far more succinctly); formal models are not exploitable in the typical sense of the word. That’s why I’m interested in what I’m tentatively calling “Brickability” (though my hope is this concept isn’t new and it already has a name I could google)—a binary string which renders all further input to be irrelevant to the output (for a given Turing machine). For now I’m not really worried about concrete implementations, since I’m not fully sure my formal model is even consistent or non-trivial yet.

I can't recall specific names / specific treatments of this, but I'm also relatively sure that it's kinda familiar, so I suspect that something exists there. Problem is that, in some sense, it falls right between areas.

On the practical side, people don't really care where the problem originates, they just want to fix it. (And people are swimming in abstraction layers and navigate them intuitively, so a "this is one layer up" doesn't really stand out, as it's still the bottom layer of some other abstraction.) So from the perspective of computer security, it just falls into the big bucket of what's called a "denial of service" vulnerability (among lots of other things like just temporarily slowing things down, or making a system do 2^51 compute steps (i.e. locking it up effectively forever but not "truly" forever) or making it drown in traffic from outside), but there's generally no category separation between "possible bugs" (that just happen to be in one specific implementation) and "necessary bugs" (the "logic bugs" that are in the spec, where every conforming program will end up with the same problem.)

On the theoretical side, you mostly work with bounded recursion instead of possibly unbounded loops, so the problem is more-or-less impossible. (And while Turing machines exist on the theory side too, working with them sucks so it doesn't really happen a lot, so people don't really notice problems that occur when working with them. And so it's mostly bounded recursion.) As an example, if you write your program in Coq, you're forced to prove that every recursive computation terminates in order to be allowed to compile (and run) your program. So as soon as you care even a little about verification, the problem basically disappears / becomes invisible. In theory, the path to reaching a fixed point by reacting to each input in exactly the same way is still open (by messing up your state), but that's very unlikely to hit by chance (especially if you write in the typical stateless style), so people might not stumble on that ever, so the chances of thorough exploration are rather low too. (Still feels like there was something, somewhere...)

In other areas:

  • From the perspective of dynamical systems, it's just a fixed point.
  • In temporal logic, it's something that would be happening "eventually forever".

If you can't find anything after using that to find more keywords, I'd suggest you ask someone at the Santa Fe Institute, they do lots of dynamical systems stuff, and if it exists at all, there's a pretty good chance that someone there has heard of it. (I've seen them mention kinda nearby concepts in one of the "Complexity Explorer" courses. According to that page, Cristopher Moore does a computability etc. course soon-ish, so he should either be the right person to ask or know who to forward it to.)

Some computer hacks (eg. power analysis, timing attacks) exploit the outputs rather than the inputs of a system to access secret information (usually keys or passwords). The human analogy would be something like cold reading.

Oh yeah, that’s a great family of hacks as well! Here’s a great example I recently came across (though the example itself is pretty old): http://www.cs.tau.ac.il/~tromer/acoustic/

Follow-up question (which I admittedly have not spent more than five minutes thinking about so far): are any Universal Turing Machines (UTMs) Brickable? My intuition is yes (based on the fact that most real-world computers seem to be), though I haven’t formally checked yet. If some UTMs are Brickable, my intuition is that all UTMs are likely Brickable, although the particular Brick may be unique to each machine. If instead only some are Brickable, I would be interested what the limiting conditions are, and if none are Brickable, I’m pretty sure that would also have massive implications, though I’m not sure exactly what those would be.

I think the universal part means it's all-or-none (any machine that can emulate a Brickable machine is itself Brickable).

We need to distinguish between "writes zeroes forever" and "simulates a machine that writes zeroes forever". The second one is also in an infinite loop, but it does things beyond writing zeroes (depending on how the simulated machine's output is encoded).

I guess "write zeroes" should not be taken literally. The part with "ignores all future output" is better. But in that case, suppose you write a program that simulates another program, and stops if someone pressed the Esc key. Even if the simulated program is bricked, the simulating program will react to the key, so it is not bricked. (And, yeah, Turing machines are not interactive, so... what is the analogy in their case?)