Inconsistent Values and Extrapolation


Interpretability might then refer to creating architectures/activation functions that are easier to be interpreted.

Software: ffmpeg

Need: Converting arbitrary video/audio file formats into another

Other programs I've tried: Various free conversion websites

Not tried: Any Adobe products

Advantage: ffmpeg doesn't have ads, or spam, or annoying pop-ups

Software: pandoc

Need: Converting arbitrary document file formats into another

Other programs I've tried: In-built conversion for office-like products

Advantage: pandoc supports a lot of formats, and generally deals with them really well—loss of information through conversion is pretty rare

I'm confused about how POC||GTFO fits together with cryptographers starting to worry about post-quantum cryptography already in 2006, when the proof of concept was we have factored 15 into 3×5 using Shor's algorithm? (They were running a whole conference on it!)

Laptop chargers are also an object for which it's trivial to own multiple, at a low cost and high (potential) advantage.

Other examples: Take caffeine/nicotine once a week instead of never or daily. Leave social situations when they're not fun or useful anymore. Do small cost-benefit analyses when they make sense[1].

See also: Solved Problems Repository, Boring Advice Repository.

  1. I've done two already this year: One to decide whether to leave a bootcamp, and another to decide which gym to select. (The second one misfired: I made a mistake in my calculation, taking only the way there as a cost and not the way back to public transport, which led me to choose the wrong one (by <100€ of cost over the time I go there)). I should've done the math (done ✓), then burned the math and gone with my gut (not done ✗). ↩︎

(I have not read your recent work on AI control, so feel free to refer me to any material that answers my questions there.)

At least with C, in my experience these kinds of mistakes are not easily caught by testing, syntax highlighting and basic compiler linting. (There's a reason why valgrind exists!) Looking over the winners for 2008, I have no idea what is going on, and think it'd take me quite a while to figure out what the hell is happening, and whether it's sketchy.

I'd enjoy reading about experiments where people have to figure out whether a piece of C code is underhanded.

Another example that feels relevant, but where I'm not sure about the exact lesson, that comes to mind is the NSA modifying S-boxes for DES in order to make them more resistant to differential cryptanalysis.

  • The NSA hid a defense from a specific attack in the S-boxes
  • People figured this out only when the specific attack was found
  • It is unknown whether they hid anything that makes offense easier

Is it possible that they hid anything that makes offense easier? I don't know.

Edit: After some searching, I came upon the pRNG Dual_EC_DRBG, which did have a bunch of NSA involvement, and where they were pretty much caught biasing the numbers in a specific direction. So attacks here are definitely possible, though in this case it took more than five years to get caught.

As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems. I'm pretty sure you've written an answer to "negatively reinforcing bad behavior can reinforce both good behavior and better hiding", so I'll leave it at that.

It seems pretty relevant to note that we haven't found an easy way of writing software[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended (see the underhanded C contest or cve-rs). The artifacts that AI systems will produce will be of similar or higher complexity.

I think this puts a serious dent into "evaluation is easier than generation"—sure, easier, but how much easier? In practice we can also solve a lot of pretty big SAT instances.

  1. Unless we get AI systems to write formal verifications for the code they've written, in which case a lot of complexity gets pushed into the specification of formal properties, which presumably would also be written by AI systems. ↩︎

Several disjointed thoughts, all exploratory.

I have the intuition that "believing in"s are how allocating Steam feels like from the inside, or that they are the same thing.

C. “Believing in”s should often be public, and/or be part of a person’s visible identity.

This makes sense if "believing in"s are useful for intra- and inter-agent coordination, which is the thing people accumulate to go on stag hunts together. Coordinating with your future self, in this framework, requires the same resource as coordinating with other agents similar to you along the relevant axes right now (or across time).

Steam might be thought of as a scalar quantity assigned to some action or plan, and which changes depending on the actions being executed or not. Steam is necessarily distinct from probability or utility because if you start making predictions about your own future actions, your belief estimation process (assuming it has some influence on your actions) has a fixed point in predicting the action will not be carried out, and then intervening to prevent the action from being carried out. There is also another fixed point in which the agent is maximally confident it will do something, and then just doing it, but it can't be persuaded to not do it.[1]

As stated in the original post, steam helps solve the procrastination paradox. I have the intuition that one can relate the (change) in steam/utility/probability to each other: Assuming utility is high,

  1. If actions/plans are performed, their steam increases
  2. If actions/plans are not performed, steam decreases
  3. If steam decreases slowly and actions/plans are executed, increase steam
  4. If steam decreases quickly and actions/plans are not executed, decrease steam even more quickly(?)
  5. If actions/plans are completed, reduce steam

If utility decreases a lot, steam only decreases a bit (hence things like sunk costs). Differential equations look particularly useful to talking more rigorously about this kind of thing.

Steam might also be related to how cognition on a particular topic gets started; to avoid the infinite regress problem of deciding what to think about, and deciding how to think about what to think about, and so on.

For each "category" of thought we have some steam which is adjusted as we observe our own previous thoughts, beliefs changing and values being expressed. So we don't just think the thoughts that are highest utility to think in expectation, we think the thoughts that are highest in steam, where steam is allocated depending on the change in probability and utility.

Steam or "believing in" seem to be bound up with abstraction à la teleosemantics: When thinking or acting, steam decides where thoughts and actions are directed to create higher clarity on symbolic constructs or plans. I'm especially thinking of the email-writing example: There is a vague notion of "I will write the email", into which cognitive effort needs to invested to crystallize the purpose, and then bringing up the further effort to actually flesh out all the details.

  1. This is not quite true, a better model would be that the agent discontinuously switches if the badness of the prediction being wrong is outweighed by the badness of not doing the thing. ↩︎

This seems very similar to the distinction between Steam and probability.

Load More