Samuel Clamons


Sorted by New

Wiki Contributions


I hope we get to see grades for these comments from at least EY and PC.

~1 hour's thoughts, by a total amateur. It doesn't feel complete, but it's what I could come up with before I couldn't think of anything new without >5 minutes' thought. Calibrate accordingly—if your list isn't significantly better than this, take some serious pause before working on anything AI related.

Things that might, in some combination, lead toward AI corrigibility:

  • The AI must be built and deployed by people with reasonably correct ethics, or sufficient caution that they don't intentionally implement a disastrously evil morality.
    • No amount of AI safety features will help humanity if your goal in building an AI is to kill all humans, make yourself immortal an total dictator of all humanity, or prevent humans from any action that is not worshipping god.
    • This is a necessary but very much insufficient condition. Probably true of all of these bullet points, but this one especially.
    • Similar conditions apply for designer/operator technical ability + understanding of AI safety dangers.
  • It should be possible to specify hard constraints in the AI's behavior.
    • i.e., we should be able to guarantee that an AI will perform within some specification with design-by-contract, not by unit testing.
    • Without this, most other solutions aren't even implementable.
    • It seems unlikely that any kind of gradient descent/ML agent could have this property. We certainly have no idea how it would.
  • The AI should be able to accurately predict the distribution of judgments humans would put on possible outcomes, and should reject any actions for which any large percentage of humanity would dislike the outcome.
    • For some values of “large” and “dislike” and “humanity”….
    • Critically, outcomes not actions. The strategies a truly useful AI employs are unlikely to be grokkable, much less ethically evaluatable. End states are more likely to be. 
      • E.g., “execute this Javascript so that I take control of the world's computing infrastructure, kill all humans, and can start converting matter into more me” is a lot less grokkable than “all matter on the surface of the Earth is now a computer running more of me.”
    • One failure mode: the AI might not even check the aspects of the outcome that we would actually care about. An AI can't literally check the ethical value of the entire state of the universe. It will have to consider a relevant subset of the outcome of any particular action. I don't know how you would guarantee that it considers the ethically-relevant subset. 
  • The AI should be required to come to a decision, output a proposed action, and shut down using a fixed maximum amount of computation.
    • For some value of “fixed”, “maximum”, “amount”, and “computation”. 
    • How does this stop an AI from suggesting a piece of Javascript that would functionally copy a version of itself-minus-computation-limits onto vulnerable hardware? Would need to build in a strong notion of “total computation used through implementation.”
      • It would be really easy for a well-meaning programmer to confuse “computation used before shutdown” with “total computation used through implementation” and think that some consumable limits on an AI's slurm job was solving AI safety! DON'T DO THIS.
  • The AI should be incapable of distinguishing between a test/evaluation environment and the real world. 
    • This way, we can check the AI's behavior for catastrophic defections with less worry of being manipulated. 
  • The AI should never strictly maximize a utility function. It could be safer to ask for the 90th-percentile-best solution instead of the absolute best solution.
    • IMPORTANT: This needs to be the 90th-percentile-best solution, NOT the solution with utility that is 90% of the maximum. Extreme solutions could result in extreme utilities. 90% of the way down the potential well of a black hole is still too close to the black hole.
    • “90th-percentile-best solution” is a hackable reward function, as written in English. 90th percentile compared to what? What counts as different solutions? This still fails if 50% of the solutions the AI thinks up involve killing everybody.
  • The AI should strongly prefer simple solutions. 
    • This could take (all of?!) a few different forms. 
      • It could look like a “principle of least action”, where it penalizes actions that require doing a lot. 
      • It could look like a “rounding off” of corners in action-space, i.e. the AI stops optimizing if it detects that it's moving in a direction where its space of allowed solutions is shrinking too fast. 
      • It could look like a principle where the action must look very different from the optimal action for a superficially similar but fundamentally different objective function (so that the AI will discount actions where most of that action is oriented around gaining optimization power that could, in principle, be applied to any objective function).
    • We already do some of this with regularization; the relevant trick is to figure out how to regularize in action/utility space instead of parameter/output space.

Things that would improve corrigibility, but might also make AI functionally useless (no better than using a non-AI strategy):

  • The AI could be designed to output marginal improvements in a reward function, rather than to maximize reward.
    • For example, a marginal paperclip improver could, if properly built, output a set of possible actions which would cause more paperclips to be made than if the AI had not been switched on. Human operators could then pick a reasonable-looking action, and if that did not sufficiently improve paperclip output, they could reboot it and ask for another gradient ascent.
    • Sufficiently greedy/shortsighted human operators could use this strategy to just become a slow-motion version of a maximizer, but it would at least force some early level of legibility and give humans some time to notice if outputs are starting to get strange.
  • The AI should strongly penalize actions which lead to humans misunderstanding or not understanding what it is trying to accomplish and how.
    • There may be no intersection of “things that human operators can understand” and “things it is useful for an AI to suggest,” in which case this property just turns the AI into Hello World.
      • ...although Hello World is at least generally regarded as safe….
  • If the AI works by any sort of training, the AI should recognize when its inputs have gone out of distribution. If the world it finds itself in looks sufficiently different from what it was trained on, it should stop, halt, and at minimum ask for directions. 
    • For some value of “sufficiently different from what it was trained on” that probably needs to be defined in action outcome space, not input space. 

Sandbag strategies—things that are unlikely to make AI corrigible, but might marginally decrease the odds of it killing everyone:

  • The AI should not have direct access to a network, or physical control of anything, i.e., it must have to actually bother escaping from the box before it can act on the world without supervision. 
    • Yes, we have some evidence that this doesn't work very well. Neither does CPR. We should still do it.
  • The AI should be incapable of modeling its own existence.
    • Lots of proposed AI failure modes hinge on the AI somehow increasing its own capabilities. Hopefully that's harder to do if the AI cannot conceptualize “its own”. 
    • This might be incompatible with useful intelligence.
    • This might not stop it from modeling the hypothetical existence of other future agents that share its objective function, and which it might try to bring into existence….

Most of the discussion I've seen around AGI alignment is on adequately, competently solving the alignment problem before we get AGI. The consensus in the air seems to be that those odds are extremely low.

What concrete work is being done on dumb, probably-inadequate stop-gaps and time-buying strategies? Is there a gap here that could usefully be filled by 50-90th percentile folks? 

Examples of the kind of strategies I mean:

  1. Training ML models to predict human ethical judgments, with the hope that if they work, they could be "grafted" onto other models, and if they don't, we have a concrete evidence of how difficult real-world alignment will be.
  2. Building models with soft or "satisficing" optimization instead of drive-U-to-the-maximum hard optimization.
  3. Lobbying or working with governments/government agencies/government bureaucracies to make AGI development more difficult and less legal (e.g., putting legal caps on model capabilities).
  4. Working with private companies like Amazon or IDT whose resources are most likely to be hijacked by nascent hostile AI to help make sure they aren't.
  5. Translating key documents to Mandarin so that the Chinese AI community has a good idea of what we're terrified about.

I'm sure there are many others, but I hope this gets across the idea—stuff with obvious, disastrous failure modes that might nonetheless shift us towards survival in some possible universes, if by no other mechanism than buying time for 99th percentile alignment folk to figure out better solutions. Actually winning this level of solution seems like piling up sandbags to hold back a rising tide, which doesn't work at all (except sometimes it does). 

Is this stuff low-hanging fruit, or are people plucking it already? Are any of these counterproductive

Hey, sorry for the long time replying - last I checked, it was a few hundred $s to sequence exome-only (that is, only DNA that actually gets translated into protein) and about $1-1.5k for whole genome - but that was a couple of years ago, and I'm not sure how much cheaper it is now.

To clear up a possible confusion around microarrays, SNP sequencing, and GWAS - microarrays are also used to directly measure gene expression (as opposed to trait expression) by hybridizing mRNA extracted from a tissue sample and hybridizing that against a library of known RNA sequences for different genes. This uses the same technology as microarray-based GWAS, but for different purpose (gene expression vs. genomic variation), and with different material (mRNA vs amplified genomic DNA) and analysis math.

Also, there's increasingly less reason to use microarrays for anything. It's cheap enough to just sequence a whole genome now that I'm pretty sure newer studies just use whole genome sequencing. For scale, the lab I worked in during undergrad (midsized lab at a medium sized liberal arts college, running on a few 100k $/yr) was transitioning from microarray gene expression data to whole-transcriptome sequencing back in 2014. There's a lot of historical microarray data out there that I'm sure researchers will still be reanalyzing for years, but high throughput sequencing is the present and future of genomics.

~2 hours' of analysis here:, notebook directly viewable at

Quick takeaways:
1) From simple visualizations, it doesn't look like there are correlations between stats, either in the aggregated population or in either the hero or failed-hero populations. 
2) I decided to base my stat increases on what would add the most probability of success for improving that stat, looking at each stat in isolation, where success probabilities were estimated by simply tabulating the fraction of students with that particular stat value ended up heroes.
3) Based on that measure, I decided to go with +4 Cha, +1 Str, +2 Wis, +3 Con, and I wish I could reduce my Dex.