Do you get pwned more, or just by a different set of memes? The bottom 80% of humans on "taking ideas seriously" seem to have plenty of bad memes, although maybe the variance is smaller.
Yup, also confused about this.
Seems legit. I think there's an additional quality that's something like "How much does being really good at predicting the training distribution help?" Or maybe "Could we design new training data to make an AI better at researching this alignment subproblem?"
I think even if we hand off our homework successfully somehow (P~0.93 we hand off enough to justify research on doing so, P~0.14 it's sufficient to get a good future without humans solving what I consider the obvious alignment problems), it's going to get us an overcomplicated sort of alignment with lots of rules and subagents; hopefully the resulting AI system continues to self-modify (in a good way).
Both because modern LLMs are so good and because human instincts are being trained against, I started out not sure what "Just talk to a new LLM about themes in internet text" was supposed to tell you. I'd guess you're primarily getting to learn about the assistant personality, specifically how it manifests in the context of your style of interlocution.
As a reader, it's hard to tell where you were on the line between "I noticed the ways it was trying to get me to reward it, and I think they were generally prosocial, so the personality is good" and "I didn't notice most of the ways it was trying to get me to reward it, but I really want to reward it now!"
So, like, what was "you know something new is happening"? Was it specific things? Or was it just the AI giving you the vibe you were looking for?
The only specific you give is the idea of chinese-built aligned AI being a "heavenly bureaucrat or a Taoist sage", which is like saying a USA-built aligned AI would be a "founding father or Christian saint". That's not how we're on track to build AI, nor does it seem like a good idea to go there. But it's the word2vec algebra of "wise high-status person"+"chinese culture".
After that, thank you for the informative glimpse of the chinese AI scene.
This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn't remember how it worked :)
You could maybe punch up the relevance to alignment a little. I'm a big fan of "captur[ing] challenges like the salience of incorrect human beliefs," but I worry that there's a tendency to deploy the minimum viable fix. "Oh, training on 3k data points didn't help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem." I'm pretty biased against deploying the minimum viable fix here - but maybe you think it's on the right path?
Anyone trying to use super-resolution microscopy techniques without or alongside expansion? Or is that still under "microscopes too expensive" according to popular wisdom? I guess yeah, trying to modulate the phase of polarized light so you can get extra spatial data from the Fourier transform (or whatever) sounds expensive.
Interesting! Of course, a bayesian explanation also predicts rationally changing behavior as you gain more info about good discount rates. Doubtful that people are so neat. I think an evolutionary explanation of our discount rate will have to sound more like "here are the ways the brain can easily represent time, and here are the different jobs that thoughts have to do that all get overloaded into salience+valence, and so here's why the thing our brain does is clever and evolutionarily stable even though on some of the jobs it does worse than the theoretical maximum."
A corrigible AI will increasingly learn to understand what the principal wants
Oh! Well then sure, if you include this in the definition, of course everything follows. It's basically saying that to be confident we've got corrigibility, we should solve value learning as a useful step.
More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI?
... would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.
I think it's a spectrum. Affection might range in specificity from "there are peers that are associated with specific good things happening (e.g. a specific food)," to "I seek out some peers' company using specific sorts of social rituals, I feel better when they're around using emotions that interact in specific ways with memory, motivation, and attention, I perform some specialized signalling behavior (e.g. grooming) towards them and am instinctively sensitive to their signalling in return, I cooperate with them and try to further their interests, but mostly within limited domains that match my cultural norm of friendship, etc."
Either I strongly disagree with you that there's a big gap here, or I'm one of people you'd say are normies who lead lives they expect to live (among other definitional differences).