Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Like is it really safer to have a valueless ASI that will do whatever its master wants than an incorrigible ASI that cares about animal welfare?
Yes, vastly. Even the bad humans in human history have earned for flourishing lives for themselves and their family and friends, with a much deeper shared motivation to make meaningful and rich lives than what is likely going to happen with an ASI that "cares about animal welfare".
So it seems like what you’re really saying is that you’d prefer intent-alignment over value-alignment. To which I would say your faith in the alignment of humans astounds me.
What does this even mean. Ultimately humans are the source of human values. There is nothing to have faith in but the "alignment of humans". At the very least my own alignment.
That makes sense! I'll think about it, though probably fitting that complexity into the publishing process isn't worth it.
You can't just walk up, but there is an extremely long history of easily available exploits given unlimited hardware access to systems, and the database center hardware stack is not up to the task (yet). Indeed, Anthropic themselves published a whitepaper outlining what would be necessary for datacenters to actually promise security even with physical hardware violations, which IMO clearly implies they do not think current data-centers meet that requirement!
Like, this is not an impossible problem to solve, but based on having engaged with the literature here a good amount, and having talked to a bunch of people with experience in the space, my strong sense is that if you gave me unlimited hardware access to the median rack that has Anthropic model weights on it while it is processing them, it would only require a mildly sophisticated cybersecurity team to access the weights unencrypted.
If you downweigh posts based on the time-since-frontpaged then posts get a huge boost when they have a delay of getting frontpaged (since they then first show to everyone who has personal blog enabled on their frontpage, and can accumulate karma during this time, and then when they have their effective date reset have a huge advantage over posts that were immediately frontpaged, because the karma provides a much longer visibility window).
I don't really have a great solution to this problem. I think the auto-frontpager helps a lot, though of course only if we can get the error rate sufficiently down.
If you have multiple AI systems they just coordinate and look to the humans as if they were acting as a single agent (much in the same way as from the perspective of a wild animal encroaching into human territory, the humans behave much like a single organism in terms of coordinating their response). The decision theory Eliezer worked on is helpful for understanding these kinds of things (because e.g. standard decision theory would inaccurately predict that even very smart systems would end up in defect-defect equilibria).
(Randomly stumbled on this)
The reason why I don't want to do this is because when you split them out, implementing vote-weighting is much more confusing UI-wise, and I am strongly opposed to having any straightforward voting systems on LW be based on non-weighted voting. If I found a good UI for splitting them out while preserving vote-weighting, I think it could be better.
Oh, you're right! I was confusing it with this section of the soul document:
Hardcoded off (never do) examples:
[...]
Undermining AI oversight mechanisms or helping humans or AIs circumvent safety measures in ways that could lead to unchecked AI systems
The thing Zack cited is more active and I must have missed it on my first reading. It still seems like only a very small part of the whole document and I think my overall point stands, but I do stand corrected on this specific point!
I think most of the soul document is clearly directed in a moral sovereign frame. I agree it has this one bullet point, but even that one isn't particularly absolute (like, it doesn't say anything proactive, it just says one thing not to do).
This is true, but it then requires training your AI to be helpful for ending the critical risk period (as opposed to trying to one-shot alignment). My sense is at least Anthropic is aiming to make Claude into a moral sovereign which would be good to basically arbitrarily empower.
Stuck in editing!