I'll note that "not being sure what utility functions are in use" is generally (in the colloquial sense) not how standard game theory works. It seems like I am not competent enough at standard game theory to clearly write down the edge cases I think might exist that could help with your understanding. This paragraph could serve as a placeholder for the case where I develop that competence.
As for non-standard game theory, you say you're reading the 2009 book The Bounds of Reason here[1] and I wonder if you've heard of the newer Translucent players: Explaini...
I recommend against the use of Math.random() in general, unless you are highly performance constrained. I've checked, and it appears browsers have commonly supported a better random source since 2015[1]/early 2016 at the latest. The code below should be entirely correct to replace both the primary and fallback UUIDv4 generation code, once adapted to a function in a TS module.
// Function available since 2015.
const uuid_b = new Uint8Array(16);
self.crypto.getRandomValues(uuid_b);
let uuid_hex = "";
for (let i = 1; i &... I may want to say something about your requirements in the future. If that is the case you can verify the latest possible writing time using the cryptographic commitment.
HMAC-SHA2-256(INPUT, HMAC_KEY)=2d5c9d62761f420e57919f4bf39f44cfe8ff3740322221b61f32de01e7e8786f
SHA3-224(HMAC_KEY)=80a2da01146495971b9ccf9fa9c20405cf582d091073aa985348cd1eCryptography Note
Note that this commitment mechanism isn't particularly secure. "Make the outputs longer" isn't something that helps by itself. If you know cryptography you may be able to get closer to Yudkowsky's hypothe
Accepting that framing, I would characterize it as optimizing for inexploitability and resistance to persuasion over peak efficiency.
Alternatively, this job/process could be described as consisting of a partially separate skill or set of skills. It appears to be an open problem on how to extract useful ideas from an isolated context[1], without distorting them in a way that would lead to problems, while also not letting out any info-hazards or malicious programs. Against adversaries (accidental or otherwise) below superintelligence, a human may be able to ...
This bodes very poorly and we should probably make sure we have a strategic reserve of AI safety researchers who do NOT talk to models going forward (to his credit Davidad recommends this anyway).
I previously followed a more standard safety protocol[1] but that might not be enough when considering secondary exposure to LLM conversations highly selected by someone already compromised.
By my recollection[2], a substantial percentage of the LLM outputs I've ever seen have been selected or amplified in distribution by Janus.
From now on I won't read anything by ...
One point of this framework is to distinguish "sharing values" from "actually trusting each other". There are cases where agents share values but don't trust each other, or get stuck in coordination traps
In Wei Dai's thinking, having the same values/utility function means that two agents care about the exact same things. This is formalized in UDT, but it's also a requirement you can add to most decision theories, e.g. CDT with reflective oracles (or some other mostly lawful incomplete measure). This is normally described as requiring that the utility funct...
The code doesn't look like it would cause catastrophic problems. The main risk to end users at the current level of testing is a bug causing important information to be missed. My ability to comment on the risk to a developer is limited however, because I haven't read the source code of all the development dependencies.
I have visually checked (as a human) the dist/power-reader.user.js file. End users should be relatively safe copying this into their browser plugins, as long as all plugins have no relevant security problems or malicious code. As mentioned b...
[Epistemic Status: Moderate confidence due to potential differences in Anthropic's stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude's constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relative...
You, Kokotajlo, not immediately dismissing the idea is "evidence" to the extent that you stand in for AI researchers that might make the decision. In quotes because a logically omniscient (e.g. perfect Bayesian) agent would presumably already have a good guess and not update much if at all. On the other hand, agents with (small) finite compute can run experiments or otherwise observe events and use the results to improve their "mathematical intuition" that is then used in a similar way to the "mathematical intuition module" in UDT, except with the sacrific...
I saw this message without context in my mail box and thought to write that this was an unsolved problem[1], that things that simply are not true can't stand up very well in a world model, but this seems like something an intelligent human like Amodei or Musk should be able to do. A 99% "probability" (guess by a human) on ¬ai_doom should not be able to fix enough detail to directly contradict reasoning on the counterlogical/counterfactual where doom instead happens. Any failure to carry out this reasoning task seems like a simple failure of reasoning in lo...
Spoiler
HJPEV is bound by a magical oath that prevents this human failing in the same way it is prevented in an agent that meets tiling desiderata. This is explicit in the text. E-Book draft, 2015, chapter 113.
Admittedly this both assumes that the "time of peril" hypothesis is correct and can be handled while maintaing human freedom, and the solution only (in maximum robustness) binds until the end of this time.