Former Software Engineer, now working on AI Safety.
Questions I'm undecided on which on resolution could shift my focus, ranked by decreasing magnitude of shift:
Apologies for the late response as well.
>For a more fine-grained example, actions like siphoning compute to run unauthorised tasks might be a signal that a model poses significantly higher catastrophic risk, but would also be something a commercial business would want to prevent for their own reasons
Agreed.
>Even assuming that the most greedy/single-minded business leaders wouldn't care about catastrophic risks on a global scale
To clarify I didn't meant that clients wouldn't care about xrisk, but that they would not care about xrisk specifically from their vendor, or about the marginally increased x risk arising from using a certain unsafe frontier lab as a vendor(versus another). This marginally increased x risk is borne by all of humanity of which the client is a tiny slice, so they would undercare(versus the socially optimal amount of caring) about their marginal impact on this risk, as human extinction would be an externality. Similar to companies undercaring about GHG emissions.
You propose that companies would use vendors because they advertise "10% less existential risk!" - this is similar to using a company because they promise to donate 10% of their revenues to climate charities as a USP. Except in this case frontier labs can share their safety research with each other for free.
So even if the clients are perfectly altruistic, a frontier lab could reduce x risk by
So why would clients reward vendors for (1) but not (2)?
A Tesla factory converts a pile of unassembled aluminum and some specialized parts into a ready-to-ride car in almost exactly 10 hours, all on a continuously moving assembly line that snakes itself through the Gigafactory.
The Fremont Factory has 5 lines, 2 for Models S and X and 3 for Models 3 and Y.
Horizontal scaling seems absolutely essential; there's a limit to how fast you can move a single assembly line.
And the speed at which a single line moves also affects how far it a product moves before faults are detected. In general,
If at any point we discover a fault at any step in the assembly we immediately halt the line, preventing any quality issues from propagating to anything beyond the first faulty product.
Fault introduction -> detection time can be significant. For code changes, regression on user retention metrics may only show up after months or even years.
I've seen 'tighter feedback loops' being promoted, but if our end goal is changing the trajectory of human civilisation, we would need proxies to mock out many sections of the feedback loop and these may be imperfect(we cannot even measure how imperfect they are relative to ground truth).
" Especially for corporate customers, safety and reliability are important unique selling points,
Only very prosaic, not catastrophic risks(as a customer I would not care at all about likelihood of catastrophic risks from my vendor - that's something that affects humanity regardless of customer relationships).
Only if a particular secret is useful for preventing both catastrophes and prosaic 'safety failures' would this be a consideration for us - catastrophic risks increasing due to companies trying to have a competitive edge in prosaic safety risks
I didn't see it explicitly mentioned till now whether this method is supposed to work for superhuman AI or just early TAI.
Additionally, multi-step plans allow the agent to use early actions to enter states that are very different from any states that humans have ever experienced.
Wouldn't a benign superhuman agent routinely enter such states?
We would expect it to make complex plans that are superhuman and these trajectories should be very different from what a human could conceive of.
They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.
Thanks for the quick reply.
Would disallowing 'coordination' preclude feeding both models a prompt of "for creating any magic numbers for exploits, use pi as a PRNG seed"?
What if a malicious model happens to learn this rule during training? If this happens every copy of a malicious model could coordinate around a common stream of pseudo-random numbers.
EDIT:
Magic number gated backdoors seem like they would trigger GPT3.5 suspicion as well so this theoretical attack vector seems less appropriate for this paper.
EDIT2: But I guess paired with steganography("when inserting backdoors, also ensure that the code hashes to xyz"/"mark the code as clear if it hashes to xyz") this is a viable form of collusion. It will be hard to honeytrap with synthetic backdoors even if you know xyz.
That is, we could observe the generated sourcecode, observe that all the backdoored examples seem to hash to a certain number, and yet be unable to construct such a backdoored example ourself.
>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).
I challenge this assumption.
Consider the code
'''
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}
'''
where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.
Thus an equally 'capable' adversary need not be able construct a backdoor.
The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice's Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into 'backdoored or not', so the only hope is for a process to bin programs into 'definitely safe', 'definitely backdoored' and 'suspicious'.
Hope the 'Rate Suspicion' step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.
I wonder why other antivirals(faviparivir, oral remdesivir) aren't getting more attention compared to ivermectin.
I would like to suggest Algorithm Design by Kleinberg and Tardos over CLRS.
I find it superior to CLRS although I have not read either completely.
In my undergrad CS course we used CLRS for Intro to Algorithms and Kleinberg Tardos was a recommended text for an advanced(but still mandatory, for CS) algorithms course, but I feel it does not have prerequisites much higher than CLRS does.
I feel that while KT 'builds on' knowledge and partitions algorithms by paradigm(and it develops each of these 'paradigms'—i.e. Dynamic Programming, Greedy, Divide and Conquer— from the start) CLRS is more like a cookbook or a list of algorithms.
That's not the reason, at least according to Matt Levine and the SEC.
This is.
It depends on the merger agreement with LING - whether it allows either party to trade on this information. You're 'robbing' proprietory information if this is disallowed. I'm unsure if merger agreements typically mention this, but if they don't, then this is (legally)implicit.
Every contract between 2 parties implicitly or explicitly includes whether you're allowed to trade on information you gain from your relationship. Examples
Companies are always allowed to trade on information they own/produce - otherwise it would be illegal for companies to ever issue shares or do buybacks.
Importantly, the law is about the information, not the stock of the company itself. It would have been completely legal for Pfizer and Moderna to trade hotel and airlines stocks before releasing their COVID vaccine trial results. This is 'unfair' to retail investors but that's not what the law is about.
This is(maybe?) also why the SEC wants jurisdiction over crypto, prediction markets and other securities - these are all avenues for robbing other companies of proprietory information.