LESSWRONG
LW

Viktor Rehnberg
1731800
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Recent AI model progress feels mostly like bullshit
Viktor Rehnberg4mo20

Another hypothesis: Your description of the task is

the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.

From METR's recent investigation on long tasks you would expect current models not to perform well on this.

METRs graph

I doubt a human professional could do the tasks you describe in something close to an hour, so perhaps its just currently too hard and the current improvements don't make much of a difference for the benchmark, but it might in the future.

Reply
Survival without dignity
Viktor Rehnberg8mo22

(Perhaps you're thinking of this https://www.lesswrong.com/posts/EKu66pFKDHFYPaZ6q/the-hero-with-a-thousand-chances)

Reply
Should you refuse this bet in Technicolor Sleeping Beauty?
Viktor Rehnberg1y30

Good formulation. "Given it's Monday" can have two different meanings:

  • you learn that you will only be awoken on Monday, then it's 50%
  • you awake assign 1/3 probability to each instance and then make the update P(T|M)=P(M|T)P(T)/P(M)=(1/2)(2/3)/(2/3)=50%

So it turns out to 50 % for both but it wasn't initially obvious to me that these two ways would have the same result.

Reply1
Should you refuse this bet in Technicolor Sleeping Beauty?
Viktor Rehnberg1y50

I'd say P(Tail|Wake-up)=2/3

Reply
Should you refuse this bet in Technicolor Sleeping Beauty?
Viktor Rehnberg1y51

The possible observer instances and their probability are:

  • Heads 50 %
    • Red room 25 %
    • Blue room 25 %
  • Tails 50 %
    • Red room 50 % (On Monday or Tuesday)
    • Blue room 50 % (On Monday or Tuesday)

If I choose a strategy "bet only if blue" (or equivalentely "bet only if red") then expected value for this strategy is (−300)∗0.25+200∗0.5=25 so I choose to follow this strategy.

I don't remember what halfer and thirder were or what position I consider to be correct.

Reply
Was Releasing Claude-3 Net-Negative?
Viktor Rehnberg1y10

Capabilities leakages don’t really “increase race dynamics”.

Do people actually claim this? Shorter timelines seems like a more reasonable claim to make. To jump directly to impacts on race dynamics is skipping at least one step.

Reply
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust
Viktor Rehnberg2y60

To me it feels like this policy is missing something that accounts for a big chunk of the risk.

While recursive self-improvement is covered by the "Autonomy and replication" point, there is another risk from actors that don't intentionally cause large scale harm but use your system to make improvements to their own systems as they don't follow your RSP. This type of recursive improvement doesn't seem to be covered by any of "Misuse" or "Autonomy and replication".

In short it's about risks due to shortening of timelines.

Reply
How to have Polygenically Screened Children
Viktor Rehnberg2y10

You can see twin birth rates fell sharply in the late 90s

Shouldn't this be triplet birthrates? Twin birthrates look pretty stable in comparison.

Reply
Some 2-4-6 problems
Viktor Rehnberg2y10

Hmm, yeah it's a bit hard to try stuff when there's no good preview. Usually I'd recommend rot13 chiffer if all else fails but for number sequences that makes less sense.

Reply
Some 2-4-6 problems
Viktor Rehnberg2y70

I knew about 2-4-6 problem from HPMOR, I really like the opportunity to try it out myself. These are my results on the four other problems:

indexA

Number of guesses:

8 guesses of which 3 were valid and 5 non-valid

Guess:

"A sequence of integers whose sum is non-negative"

Result: Failure

indexB

Number of guesses:

39 of which 23 were valid 16 non-valid

Guess:

"Three ordered real numbers where the absolute difference between neighbouring numbers is decreasing."

Result: Success

indexC

Number of guesses:

21 of which 15 were valid and 6 non-valid

Guess:

"Any three real numbers whose sum is less than 50."

Result: Success

indexD

Number of guesses:

16 of which 8 were valid and 8 non-valid

Guess:

"First number is a real number and the other two are integers divisible by 5"

Result: Failure

Performance analysis

I'd say that the main failure modes were that I didn't do enough tests and I was a very bad number generator. For example, in indexD

I made 9 tests to test my final hypothesis 4 of which were valid, that my guess and the actual rule would give the same result for these 9 tests if I were actually good at randomizing is very small.

I could also say that I was a bit naive on the first test and that I'd grown overconfident after two successive successes for the final test.

Reply
Load More
43Concrete empirical research projects in mechanistic anomaly detection
1y
3
7Intuitions by ML researchers may get progressively worse concerning likely candidates for transformative AI
3y
0
51Takeaways from the Intelligence Rising RPG
4y
7