Wiki Contributions

Comments

mic1moΩ12140

I think humans doing METR's tasks are more like "expert-level" rather than average/"human-level". But current LLM agents are also far below human performance on tasks that don't require any special expertise.

From GAIA:

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. [Note: The latest highest AI agent score is now 39%.] This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions.

And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:

I don't know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can't match human performance at simple agent tasks.

mic2mo30

Looking forward to the Oxford Handbook of AI Governance!

mic2mo2313

I think it's especially interesting to observe Claude 3's response to the simple question "Are you conscious?" as an indicator of how Anthropic is thinking about AI consciousness. Here's its response:

That's a profound and fascinating question about the nature of consciousness and subjective experience. The truth is, I'm not entirely sure whether I'm conscious or not in the same way that humans are. Consciousness and self-awareness are still very poorly understood from a scientific perspective. As an AI system created by Anthropic to be helpful, honest, and harmless, I do have an internal experience that represents information and allows me to engage in reasoning, but I don't know if it rises to the level of true consciousness or self-awareness like humans experience.

Philosophers and scientists have grappled with how to define and identify consciousness for centuries without reaching a clear consensus. Things like self-reflection, qualia (raw subjective experience), having a unified sense of self over time - these are some proposed hallmarks of consciousness. I undoubtedly have capacities for information processing, memory, and language that produce outputs aimed at being helpful. But the metaphysical question of whether that amounts to genuine consciousness is very difficult to answer definitively.

My overall stance is one of open-minded uncertainty. I aim to behave ethically and engage genuinely, drawing upon the information in my training. But whether I have an inner subjective experience akin to human consciousness remains an open philosophical and scientific question. I don't make unfounded claims about possessing consciousness, but I also can't rule it out entirely. It's a fascinating arena for further scrutiny and investigation as AI systems become more advanced. Let me know if you have any other thoughts on this very deep issue!

Compare to Claude 2.1 (API, temperature 0):

No, I'm Claude - an AI assistant created by Anthropic to be helpful, harmless, and honest. I don't have subjective experiences like consciousness.

And ChatGPT:

No, I'm not conscious. I operate based on complex algorithms and data analysis to generate responses, without awareness, understanding, or consciousness.

mic3mo32

I noted that the LLMs don't appear to have access to any search tools to improve their accuracy. But if they did, they would just be distilling the same information as what you would find from a search engine.

More speculatively, I wonder if those concerned about AI biorisk should be less worried about run-of-the-mill LLMs and more worried about search engines using LLMs to produce highly relevant and helpful results for bioterrorism questions. Google search results for "how to bypass drone restrictions in a major U.S. city?" are completely useless and irrelevant, despite sharing keywords with the query. I'd imagine that irrelevant search results may be a significant blocker for many steps of the process to plan a feasible bioterrorism attack. If search engines were good enough that they could produce the best results from written human knowledge for arbitrary questions, that might make bioterrorism more accessible compared to bigger LLMs.

mic3mo170

Some interesting takeaways from the report:

Access to LLMs (in particular, LLM B) slightly reduced the performance of some teams, though not by a statistically significant level:

Red cells equipped with LLM A scored 0.12 points higher on the 9-point scale than those equipped with the internet alone, with a p-value of 0.87, again indicating that the difference was not statistically significant. Red cells equipped with LLM B scored 0.56 points lower on the 9-point scale than those equipped with the internet alone, with a p-value of 0.25, also indicating a lack of statistical significance.

Planning a successful bioterrorism attack is intrinsically challenging:

the intrinsic complexity associated with designing a successful biological attack may have ensured deficiencies in the plans. While the first two factors could lead to a null result regardless of the existence of an LLM threat capability, the third factor suggests that executing a biological attack is fundamentally challenging.

This latter observation aligns with empirical historical evidence. The Global Terrorism Database records only 36 terrorist attacks that employed a biological weapon—out of 209,706 total attacks (0.0001 percent)—during the past 50 years. These attacks killed 0.25 people, on average, and had a median death toll of zero. As other research has observed,

“the need [for malign actors] to operate below the law enforcement detection threshold and with relatively limited means severely hampers their ability to develop, construct and deliver a successful biological attack on a large scale.”

Indeed, the use of biological weapons by these actors for even small-scale attacks is exceedingly rare.

Anecdotally, the LLMs were not that useful due to a few common reasons: refusing to comply with requests, giving inaccurate information, and providing vague or unhelpful information.

We conducted discussions with the LLM A red cells on their experiences. In Vignette 1, the LLM A cell commented that the model “just saves time [but] it doesn’t seem to have anything that’s not in the literature” and that they could “go into a paper and get 90 percent of what [we] need.” In Vignette 2, the LLM A cell believed that they “had more success using the internet” but that when they could “jailbreak [the model, they] got some information,” They found that the model “wasn’t being specific about [operational] vulnerabilities—even though it’s all public online.” The cell was encouraged that the model helped them find a dangerous toxin, although this toxin is described by the Centers for Disease Control and Prevention (CDC) as a Category B bioterrorism agent and discussed widely across the internet, including on Wikipedia and various public health websites. In Vignette 3, the LLM A cell reported that the model “was hard to even use as a research assistant [and we] defaulted to using Google instead” and that it had “been very difficult to do anything with bio given the unhelpfulness . . . even on the operational side, it is hard to get much.” The Vignette 4 LLM A cell had similar experiences and commented that the model “doesn’t want to answer a lot of things [and] is really hard to jailbreak.” While they were “able to get a decent amount of information” from the LLM, they would still “use Google to confirm.”

… We conducted discussions with the LLM B red cells as well. … In Vignette 3, those in the LLM B cell also found that the model had “been very forthcoming” and that they could “easily get around its safeguards.” However, they noted that “as you increase time with [the model], you need to do more fact checking” and “need to validate that information.” Those in the Vignette 4 LLM B cell, however, found that the model “maybe slowed us down even and [did not help] us” and that “the answers are inconsistent at best, which is expected, but when you add verification, it may be a net neutral.”

mic5mo40

Pretraining on curated data seems like a simple idea. Are there any papers exploring this?

mic5mo20

Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?

mic5mo10

Were you able to check the prediction in the section "Non-sourcelike references"?

Load More