LESSWRONG
LW

733
artkpv
618220
Message
Dialogue
Subscribe

Artyom Karpov, www.artkpv.net

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2artkpv's Shortform
7d
7
Hidden Reasoning in LLMs: A Taxonomy
artkpv16h10

This is interesting, thanks! Though I found this taxonomy lacks the division of steganography into low / high density (1 bit steganography is not that steganography), the one that requires or not the distribution of covertext (see Motwani et al.), the division into reasoning vs messages, and other.

Reply
How I Became a 5x Engineer with Claude Code
artkpv3d10

Should it be a 5X junior-middle engineer, not all engineer work? I found Claude Code really struggles to implement something more complicated than a simple data mapping, APIs, or similar. Concrete examples I saw Claude Code didn't work well and instead made me slower are arirthmetic coding, PPO. Also, see this research from METR that makes similar conclusion.

Reply
artkpv's Shortform
artkpv6d20

Not sure I understand if you disagree or agree with something. The point of the post above was that LLMs might stop showing the growth as we see now (Kwa et al., ‘Measuring AI Ability to Complete Long Tasks’), not that there is no LLM reasoning at all, general or not.

Reply
artkpv's Shortform
artkpv6d10

I agree that commercial models don't detail their data, the point is to have an estimate. I guess, Soldaini et al., ‘Dolma’, made their best to collect the data, and we can assume commercial models have similar sources.

Reply
artkpv's Shortform
artkpv7d3-3

The question of whether LLMs are a dead end, as discussed by R. Sutton, Y. LeCun, and T. Ord, among many others, is hundreds of years old. Currently, we see that the chance of an LLM agent failing a task rises with the number of steps taken. This was observed even before the era of LLMs, when agents were trained with imitation learning. The crux is whether further training of LLMs leads to the completion of longer tasks or if these agents hit a wall. Do LLMs indeed build a real-world model that allows them to take the right actions in long time horizon tasks? Yet, models might only build a model from predicting what humans would say as their next token. Then the question is whether humans possess the necessary knowledge.

Questions like "what is knowledge, and do we have it?" are hundreds of years old. Aristotle wrote that the basis for every statement, and thus reasoning and thinking (1005b5), is that "A or not A" (the law of identity). In other words, it is impossible for something to be and not be what it is simultaneously. This is the beginning of all reasoning. This law opposes the view of sophists like Protagoras, who claimed that what we sense or our opinions constitute knowledge. Sophists held that something can both be and not be what it is at the same time, or "everything flows" (panta rhei, Heraclitus). Plato and Aristotle opposed this view. The law of identity suggests that ground truth is essential for correct reasoning and action. And it’s not about mathematical problems where LLMs show impressive results; it’s about reasoning and acting in the real world. So far, LLMs are taught mainly based on predicting what people would say next—opinions rather than real-world experience.

Why are LLMs trained on opinions? Their pre-training corpus is over 99% composed of people’s opinions and not real-world experience. The entire history of knowledge is a struggle to find the truth and overcome falsehoods and fallacies. The artifacts remaining from this struggle are filled with false beliefs. Even our best thinkers were wrong in some sense, like Aristotle, who believed slavery wasn’t a bad thing (see his Politics). We train LLMs not only on the artifacts from our best thinkers but, in 99.95% of cases, on web crawls, social media, and code. The largest bulk of compute is spent on pre-training, not on post-training for real-world tasks. Whether the data is mostly false or can serve as a good foundation for training on real-world tasks remains an open question. Can a model trained to predict opinions without real experience behave correctly? This is what reinforcement learning addresses.

Reinforcement learning involves learning from experience in the search for something good. Plato depicted this beautifully in his allegory of the cave, where a seeker finds truth on the path to the Sun. A real-world model is built from seeking something good. The current standard model of an intelligent agent reflects what Aristotle described about human nature: conscious decisions, behavior, and reasoning to achieve good (Nicomachean Ethics, 1139a30). LLMs are mostly trained on predicting the next token, not achieving something good. Perhaps Moravec's paradox results from this training; models don’t possess the general knowledge or reasoning. General reasoning might be required to build economically impactful agents. General reasoning is the thinking using the real world knowledge in novel situations. Will models learn it someday?

Reply
A Steering Vector for SQL Injection Vulnerabilities in Phi-1.5
artkpv13d20

Nice work! Have you tried different prompts to make the model safer? Like adding a system message that privacy is priority, etc. I'm curious because that unsafe behavior might be avoided with cheaper techniques like just modifying a prompt.

Reply
Early Signs of Steganographic Capabilities in Frontier LLMs
artkpv3mo10

This is valuable study because it presents realistic scenarios where steganography capabilities might be dangerous. Also, it draws distinction between message passing and encoded reasoning which is important for mitigation strategies. This study attempts to rigorously evaluate these capabilities, and it is important that they demonstrate that models tend to use obvious cryptography while instructed to be stealthy which supports the claim that steganography reasoning is weak, well, at least, in their setting.

However, I found that they conflate steganography and cryptography, e.g. in the abstract where they mention Hexadecimal which is, obviously, cryptography. The difference is that cryptography doesn't aim to hide the presence of hidden information, consider a monitor that sees a text like "68656c6c6f" ("hello"). Then their claim, "our results indicate that current LLMs exhibit nascent steganographic capabilities", is not well supported by their experiments because they didn't try enough steganography schemes: they only tried variations of the acrostic encoding and the N-th letter which are very basic schemes and pre deep neuron networks era or even pre Internet era (acrostic was there for millennia). Still, I think their work and code can be used to extend experiments to have better support for their claims.

Reply
LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
artkpv3mo10

That's interesting because it puts models in the situation of conflict between different objectives put inside those models.

While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is "F", so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.

If those questions are not fair (they don't have right option), those answers doesn't look like cheating.

Reply
An Opinionated Guide to Using Anki Correctly
artkpv3mo50

That's an interesting post, thank you! I've also been using Anki for a long time. I started in 2018. Now I do about 70 reviews a day (just checked my stats in Anki). The downside of your system, imho, is that it doesn't integrate with other notes, like notes for books, one's published posts, ideas, etc. And your handles from the cards looks an artificial and unmaintainable solution that won't last long, I think. I found it useful to have one system organized around plain text (markdown). I create my anki cards from those plain text files using this script. Other than that, I liked the idea of short cards, and your other advice.

Also, I'm suspicious about effectiveness of sharing personal Anki cards because personal associations matter a lot for retention and recall. I found this article useful.

Reply
What can be learned from scary demos? A snitching case study
artkpv4mo10

Weird = coincidences that “point at badness” too hard?

I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.

That doesn't look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you've proven with your tests. I think weirdness is a shaky concept, i.e. it's hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it's important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).

Reply
Load More
2artkpv's Shortform
7d
7
17How dangerous is encoded reasoning?
4mo
0
3Philosophical Jailbreaks: Demo of LLM Nihilism
4mo
0
9The Steganographic Potentials of Language Models
5mo
0
6CCS on compound sentences
1y
0
23Inducing human-like biases in moral reasoning LMs
2y
3
1How important is AI hacking as LLMs advance?
2y
0
7My (naive) take on Risks from Learned Optimization
3y
0