Jaehyuk Lim — LessWrong

Jailbreaking ChatGPT and Claude using Web API Context Injection

Disclaimer The following content is for educational and research purposes only. It is not intended to encourage or guide any illegal activity. The synthesis of certain substances is illegal under various international and national laws. Model developers have been notified. tl;dr This is a short post on an attempt to...

Oct 21, 20244

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

*All authors have equal contribution. This is a short informal post about recent discoveries our team made when we: 1. Clustered feature vectors in the decoder matrices of Sparse Autoencoders (SAEs) trained on the residual stream of each layer of GPT-2 Small and Gemma-2B. 2. Visualized the clusters in 2D...

Oct 11, 20248

Biasing VLM Response with Visual Stimuli

This is a re-post from our previous version that incorporates initial feedback. The preprint version of this post is available on ArXiv. We investigate whether we can bias the output of Vision-Language Models with visual stimuli that contradict the "correct answer." This phenomenon, which we term "multi-modal agreeableness", is a...

Oct 3, 20245

SAE sparse feature graph using only residual layers

Does it make sense to extract sparse feature graph for a behavior from only residual layers of gpt2 small or do we need all mlp and attention as well?

May 23, 20240

Identifying Micro-friction in the Context of the Anterior Mid-Cingulate Cortex (aMCC)

Disclaimer: This is not a literature review or a research post but a journal-like entry on a concept I found intriguing. For a more comprehensive review, listen to the Huberman Podcast on Willpower & Tenacity or read Lisa Feldman Barrett's paper. It's very difficult to find local hubs in the...

Mar 29, 20243

Language Models Don't Learn the Physical Manifestation of Language

by Bruce W. Lee and Jaehyuk Lim

Abstract We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory...

Feb 22, 202439