Wiki Contributions

Comments

I agree with you there. There are numerous benefits of being an autodidact (freedom to learn what you want, less pressure from authorities), but formal education offers more mentorship. For most people, the desire to learn something is often not enough even with the increased accessibility of information, as the material gets more complex. 

Do you see possible dangers of closed-loop automated interpretability systems as well?

the only aligned AIs are those which are digital emulations of human brains

I don't think this is necessarily true. I don't think emulated human brains are necessary for full alignment, nor whether emulated human brains would be more aligned than a well calibrated and scaled up version of our current alignment techniques (+ new ones to be discovered in the next few years). To emulate the entire human brain to align values seem to be not only implausible (even with neurmorphic computing, efficient neural networks and Moore's law^1000), it seems like an overkill and a misallocation of valuable computational resources.  Assuming I'm understanding "emulated human brains" correctly, emulation would mean pseudo-sentient systems solely designed to be aligned to our values. Perhaps morality can be a bit simpler than that, somewhere in the middle of static, written rules (the law) and the unpredictable human mind. Because if we do make more of people essentially, it's not really addressing the "many biases or philosophical inadequacies" of us. 

What's the difference between "having a representation" for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I'm interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people.

Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity-in-mistralai-mistral-7b-v0-1/70031), the presence of distinct representations for uppercase and lowercase characters might not be as relevant to the task for the model as one would assume. It seems plausible that these representations may not substantially influence the model's ability to perform H-Test, such as answering multiple-choice questions, with non-negligible probability. Perhaps one should probe for another representation, such as a circuit for "eliciting information".