Towards White Box Deep Learning

Maciej Satkiewicz

Towards White Box Deep Learning

by Maciej Satkiewicz

1 min read27th Mar 20245 comments

17

Interpretability (ML & AI)Machine Learning (ML)Research AgendasAI

Frontpage

This is a linkpost for https://arxiv.org/abs/2403.09863

Hi, I’d like to share my paper that proposes a novel approach for building white box neural networks.

The paper introduces semantic features as a general technique for controlled dimensionality reduction, somewhat reminiscent of Hinton’s capsules and the idea of “inverse rendering”. In short, semantic features aim to capture the core characteristic of any semantic entity - having many possible states but being at exactly one state at a time. This results in regularization that is strong enough to make the PoC neural network inherently interpretable and also robust to adversarial attacks - despite no form of adversarial training! The paper may be viewed as a manifesto for a novel white-box approach to deep learning.

As an independent researcher I’d be grateful for your feedback!

New to LessWrong?

Getting Started

FAQ

Library

Towards White Box Deep Learning

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:08 AM

[-]mishka1mo63

This looks interesting, thanks!

This post could benefit from an extended summary.

In lieu of such a summary, in addition to the abstract

This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at this https URL

I'll quote a paragraph from Section 1.2, "The core idea"

This paper introduces semantic features as a general idea for sharing weights inside a neural network layer. [...] The concept is similar to that of "inverse rendering" in Capsule Networks where features have many possible states and the best-matching state has to be found. Identifying different states by the subsequent layers gives rise to controlled dimensionality reduction. Thus semantic features aim to capture the core characteristic of any semantic entity - having many possible states but being at exactly one state at a time. This is in fact a pretty strong regularization. As shown in this paper, choosing appropriate layers of semantic features for the [Minimum Viable Dataset] results in what can be considered as a white box neural network.

[-]Maciej Satkiewicz1mo20

Thank you! The quote you picked is on point, I added an extended summary based on this, thanks for the suggestion!

[-]mishka1mo30

Thanks, this is very interesting.

I wonder if this approach is extendable to learning to predict the next word from a corpus of texts...

The first layer might perhaps still be embedding from words to vectors, but what should one do then? What would be a possible minimum viable dataset?

Perhaps, in the spirit of PoC of the paper, one might consider binary sequences of 0s and 1s, and have only two words, 0 and 1, and ask what would it take to have a good predictor of the next 0 or 1 given a long sequence of those as a context. This might be a good starting point, and then one might consider different examples of that problem (different examples of (sets of) sequences of 0 and 1 to learn from).

[-]Maciej Satkiewicz1mo40

These are interesting considerations! I haven't put much thought on this yet but I have some preliminary ideas.

Semantic features are intended to capture meaning-preserving variations of structures. In that sense the "next word" problem seems ill-posed as some permutations of words preserve meaning; in reality its a hardly natural problem also from the human perspective.

The question I'd ask here is "what are the basic semantic building blocks of text for us humans?" and then try to model these blocks using the machinery of semantic features, i.e. model the invariants of these semantic blocks. Only then I'd think of adequate formulations of useful problems regarding text understanding.

So I'd say that these semantic atoms of text are actually thoughts (encoded by certain sequences of words/sentences that enjoy certain permutation-invariance). Thus semantic features would aim to capture thoughts-at-locations by finding these sequences (up to their specific permutations) and deeper layers would capture higher-level thoughts-at-locations composed of the former. This could potentially uncover some euclidian structure in the text (which makes sense as humans arguably think within the space-time framework, after Kant's famous "Space and time are the framework within which the mind is constrained to construct its experience of reality").

That being said the problems I'd consider would be some forms of translation (to another language or another modality) rather then the artificial next-word prediction.

The MVD for this problem could very well consist of 0's and 1's provided that they'd encode some simple yet sensible semantics. I'd have to think of a specific example more, it's a nice question :)

[-]Nathan Helm-Burger1mo21

In addition to translation (which I do think is a useful problem for theoretical experiments), I would recommend question answering as something which gets at 'thoughts' rather than distractors like 'linguistic style'. I don't think multiple choice question answering is all that great a measure for some things, but it is a cleaner measure of the correctness of the underlying thoughts.

I agree that abstracting away from things like choice of grammar/punctuation or which synonym to use is important to keeping the research question clean.

Moderation Log