Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it's not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like "feature activates which indicates the intent to carry out a harmful plan, and feature... (read more)
Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it's not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like "feature activates which indicates the intent to carry out a harmful plan, and feature... (read more)