I hear that we're supposed to brainstorm ways to spend massive amounts of money to advance AGI safety. Well here's a brainstorm, sorry if it's stupid, I'm not an interpretability researcher:

We could pay Cycorp to open-source Cyc, so that researchers can incorporate it into future AGI interpretability systems.

What is Cyc? I'm not an expert. Most of what I know about Cyc comes from this podcast interview, and wikipedia, and LW wiki, and Eliezer Yudkowsky dismissing Cyc as a not on the path to AGI. I agree, by the way: I don't think that Cyc is on the path to AGI. (I'm not even sure if it's trying to be.) But it doesn't matter, that's not why I'm talking about it.

Anyway, Cyc is a language for expressing "knowledge" (e.g. water will fall out of an upside-down cup), and super-giant database of such "knowledge", hand-coded over the course of >1000 person-years and >35 wall-clock years, by a team of tireless employees whom I do not envy.

How expensive would this be: Beats me. I didn't ask. (Wild guess: one-time cost in the eight figures, based on Cycorp annual revenue of ~$5M according to the first hit on google (which I didn't double-check).)

Why might open-sourcing Cyc be helpful for AGI interpretability efforts? Well maybe it won't. But FWIW, here's what I was thinking…

When I imagine what a future interpretability system will look like, in general, I imagine an interface…

  • The human-legible side of the interface consists of, maybe, human-understandable natural-language phrases, or human-understandable pictures, or human-understandable equations, or whatever.
  • The trained-model side of the interface consists of stuff that's happening in a big complicated unlabeled model built by ML.

Anyway, my surmise is that Cyc might be a good fit for the former (human-legible) side of this interface.

Specifically, the Cyc project has this massive structured knowledge system, and everything in that system is human-legible. It was, after all, built by humans!

You might say: natural language is human-legible too. Why not use that?

Well for one thing, natural-language descriptions can be ambiguous. For example, dictionary words may have dozens of definitions. In Cyc, such a word would be disambiguated: it would be dozens of different tokens for the dozens of different definitions and nuances. (…If I understand correctly. It came up in the interview.)

For another thing, there's a matching-across-the-interface challenge: we want to make sure that the right human-legible bits are matched to the corresponding trained-model bits (at least, insofar as such a thing is possible). Here, the massive Cyc knowledge database (i.e. millions of common-sense facts about the world like "air can pass through a screen door" or whatever else those poor employees had to input over the decades) would presumably come in handy. After building a link between things in the ML system and Cyc, we can check (somehow) that the ML system endorses all the "knowledge" in Cyc as being correct. If that test passes, then that's a good sign we're matching the two things up properly; if it fails in some area, we would want to go back and figure out what's going on there.

I don't want to make too strong a statement though. At the end of the day, it seems to me that a natural-language interface to a future AGI interpretability system has both strengths and weaknesses compared to a Cyc interface. (And there are other possibilities too.) Hey, maybe we should have both! When it comes to interpretability, I say more dakka :-P

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 12:49 PM

Considering "Symbolic Knowledge Distillation: from General Language Models to Commonsense Models", West et al 2021, and Ernie, the Cyc knowledge base may not be too useful compared to what can already be boosted out of existing models.

If their revenue is $5m, I believe a 5x multiplier is a reasonable average, and the buyout would cost $25m. Most of what you buy with that would be either useless or rapidly depreciating (even beyond normal for software/tech). For $25m, you could probably do quite a lot with existing models.

Thanks! I'm all for very-high-quality human-legible knowledge-graph / world-models, created by whatever means. If people know how to make such things algorithmically, so much the better.

I am, however, confused about how an algorithmically-generated knowledge-graph would wind up with clear unambiguous human-legible labels on the nodes. Those labels are really the key ingredient. If they can be auto-generated, then I'm pleasantly surprised. I was assuming we'd need a human to write the hundreds of thousands of labels (e.g. "friend-in-the-Quaker-sense")

I was assuming we'd need a human to write the hundreds of thousands of labels (e.g. "friend-in-the-Quaker-sense")

As always, "Sampling can show the presence of knowledge but not its absence." Self-distillation is witchcraft - I don't blame people, even here, who I have to remind that it is in fact a thing. Works for a lot of stuff, whether it's playing Starcraft or translating French or solving math or generating knowledge graphs...

Some small subsets of CYC were released (see Wiki). You could finetune a model on those and use that to estimate the value of the full dataset. (You could also talk to Michael Witbrock, who worked on CYC in the past and is familiar with AI alignment.) 

There are also various large open-source knowledge bases. There are closed knowledge graphs/bases at Google and many other companies. You might be able to collaborate with researchers at those organizations. 

My superficial understanding is that Cyc has two crucial advantages over all current knowledge bases / knowledge graphs:

  1. It is much, much bigger
  2. Predicates can be of any arity (properties of one entity, relations between two entities, more complex, structured relationships between N entities for any N), whereas knowledge graphs can only represent binary relationships R(X,Y), like "X loves Y".

If I understand it correctly, then Cyc's knowledge base is a knowledge hypergraph. Maybe it doesn't eventually matter and you can squeeze any knowledge encoded into Cyc's KB into ordinary knowledge graphs without creating some edge-spaghetti hell.

Cyc's human-input 'knowledge' would indeed be an interesting corpus, but my impression has always been that nothing really useful has come of it to date. I wouldn't pay much for access.

So you've noted a need for labeling things? I posit a slightly different hybrid approach would be very useful in machine learning. In computer science, there is a concept known as memoization, which is simply storing the results of a previous calculation you might need to make again. In many cases, it can dramatically improve performance. Make a hybrid that can make it's own entries into the knowledge base (and retrieve them, of course.) Seed it with things you know are useful, like how math works, and a dictionary or two, (plus, perhaps, a less firmly believed knowledge base like how you are talking about here,) but let the algorithm add as many new things as it likes. I'm not sure why people are willing to spend such crazy amounts on training the thing, and so little on memory/facts/concepts.

I think this is a pretty reasonable goal.  I also listened to that podcast interview, and although I certainly don't think they are near an AGI right now, it may have some missing pieces that other projects don't, particularly in regards to explaining AI actions in a human-intelligible fashion.

I don't think open-sourcing would require a buy-out.  The plethora of companies built around open-source code bases shows that one can have an open-sourced code base, and still be profitable.    

Gwern, what makes you pick a 5x multiplier?

The average P/E ratio for the S&P 500 is around 30 right now.  I would expect that a firm like Cyc may be worth a bit more, since it is a moonshot project.  

If their revenue is 5 million, I would expect the company value is roughly 150 million, based on that back of the napkin math.

How much they would charge to open source, however, could be drastically less than that, and maybe in single digit numbers. 

(In a P/E ratio, the "earnings" is profit, which in Cyc's case is probably negative. Gwern is using a P/S ratio, price to sales where sales=revenue, since these are usually used for startups since they're scaling and earnings are still negative. 5 seems reasonable because, while P/S can go much higher for startups rapidly scaling, Cyc doesn't seem to be rapidly scaling.)

I thought I was being generous by applying what several articles/blog posts told me was a fairly typical multiplier for small private businesses: I can't think offhand of an 'AI startup' (are you still a 'startup' if you are 26 years old and going nowhere fast?) I'd rather own less than a big knowledge base and inference engine dating from the '80s. In any case, if you believe the multiplier should be much bigger than 5x, then that makes buying look all the worse.