All of HumanitiesResearcher's Comments + Replies

This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.

I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, "This mark seems characteristic of this Print Shop's status.") or on a very naive frequentist basis (eg, "This mark comes up N times, and that's a big number, so it must be from Print Shop X"). My project would take these existing claims as priors. Is that valid?

I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.

Yes. He said that I should be careful about sharing my project because, otherwise, I'll be reading about it in a journal in a few months. His warning may exaggerate the likelihood of a rival researcher and mis-value the expansion of knowledge, but I'm deferring to him as a concession of my ignorance, especially regarding rules of the academy.

"Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats."

Sorry to interrupt a perfectly lovely conversation. I just have a few things to add:

  • I may have overstated the case in my first post. We have some information about print shops. Specifically, we can assign very small books to print shops with a high degree of confidence. (The catch is that small books don't tend to survive very well. The remaining population is rare and intermittent in terms of production date.)

  • There are some hypotheses that could be treated as priors, but they're very rarely quantified (projects like this are rare in today's humanities).

Yep. It's not the Bible. I suspect that there are already good stats compiled on the Q-source, etc.

In a way it's not only futile but limiting to play the guessing game. There are lots of possible applications of Bayesian methods to the humanities. Maybe this discussion will help more projects than my own.

Yes, I see an accord between your statement and Vaniver's. As I said below, most tools have very slow repair cycles.

I was openly warned by a professor (who will likely be on the dissertation committee) not to talk about this project widely.

The capitalized nouns are to highlight key terms. I believe the current description is specific enough to describe the situation accurately and without misleading people, but not too specific to break my professor's (correct) advice.

Have I broken LW protocol? Obviously, I'm new here.

Did they say why?

I have just such a thing, referred to as "Marks." I haven't yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?

You claimed to not know what printers there were, how many there were, and what connection they had to 'Marks'. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: 'we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...' There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.

Fortunately, we know which tool types leave which marks. We also have a very strong understanding of the ways in which tools break and leave marks.

Thanks again for entertaining this line of inquiry.

Have to define your features somehow.

I don't understand what this means. Can you say more?

0gwern10y A specific concrete variable you can code up, like 'total number of commas'.

That's a hell of a summary, thanks!

I'm glad you mentioned the repair cycle of tools. There are some tools that are regularly repaired (let's just call them "Big Tools") and some that aren't ("Little Tools"). Both are expensive at first and to repair, but it seems the Print Shops chose to repair Big Tools because they were subject to breakage that significantly reduced performance.

I should add another twist since you mentioned sheets of known origins: Assume that we can only decisively assign origins to single sheets. There are two probl... (read more)

Okay. Are the classes of marks distinct by tool type- that is, if I see a mark on a sheet, I know whether it came from tool type X or tool type Y- or do we need to try and discover what sort of marks the various tools can leave?

Very helpful points, thanks. The scholarly community already has a pretty good working knowledge of the Tools, and thus the theoretical model of Tool breakage ("breakage" may be more accurate than "decay," since the decay is non-incremental and stochastic). We know the order in which parts of the Tools break, and we have some hypotheses correlating breakage to gross usage. The twist is that we don't know when any Print Shops produced the Big Book, so we can only extrapolate a timeline based on Tool breakage

Can you say more about the holdout sample? Should the holdout sample be a randomly selected sample of data, or something suspected to be associated with Print Shops [x,y,z] ? Print Shops [a,b,c] ?

Interesting feedback.

It's the Bible, isn't it.

Ha, I wish. No, it's more specific to literature.

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

We have minimal information about Print Shops. I wouldn't say the existing data are garbage, just mostly unquantified.

Have you considered googling for previous work?

Yes, but thanks to you I know the shibboleth of "Bayesian stylometry." Makes sense, and I've already read some books in a similar vein, but there ar... (read more)

Have to define your features somehow. Really? I was under the opposite impression, that stylometry was, since the '60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism. No, not really. I think I would describe GIGO in this context as 'data which is equally consistent with all theories'.

Thanks for the feedback. I actually cleared up the technical language considerably. I don't think there's any need to get lost in the weeds of the specifics while I'm still hammering out the method.

Hi everyone,

I'm a humanities PhD who's been reading Eliezer for a few years, and who's been checking out LessWrong for a few months. I'm well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart's desire is to workshop research methods with the LW community.

It may break protocol, ... (read more)

This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help. To make sure I understand your problem: We have many copies of the Big Book. Each copy is a collection of many sheets. Each sheet was produced by a single tool, but each tool produces many sheets. Each shop contains many tools, but each tool is owned by only one shop. Each sheet has information in the form of marks. Sheets created by the same tool at similar times have similar marks. It may be the case that the marks monotonically increase until the tool is repaired. Right now, we have enough to take a database of marks on sheets and figure out how many tools we think there were, how likely it is each sheet came from each potential tool, and to cluster tools into likely shops. (Note that a 'tool' here is probably only one repair cycle of an actual tool, if they are able to repair it all the way to freshness.) We can either do this unsupervised, and then compare to whatever other information we can find (if we have a subcollection of sheets with known origins, we can see how well the estimated probabilities did), or we can try to include that information for supervised learning.
How about talking clearly about whatever you are currently hinting at?

I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence.

It's the Bible, isn't it.

Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

How can you possibly get off the ground if you have no information ab... (read more)

Any time you are doing statistical analysis, you always want a sample of data that you don't use to tune the model and where you know the right answer. (a 'holdout' sample) In this case, you should have several books related to the various print shops that you don't feed into your Bayesian algorithm. You can then assess the algorithm by seeing if it gets these books correct. To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you'd need to know how old the tools wee that made those books. Either that, or you'd have to have some understanding of how the tools decay from a theoretical model.