Research access to large amounts of anonymized patient data.

Take all the data you have, come up with some theory to describe it, build the scheme into a lossless data compressor, and invoke it on the data set. Write down the compression rate you achieve, and then try to do better. And better. And better. This goal will force you to systematically improve your understanding of the data.

(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encodi... (read more)

(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding - I can give you my implementation, or you can find one on the web.

The unsolved problems are the ones hiding behind the token "sufficiently well specified statistical model".

That said, thanks for the pointer to arithmetic encoding, that may be useful in the future.

2Punoxysm5yWould anyone want to literally do this on something as complex as patient data? If not, why not just say try to come up with as good of models as you can? Pick a couple of quantities of interest and try to model them as accurately as you can.
1Username5yThis positively sounds a lot like advice that was given in response to a question in the open thread about how to go about a masters thesis. I can't find it but I endorse the recommendation. Immerse yourself in the data. Attack it from different angles and try to compress it down as much as possible. The idea behind the advice is that if you understand the mechanics behind the process the data can be generated from the process (imagine an image of a circle encoded as svg instead of bitmap (or png)). There are two caveats: 1) You can't eliminate noise of course. 2) You are limited by your data set(s). For the former you know enough tools to separate the noise from the data and quantify it.For the latter you should join in extenal data sets. Your modelling might suggest which could improve your compression. E.g. try to link in SNPs databases.

Request for suggestions: ageing and data-mining

by bokov 1 min read24th Nov 201448 comments


Imagine you had the following at your disposal:

  • A Ph.D. in a biological science, with a fair amount of reading and wet-lab work under your belt on the topic of aging and longevity (but in hindsight, nothing that turned out to leverage any real mechanistic insights into aging).
  • A M.S. in statistics. Sadly, the non-Bayesian kind for the most part, but along the way acquired the meta-skills necessary to read and understand most quantitative papers with life-science applications.
  • Love of programming and data, the ability to learn most new computer languages in a couple of weeks, and at least 8 years spent hacking R code.
  • Research access to large amounts of anonymized patient data.
  • Optimistically, two decades remaining in which to make it all count.

Imagine that your goal were to slow or prevent biological aging...

  1. What would be the specific questions you would try to tackle first?
  2. What additional skills would you add to your toolkit?
  3. How would you allocate your limited time between the research questions in #1 and the acquisition of new skills in #2?

Thanks for your input.


I thank everyone for their input and apologize for how long it has taken me to post an update.

I met with Aubrey de Grey and he recommended using the anonymized patient data to look for novel uses for already-prescribed drugs. He also suggested I do a comparison of existing longitudinal studies (e.g. Framingham) and the equivalent data elements from our data warehouse. I asked him that if he runs into any researchers with promising theories or methods but for a massive human dataset to test them on, to send them my way.

My original question was a bit to broad in retrospect: I should have focused more on how to best leverage the capabilities my project already has in place rather than a more general "what should I do with myself" kind of appeal. On the other hand, at the time I might have been less confident about the project's success than I am now. Though the conversation immediately went off into prospective experiments rather than analyzing existing data, there were some great ideas there that may yet become practical to implement.

At any rate, a lot of this has been overcome by events. In the last six months I realized that before we even get to the bifurcation point between longevity and other research areas, there are a crapload of technical, logistical, and organizational problems to solve. I no longer have any doubt that these real problems are worth solving, my team is well positioned to solve many of them, and the solutions will significantly accelerate research in many areas including longevity. We have institutional support, we have a credible revenue stream, and no shortage of promising directions to pursue. The limiting factor now is people-hours. So, we are recruiting.

Thanks again to everyone for their feedback.