Probably you won't be able to perform a data-driven habit stacking for self-improvement

siarshai

Suppose you’re not happy with the quality of your sleep. You’ve already stopped doing the obviously harmful things (no more coffee at night), and your sleep has improved - but you’d like to work on it further. A coworker gives you an herbal mix with St. John’s wort and lavender. You try drinking it at night instead of coffee, and it does seem that sometimes your sleep really does get deeper than before. But sometimes it doesn’t. You’re willing to experiment, but how do you actually check whether the herbs work, or whether it’s just random variation?

Or suppose you’re not particularly satisfied with your productivity at work. Following the advice from "Atomic Habits" and books on workflow organization, you’ve introduced a few useful micro‑habits and ergonomic improvements. But what do you do once the low‑hanging fruits are picked up? Time is limited - you can’t implement everything that someone somewhere calls "useful". Some habits are even mutually exclusive: it’s impossible to both socialize during lunch and sit alone in silence at the same time.

Or, for example, you want to achieve better results in fishing… you get the idea.

"Don’t underestimate the power of small things taken in large numbers" is a wise thought - but how do you figure out which small things actually work in your specific case? This is quite a roadblock on the path to becoming stronger. If you’ve taken a statistics course or have heard the term "A/B testing," you probably already have a rough outline of an answer. Pick a target metric, collect historical data, then collect data after the chosen intervention, compare the average values before and after - and voilà, a scientifically grounded answer. The problem is that if you jump into a project like this without a solid plan, you’ll likely end up with a pile of useless numbers in a spreadsheet. Why this kind of endeavor is not for the faint‑hearted is exactly what I want to talk about in this article.

TLDR: Most effects you’re likely to encounter are too weak to detect against the background noise of everyday life. Even for a relatively noticeable effect, you need a carefully thought‑out experimental design and several months of work.

Notation

The following notation is used throughout the text:

D - variance (a measure of how spread out a random variable is)
- standard deviation (the square root of variance)
- a normal distribution with mean and variance D
- Cohen’s d statistic (the distance between the peaks of two normal-distribution populations, normalized by the square root of the sum of their variances)

For simplicity, the text below assumes that the intervention does not change the shape of the distribution (i.e. d is measured in units of of the original distribution). This does not strongly affect the arguments below. I’m also aware of the limitations of using the normal distribution for modeling. If we take those limitations seriously, the outlook for self‑experimentation becomes even worse.

Minimal background for people who want to read the article but don’t know probability theory

Skip this section if you know what the symbols above mean and remember the basic properties of the normal distribution.

There will be no detailed derivations or rigorous proofs here. The article is far too short for me to teach even an introductory statistics course. But here’s a short summary so that readers unfamiliar with statistics can still follow the reasoning that comes next.

Imagine a regular six‑sided die. If it’s a fair die, the probability of rolling any given face is 1/6 = 16.66%. This is a uniform distribution - all outcomes are equally likely. It can be visualized with the following histogram:

The horizontal axis shows the outcomes, the vertical axis shows their probabilities. The sum of probabilities over all outcomes is equal to one (100%).

But if we roll two six‑sided dice, the distribution of the sum of the outcomes is no longer uniform. If we list all possible outcomes:

1 + 1 = 2
1 + 2 = 3
2 + 1 = 3
1 + 3 = 4
2 + 2 = 4
3 + 1 = 4
1 + 4 = 5
...
5 + 6 = 11
6 + 5 = 11
6 + 6 = 12

we can see that 2 can only occur if both dice show 1 (which happens in 1 out of 36 cases). A sum of 3 can happen in two ways: the first die is 1 and the second is 2, or vice versa - so the probability is 2/36. The chance of rolling a 4 is even higher: 3/36, and so on. The largest number of combinations produce a sum of 7. These include cases where both dice land near the middle of the range (3+4, 4+3), as well as cases where a small value is compensated by a large one (1+6, 2+5, 5+2, 6+1).

The bars of the histogram form a triangle. What happens if we roll three six‑sided dice instead?

You can see that the center becomes smoother, while the tails spread out. Again, this happens because getting a 3 or an 18 requires all three dice to land on 1 or 6 respectively. Central values, on the other hand, can arise both when all dice land near the middle and when small values compensate large ones.

If we take 25 dice, the resulting plot takes a particular bell shape:

For practical purposes, a curve like this is already quite close to a normal distribution. Many things in our lives follow an (approximately) normal distribution: the height of men of a fixed age in a given country, measurement errors of most instruments, your mood over the course of a year. This happens for the same reason as with the sum of dice. You’re in a good mood when everything goes according to plan (all dice roll 5–6). You’re in a bad mood when nothing works out (all dice roll 1–2). But an average mood can result both from an unremarkable day and from good things compensating for bad ones.

("Didn’t get enough sleep this night. But I remembered my favorite school teacher in the morning, which gave me a motivational boost. Then there was a traffic jam on the way to work - so irritating! At least the day is sunny; you can feel that spring is coming - I love spring.")

Probability theory says that the elements being summed don’t have to be uniformly distributed. As long as they have a sufficiently "nice" shape, their sum will converge to a normal distribution. This is called the central limit theorem. And "niceness" holds surprisingly often.

The bell curve can have different widths, and we’d like to be able to measure this spread. Simply measuring the distance between the leftmost and rightmost values is impractical, because values near the edges of the bell occur very rarely. So in practice, we use variance () - the mean squared distance from the center of the distribution - or standard deviation (), the square root of variance.

The red arrows in the picture show distances from the center to several points. To compute the standard deviation, we take all these distances, square each of them, sum them up, divide by their count, and then take the square root of the result.

It is known that 68% of elements of a normal distribution lie within the range , and 95% lie within . Conversely, if we know that 95% of the elements of a normal distribution fall within a certain range, we can estimate .

Now suppose we have two sets of values: an original random distribution and the same distribution shifted by some amount. For example, a database of how well students performed on a test after 48 weeks of studying a subject, and how well they performed after 96 weeks. How do we estimate how much doubling the time improves the results? It’s convenient to measure this shift in units of : this way of measuring reflects not the absolute change, but the change relative to the natural variability. A ten‑point improvement can be impressive if the typical spread is five points, and almost meaningless if the spread is fifty points.

What if the intervention not only shifts the mean, but also increases the spread of values? That’s not a problem - we take the average according to a standard rule: and measure the distance in those units:

This is called Cohen’s d statistic. Scientific papers usually report results using exactly this measure.

How large of effects should you expect from self‑experimentation?

It’s hard to answer this question in general. Some people are trying to optimize sleep quality, others are aiming to gain muscles. Let’s calibrate expectations - here are Cohen’s d values for a few well‑studied effects:

Taking creatine (a dietary supplement widely used among bodybuilders), depending on the protocol and how outcomes are measured, yields an additional performance gain of about d = 0.27–0.59^[1].
Strength training with a coach, as opposed to training alone, adds another d = 0.28–0.4 depending on how performance is measured^[2].
Meta‑analyses estimate the average effect of antidepressants at around d = 0.3^[3]. The effect is smaller in cases of mild depression. In real life, however, one should expect a much larger impact: a doctor can try several medications and fine‑tune the dosage.
There is no consensus on the effectiveness of mindfulness meditation for treating anxiety disorders. Studies show a wide spread of results, but an average estimate of d = 0.2 seems quite plausible^[4]^[5].
Moderate‑intensity strength training (60–80% of one‑rep max) reduces glycated hemoglobin levels by about d = 0.4^[6].
Additional financial education classes yield d = 0.15 for financial knowledge (test scores) and d = 0.07 for financially responsible behavior (for example, building a larger emergency fund)^[7].

On the one hand, your personal life situation or biochemistry might amplify effects that are useless for most people. On top of that, studies are limited by the fact that the same "dose" of an intervention is applied to all participants, whereas you can fine‑tune the "dose" much more precisely for yourself. On the other hand, you’re unlikely to test anything on yourself that’s stronger than prescription antidepressants. Also, most likely you’ll want to optimize things like "mood" or "energy throughout the day." It’s rare for something to simultaneously affect biology, externally imposed daily schedules, psychology, and the habits feeding into these metrics. So expecting effects of d > 1 from any intervention in these cases would be strange. Either you already know about such interventions and are already doing them (you’ve almost certainly gone to school - not some specific course, but school in general). Or you know about them but aren’t doing them for good reasons (moving to a Buddhist monastery would almost certainly have a strong effect on anxiety but there are legitimate reasons not to do that). Or these are major life‑changing decisions that don’t just happen casually, and that are hard to reproduce and study - leaving an abusive partner, or radically changing career track.

As a result, in my opinion, if you’re trying to optimize your life by tweaking habits and selecting supplements, you’re effectively "hunting" for effects in the range d = 0.1–0.4.

What does d = 0.1–0.4 actually mean in real life?

in IQ measurement is about 1.5 IQ points. An extra 1 centimeter of height corresponds to roughly in the distribution of adult male height^[8]. Not much - but I wouldn’t casually throw away extra IQ points or height centimeters.
The 10th percentile of electric motor assemblers in the US earns about $25k per year, while the 90th percentile earns $56.5k^[9]. Roughly speaking, the standard deviation of this distribution is around $8.5k per year, and 0.2 of that is $1.7k per year. That’s enough to buy a decent bicycle in Texas.
If someone has between 5 and 45 free hours per week for leisure in 95% of cases, then an effect that increases free time by would give them an extra 3 hours per week. Again, not a huge amount - but enough to play one new board game per week, or to finish Dispatch in three weeks^[10].
If in 95% of cases you read 40–160 pages per week (a 400‑page book per month), then corresponds to an extra 12 pages per week (about one and a half additional 400‑page books per year).

In medical literature, effects of are considered "small," - "medium," and - "large"^[11]. The examples above make it clear why: really isn’t that much. But let’s remember the context. Medical literature evaluates drugs against specific diseases. You can’t give patients arbitrary doses of whatever you want. Doctors have to justify why they prescribe a particular pill, because side effects are real. In your own life, on the other hand, nobody stops you from applying as many interventions as you want - until you run out of money or free time. If, in the free‑time example, you could find and apply three such formally "small" effects, that would give you 9 extra free hours per week. That’s a full workday! So the idea of stacking many small effects is at least worth a serious look.

Sounds tempting - but then what’s the catch?

How many observations, exactly?

To estimate the absolute effect of an intervention, we take a set of observations before the intervention and another set after it, and then compare the mean values of the two sets. The problem is that we need a lot of data. For example, in the figure below we have only 4 observations - and we got unlucky. Even though the green distribution is actually to the right of the blue one, based on our limited data we would have to conclude that it’s to the left, and by quite a lot:

(There’s a similar issue when estimating the standard deviation needed to compute Cohen’s d, but let’s focus on the means for now.) All we can hope for is that with more observations, the chance of getting unlucky with all of them decreases. For example, here some of the variability partially cancels out:

Computing the mean difference for the plot above gives , which is closer to the true shift of 0.25 - but still not that close.

The pooled standard deviation, computed using the method from the previous section, is equal to 1, so the d‑value is also 0.6: ,

So how many observations do we actually need to be confident? Let’s look at how well the estimated value converges to the true effect. Below are plots from twenty simulations showing how our estimate of the difference between two normal distributions with D = 1 depends on the number of observations. We’re trying to detect true effects of d = 0.1, d = 0.2, and d = 0.3, respectively:

From these plots we can draw several conclusions:

On average, we do converge to the true effect. The more observations we have, the closer we get.
That said, you don’t know in advance which simulation branch you’re on. You can either underestimate or overestimate the effect of the intervention.
The spread of the estimates shrinks relatively quickly at first, but after about 200 observations, adding more data barely reduces the spread. At the given noise level, it takes around 1200 observations for the simulation curves to consistently fall within a band around the true value.
The convergence rate depends only very weakly on the true effect size.
This isn’t visible in the plots, but the single most important factor is the noise level - the natural variability of the observed quantity.

If you look closely at the plots, you’ll see that even for an effect of d = 0.3, you need more than 175 observations before all simulation runs at least register a positive effect. For d = 0.1, you need roughly 1000. But even this understates how hard the experiment really is! In practice, we need to distinguish between interventions that have an effect and interventions that have no effect at all. That means the bundles of simulations for "effect" and "no effect" need to be sufficiently separated for us to tell which bundle our observed line belongs to:

From the image above, it’s clear that separating these bundles requires around 600–700 observations.

The plots are intuitive, but not very formal. What does theory say?

p‑value

In science, the standard way to assess whether we can reject the null hypothesis (no effect; the intervention has no impact) is via a p‑value test. The p‑value for an effect is the probability that the null hypothesis is true and that we observe a difference between two datasets that is at least as large as the one we got. Conventionally, if this probability is below 5%, we reject the null hypothesis. This is the well‑known criterion .

How can we estimate the probability of observing the measured difference if there is actually no effect? Once again, the normal distribution comes to the rescue. The sum and the difference of normally distributed random variables are themselves normally distributed^[12]. As a result, the difference of sample means is also normally distributed. If the null hypothesis is true and the distributions and are identical, then the peak of this difference distribution is at zero^[13]. It can be shown that the standard deviation of this difference is . Therefore, if we divide one by the other, the statistic will follow a normal distribution with unit variance and (under the null hypothesis) mean zero. We can then compare the observed value against tabulated values for to determine how likely it was to obtain such a result by chance^[14].

Below are plots showing how confidence in our conclusion grows with the number of observations, corresponding to the convergence plots above. The bold line shows the average across the p‑value experiments above. This is not exactly the true expected p‑value curve, but it already makes the point: achieving 95% confidence requires a lot of observations.

So what should we expect theoretically? We can "invert" the derivation above and compute how much data is needed to detect, with a given probability, an effect of known size d. Here are a few scenarios:

Theory says that to have 5% false positives and 5% false negatives, we need:

For d = 0.3 - about 580 observations (i.e. roughly 290 before the intervention and 290 after; note how well this matches the separation point of the d‑value bundles above)
For d = 0.2 - about 1300 observations
For d = 0.1 - about 5200 (!) observations

If we lower the bar slightly and allow 5% false positives and 20% false negatives, we need:

For d = 0.3 - about 350 observations
For d = 0.2 - about 800 observations
For d = 0.1 - about 3150 observations

If we lower the bar even further and allow 20% false positives and 20% false negatives:

d = 0.3 - 200 observations
d = 0.2 - 250 observations
d = 0.1 - 1800 observations

And finally, if we are absolutely certain that the effect must be positive and use a one‑sided test:

d = 0.3 - 126 observations
d = 0.2 - 280 observations
d = 0.1 - 1120 observations

So even if we significantly relax the confidence requirements and completely ignore the possibility that the intervention could do harm instead of good, detecting a not‑so‑weak effect of d = 0.3 still requires four months of observations! As a reminder, this is roughly the effect size of the first correctly chosen antidepressant for a patient with moderate depression. You can basically give up on trying to measure anything weaker. Vitamins (if you don’t have a specific deficiency), fish oil, and an extra 20 minutes of walking during the day (unless you’re usually completely sedentary) simply won’t be visible - even with very careful tracking.

Additional complications

But that’s still not all. There are several more effects worth paying attention to - each of them further increases the number of observations required.

Non‑linear interactions

It’s very rare for an intervention’s effect to be linear across the entire range we care about. Most interventions have a saturation threshold. The first extra hour of walking gives a larger effect than the third. This isn’t too bad, although it does mean we tend to underestimate the effect of partial interventions.

Much worse is when an intervention’s effect looks like an inverted U. The first cup of coffee makes you more alert, while the fifth makes you shake in hyper‑focused anxiety. Less than six hours of sleep is bad, but more than ten hours of sleep is also bad. Catching the optimal level of exposure is not always easy. Ideally, you’d want to test several intensity levels - which means running the experiment at least twice.

Accumulation and time‑to‑effect

Many interventions don’t produce an immediate effect. For example, atomoxetine (a medication for ADHD) has an effect ranging from d = 0.4 up to d = 2 (!) depending on the patient group and how symptoms are measured^[15], but it can take up to four weeks to accumulate in the body. As a result, some ADHD patients may abandon even such an effective drug before it has time to work. The results of going to the gym or starting psychotherapy take even longer to show up.

Often, it’s not even clear what delay to expect. If you’re studying the relationship between weight loss and back pain, you probably shouldn’t expect the pain to go away immediately after losing weight. The damage to the back has already been done - it needs time to heal. But how long it takes for the effect to reach a plateau is unclear.

Side conditions and seasonality

The same intervention can be useful at one time and useless at another. Seasonality is the most obvious factor here, but it’s not the only one. An extra hour of walking or a vitamin D capsule may have a stronger effect in winter than in summer, because people tend to get less light and physical activity during the cold months. Sports supplements may work when you’re in a caloric surplus and do nothing in a deficit. Meditation might improve mood in the morning but harm sleep quality in the evening.

Substitution effects

Suppose you want to study how sleep deprivation affects you. You collect data on work productivity depending on whether you slept 6 or 8 hours. But if, after a six‑hour night, you drink more tea - and tea has a stimulating effect on you - you might not see any productivity drop from sleep deprivation in the data at all. You either need to fix all other parameters of the system (hard to do in practice), or collect data on them as well. Accounting for substitution effects complicates the model. And it’s often unclear in advance which factors might act as substitutes.

"Noisy" measurement units

If you’re measuring your own happiness or energy levels, you’ll run into consistency issues. People can usually distinguish between feeling 2/10 and 8/10 - but those extreme states don’t happen very often. Most of the time, you need to tell apart 4/10 from 5/10 and 5/10 from 6/10, and not everyone is that good at introspection. By the second month of a study, people get tired of doing this and start choosing between 4 and 5 at random. If there are no systematic biases, the estimate should still eventually converge to the "true" value. Still, measurement error adds on top of the error in the underlying state, increasing overall noise - and with it, the required number of measurements.

You should try to ground abstract target metrics either in questionnaires or in objective physiological indicators. But filling out questionnaires takes time, and it requires willpower not to start answering them carelessly. Objective physiological metrics have a different problem: thanks to wearable devices they’re easy to collect, but they’re usually only very indirectly related to what you actually care about. You can use resting heart‑rate variability as a proxy for energy or anxiety levels - but the connection is quite indirect.

The observer effect

It’s hard to run a blind experiment when you’re both the researcher and the subject. This only really works for pills - and even then, not everyone is willing to deal with the organizational hassle. You know what effect you’re hoping for, and you’ll subconsciously nudge your ratings in the desired direction.

Moreover, the very act of monitoring changes the system. Data collection makes you engage with everything more consciously. Mindfulness is generally a good thing, but it also creates a gap between the experimental setup and real life. Checklists add friction to activities that used to be automatic. Excessive attention to your own happiness and calmness can itself cause depression and anxiety. This effect persists even if you know about it.

The general noise of life

Work deadlines, unexpected trips, illnesses, and periods of unusually good mood will create "holes" in your data. You’ll need to decide which data points "count," or else log everything. You also need to decide what to do with missing data.

The best thing you can do for measurement quality is to reduce the overall noise level of the metric. Unfortunately, it’s very hard to put yourself into sufficiently controlled conditions for several months just to finish an experiment. You can try measuring more frequently, but beyond a certain (often unknown) threshold, increasing measurement frequency just collects more noise.

Conclusion and silver lining

"Data‑driven habit optimization" sounds cool and scientific, but in practice it’s a very situational endeavor. It’s worth attempting only if you:

are confident that the intervention you’re studying has at least d = 0.3 (see the list of effects above; do you really expect your intervention to be comparable?)
can come up with ways to deal with the additional complications described in the previous section
are ready to dedicate several months to the study

Estimate the expected cost of the data for such interventions. How much would you value information about the effect size? What is the expected cost of running the experiment? Even if this comparison suggests the study is worth doing, don’t expect an easy path or quick results. You will need this amount of dedication to the science.

EDIT: I would like to clarify in response to the comments that I don't think that any data-driven life improvement is futile. My point that this particular flavor of rationalistic/nootrop-enthusiastic of searching and stacking d=0.1-0.3 improvements is very hard to implement. Other data-driven and generally goal-driven techniques are still very useful. For example:

As I've mentioned, you don't know effect of certain intervention beforehand. In rare cases you may have a certain disorder you don't know about and which can be greately ameliorated. If, say, you have vitamin A deficiency and when you start taking vitamin A supplements you notice effect. In this case it will be relatively easy to prove that effect is not a placebo.
- So I guess, a fruitful strategy would be trying many things in sequence for a short period of time and investigating further if you notice any effect. Lossy but implementable.
Sometimes the mere noticing and paying attention what's happening to you allows to you to thing if it's worth changing your life in some aspects.
- I've mentioned that sometimes the tracking itself introduces harmful friction into the process. However sometimes tracking itself helps! A common piece of advice in weightlifting is to track your progress. Same with losing weight.
- It's hard to fit outliers into normal distribution but noticing outliers in data motivates you to think why those outliers happened. They give you more information about yourself, uncover unknown unknowns.
- Sometimes seeing negative patterns in data allows you to react to problems before they happen (on the other hand, see also: hypochondria)
Sometimes hard data help to gather your willpower to implement certain changes. For example when I do any intellectually demanding tasks after 22:00 my sleep quality drastically drops. I knew about that effect for some time but tried not to do anything at night only half-heartedly. Measuring that drop helped me to abstain from coding at night more consistently.
I've mentioned that it's hard to achieve meaningful effect size on 'happiness' and other broad metrics but on the flip side it's not that hard to achieve an effect on some narrow and easily measurable things. For example, tracking your blood sugar or blood pressure and expermenting on how different things affect them is much more reliable.
In rare cases experimenting triggers cascade effects or changes system around you enough that normal distribution is no longer applicable to modeling your situation.

^{^}
https://link.springer.com/article/10.1186/s11556-025-00392-9
^{^}
https://journal.iusca.org/index.php/Journal/article/download/101/184/
^{^}
https://www.thelancet.com/article/S0140-6736(17)32802-7/fulltext
^{^}
https://www.mdpi.com/2254-9625/10/3/52
^{^}
https://pubmed.ncbi.nlm.nih.gov/39740743/
^{^}
https://academic.oup.com/ehjopen/article/5/5/oeaf093/8231441
^{^}
https://www.academia.edu/105909537/Financial_education_in_schools_A_meta_analysis_of_experimental_studies
^{^}
https://ourworldindata.org/human-height
^{^}
https://www.salary.com/tools/salary-calculator/electric-motor-assembler#google_vignette
^{^}
https://howlongtobeat.com/game/160618
^{^}
https://www.medscape.org/viewarticle/569729
^{^}
Intuitively: if X is the result of summing N dice and Y is the result of summing M dice, then X+Y is the result of summing N+M dice - which only makes the distribution more bell‑shaped.
^{^}
Intuition: imagine drawing pairs of numbers from the same set, one with your left hand and one with your right, and subtracting right from left. On average, for every pair ordered (first‑left, second‑right) there will be a corresponding (first‑right, second‑left). These opposite differences cancel out, and the mean ends up at zero.
^{^}
When the values are unknown, this statistic technically follows a t‑distribution, but for sufficiently large samples and our level of precision, this doesn’t matter much.
^{^}
https://journals.sagepub.com/doi/pdf/10.1177/2045125316647686

47