Recently I read the paper Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election by Xiao-Li Meng and I found a number of insights in the paper really fascinating — however, I haven’t seen much coverage of these insights in other locations likely because the paper is pretty dense and the way Meng presents these insights isn’t especially intuitive. In this blog post I’m going to try to provide a more intuitive and grok-able explanation of the concepts presented in the paper. With this goal in mind, I’m going to drop much of the formal terminology (and accompanying rigor!) that is used in the paper with the hopes that the trade-off from formal precision to interpretability doesn’t lead anyone too far astray. Any errors in the presentation should be solely attributed to me and not the original author — and if you encounter any inaccuracies, please reach out.
Let’s imagine that you’re a researcher with access to two different data sources that you can use to tackle an important business question. The first is a simple random sample of 200 users. The second is an administrative dataset1A dataset that’s available from an administrative system. E.g., a payroll system or a list of residents from the DMV covering 100,000 users. Which dataset should you use to answer your business question? Which should you trust more?
Any non-statisticians reading this might intuitively think that the data source with 100,000 (500x more data!) should be a slam-dunk. Unfortunately, many people have been trained to take sample-size as a measure of study reliability, but it turns out that sample-size alone is not at all a good measure of study reliability.
Fortunately, Meng has given us a nice framework for thinking about this type of problem and evaluating dataset quality and inference error.
The Fundamental Identity of Inference Error
Let’s assume that we are attempting to estimate the share of the population that prefers Hawaiian pizza to any other type of pizza. There’s a population of size , and some “true” proportion of that population that prefers Hawaiian pizza (inexplicably). Then we have a sample of size from which we can calculate a proportion that prefers Hawaiian pizza . Ideally we want our estimate () to be as close as possible to the truth ().
Note: For the remainder of this blog post, I’m going to be working only with proportions, but (I believe) all of the results in the paper generalize to more complex measurements you might want to infer from a sample to a population including the mean of a continuous variable.
It turns out, that there’s an identity that describes the magnitude of the error () we can expect from our study:
The terms of the right-hand side of this identity can be understood as follows:
- is the problem difficulty — that is, the more that the value we care about varies in the population, the harder it will be to accurately measure. For example, because human heights don’t vary that much, we don’t need to sample that many humans to get a good estimate of the average height within a population. However, if we wanted to estimate the average height of every plant in the world, we’d have to have a very large sample because the underlying distribution of plant-heights actually does vary enormously. If the true value doesn’t vary at all, this term goes to zero. In that case, if we are able to sample anyone at all then we will have zero error!
- is the data quantity — that is, what share of the population are we not able to observe2Well, the square-root of the share of the population that isn’t observed.? If we are able to sample the entire population () then this term goes to zero and we will have zero error!
- is the sample quality — this reflects how biased our sampling method is. In particular, it reflects how well our sample matches the population along the dimension we care about, . This term is likely the most unfamiliar (and is the one I spent the most time trying to understand) so that’s where we’ll spend the majority of this blog post.
An important thing to underscore is that this formula is an identity. That is, it is assumption-free and is true in the same way Euler’s identity is true. These terms are not estimated from data, but rather they are derived definitionally from first principles.
Let’s go back to our pizza example. Let’s assume that we are trying to estimate the share of our county that prefers Hawaiian pizza (scientists are still uncertain what causes this malfunction in tastebuds)– our total county has 1M people. Let’s assume that the true rate of liking Hawaiian pizza is . Let’s assume that we commission a survey of 500 people to ask them about their pizza preferences.
- Problem difficulty:
- Data quantity:
So far so good.
- Sample quality:
Unfortunately, it’s not easily possible to calculate the sample quality — in fact, it’s never possible to calculate this value from a sample alone. However, it is very helpful to reason about the values this term could take on and how that affects our expected error.
The actual value of is well defined, but it’s not easily interpretable. It can be calculated as:
However, for the purposes of building an intuition, it’s easier to think about how different values of will scale our expected error up and down. is a measure of the sampling bias. That is, it tells us how more or less likely one group is to be sampled. If the sample were truly random, neither group would be relatively more or less likely to be sampled, and so our estimate of the proportion in the sample will approach the true proportion in the population. However, if our sampling method is biased, , and our estimate will not accurately generalize to the broader population. In technical terms, you could say that is the relative difference in response likelihood conditional on the value of the outcome we’re seeking to measure.
So, if our sample is truly random, then people who like Hawaiian-pizza shouldn’t be more likely to participate in our survey — they should participate in proportion to how prevalent they are in the population. However, if our sample isn’t truly random, if it’s biased, then we might over-sample from the Hawaiian-pizza lovers and our estimate of the true underlying rate in the population won’t be accurate. That is, a biased sample will have and as the magnitude of increases, our estimate will become more and more biased. What we’ll see is that even very small values of can have meaningful impacts on the accuracy of our estimate, no matter how big our sample size is.
Below I’ve included a table from a simulation where we imagine a total population of 1,000,000 people — some of whom prefer Hawaiian pizza (again, inexplicably) and some who prefer other types of pizza. In this simulation we have two sampling methods — a simple random sample (SRS) and a biased sample where Hawaiian-pizza lovers are slightly more likely to be present in the sample. Here we’re using so the net difference in the probability of response between Hawaiian-pizza lovers and non-lovers is 0.1%.
If we sample 10,000 people (1% of the population) we expect to draw Hawaiian-pizza lovers at a rate of 1.08% and non-Hawaiian-pizza lovers at a rate of 0.98%, which doesn’t sound like that big of a difference. That is, the total difference in likelihood to be sampled () is so in a sample of 1% of the population (10k / 1M) we will expect to draw 1.08% of Hawaiian-pizza lovers and 0.98% of non-Hawaiian-pizza lovers (instead of an even 1% from both).3The arithmetic on 0.98% and 1.08% is a little bit tedious. The formula looks like `resp_p – (1 – p_x) * resp_p * R`
|Biased Sample||SRS||Biased Sample||SRS|
What we see is that the two samples have similar error rates across both measures, but until our sample size gets to be about 10,000 people (1% of the population). At that point, the simple random sample continues improving in accuracy (the error rates go down) but the biased sample does not. In fact, both the MAPE and the RMSE error rates start to plateau at a sample size of 10,000, so beyond that increasing the sample size doesn’t improve accuracy at all!
And this is where things can get especially confusing for non-statisticians attempting to interpret these data — they might see that there’s an n-size of 500,000 (half the population!) and think that it’ll be a huge sample, clearly better than a simple-rand-sample of only 1,000, but they’d be wrong! The simple random sample of 1,000 people has more accuracy than the very-slightly-biased sample of 500,000!
Even worse, if you naively use statistical methods that assume a simple random sample, you’ll be led doubly astray. If we calculate the confidence interval for the sample of 500,000 as if it were truly random, you’d estimate a confidence interval that’s way too small. With a sample size of 500,000, our average estimate for the proportion of Hawaiian-pizza lovers in our simulation is 21.6%.
Our formula for a 95% confidence interval is:
which in this case yields a confidence interval from 0.2154 to 0.2165 which is a very tight confidence interval! It doesn’t even come close to including the true value of 0.2! This is a real problem — if practitioners aren’t thinking critically about how their data were sampled (and therefore if the statistical techniques they’re using can be applied) they can end up being very confident and also be very wrong! In fact, the more data they have, the more confident they’ll be in their number that is in fact leading them astray!
So the moral of the story here is that the simple random sampling is incredibly powerful and does a lot of work for us! In this brave new world where we have access to lots of administrative data that only partially represents the population of interest that we’re studying, we have to be very careful with the tools we to use to extrapolate from those data sets because even small amounts of non-randomness in the sampling technique ( ) can cause relatively large errors in our estimates while also generating a false sense of confidence in the data!
Implications for Analysts and Researchers
This point cannot be emphasized enough: you can never calculate how biased your sample actually is — it is an unknowable quantity in a real research setting. In the analysis included above I simulate data that has a real bias in order to see how the results change with different sampling sizes, but unfortunately as researchers we can never know what that value is.
There are a few things every analyst should keep in mind when performing any analysis:
- Is my sample potentially biased? What are all the ways it might be biased?
- If it is biased, what magnitude of bias might we expect to encounter?
- How does that magnitude of bias affect the interpretation of our results?
- Is that value reflected in the confidence intervals we’re using?
It might be the case that in the analysis you’re doing, you expect that the magnitude of the bias is relatively small and so for the purposes of a “quick and dirty” analysis using biased data is fine (maybe having 2% error in your analysis of pizza preferences is just fine). However, because the magnitude of that sampling bias is not possible to calculate from the sample you collected, any amount of bias in your sample could in fact be very large and you wouldn’t know about it.
We’ve seen this problem play out routinely in some of the research that has been coming out in the midst of the COVID-19 crisis — researchers, in the interest of moving quickly (which I generally applaud!), are making the mistake of generalizing from non-random samples into population-wide metrics which are simply not valid. The statistics that the researchers use to estimate confidence intervals assume a truly random sample from the population, and their reported confidence bounds are much too narrow — as we saw in the example above, the more data you collect in a biased manner, the narrower your naive confidence bands will be, and the more likely it will be that they don’t include the true value.
Collecting a truly random sample is very hard! Pollsters have gotten very good at being able to leverage small samples of the US population to get good reads on how candidates will perform in national elections, but no one knows how to do the same thing for a virus-like COVID-19! We don’t know which demographic characteristics matter, so we don’t know how to weight our samples appropriately to generalize from relatively small samples up to the population as a whole.
It behooves us as consumers of research as well as the researchers themselves to keep this problem in mind as they’re presenting results — we should assume that every sample collected is biased and cannot be generalized to the entire population of the US without very large error bars. When analyzing these data, we should be asking ourself “what population does this generalize to, and is that the population we care about?”
Hopefully, y’all found this framework for thinking about how to evaluate the expected error rate from a given data source helpful — remember that it’s these three factors that explain the error for a given analysis:
- Problem difficulty:
- Data quantity:
- Sample quality:
Hopefully, this framework helps you think critically about the work you’re doing and also helps you to explain potential biases to your stakeholders. If you have questions or thoughts, you can always reach me at firstname.lastname@example.org
Thanks to some extremely sharp-eyed error-catching from Ludmila Exbrayat, I made some edits to the section describing the difference in sampling rates between the two populations. I updated the blog so that the example difference in sampling rates correctly shows 0.98 and 1.98 instead of 0.95 and 1.05 as it showed before. I also updated the code in the gist and the results in the table to reflect this change — none of these changes altered the results or the interpretation.
This post is powered by QuickLatex — check it out for your WordPress + needs
R script for the simulation can be found here: github gist