CategoryStatistics

The magical power of random sampling

Recently I read the paper Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election by Xiao-Li Meng and I found a number of insights in the paper really fascinating — however, I haven’t seen much coverage of these insights in other locations likely because the paper is pretty dense and the way Meng presents these insights isn’t especially intuitive. In this blog post I’m going to try to provide a more intuitive and grok-able explanation of the concepts presented in the paper. With this goal in mind, I’m going to drop much of the formal terminology (and accompanying rigor!) that is used in the paper with the hopes that the trade-off from formal precision to interpretability doesn’t lead anyone too far astray. Any errors in the presentation should be solely attributed to me and not the original author — and if you encounter any inaccuracies, please reach out.

Motivation

Let’s imagine that you’re a researcher with access to two different data sources that you can use to tackle an important business question. The first is a simple random sample of 200 users. The second is an administrative dataset1A dataset that’s available from an administrative system. E.g., a payroll system or a list of residents from the DMV covering 100,000 users. Which dataset should you use to answer your business question? Which should you trust more?

Any non-statisticians reading this might intuitively think that the data source with 100,000 (500x more data!) should be a slam-dunk. Unfortunately, many people have been trained to take sample-size as a measure of study reliability, but it turns out that sample-size alone is not at all a good measure of study reliability.

Fortunately, Meng has given us a nice framework for thinking about this type of problem and evaluating dataset quality and inference error.

The Fundamental Identity of Inference Error

Let’s assume that we are attempting to estimate the share of the population that prefers Hawaiian pizza to any other type of pizza. There’s a population of size N, and some “true” proportion of that population p that prefers Hawaiian pizza (inexplicably). Then we have a sample of size n from which we can calculate a proportion that prefers Hawaiian pizza \hat{p}. Ideally we want our estimate (\hat{p}) to be as close as possible to the truth (p).

Note: For the remainder of this blog post, I’m going to be working only with proportions, but (I believe) all of the results in the paper generalize to more complex measurements you might want to infer from a sample to a population including the mean of a continuous variable.

It turns out, that there’s an identity that describes the magnitude of the error (p - \hat{p}) we can expect from our study:

p - \hat{p} = stddev(p) \times \sqrt{\frac{N-n}{N}} \times \rho_{R,p}

The terms of the right-hand side of this identity can be understood as follows:

  1. stddev(p) is the problem difficulty — that is, the more that the value we care about varies in the population, the harder it will be to accurately measure. For example, because human heights don’t vary that much, we don’t need to sample that many humans to get a good estimate of the average height within a population. However, if we wanted to estimate the average height of every plant in the world, we’d have to have a very large sample because the underlying distribution of plant-heights actually does vary enormously. If the true value doesn’t vary at all, this term goes to zero. In that case, if we are able to sample anyone at all then we will have zero error!
  2. \sqrt{\frac{N-n}{N}} is the data quantity — that is, what share of the population are we not able to observe2Well, the square-root of the share of the population that isn’t observed.? If we are able to sample the entire population (n = N) then this term goes to zero and we will have zero error!
  3. \rho_{R,p} is the sample quality — this reflects how biased our sampling method is. In particular, it reflects how well our sample matches the population along the dimension we care about, p. This term is likely the most unfamiliar (and is the one I spent the most time trying to understand) so that’s where we’ll spend the majority of this blog post.

An important thing to underscore is that this formula is an identity. That is, it is assumption-free and is true in the same way Euler’s identity is true. These terms are not estimated from data, but rather they are derived definitionally from first principles.

Gaining Intuition

Let’s go back to our pizza example. Let’s assume that we are trying to estimate the share of our county that prefers Hawaiian pizza (scientists are still uncertain what causes this malfunction in tastebuds)– our total county has  1M people. Let’s assume that the true rate of liking Hawaiian pizza is p = 0.2. Let’s assume that we commission a survey of 500 people to ask them about their pizza preferences.

  • Problem difficulty: stddev(p) = \sqrt{p(1-p)} = \sqrt{0.2(0.8)} = 0.4
  • Data quantity: \sqrt{\frac{N-n}{N}} = \sqrt{\frac{999,500}{1,000,000}} = 0.997

So far so good.

  • Sample quality: \rho_{R,p} = ???

Unfortunately, it’s not easily possible to calculate the sample quality — in fact, it’s never possible to calculate this value from a sample alone. However, it is very helpful to reason about the values this term could take on and how that affects our expected error.

The actual value of \rho_{R,p} is well defined, but it’s not easily interpretable. It can be calculated as:

\rho_{R,p} = \Delta_R \sqrt{\frac{p(1-p)}{f(1-f)}} where f = \frac{n}{N}

However, for the purposes of building an intuition, it’s easier to think about how different values of \Delta_R will scale our expected error up and down. \Delta_R is a measure of the sampling biasThat is, it tells us how more or less likely one group is to be sampled. If the sample were truly random, neither group would be relatively more or less likely to be sampled, and so our estimate of the proportion in the sample will approach the true proportion in the population. However, if our sampling method is biased, \Delta_R \ne 0, and our estimate will not accurately generalize to the broader population. In technical terms, you could say that \Delta_R  is the relative difference in response likelihood conditional on the value of the outcome we’re seeking to measure.

So, if our sample is truly random, then people who like Hawaiian-pizza shouldn’t be more likely to participate in our survey — they should participate in proportion to how prevalent they are in the population. However, if our sample isn’t truly random, if it’s biased, then we might over-sample from the Hawaiian-pizza lovers and our estimate of the true underlying rate in the population won’t be accurate. That is, a biased sample will have \Delta_R \ne 0 and as the magnitude of \Delta_R increases, our estimate will become more and more biased. What we’ll see is that even very small values of \Delta_R can have meaningful impacts on the accuracy of our estimate, no matter how big our sample size is.

Below I’ve included a table from a simulation where we imagine a total population of 1,000,000 people — some of whom prefer Hawaiian pizza (again, inexplicably) and some who prefer other types of pizza. In this simulation we have two sampling methods — a simple random sample (SRS) and a biased sample where Hawaiian-pizza lovers are slightly more likely to be present in the sample. Here we’re using \Delta_R = -0.1% so the net difference in the probability of response between Hawaiian-pizza lovers and non-lovers is 0.1%.

If we sample 10,000 people (1% of the population) we expect to draw Hawaiian-pizza lovers at a rate of 1.08% and non-Hawaiian-pizza lovers at a rate of 0.98%, which doesn’t sound like that big of a difference. That is, the total difference in likelihood to be sampled (\Delta_R) is 0.1 so in a sample of 1% of the population (10k / 1M) we will expect to draw 1.08% of Hawaiian-pizza lovers and 0.98% of non-Hawaiian-pizza lovers (instead of an even 1% from both).3The arithmetic on 0.98% and 1.08% is a little bit tedious. The formula looks like `resp_p – (1 – p_x) * resp_p * R`

Sample Size
RMSE MAPE
Biased Sample SRS Biased Sample SRS
50 6.48% 5.61% 26.20% 22.60%
100 4.33% 3.90% 16.90% 15.50%
1,000 2.05% 1.37% 8.64% 5.35%
10,000 1.66% 0.38% 8.11% 1.49%
100,000 1.61% 0.12% 8.03% 0.5%
500,000 1.60% 0.04% 8.00% 0.16%

What we see is that the two samples have similar error rates across both measures, but until our sample size gets to be about 10,000 people (1% of the population). At that point, the simple random sample continues improving in accuracy (the error rates go down) but the biased sample does not. In fact, both the MAPE and the RMSE error rates start to plateau at a sample size of 10,000, so beyond that increasing the sample size doesn’t improve accuracy at all!

And this is where things can get especially confusing for non-statisticians attempting to interpret these data — they might see that there’s an n-size of 500,000 (half the population!) and think that it’ll be a huge sample, clearly better than a simple-rand-sample of only 1,000, but they’d be wrong! The simple random sample of 1,000 people has more accuracy than the very-slightly-biased sample of 500,000!

Even worse, if you naively use statistical methods that assume a simple random sample, you’ll be led doubly astray. If we calculate the confidence interval for the sample of 500,000 as if it were truly random, you’d estimate a confidence interval that’s way too small. With a sample size of 500,000, our average estimate for the proportion of Hawaiian-pizza lovers in our simulation is 21.6%.

Our formula for a 95% confidence interval is:

p \pm \sqrt{\frac{p(1-p)}{n}}

which in this case yields a confidence interval from 0.2154 to 0.2165 which is a very tight confidence interval! It doesn’t even come close to including the true value of 0.2! This is a real problem — if practitioners aren’t thinking critically about how their data were sampled (and therefore if the statistical techniques they’re using can be applied) they can end up being very confident and also be very wrong! In fact, the more data they have, the more confident they’ll be in their number that is in fact leading them astray!

So the moral of the story here is that the simple random sampling is incredibly powerful and does a lot of work for us! In this brave new world where we have access to lots of administrative data that only partially represents the population of interest that we’re studying, we have to be very careful with the tools we to use to extrapolate from those data sets because even small amounts of non-randomness in the sampling technique ( \Delta_R \ne 0) can cause relatively large errors in our estimates while also generating a false sense of confidence in the data!

Implications for Analysts and Researchers

This point cannot be emphasized enough: you can never calculate how biased your sample actually is — it is an unknowable quantity in a real research setting. In the analysis included above I simulate data that has a real bias in order to see how the results change with different sampling sizes, but unfortunately as researchers we can never know what that value is.

There are a few things every analyst should keep in mind when performing any analysis:

  • Is my sample potentially biased? What are all the ways it might be biased?
  • If it is biased, what magnitude of bias might we expect to encounter?
  • How does that magnitude of bias affect the interpretation of our results?
  • Is that value reflected in the confidence intervals we’re using?

It might be the case that in the analysis you’re doing, you expect that the magnitude of the bias is relatively small and so for the purposes of a “quick and dirty” analysis using biased data is fine (maybe having 2% error in your analysis of pizza preferences is just fine). However, because the magnitude of that sampling bias is not possible to calculate from the sample you collected, any amount of bias in your sample could in fact be very large and you wouldn’t know about it.

We’ve seen this problem play out routinely in some of the research that has been coming out in the midst of the COVID-19 crisis — researchers, in the interest of moving quickly (which I generally applaud!), are making the mistake of generalizing from non-random samples into population-wide metrics which are simply not valid. The statistics that the researchers use to estimate confidence intervals assume a truly random sample from the population, and their reported confidence bounds are much too narrow — as we saw in the example above, the more data you collect in a biased manner, the narrower your naive confidence bands will be, and the more likely it will be that they don’t include the true value.

Collecting a truly random sample is very hard! Pollsters have gotten very good at being able to leverage small samples of the US population to get good reads on how candidates will perform in national elections, but no one knows how to do the same thing for a virus-like COVID-19! We don’t know which demographic characteristics matter, so we don’t know how to weight our samples appropriately to generalize from relatively small samples up to the population as a whole.

It behooves us as consumers of research as well as the researchers themselves to keep this problem in mind as they’re presenting results — we should assume that every sample collected is biased and cannot be generalized to the entire population of the US without very large error bars. When analyzing these data, we should be asking ourself “what population does this generalize to, and is that the population we care about?”

Summing Up

Hopefully, y’all found this framework for thinking about how to evaluate the expected error rate from a given data source helpful — remember that it’s these three factors that explain the error for a given analysis:

  1. Problem difficulty: stddev(p)
  2. Data quantity: \sqrt{\frac{N-n}{N}}
  3. Sample quality: \sim\Delta_R

Hopefully, this framework helps you think critically about the work you’re doing and also helps you to explain potential biases to your stakeholders. If you have questions or thoughts, you can always reach me at michael@kaminsky.rocks


Edits:

Thanks to some extremely sharp-eyed error-catching from Ludmila Exbrayat, I made some edits to the section describing the difference in sampling rates between the two populations. I updated the blog so that the example difference in sampling rates correctly shows 0.98 and 1.98 instead of 0.95 and 1.05 as it showed before. I also updated the code in the gist and the results in the table to reflect this change — none of these changes altered the results or the interpretation.


This post is powered by QuickLatex — check it out for your WordPress + \LaTeX needs

R script for the simulation can be found here: github gist

Notes on GAMs in R with a binary dependent variables

A few weeks ago I was working on a project using GAMs to estimate the gradients of the marginal effects of the likelihood of a customer to convert. This was a very cool problem, and I learned a lot, but it took me quite a bit of time to figure out how to get the built-in functions to generate outputs that are interpretable in a regression with a binary dependent variable (e.g., logistic-style regression). Below I’ve included a slightly-cleaned-up version of my notes from my research in case they’re helpful to any of y’all!

If you’re not sure what GAMs are, I’d check out these resources for a good overview:


 

Model Validation

Use gam.check() There’s two big things to check:

  • Does the model converge at all? If not, your results aren’t valid.
  • Check the p-values for the smooths at the bottom — if any of these are significant you should increase the smoothe values (k= in the model specification) — remember the k is a ceiling on the value so erring too high won’t impact your results, but erring too low will!
  • If you’re using a binary model, the QQ plot and other automatically generated plots aren’t useful. However, uou can use DHARMa to get interpretable fit checks

Logistic GAMs

  • Use the `family = “binomial”` option to estimate a logistic-style GAM model
  • Like a logistic regression, the output is on the log scale — we can convert to the probability scale using the plogis() function
    • When plotting, we can convert our output to the probability scale by using the trans argument. The trans argument takes a function that transforms the output. So we can pass the plogis() logistic function to this argument, and all values in plots will be transformed from log-odds to probabilities. plot(binom_mod, pages = 1, trans = plogis)(Source)
  • To get appropriately-scaled marginal effects plots, we can use the following specification: plot(binom_mod, pages = 1, trans = plogis, shift = coef(binom_mod)[1], seWithMean = TRUE)
    • Now our partial effect plots have much more intuitive interpretation. Each partial effect plot can be interpreted as showing the probability of the outcome if all other variables were at their average value. At their own average value, you get only the effect of the intercept. (Source)

The different GAM smoothing options

You can select s(), ti(), or te() for a smooth. ti() and te() work like interaction terms, but they’re crucially different:

  • ti() only works with variables that are on the same scale. It’s commonly used for something like latitude and longitude. (Sometimes the documentation refers to these variables as being “isotropic”)
  • te() works with variables are not on the same scale (they’re “anisotropic”)
  • You have to choose the number of basis functions to use for each smooth, using the k argument of s or te. (Source])
    • Given the penalization process, the exact choice of k isn’t too big of a deal, and can be seen as an upper limit to the flexibility of the term. The actual flexibility is determined via the penalization process. (Source)

Plotting interaction effects

  • In mgcv’s plot() command, interactions are represented with contour plots. A single plot represents both variables and their interaction. In this plot the axes represent values of our predictor variables, x1 and x2. The interior is a topographic map of predicted values. The contour lines represent points of equal predicted values, and they are labeled. The dotted lines show uncertainty in prediction; they represent how contour lines would move if predictions were one standard error higher or lower. (Source)
  • The plots generated by mgcv’s plot() function are partial effect plots. That is, they show the component effect of each of the smooth or linear terms in the model, which add up to the overall prediction. (Source)
  • It’s often useful to plot the standard errors of a partial effect term combined with the standard errors of the model intercept. This is because confidence intervals at the mean value of a variable can be very tiny, and don’t reflect overall uncertainty in our model. Using the seWithMean argument adds in this uncertainty. (Source)
    • To make the plots even more interpretable, it’s useful to shift the scale so that the intercept is included. Using the shift argument, we can shift the scale by value of the intercept, which is the first coefficient of the model. Note how the y-axis has changed. Now, the partial effect plot has a more natural interpretation – it shows us the prediction of the output, assuming other variables are at their average value. For instance, this plot shows that the miles per gallon of a 2000 pound car is about 30, all else being equal. (Source)

References