This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Chapter 10 Approximate Bayesian Computation10.1 The Variability HypothesisI have a soft spot for crank science. Recently I visited Norumbega Tower, which is an enduring monument to the crackpot theories of Eben Norton Horsford, inventor of doubleacting baking powder and fake history. But that’s not what this chapter is about. This chapter is about the Variability Hypothesis, which "originated in the early nineteenth century with Johann Meckel, who argued that males have a greater range of ability than females, especially in intelligence. In other words, he believed that most geniuses and most mentally retarded people are men. Because he considered males to be the ’superior animal,’ Meckel concluded that females’ lack of variation was a sign of inferiority." I particularly like that last part, because I suspect that if it turns out that women are actually more variable, Meckel would take that as a sign of inferiority, too. Anyway, you will not be surprised to hear that the evidence for the Variability Hypothesis is weak. Nevertheless, it came up in my class recently when we looked at data from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS), specifically the selfreported heights of adult American men and women. The dataset includes responses from 154407 men and 254722 women. Here’s what we found:
That’s very close, so we could conclude that this dataset provides weak evidence against the Variability Hypothesis. But we can use Bayesian methods to make that conclusion more precise. And answering this question gives me a chance to demonstrate some techniques for working with large datasets. I will proceed in a few steps:
You can download the code in this chapter from http://thinkbayes.com/variability.py. For more information see Section 0.3. 10.2 Mean and standard deviationIn Chapter 9 we estimated two parameters simultaneously using a joint distribution. In this chapter we use the same method to estimate the parameters of a Gaussian distribution: the mean, mu, and the standard deviation, sigma. For this problem, I define a Suite called Height that represents a map from each mu, sigma pair to its probability: class Height(thinkbayes.Suite, thinkbayes.Joint): def __init__(self, mus, sigmas): thinkbayes.Suite.__init__(self) pairs = [(mu, sigma) for mu in mus for sigma in sigmas] thinkbayes.Suite.__init__(self, pairs) mus is a sequence of possible values for mu; sigmas is a sequence of values for sigma. The prior distribution is uniform over all mu, sigma pairs. The likelihood function is easy. Given hypothetical values of mu and sigma, we compute the likelihood of a particular value, x. That’s what EvalGaussianPdf does, so all we have to do is use it: # class Height def Likelihood(self, data, hypo): x = data mu, sigma = hypo like = thinkbayes.EvalGaussianPdf(x, mu, sigma) return like If you have studied statistics from a mathematical perspective, you know that when you evaluate a PDF, you get a probability density. In order to get a probability, you have to integrate probability densities over some range. But for our purposes, we don’t need a probability; we just need something proportional to the probability we want. A probability density does that job nicely. The hardest part of this problem turns out to be choosing appropriate ranges for mus and sigmas. If the range is too small, we omit some possibilities with nonnegligible probability and get the wrong answer. If the range is too big, we get the right answer, but waste computational power. So this is an opportunity to use classical estimation to make Bayesian techniques more efficient. Specifically, we can use classical estimators to find a likely location for mu and sigma, and use the standard errors of those estimates to choose a likely spread. If the true parameters of the distribution are µ and σ, and we take a sample of n values, an estimator of µ is the sample mean, m. And an estimator of σ is the sample standard variance, s. The standard error of the estimated µ is s / √n and the standard error of the estimated σ is s / √2 (n−1). Here’s the code to compute all that: def FindPriorRanges(xs, num_points, num_stderrs=3.0): # compute m and s n = len(xs) m = numpy.mean(xs) s = numpy.std(xs) # compute ranges for m and s stderr_m = s / math.sqrt(n) mus = MakeRange(m, stderr_m, num_stderrs) stderr_s = s / math.sqrt(2 * (n1)) sigmas = MakeRange(s, stderr_s, num_stderrs) return mus, sigmas xs is the dataset. The return value is a pair of sequences, mus and sigmas. def MakeRange(estimate, stderr, num_stderrs): spread = stderr * num_stderrs array = numpy.linspace(estimatespread, estimate+spread, num_points) return array numpy.linspace makes an array of equally spaced elements between estimatespread and estimate+spread, including both. 10.3 UpdateFinally here’s the code to make and update the suite: mus, sigmas = FindPriorRanges(xs, num_points) suite = Height(mus, sigmas) suite.UpdateSet(xs) print suite.MaximumLikelihood() This process might seem bogus, because we use the data to choose the range of the prior distribution, and then use the data again to do the update. In general, using the same data twice is, in fact, bogus. But in this case it is ok. Really. We use the data to choose the
range for the prior, but only to avoid computing a lot of
probabilities that would have been very small anyway. With
In effect, the prior is uniform over all values of mu and sigma, but for computational efficiency we ignore all the values that don’t matter. 10.4 The posterior distribution of CVOnce we have the posterior joint distribution of mu and sigma, we can compute the distribution of CV for men and women, and then the probability that one exceeds the other. To compute the distribution of CV, we enumerate pairs of mu and sigma: def CoefVariation(suite): pmf = thinkbayes.Pmf() for (mu, sigma), p in suite.Items(): pmf.Incr(sigma/mu, p) return pmf Then we use The analysis itself is simple, but there are two more issues we have to deal with:
The following sections explain these problems and their solutions. 10.5 UnderflowIf we select the first 100 values from the BRFSS dataset and run the analysis I just described, it runs without errors and we get posterior distributions that look reasonable. If we select the first 1000 values and run the program again, we get
an error in ValueError: total probability is zero. The problem is that we are using probability densities to compute likelihoods, and densities from continuous distributions tend to be small. And if you take 1000 small values and multiply them together, the result is very small. In this case it is so small it can’t be represented by a floatingpoint number, so it gets rounded down to zero, which is called underflow. And if all probabilities in the distribution are 0, it’s not a distribution any more. A possible solution is to renormalize the Pmf after each update, or after each batch of 100. That would work, but it would be slow. A better alternative is to compute likelihoods under a log transform. That way, instead of multiplying small values, we can add up log likelihoods. Pmf provides methods Log, LogUpdateSet and Exp to make this process easy. Log computes the log of the probabilities in a Pmf: # class Pmf def Log(self): m = self.MaxLike() for x, p in self.d.iteritems(): if p: self.Set(x, math.log(p/m)) else: self.Remove(x) Before applying the log transform Log uses MaxLike to find m, the highest probability in the Pmf. It divide all probabilities by m, so the highest probability gets normalized to 1, which yields a log of 0. The other log probabilities are all negative. If there are any values in the Pmf with probability 0, they are removed. While the Pmf is under a log transform, we can’t use Update, UpdateSet, or Normalize. The result would be nonsensical; if you try, Pmf raises an exception. Instead, we have to use LogUpdate and LogUpdateSet. Here’s the implementation of LogUpdateSet: # class Suite def LogUpdateSet(self, dataset): for data in dataset: self.LogUpdate(data) LogUpdateSet loops through the data and calls LogUpdate: # class Suite def LogUpdate(self, data): for hypo in self.Values(): like = self.LogLikelihood(data, hypo) self.Incr(hypo, like) LogUpdate is just like Update except that it calls LogLikelihood instead of Likelihood, and Incr instead of Mult. Using loglikelihoods avoids the problem with underflow, but while the Pmf is under the log transform, there’s not much we can do with it. We have to use Exp to invert the transform: # class Pmf def Exp(self): m = self.MaxLike() for x, p in self.d.iteritems(): self.Set(x, math.exp(pm)) If the loglikelihoods are large negative numbers, the resulting likelihoods might underflow. So Exp finds the maximum loglikelihood, m, and shifts all the likelihoods up by m. The resulting distribution has a maximum likelihood of 1. This process inverts the log transform with minimal loss of precision. 10.6 LoglikelihoodNow all we need is LogLikelihood. # class Height def LogLikelihood(self, data, hypo): x = data mu, sigma = hypo loglike = scipy.stats.norm.logpdf(x, mu, sigma) return loglike norm.logpdf computes the loglikelihood of the Gaussian PDF. Here’s what the whole update process looks like: suite.Log() suite.LogUpdateSet(xs) suite.Exp() suite.Normalize() To review, Log puts the suite under a log transform. LogUpdateSet calls LogUpdate, which calls LogLikelihood. LogUpdate uses Pmf.Incr, because adding a loglikelihood is the same as multiplying by a likelihood. After the update, the loglikelihoods are large negative numbers, so Exp shifts them up before inverting the transform, which is how we avoid underflow. Once the suite is transformed back, the probabilities are “linear” again, which means “not logarithmic”, so we can use Normalize again. Using this algorithm, we can process the entire dataset without underflow, but it is still slow. On my computer it might take an hour. We can do better. 10.7 A little optimizationThis section uses math and computational optimization to speed things up by a factor of 100. But the following section presents an algorithm that is even faster. So if you want to get right to the good stuff, feel free to skip this section. Suite.LogUpdateSet calls LogUpdate once for each data point. We can speed it up by computing the loglikelihood of the entire dataset at once. We’ll start with the Gaussian PDF:
and compute the log (dropping the constant term):
Given a sequence of values, x_{i}, the total loglikelihood is
Pulling out the terms that don’t depend on i, we get
which we can translate into Python: # class Height def LogUpdateSetFast(self, data): xs = tuple(data) n = len(xs) for hypo in self.Values(): mu, sigma = hypo total = Summation(xs, mu) loglike = n * math.log(sigma)  total / 2 / sigma**2 self.Incr(hypo, loglike) By itself, this would be a small improvement, but it creates an opportunity for a bigger one. Notice that the summation only depends on mu, not sigma, so we only have to compute it once for each value of mu. To avoid recomputing, I factor out a function that computes the summation, and memoize it so it stores previously computed results in a dictionary (see http://en.wikipedia.org/wiki/Memoization): def Summation(xs, mu, cache={}): try: return cache[xs, mu] except KeyError: ds = [(xmu)**2 for x in xs] total = sum(ds) cache[xs, mu] = total return total cache stores previously computed sums. The try statement returns a result from the cache if possible; otherwise it computes the summation, then caches and returns the result. The only catch is that we can’t use a list as a key in the cache, because it is not a hashable type. That’s why LogUpdateSetFast converts the dataset to a tuple. This optimization speeds up the computation by about a factor of 100, processing the entire dataset (154 407 men and 254 722 women) in less than a minute on my notveryfast computer. 10.8 ABCBut maybe you don’t have that kind of time. In that case, Approximate Bayesian Computation (ABC) might be the way to go. The motivation behind ABC is that the likelihood of any particular dataset is:
We don’t really care about the likelihood of seeing the exact dataset we saw. Especially for continuous variables, we care about the likelihood of seeing any dataset like the one we saw. For example, in the Euro problem, we don’t care about the order of the coin flips, only the total number of heads and tails. And in the locomotive problem, we don’t care about which particular trains were seen, only the number of trains and the maximum of the serial numbers. Similarly, in the BRFSS sample, we don’t really want to know the probability of seeing one particular set of values (especially since there are hundreds of thousands of them). It is more relevant to ask, “If we sample 100,000 people from a population with hypothetical values of µ and σ, what would be the chance of collecting a sample with the observed mean and variance?” For samples from a Gaussian distribution, we can answer this question efficiently because we can find the distribution of the sample statistics analytically. In fact, we already did it when we computed the range of the prior. If you draw n values from a Gaussian distribution with parameters µ and σ, and compute the sample mean, m, the distribution of m is Gaussian with parameters µ and σ / √n. Similarly, the distribution of the sample standard deviation, s, is Gaussian with parameters σ and σ / √2 (n−1).
We can use these sample distributions to compute the likelihood of the
sample statistics, m and s, given hypothetical values
for µ and σ. Here’s a new version of def LogUpdateSetABC(self, data): xs = data n = len(xs) # compute sample statistics m = numpy.mean(xs) s = numpy.std(xs) for hypo in sorted(self.Values()): mu, sigma = hypo # compute log likelihood of m, given hypo stderr_m = sigma / math.sqrt(n) loglike = EvalGaussianLogPdf(m, mu, stderr_m) #compute log likelihood of s, given hypo stderr_s = sigma / math.sqrt(2 * (n1)) loglike += EvalGaussianLogPdf(s, sigma, stderr_s) self.Incr(hypo, loglike) On my computer this function processes the entire dataset in about a second, and the result agrees with the exact result with about 5 digits of precision. 10.9 Robust estimationWe are almost ready to look at results, but we have one more problem to deal with. There are a number of outliers in this dataset that are almost certainly errors. For example, there are three adults with reported height of 61 cm, which would place them among the shortest living adults in the world. At the other end, there are four women with reported height 229 cm, just short of the tallest women in the world. It is not impossible that these values are correct, but it is unlikely, which makes it hard to know how to deal with them. And we have to get it right, because these extreme values have a disproportionate effect on the estimated variability. Because ABC is based on summary statistics, rather than the entire dataset, we can make it more robust by choosing summary statistics that are robust in the presence of outliers. For example, rather than use the sample mean and standard deviation, we could use the median and interquartile range (IQR), which is the difference between the 25th and 75th percentiles. More generally, we could compute an interpercentile range (IPR) that spans any given fraction of the distribution, p: def MedianIPR(xs, p): cdf = thinkbayes.MakeCdfFromList(xs) median = cdf.Percentile(50) alpha = (1p) / 2 ipr = cdf.Value(1alpha)  cdf.Value(alpha) return median, ipr xs is a sequence of values. p is the desired range; for example, p=0.5 yields the interquartile range. MedianIPR works by computing the CDF of xs, then extracting the median and the difference between two percentiles. We can convert from ipr to an estimate of sigma using the Gaussian CDF to compute the fraction of the distribution covered by a given number of standard deviations. For example, it is a wellknown rule of thumb that 68% of a Gaussian distribution falls within one standard deviation of the mean, which leaves 16% in each tail. If we compute the range between the 16th and 84th percentiles, we expect the result to be 2 * sigma. So we can estimate sigma by computing the 68% IPR and dividing by 2. More generally we could use any number of sigmas. MedianS performs the more general version of this computation: def MedianS(xs, num_sigmas): half_p = thinkbayes.StandardGaussianCdf(num_sigmas)  0.5 median, ipr = MedianIPR(xs, half_p * 2) s = ipr / 2 / num_sigmas return median, s Again, xs is the sequence of values; Finally, in LogUpdateSetABC we can replace the sample mean and standard deviation with median and s. And that pretty much does it. It might seem odd that we are using observed percentiles to estimate µ and σ, but it is an example of the flexibility of the Bayesian approach. In effect we are asking, “Given hypothetical values for µ and σ, and a sampling process that has some chance of introducing errors, what is the likelihood of generating a given set of sample statistics?” We are free to choose any sample statistics we like, up to a point: µ and σ determine the location and spread of a distribution, so we need to choose statistics that capture those characteristics. For example, if we chose the 49th and 51st percentiles, we would get very little information about spread, so it would leave the estimate of σ relatively unconstrained by the data. All values of sigma would have nearly the same likelihood of producing the observed values, so the posterior distribution of sigma would look a lot like the prior. 10.10 Who is more variable?Finally we are ready to answer the question we started with: is the coefficient of variation greater for men than for women? Using ABC based on the median and IPR with For each joint distribution, I computed the posterior distribution of CV. Figure 10.3 shows these distributions for men and women. The mean for men is 0.0410; for women it is 0.0429. Since there is no overlap between the distributions, we conclude with near certainty that women are more variable in height than men. So is that the end of the Variability Hypothesis? Sadly, no. It turns
out that this
result depends on the choice of the
interpercentile range. With The reason for the difference is that there are more men of short stature, and their distance from the mean is greater. So our evaluation of the Variability Hypothesis depends on the
interpretation of “variability.” With To decide which emphasis is appropriate, we would need a more precise statement of the hypothesis. As it is, the Variability Hypothesis may be too vague to evaluate. Nevertheless, it helped me demonstrate several new ideas and, I hope you agree, it makes an interesting example. 10.11 DiscussionThere are two ways you might think of ABC. One interpretation is that it is, as the name suggests, an approximation that is faster to compute than the exact value. But remember that Bayesian analysis is always based on modeling decisions, which implies that there is no “exact” solution. For any interesting physical system there are many possible models, and each model yields different results. To interpret the results, we have to evaluate the models. So another interpretation of ABC is that it represents an alternative model of the likelihood. When we compute p(DH), we are asking “What is the likelihood of the data under a given hypothesis?” For large datasets, the likelihood of the data is very small, which is a hint that we might not be asking the right question. What we really want to know is the likelihood of any outcome like the data, where the definition of “like” is yet another modeling decision. The underlying idea of ABC is that two datasets are alike if they yield the same summary statistics. But in some cases, like the example in this chapter, it is not obvious which summary statistics to choose. You can download the code in this chapter from http://thinkbayes.com/variability.py. For more information see Section 0.3. 10.12 ExercisesExercise 1 An “effect size” is a statistic intended to measure the difference between two groups (see http://en.wikipedia.org/wiki/Effect_size). For example, we could use data from the BRFSS to estimate the difference in height between men and women. By sampling values from the posterior distributions of µ and σ, we could generate the posterior distribution of this difference. But it might be better to use a dimensionless measure of effect size, rather than a difference measured in cm. One option is to use divide through by the standard deviation (similar to what we did with the coefficient of variation). If the parameters for Group 1 are (µ_{1}, σ_{1}), and the parameters for Group 2 are (µ_{2}, σ_{2}), the dimensionless effect size is
Write a function that takes joint distributions of mu and sigma for two groups and returns the posterior distribution of effect size. Hint: if enumerating all pairs from the two distributions takes too long, consider random sampling. 
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
