This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version, or you can buy a hardcopy here. Chapter 8 Estimation8.1 The estimation gameLet’s play a game. I’ll think of a distribution, and you have to guess what it is. We’ll start out easy and work our way up. I’m thinking of a distribution. I’ll give you two hints; it’s a normal distribution, and here’s a random sample drawn from it: {−0.441, 1.774, −0.101, −1.138, 2.975, −2.138} What do you think is the mean parameter, µ, of this distribution? One choice is to use the sample mean to estimate µ. Up until now we have used the symbol µ for both the sample mean and the mean parameter, but now to distinguish them I will use x for the sample mean. In this example, x is 0.155, so it would be reasonable to guess µ = 0.155. This process is called estimation, and the statistic we used (the sample mean) is called an estimator. Using the sample mean to estimate µ is so obvious that it is hard to imagine a reasonable alternative. But suppose we change the game by introducing outliers. I’m thinking of a distribution. It’s a normal distribution, and here’s a sample that was collected by an unreliable surveyor who occasionally puts the decimal point in the wrong place. {−0.441, 1.774, −0.101, −1.138, 2.975, −213.8} Now what’s your estimate of µ? If you use the sample mean your guess is −35.12. Is that the best choice? What are the alternatives? One option is to identify and discard outliers, then compute the sample mean of the rest. Another option is to use the median as an estimator. Which estimator is the best depends on the circumstances (for example, whether there are outliers) and on what the goal is. Are you trying to minimize errors, or maximize your chance of getting the right answer? If there are no outliers, the sample mean minimizes the mean squared error (MSE). If we play the game many times, and each time compute the error x − µ, the sample mean minimizes
Where m is the number of times you play the estimation game (not to be confused with n, which is the size of the sample used to compute x). Minimizing MSE is a nice property, but it’s not always the best strategy. For example, suppose we are estimating the distribution of wind speeds at a building site. If we guess too high, we might overbuild the structure, increasing its cost. But if we guess too low, the building might collapse. Because cost as a function of error is asymmetric, minimizing MSE is not the best strategy. As another example, suppose I roll three sixsided dice and ask you to predict the total. If you get it exactly right, you get a prize; otherwise you get nothing. In this case the value that minimizes MSE is 10.5, but that would be a terrible guess. For this game, you want an estimator that has the highest chance of being right, which is a maximum likelihood estimator (MLE). If you pick 10 or 11, your chance of winning is 1 in 8, and that’s the best you can do. Exercise 1
Write a function that draws 6 values from a normal distribution with
µ = 0 and σ = 1. Use the sample mean to estimate µ and
compute the error x − µ. Run the function 1000 times and
compute MSE.
Now modify the program to use the median as an estimator. Compute MSE again and compare to the MSE for x. 8.2 Guess the varianceI’m thinking of a distribution. It’s a normal distribution, and here’s a (familiar) sample: {−0.441, 1.774, −0.101, −1.138, 2.975, −2.138} What do you think is the variance, σ^{2}, of my distribution? Again, the obvious choice is to use the sample variance as an estimator. I will use S^{2}to denote the sample variance, to distinguish from the unknown parameter σ^{2}.
For large samples, S^{2}is an adequate estimator, but for small samples it tends to be too low. Because of this unfortunate property, it is called a biased estimator. An estimator is unbiased if the expected total (or mean) error, after many iterations of the estimation game, is 0. Fortunately, there is another simple statistic that is an unbiased estimator of σ^{2}:
The biggest problem with this estimator is that its name and symbol are used inconsistently. The name “sample variance” can refer to either S^{2}or S_{n−1}^{2}, and the symbol S^{2}is used for either or both. For an explanation of why S^{2}is biased, and a proof that S_{n−1}^{2}is unbiased, see http://wikipedia.org/wiki/Bias_of_an_estimator. Exercise 2
Write a function that draws 6 values from a normal distribution with
µ = 0 and σ = 1. Use the sample variance to estimate
σ^{2} and compute the error S^{2} − σ^{2}.
Run the function 1000 times and compute mean error (not squared).
Now modify the program to use the unbiased estimator S_{n−1}^{2}. Compute the mean error again and see if it converges to zero as you increase the number of games. 8.3 Understanding errorsBefore we go on, let’s clear up a common source of confusion. Properties like MSE and bias are longterm expectations based on many iterations of the estimation game. While you are playing the game, you don’t know the errors. That is, if I give you a sample and ask you to estimate a parameter, you can compute the value of the estimator, but you can’t compute the error. If you could, you wouldn’t need the estimator! The reason we talk about estimation error is to describe the behavior of different estimators in the long run. In this chapter we run experiments to examine those behaviors; these experiments are artificial in the sense that we know the actual values of the parameters, so we can compute errors. But when you work with real data, you don’t, so you can’t. Now let’s get back to the game. 8.4 Exponential distributionsI’m thinking of a distribution. It’s an exponential distribution, and here’s a sample: {5.384, 4.493, 19.198, 2.790, 6.122, 12.844} What do you think is the parameter, λ, of this distribution? In general, the mean of an exponential distribution is 1/λ, so working backwards, we might choose λ = 1 / x It is common to use hat notation for estimators, so λ is an estimator of λ. And not just any estimator; it is also the MLE estimator^{1}. So if you want to maximize your chance of guessing λ exactly, λ is the way to go. But we know that x is not robust in the presence of outliers, so we expect λ to have the same problem. Maybe we can find an alternative based on the sample median. Remember that the median of an exponential distribution is ln(2) / λ, so working backwards again, we can define an estimator
where µ_{1/2} is the sample median. Exercise 3
Run an experiment to see which of λ and λ_{1/2} yields
lower MSE. Test whether either of them is biased.
8.5 Confidence intervalsSo far we have looked at estimators that generate single values, known as point estimates. For many problems, we might prefer an interval that specifies an upper and lower bound on the unknown parameter. Or, more generally, we might want that whole distribution; that is, the range of values the parameter could have, and for each value in the range, a notion of how likely it is. Let’s start with confidence intervals. I’m thinking of a distribution. It’s an exponential distribution, and here’s a sample: {5.384, 4.493, 19.198, 2.790, 6.122, 12.844} I want you to give me a range of values that you think is likely to contain the unknown parameter λ. More specifically, I want a 90% confidence interval, which means that if we play this game over and over, your interval will contain λ 90% of the time. It turns out that this version of the game is hard, so I’m going to tell you the answer, and all you have to do is test it. Confidence intervals are usually described in terms of the miss rate, α, so a 90% confidence interval has miss rate α = 0.1. The confidence interval for the λ parameter of an exponential distribution is
where n is the sample size, λ is the meanbased estimator from the previous section, and χ^{2}(k, x) is the CDF of a chisquared distribution with k degrees of freedom, evaluated at x (see http://wikipedia.org/wiki/Chisquare_distribution). In general, confidence intervals are hard to compute analytically, but relatively easy to estimate using simulation. But first we need to talk about Bayesian estimation. 8.6 Bayesian estimationIf you collect a sample and compute a 90% confidence interval, it is tempting to say that the true value of the parameter has a 90% chance of falling in the interval. But from a frequentist point of view, that is not correct because the parameter is an unknown but fixed value. It is either in the interval you computed or not, so the frequentist definition of probability doesn’t apply. So let’s try a different version of the game. I’m thinking of a distribution. It’s an exponential distribution, and I chose λ from a uniform distribution between 0.5 and 1.5. Here’s a sample, which I’ll call X: {2.675, 0.198, 1.152, 0.787, 2.717, 4.269} Based on this sample, what value of λ do you think I chose? In this version of the game, λ is a random quantity, so we can reasonably talk about its distribution, and we can compute it easily using Bayes’s theorem. Here are the steps:
Given a posterior distribution, it is easy to compute a confidence interval. For example, to compute a 90% CI, you can use the 5th and 95th percentiles of the posterior. Bayesian confidence intervals are sometimes called credible intervals; for a discussion of the differences, see http://wikipedia.org/wiki/Credible_interval. 8.7 Implementing Bayesian estimationTo represent the prior distribution, we could use a Pmf, Cdf, or any other representation of a distribution, but since we want to map from a hypothesis to a probability, a Pmf is a natural choice. Each value in the Pmf represents a hypothesis; for example, the value 0.5 represents the hypothesis that λ is 0.5. In the prior distribution, all hypotheses have the same probability. So we can construct the prior like this: def MakeUniformSuite(low, high, steps): hypos = [low + (highlow) * i / (steps1.0) for i in range(steps)] pmf = Pmf.MakePmfFromList(hypos) return pmf This function makes and returns a Pmf that represents a collection of related hypotheses, called a suite. Each hypothesis has the same probability, so the distribution is uniform. The arguments low and high specify the range of values; steps is the number of hypotheses. To perform the update, we take a suite of hypotheses and a body of evidence: def Update(suite, evidence): for hypo in suite.Values(): likelihood = Likelihood(evidence, hypo) suite.Mult(hypo, likelihood) suite.Normalize() For each hypothesis in the suite, we multiply the prior probability by the likelihood of the evidence. Then we normalize the suite. In this function, suite has to be a Pmf, but evidence can be any type, as long as Likelihood knows how to interpret it. Here’s the likelihood function: def Likelihood(evidence, hypo): param = hypo likelihood = 1 for x in evidence: likelihood *= ExpoPdf(x, param) return likelihood In Likelihood we assume that evidence is a sample from an exponential distribution and compute the product in the previous section. ExpoPdf evaluates the PDF of the exponential distribution at x: def ExpoPdf(x, param): p = param * math.exp(param * x) return p Putting it all together, here’s the code that creates the prior and computes the posterior: evidence = [2.675, 0.198, 1.152, 0.787, 2.717, 4.269] prior = MakeUniformSuite(0.5, 1.5, 100) posterior = prior.Copy() Update(posterior, evidence) You can download the code in this section from http://thinkstats.com/estimate.py. When I think of Bayesian estimation, I imagine a room full of people, where each person has a different guess about whatever you are trying to estimate. So in this example they each have a guess about the correct value of λ. Initially, each person has a degree of confidence about their own hypothesis. After seeing the evidence, each person updates their confidence based on P(EH), the likelihood of the evidence, given their hypothesis. Most often the likelihood function computes a probability, which is at most 1, so initially everyone’s confidence goes down (or stays the same). But then we normalize, which increases everyone’s confidence. So the net effect is that some people get more confident, and some less, depending on the relative likelihood of their hypothesis. 8.8 Censored dataThe following problem appears in Chapter 3 of David MacKay’s Information Theory, Inference and Learning Algorithms, which you can download from http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/. Unstable particles are emitted from a source and decay at a distance x, a real number that has an exponential probability distribution with [parameter] λ. Decay events can only be observed if they occur in a window extending from x = 1 cm to x = 20 cm. n decays are observed at locations { x_{1}, ... , x_{N} }. What is λ? This is an example of an estimation problem with censored data; that is, we know that some data are systematically excluded. One of the strengths of Bayesian estimation is that it can deal with censored data with relative ease. We can use the method from the previous section with only one change: we have to replace PDF_{expo} with the conditional distribution:
for 1 < x < 20, and 0 otherwise, with
You might remember Z(λ) from Exercise 5. I told you to keep it handy. Exercise 4
Download http://thinkstats.com/estimate.py, which contains the code
from the previous section, and make a copy named decay.py.
Modify decay.py to compute the posterior distribution of λ for the sample X = {1.5, 2, 3, 4, 5, 12}. For the prior you can use a uniform distribution between 0 and 1.5 (not including 0). You can download a solution to this problem from http://thinkstats.com/decay.py. Exercise 5
In the 2008 Minnesota Senate race the final vote count was 1,212,629
votes for Al Franken and 1,212,317 votes for Norm Coleman. Franken
was declared the winner, but as Charles Seife points out in Proofiness, the margin of victory was much smaller than the margin
of error, so the result should have been considered a tie.
Assuming that there is a chance that any vote might be lost and a chance that any vote might be doublecounted, what is the probability that Coleman actually received more votes? Hint: you will have to fill in some details to model the error process. 8.9 The locomotive problemThe locomotive problem is a classic estimation problem also known as the “German tank problem.” Here is the version that appears in Mosteller, Fifty Challenging Problems in Probability: “A railroad numbers its locomotives in order 1..N. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has.” Before you read the rest of this section, try to answer these questions:
For best results, you should take some time to work on these questions before you continue. For a given estimate, N, the likelihood of seeing train i is 1/N if i ≤ N, and 0 otherwise. So the MLE is N = i. In other words, if you see train 60 and you want to maximize your chance of getting the answer exactly right, you should guess that there are 60 trains. But this estimator doesn’t do very well in terms of MSE. We can do better by choosing N = ai; all we have to do is find a good value for a. Suppose that there are, in fact, N trains. Each time we play the estimation game, we see train i and guess ai, so the squared error is (ai − N)^{2}. If we play the game N times and see each train once, the mean squared error is
To minimize MSE, we take the derivative with respect to a:
And solve for a.
At first glance, that doesn’t seem very useful, because N appears on the righthand side, which suggests that we need to know N to choose a, but if we knew N, we wouldn’t need an estimator in the first place. However, for large values of N, the optimal value for a converges to 3/2, so we could choose N = 3i/ 2. To find an unbiased estimator, we can compute the mean error (ME):
And find the value of a that yields ME = 0, which turns out to be
For large values of N, a converges to 2, so we could choose N = 2i. So far we have generated three estimators, i, 3i/2, and 2i, that have the properties of maximizing likelihood, minimizing squared error, and being unbiased. Yet another way to generate an estimator is to choose the value that makes the population mean equal the sample mean. If we see train i, the sample mean is just i; the train population that has the same mean is N = 2i − 1. Finally, to compute the Bayesian posterior distribution, we compute
Where H_{n} is the hypothesis that there are n trains, and i is the evidence: we saw train i. Again, P(iH_{n}) is 1/n if i < n, and 0 otherwise. The normalizing constant, P(i), is just the sum of the numerators for each hypothesis. If the prior distribution is uniform from 1 to 200, we start with 200 hypotheses and compute the likelihood for each. You can download an implementation from http://thinkstats.com/locomotive.py. Figure 8.1 shows what the result looks like. The 90% credible interval for this posterior is [63, 189], which is still quite wide. Seeing one train doesn’t provide strong evidence for any of the hypotheses (although it does rule out the hypotheses with n < i). If we start with a different prior, the posterior is significantly different, which helps to explain why the other estimators are so diverse. One way to think of different estimators is that they are implicitly based on different priors. If there is enough evidence to swamp the priors, then all estimators tend to converge; otherwise, as in this case, there is no single estimator that has all of the properties we might want. Exercise 6
Generalize locomotive.py to handle the case where you
see more than one train. You should only have to
change a few lines of code.
See if you can answer the other questions for the case where you see more than one train. You can find a discussion of the problem and several solutions at http://wikipedia.org/wiki/German_tank_problem. 8.10 Glossary

Like this book?
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
