This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Chapter 12 Evidence12.1 Interpreting SAT scoresSuppose you are the Dean of Admission at a small engineering college in Massachusetts, and you are considering two candidates, Alice and Bob, whose qualifications are similar in many ways, with the exception that Alice got a higher score on the Math portion of the SAT, a standardized test intended to measure preparation for collegelevel work in mathematics. If Alice got 780 and Bob got a 740 (out of a possible 800), you might want to know whether that difference is evidence that Alice is better prepared than Bob, and what the strength of that evidence is. Now in reality, both scores are very good, and both candidates are probably well prepared for college math. So the real Dean of Admission would probably suggest that we choose the candidate who best demonstrates the other skills and attitudes we look for in students. But as an example of Bayesian hypothesis testing, let’s stick with a narrower question: “How strong is the evidence that Alice is better prepared than Bob?” To answer that question, we need to make some modeling decisions. I’ll start with a simplification I know is wrong; then we’ll come back and improve the model. I pretend, temporarily, that all SAT questions are equally difficult. Actually, the designers of the SAT choose questions with a range of difficulty, because that improves the ability to measure statistical differences between testtakers. But if we choose a model where all questions are equally difficult, we
can define a characteristic, 12.2 The scaleIn order to understand SAT scores, we have to understand the scoring and scaling process. Each testtaker gets a raw score based on the number of correct and incorrect questions. The raw score is converted to a scaled score in the range 200–800. In 2009, there were 54 questions on the math SAT. The raw score for each testtaker is the number of questions answered correctly minus a penalty of 1/4 point for each question answered incorrectly. The College Board, which administers the SAT, publishes the map from raw scores to scaled scores. I have downloaded that data and wrapped it in an Interpolator object that provides a forward lookup (from raw score to scaled) and a reverse lookup (from scaled score to raw). You can download the code for this example from http://thinkbayes.com/sat.py. For more information see Section 0.3. 12.3 The priorThe College Board also publishes the distribution of scaled scores
for all testtakers. If we convert each scaled score to a raw score,
and divide by the number of questions, the result is an estimate
of Here is the code that reads and processes the data: class Exam(object): def __init__(self): self.scale = ReadScale() scores = ReadRanks() score_pmf = thinkbayes.MakePmfFromDict(dict(scores)) self.raw = self.ReverseScale(score_pmf) self.prior = DivideValues(raw, 54) Exam encapsulates the information we have about the exam. ReadScale and ReadRanks read files and return objects that contain the data: self.scale is the Interpolator that converts from raw to scaled scores and back; scores is a list of (score, frequency) pairs.
Figure 12.1 shows the prior distribution of
For each testtaker, I define a Suite called Sat that
represents the distribution of class Sat(thinkbayes.Suite): def __init__(self, exam, score): thinkbayes.Suite.__init__(self) self.exam = exam self.score = score # start with the prior distribution for p_correct, prob in exam.prior.Items(): self.Set(p_correct, prob) # update based on an exam score self.Update(score)
As usual, we inherit Update from Suite and provide Likelihood: def Likelihood(self, data, hypo): p_correct = hypo score = data k = self.exam.Reverse(score) n = self.exam.max_score like = thinkbayes.EvalBinomialPmf(k, n, p_correct) return like hypo is a hypothetical
value of To keep things simple, I interpret the raw score as the number of correct answers, ignoring the penalty for wrong answers. With this simplification, the likelihood is given by the binomial distribution, which computes the probability of k correct responses out of n questions. 12.4 PosteriorFigure 12.2 shows the posterior distributions
of Which brings us back to the original question, “How strong is the
evidence that Alice is better prepared than Bob?” We can use the
posterior distributions of To formulate the question in terms of Bayesian hypothesis testing, I define two hypotheses:
To compute the likelihood of A, we can enumerate all pairs of values
from the posterior distributions and add up the total probability of
the cases where So we can define a Suite that computes the posterior probabilities of A and B: class TopLevel(thinkbayes.Suite): def Update(self, data): a_sat, b_sat = data a_like = thinkbayes.PmfProbGreater(a_sat, b_sat) b_like = thinkbayes.PmfProbLess(a_sat, b_sat) c_like = thinkbayes.PmfProbEqual(a_sat, b_sat) a_like += c_like / 2 b_like += c_like / 2 self.Mult('A', a_like) self.Mult('B', b_like) self.Normalize() Usually when we define a new Suite, we inherit Update and provide Likelihood. In this case I override Update, because it is easier to evaluate the likelihood of both hypotheses at the same time. The data passed to Update are Sat objects that represent
the posterior distributions of
Here is the code that creates TopLevel and updates it: exam = Exam() a_sat = Sat(exam, 780) b_sat = Sat(exam, 740) top = TopLevel('AB') top.Update((a_sat, b_sat)) top.Print() The likelihood of A is 0.79 and the likelihood of B is 0.21. The likelihood ratio (or Bayes factor) is 3.8, which means that these test scores are evidence that Alice is better than Bob at answering SAT questions. If we believed, before seeing the test scores, that A and B were equally likely, then after seeing the scores we should believe that the probability of A is 79%, which means there is still a 21% chance that Bob is actually better prepared. 12.5 A better modelRemember that the analysis we have done so far is based on the simplification that all SAT questions are equally difficult. In reality, some are easier than others, which means that the difference between Alice and Bob might be even smaller. But how big is the modeling error? If it is small, we conclude that the first model—based on the simplification that all questions are equally difficult—is good enough. If it’s large, we need a better model. In the next few sections, I develop a better model and discover (spoiler alert!) that the modeling error is small. So if you are satisfied with the simple mode, you can skip to the next chapter. If you want to see how the more realistic model works, read on...
This function is a simplified version of the curve used in item response theory, which you can read about at http://en.wikipedia.org/wiki/Item_response_theory. efficacy and difficulty are considered to be on the same scale, and the probability of getting a question right depends only on the difference between them. When efficacy and difficulty are equal, the probability of getting the question right is 50%. As efficacy increases, this probability approaches 100%. As it decreases (or as difficulty increases), the probability approaches 0%. Given the distribution of efficacy across testtakers and the distribution of difficulty across questions, we can compute the expected distribution of raw scores. We’ll do that in two steps. First, for a person with given efficacy, we’ll compute the distribution of raw scores. def PmfCorrect(efficacy, difficulties): pmf0 = thinkbayes.Pmf([0]) ps = [ProbCorrect(efficacy, diff) for diff in difficulties] pmfs = [BinaryPmf(p) for p in ps] dist = sum(pmfs, pmf0) return dist difficulties is a list of difficulties, one for each question. ps is a list of probabilities, and pmfs is a list of twovalued Pmf objects; here’s the function that makes them: def BinaryPmf(p): pmf = thinkbayes.Pmf() pmf.Set(1, p) pmf.Set(0, 1p) return pmf dist is the sum of these Pmfs. Remember from Section 5.4 that when we add up Pmf objects, the result is the distribution of the sums. In order to use Python’s sum to add up Pmfs, we have to provide pmf0 which is the identity for Pmfs, so pmf + pmf0 is always pmf. If we know a person’s efficacy, we can compute their distribution of raw scores. For a group of people with a different efficacies, the resulting distribution of raw scores is a mixture. Here’s the code that computes the mixture: # class Exam: def MakeRawScoreDist(self, efficacies): pmfs = thinkbayes.Pmf() for efficacy, prob in efficacies.Items(): scores = PmfCorrect(efficacy, self.difficulties) pmfs.Set(scores, prob) mix = thinkbayes.MakeMixture(pmfs) return mix MakeRawScoreDist takes efficacies, which is a Pmf that represents the distribution of efficacy across testtakers. I assume it is Gaussian with mean 0 and standard deviation 1.5. This choice is mostly arbitrary. The probability of getting a question correct depends on the difference between efficacy and difficulty, so we can choose the units of efficacy and then calibrate the units of difficulty accordingly. pmfs is a metaPmf that contains one Pmf for each level of efficacy, and maps to the fraction of testtakers at that level. MakeMixture takes the metapmf and computes the distribution of the mixture (see Section 5.6). 12.6 CalibrationIf we were given the distribution of difficulty, we could use
I assume that the distribution of difficulty is uniform with parameters center and width. MakeDifficulties makes a list of difficulties with these parameters. def MakeDifficulties(center, width, n): low, high = centerwidth, center+width return numpy.linspace(low, high, n) By trying out a few combinations, I found that center=0.05 and width=1.8 yield a distribution of raw scores similar to the actual data, as shown in Figure 12.3. So, assuming that the distribution of difficulty is uniform, its range is approximately 1.85 to 1.75, given that efficacy is Gaussian with mean 0 and standard deviation 1.5. The following table shows the range of ProbCorrect for testtakers at different levels of efficacy:
Someone with efficacy 3 (two standard deviations above the mean) has a 99% chance of answering the easiest questions on the exam, and a 78% chance of answering the hardest. On the other end of the range, someone two standard deviations below the mean has only a 24% chance of answering the easiest questions. 12.7 Posterior distribution of efficacyNow that the model is calibrated, we can compute the posterior distribution of efficacy for Alice and Bob. Here is a version of the Sat class that uses the new model: class Sat2(thinkbayes.Suite): def __init__(self, exam, score): self.exam = exam self.score = score # start with the Gaussian prior efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3) thinkbayes.Suite.__init__(self, efficacies) # update based on an exam score self.Update(score)
def Likelihood(self, data, hypo): efficacy = hypo score = data raw = self.exam.Reverse(score) pmf = self.exam.PmfCorrect(efficacy) like = pmf.Prob(raw) return like pmf is the distribution of raw scores for a testtaker with the given efficacy; like is the probability of the observed score. Figure 12.4 shows the posterior distributions of efficacy for Alice and Bob. As expected, the location of Alice’s distribution is farther to the right, but again there is some overlap. Using TopLevel again, we compare A, the hypothesis that Alice’s efficacy is higher, and B, the hypothesis that Bob’s is higher. The likelihood ratio is 3.4, a bit smaller than what we got from the simple model (3.8). So this model indicates that the data are evidence in favor of A, but a little weaker than the previous estimate. If our prior belief is that A and B are equally likely, then in light of this evidence we would give A a posterior probability of 77%, leaving a 23% chance that Bob’s efficacy is higher. 12.8 Predictive distributionThe analysis we have done so far generates estimates for Alice and Bob’s efficacy, but since efficacy is not directly observable, it is hard to validate the results. To give the model predictive power, we can use it to answer a related question: “If Alice and Bob take the math SAT again, what is the chance that Alice will do better again?” We’ll answer this question in two steps:
We already have most of the code we need. To compute
the predictive distributions, we can use exam = Exam() a_sat = Sat(exam, 780) b_sat = Sat(exam, 740) a_pred = exam.MakeRawScoreDist(a_sat) b_pred = exam.MakeRawScoreDist(b_sat) Then we can find the likelihood that Alice does better on the second test, Bob does better, or they tie: a_like = thinkbayes.PmfProbGreater(a_pred, b_pred) b_like = thinkbayes.PmfProbLess(a_pred, b_pred) c_like = thinkbayes.PmfProbEqual(a_pred, b_pred) The probability that Alice does better on the second exam is 63%, which means that Bob has a 37% chance of doing as well or better. Notice that we have more confidence about Alice’s efficacy than we do about the outcome of the next test. The posterior odds are 3:1 that Alice’s efficacy is higher, but only 2:1 that Alice will do better on the next exam. 12.9 DiscussionWe started this chapter with the question, “How strong is the evidence that Alice is better prepared than Bob?” On the face of it, that sounds like we want to test two hypotheses: either Alice is more prepared or Bob is. But in order to compute likelihoods for these hypotheses, we
have to solve an estimation problem. For each testtaker
we have to find the posterior distribution of either
Values like this are called nuisance parameters because we don’t care what they are, but we have to estimate them to answer the question we care about. One way to visualize the analysis we did in this chapter is
to plot the space of these parameters. def MakeJoint(pmf1, pmf2): joint = Joint() for v1, p1 in pmf1.Items(): for v2, p2 in pmf2.Items(): joint.Set((v1, v2), p1 * p2) return joint This function assumes that the two distributions are independent. Figure 12.5 shows the joint posterior distribution of
In TopLevel.Update, when we compute the likelihoods of A and B, we add up the probability mass on each side of this line. For the cells that fall on the line, we add up the total mass and split it between A and B. The process we used in this chapter—estimating nuisance parameters in order to evaluate the likelihood of competing hypotheses—is a common Bayesian approach to problems like this. 
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
