Zipf’s law describes a relationship between the frequencies and ranks of words in natural languages1. The “frequency” of a word is the number of times it appears in a body of work. The “rank” of a word is its position in a list of words sorted by frequency: the most common word has rank 1, the second most common has rank 2, etc.
Specifically, Zipf’s Law predicts that the frequency, f, of the word with rank r is:
| f = c r−s |
where s and c are parameters that depend on the language and the text.
If you take the logarithm of both sides of this equation, you get:
| logf = logc − s logr |
So if you plot logf versus logr, you should get a straight line with slope −s and intercept logc.
If you need some help getting started, you can download greenteapress.com/compmod/Hist.py, which provides an object named Hist that you might find useful.
Plot the results and check whether they form a straight line. Can you estimate the value of s?
You might want to use the pylab package, which provides easy-to-use functions to generate a variety of plots. In particular, pylab.loglog plots a series of points on a log-log scale. pylab is part of a larger package called matplotlib; it is not included in all Python distributions, so you might have to install it.
Here is an example that generates a simple graph:
import pylab
# compute lists of x and y values
xs = range(0, 10)
ys = [x**2 for x in xs]
# plot the data
pylab.plot(xs, ys, '-ro')
# set title and labels
pylab.title('Parabola')
pylab.xlabel('x')
pylab.ylabel('y = x**2')
# show the plot
pylab.show()
pylab.plot takes two lists as parameters. If you have a list of pairs instead, you can use zip and the scatter operator to “transpose” it. For example, if t is a list of pairs, zip(*t) returns a pair of lists.
The most common gotcha with pylab is that you have to call pylab.show to see the plot. You can read more about it at matplotlib.sourceforge.net.
You can download my solution from greenteapress.com/compmod/Zipf.py
A distribution is a statistical description of a set of values. For example, if you collect the population of every city and town in the U.S., you would have a set of about 14,000 integers.
The simplest description of this set is a list of numbers, which would be complete but not very informative. A more concise description would be a statistical summary like the mean and variation, but that is not a complete description because there are many sets of values with the same summary statistics.
One alternative is a histogram, which divides the range of possible values into “bins” and counts the number of values that fall in each bin. Histograms are common, so they are easy to understand, but they have some drawbacks: the primary one is that it is tricky to get the bin size right. If the bins are too small, then the number of values in each bin is also small, so the histogram doesn’t give much insight. If the bins are too large, they lump together a wide range of values, which obscures details that might be important.
A better alternative is a cumulative distribution, which shows the percentage of values less than or equal to x for a range of x that sweeps from the smallest value in the set to the highest.
For example, the following figure shows the cumulative distribution function (CDF), for the values (1,2,2,4,5).

The range of the CDF is always from 0 to 1. The values on the y-axis are sometimes called percentiles, although strictly-speaking the range of a percentile is from 0 to 100. The values on the x-axis are called quantities, or sometimes “quantiles.”
The filled and empty circles indicate the value of the function at integral values: for example, cdf(1) = 0.2 because 20% of the values in the set are less than or equal to 1.
For example, to make a distribution of the values (1,2,2,4,5) you would run:
h = Hist([1,2,2,4,5]) d = Dist(h)
1 0.0 1 0.2 2 0.2 2 0.6 4 0.6 4 0.8 5 0.8 5 1.0
You should give some thought to choosing a data structure to represent a distribution. percentile should be O(logn), where n is the number of unique quantities in the distribution. The other methods should be linear in n.
You can download my solution from greenteapress.com/compmod/Dist.py
The distributions we have seen so far are sometimes called empirical distributions because they are based on a dataset that comes from some kind of empirical observation.
An alternative is what I will call a closed-form distribution, which is characterized by a CDF that can be expressed as a closed-form function. Some of these distributions, like the Gaussian2 or normal distribution, are well known, at least to people who have studied statistics. Many real world phenomena can be approximated by closed-form distributions, which is why they are useful.
For example, if you observe a mass of radioactive material with an instrument that can detect decay events, the distribution of times between events will most likely fit an exponential distribution. The same is true for any series of events where an event is equally likely at any time.
The CDF of the exponential distribution is:
| cdf(x) = 1 − e−λ x |
The parameter, λ, determines the mean and variance of the distribution. This equation can be used to derive a simple visual test for whether a dataset can be well approximated by an exponential distribution. All you have to do is plot the complementary distribution on a log-y scale.
The complementary distribution (CCDF) is just 1 − cdf(x); if you plot the complementary distribution of a dataset that you think is exponential, you expect to see a function like:
| y = 1 − cdf(x) ∼ e−λ x |
If you take the log of both sides of this equation, you get:
| logy ∼ −λ x |
So on a log-y scale the CCDF should look like a straight line with slope −λ.
As an example, I will analyze the time of birth for 44 babies born in a 24-hour period in a hospital in Brisbane, Australia.3.
Using the list of birth times, I computed the “interarrival times;” that is, the times between successive births. The following figure shows the CCDF of these values. The graph on the right is on a log-y scale. It is not particularly straight, which suggests that the exponential model is only an approximation of the observed process. Most likely the underlying assumption—that a birth is equally likely at any time of day—is not exactly true.

plot_ccdf that plots the complementary
CDF of the distribution.
The Pareto distribution is named after the economist Vilfredo Pareto, who used it to describe the distribution of wealth (see wikipedia.org/wiki/Pareto_distribution). Since then, people have used it to describe a variety of phenomena in the natural and social sciences, including sizes of cities and towns, sand particles and meteorites, forest fires and earthquakes.
The Pareto distribution is characterized by a CDF with the following functional form:
| 1− | ⎛ ⎜ ⎜ ⎝ |
| ⎞ ⎟ ⎟ ⎠ |
|
The parameters xm and α determine the location and shape of the distribution. xm is the minimum possibile quantity.
Values from a Pareto distribution often have these properties:
To get a sense of the difference between the Pareto and Gaussian distributions, imagine what the world would be like if the distribution of human height were Pareto. Using the same minimum height, xm = 100 cm, we could choose α = 1.7, so that the median height is about 150 cm.
Here are empirical CDFs for samples with n=9000 drawn from these two distributions:

The figure on the left is on a log-x scale. It shows that the median and lower bounds of the distributions are the same, but the Pareto distribution has a longer tail. The figure on the right is on a log-log scale. The Pareto distribution is a straight line except in the extreme tail, where it is noisy because of the limitations of a finite sample.
If you generate 6 billion random values from the Gaussian distribution shown in the figure, you would guess that the tallest person in the world is about 255 cm. In fact, the tallest currently living person, Leonid Stadnyk, is 259 cm.
But if you generate 6 billion random values from the Pareto distribution, you will see that the tallest person in the world could easily be taller than 80.5 km, which would technically make him or her an astronaut, at least in the United States (see wikipedia.org/wiki/Earth's_atmosphere). That’s what it means to be scale-free!
There is a simple visual test that can indicate whether an empirical distribution is well-characterized by a Pareto distribution: on a log-log scale, the CCDF looks like a straight line. The derivation is similar to what we saw in the previous section.
The equation for the CCDF is:
| y = 1 − cdf(x) ∼ | ⎛ ⎜ ⎜ ⎝ |
| ⎞ ⎟ ⎟ ⎠ |
|
Taking the log of both sides yields:
| logy ∼ −α (logx − logxm ) |
So if you plot logy versus logx, it should look like a straight line with slope −α and intercept −α logxm.
The Weibull distribution is a generalization of the exponential distribution that comes up in failure analysis (see wikipedia.org/wiki/Weibull_distribution).
Can you find a transformation that makes a Weibull distribution look like a straight line? What do the slope and intercept of the line indicate?
Use random.weibullvariate to generate a sample from a Weibull distribution and use it to test your transformation.
The U.S. Census Bureau publishes data on the population of every incorporated city and town in the United States. I have written a small program that downloads this data and converts it into a convenient form. You can download it from greenteapress.com/compmod/populations.py.
What conclusion do you draw about the distribution of sizes for cities and towns?
In 1999 Barabási and Albert published a paper in Science, “Emergence of Scaling in Random Networks,” that characterizes the structure (also called “topology”) of several real-world networks, including graphs that represent the interconnectivity of movie actors, world-wide web (WWW) pages, and elements in the electrical power grid in the western United States.
They measure the degree (number of connections) of each node and compute P(k), the probability that a vertex has degree k; then they plot P(k) versus k on a log-log scale. The tail of the plot fits a straight line, so they conclude that it obeys a power law: as k gets large, P(k) is asymptotic to k− γ, where γ is a parameter that determines the rate of decay.
They also propose a model that generates random graphs with the same property. The essential features of the model, which distinguish it from the Erdős-Rényi model and the Watts-Strogatz model, are:
Finally, they show that graphs generated by this model have a distribution of degrees that obeys a power law. Graphs that have this property are sometimes called scale-free networks (see wikipedia.org/wiki/Scale-free_network), but the name can be confusing because it is the distribution of degrees that is scale-free.
In order to maximize confusion, distributions that obey the power law are also sometimes called scaling distributions because they are invariant under a change of scale. That means that if you change the units the quantities are expressed in, the slope parameter, γ, doesn’t change. You can read wikipedia.org/wiki/Power_law for the details, but it is not important for what we are doing here.
This exercise asks you to make connections between the Watts-Strogatz (WS) and Barabási-Albert (BA) models:
At this point we have seen three phenomena that yield a straight line on a log-log plot:
The similarity in these plots is not a coincidence; these visual tests are closely related.
Starting with a power-law distribution, we have:
| P(k) ∼ k− γ |
P(k) is the probability that a random variable X equals k, so the cumulative distribution of X is:
| cdf(x) = Pr{X ≤ x } = |
| P(k) |
For large values of k we can approximate the summation with an integral:
| k− γ ∼ | ∫ |
| k− γ = |
| (1 − x−γ + 1) |
To make this a proper CDF we could normalize it so that it goes to 1 as x goes to infinity, but that’s not necessary, because all we need to know is:
| cdf(x) ∼ 1 − x−γ + 1 |
Which shows that the distribution of x is asymptotic to a Pareto distribution with xm = 1 and α = γ − 1.
Similarly, if we start with a straight line on a Zipf plot, we have4
| f = c r−s |
Where f is the frequency of the word with rank r. Inverting this relationship yields:
| r = (f/c)−1/s |
Now subtracting 1 and dividing through by the number of different words, n, we get
| = |
| − |
|
Which is only interesting because if r is the rank of a word, then (r−1)/n is the fraction of words with lower ranks, which is the fraction of words with higher frequency, which is the CCDF of the distribution of frequencies:
| ccdf(x) = Pr{X > x } = |
| − |
|
Where X is a random variable drawn from the distribution of frequencies. To characterize the asymptotic behavior for large n we can ignore c and 1/n, which yields:
| ccdf(x) ∼ f−1/s |
Which shows that if a set of words obeys Zipf’s law then the distribution of their frequencies is asymptotic to a Pareto distribution with xm = 1 and α = 1/s.
So the three visual tests are mathematically equivalent; a dataset that passes one test will pass all three. But as a practical matter, the power law plot is noisier than the other two, because it is the (discrete) derivative of the CCDF.
The Zipf and CCDF plots are more robust, but Zipf’s law is only applicable to discrete data (like words), not continuous quantities. CCDF plots work with both.
For these reasons—robusteness and generality—I recommend using CCDFs exclusively.
The Internet Movie Database (IMDb) is an online collection of information about movies. Their database is available in plain text format, so it is reasonably easy to read from Python. For this exercise, the files you need are actors.list.gz and actresses.list.gz; you can download them from www.imdb.com/interfaces#plain.
I have written a program that parses these files and splits them into actor names, movie titles, etc. You can download it from greenteapress.com/compmod/imdb.py.
If you run imdb.py as a script, it reads actors.list.gz
and prints one actor-movie pair per line. Or, if you import
imdb you can use the function process_file to, well,
process the file. The arguments are a filename, a function
object and an optional number of lines to process. Here is
an example:
import imdb
def print_info(actor, date, title, role):
print actor, date, title, role
imdb.process_file('actors.list.gz', print_info)
When you call process_file, it opens filename, reads the
contents, and calls print_info once for each line in the file.
print_info takes an actor, date, movie title and role as
arguments and prints them.