Previous Up Next

Chapter 4  Scale-free networks

4.1  Zipf’s Law

Zipf’s law describes a relationship between the frequencies and ranks of words in natural languages1. The “frequency” of a word is the number of times it appears in a body of work. The “rank” of a word is its position in a list of words sorted by frequency: the most common word has rank 1, the second most common has rank 2, etc.

Specifically, Zipf’s Law predicts that the frequency, f, of the word with rank r is:

f = c rs 

where s and c are parameters that depend on the language and the text.

If you take the logarithm of both sides of this equation, you get:

logf = logc − s logr 

So if you plot logf versus logr, you should get a straight line with slope −s and intercept logc.

Exercise 1   Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency. You can test it by downloading an out-of-copyright book in plain text format from gutenberg.net. You might want to remove punctuation from the words.

If you need some help getting started, you can download greenteapress.com/compmod/Hist.py, which provides an object named Hist that you might find useful.

Plot the results and check whether they form a straight line. Can you estimate the value of s?

You might want to use the pylab package, which provides easy-to-use functions to generate a variety of plots. In particular, pylab.loglog plots a series of points on a log-log scale. pylab is part of a larger package called matplotlib; it is not included in all Python distributions, so you might have to install it.

Here is an example that generates a simple graph:

import pylab

# compute lists of x and y values
xs = range(0, 10)
ys = [x**2 for x in xs]

# plot the data
pylab.plot(xs, ys, '-ro')

# set title and labels
pylab.title('Parabola')
pylab.xlabel('x')
pylab.ylabel('y = x**2')

# show the plot
pylab.show()

pylab.plot takes two lists as parameters. If you have a list of pairs instead, you can use zip and the scatter operator to “transpose” it. For example, if t is a list of pairs, zip(*t) returns a pair of lists.

The most common gotcha with pylab is that you have to call pylab.show to see the plot. You can read more about it at matplotlib.sourceforge.net.

You can download my solution from greenteapress.com/compmod/Zipf.py

4.2  Cumulative distributions

A distribution is a statistical description of a set of values. For example, if you collect the population of every city and town in the U.S., you would have a set of about 14,000 integers.

The simplest description of this set is a list of numbers, which would be complete but not very informative. A more concise description would be a statistical summary like the mean and variation, but that is not a complete description because there are many sets of values with the same summary statistics.

One alternative is a histogram, which divides the range of possible values into “bins” and counts the number of values that fall in each bin. Histograms are common, so they are easy to understand, but they have some drawbacks: the primary one is that it is tricky to get the bin size right. If the bins are too small, then the number of values in each bin is also small, so the histogram doesn’t give much insight. If the bins are too large, they lump together a wide range of values, which obscures details that might be important.

A better alternative is a cumulative distribution, which shows the percentage of values less than or equal to x for a range of x that sweeps from the smallest value in the set to the highest.

For example, the following figure shows the cumulative distribution function (CDF), for the values (1,2,2,4,5).

The range of the CDF is always from 0 to 1. The values on the y-axis are sometimes called percentiles, although strictly-speaking the range of a percentile is from 0 to 100. The values on the x-axis are called quantities, or sometimes “quantiles.”

The filled and empty circles indicate the value of the function at integral values: for example, cdf(1) = 0.2 because 20% of the values in the set are less than or equal to 1.

Exercise 2   Write a definition for a class named Dist that provides the following methods:
__init__:
Takes a histogram object similar to the one defined in greenteapress.com/compmod/Hist.py and builds a representation of the distribution.

For example, to make a distribution of the values (1,2,2,4,5) you would run:

h = Hist([1,2,2,4,5])
d = Dist(h)
percentile:
Takes a quantity and returns the corresponding percentile. For example, given the previous values, d.percentile(2) should return 0.6.
print_cdf:
Print the quantities in the distribution and their corresponding percentiles, with two lines per quantity. For the previous distribution, the output should be:
1 0.0
1 0.2
2 0.2
2 0.6
4 0.6
4 0.8
5 0.8
5 1.0
plot_cdf:
Use pylab to plot the distribution in the style of the figure above. You should plot two points per quantity to make a stair-step pattern. You don’t have to plot the filled and empty circles, but you can.

You should give some thought to choosing a data structure to represent a distribution. percentile should be O(logn), where n is the number of unique quantities in the distribution. The other methods should be linear in n.

You can download my solution from greenteapress.com/compmod/Dist.py

4.3  Closed-form distributions

The distributions we have seen so far are sometimes called empirical distributions because they are based on a dataset that comes from some kind of empirical observation.

An alternative is what I will call a closed-form distribution, which is characterized by a CDF that can be expressed as a closed-form function. Some of these distributions, like the Gaussian2 or normal distribution, are well known, at least to people who have studied statistics. Many real world phenomena can be approximated by closed-form distributions, which is why they are useful.

For example, if you observe a mass of radioactive material with an instrument that can detect decay events, the distribution of times between events will most likely fit an exponential distribution. The same is true for any series of events where an event is equally likely at any time.

The CDF of the exponential distribution is:

cdf(x) = 1 − e−λ x 

The parameter, λ, determines the mean and variance of the distribution. This equation can be used to derive a simple visual test for whether a dataset can be well approximated by an exponential distribution. All you have to do is plot the complementary distribution on a log-y scale.

The complementary distribution (CCDF) is just 1 − cdf(x); if you plot the complementary distribution of a dataset that you think is exponential, you expect to see a function like:

y = 1 − cdf(x) ∼ e−λ x 

If you take the log of both sides of this equation, you get:

logy ∼ −λ x 

So on a log-y scale the CCDF should look like a straight line with slope −λ.

As an example, I will analyze the time of birth for 44 babies born in a 24-hour period in a hospital in Brisbane, Australia.3.

Using the list of birth times, I computed the “interarrival times;” that is, the times between successive births. The following figure shows the CCDF of these values. The graph on the right is on a log-y scale. It is not particularly straight, which suggests that the exponential model is only an approximation of the observed process. Most likely the underlying assumption—that a birth is equally likely at any time of day—is not exactly true.

Exercise 3   In your Dist class, add a method called plot_ccdf that plots the complementary CDF of the distribution.
Exercise 4   Use the function expovariate in the random module to generate 44 values from an exponential distribution with mean 32.6. Plot the CCDF on linear and log-y scales and compare it to the figure above.

4.4  Pareto distributions

The Pareto distribution is named after the economist Vilfredo Pareto, who used it to describe the distribution of wealth (see wikipedia.org/wiki/Pareto_distribution). Since then, people have used it to describe a variety of phenomena in the natural and social sciences, including sizes of cities and towns, sand particles and meteorites, forest fires and earthquakes.

The Pareto distribution is characterized by a CDF with the following functional form:

1− 


x
xm
 


−α



 
 

The parameters xm and α determine the location and shape of the distribution. xm is the minimum possibile quantity.

Values from a Pareto distribution often have these properties:

Long tail:
Pareto distributions contain many small values and a few very large ones.
80/20 rule:
The large values in a Pareto distribution are so large that they add up to a disproportionate share of the total. In the context of wealth, the 80/20 rule says that 20% of the people own 80% of the wealth.
Scale free:
Short-tailed distributions are centered around a typical size, which is called a “scale.” For example, the great majority of adult humans are between 100 and 200 cm in height, so we could say that the scale of human height is a few hundred centimeters. But for long-tailed distributions, there is no similar range (bounded by a factor of two) that contains the bulk of the distribution. So we say that these distributions are “scale-free.”

To get a sense of the difference between the Pareto and Gaussian distributions, imagine what the world would be like if the distribution of human height were Pareto. Using the same minimum height, xm = 100 cm, we could choose α = 1.7, so that the median height is about 150 cm.

Here are empirical CDFs for samples with n=9000 drawn from these two distributions:

The figure on the left is on a log-x scale. It shows that the median and lower bounds of the distributions are the same, but the Pareto distribution has a longer tail. The figure on the right is on a log-log scale. The Pareto distribution is a straight line except in the extreme tail, where it is noisy because of the limitations of a finite sample.

If you generate 6 billion random values from the Gaussian distribution shown in the figure, you would guess that the tallest person in the world is about 255 cm. In fact, the tallest currently living person, Leonid Stadnyk, is 259 cm.

But if you generate 6 billion random values from the Pareto distribution, you will see that the tallest person in the world could easily be taller than 80.5 km, which would technically make him or her an astronaut, at least in the United States (see wikipedia.org/wiki/Earth's_atmosphere). That’s what it means to be scale-free!

There is a simple visual test that can indicate whether an empirical distribution is well-characterized by a Pareto distribution: on a log-log scale, the CCDF looks like a straight line. The derivation is similar to what we saw in the previous section.

The equation for the CCDF is:

y = 1 − cdf(x) ∼ 


x
xm
 


−α



 
 

Taking the log of both sides yields:

logy ∼ −α (logx − logxm ) 

So if you plot logy versus logx, it should look like a straight line with slope −α and intercept −α logxm.

Exercise 5  

The Weibull distribution is a generalization of the exponential distribution that comes up in failure analysis (see wikipedia.org/wiki/Weibull_distribution).

Can you find a transformation that makes a Weibull distribution look like a straight line? What do the slope and intercept of the line indicate?

Use random.weibullvariate to generate a sample from a Weibull distribution and use it to test your transformation.

Exercise 6   The distribution of populations for cities and towns has been proposed as an example of a real-world phenomenon that can be described with a Pareto distribution.

The U.S. Census Bureau publishes data on the population of every incorporated city and town in the United States. I have written a small program that downloads this data and converts it into a convenient form. You can download it from greenteapress.com/compmod/populations.py.

  1. Read over the program to make sure you know what it does and then write a program that computes and plots the distribution of populations for the 14593 cities and towns in the dataset.
  2. In your Dist class, write a function called quantile that takes a percentile as a parameter and returns the corresponding quantity. What is the median size in this distribution (the quantity that corresponds to the 50th percentile)? What are the 25th and 75th percentiles?
  3. Plot the CDF on linear and log-x scales so you can get a sense of the shape of the distribution. Then plot the CCDF on a log-log scale to see if it has the characteristic shape of a Pareto distribution.
  4. Use your plot to estimate values for the parameters xm and α. Then use the function paretovariate in the random module to generate 14593 values from a Pareto distribution with the parameters you estimated in the previous problem. Plot the CCDF on a log-log scale and compare it the CCDF of the observed data.

What conclusion do you draw about the distribution of sizes for cities and towns?

4.5  Barabási  and Albert

In 1999 Barabási  and Albert published a paper in Science, “Emergence of Scaling in Random Networks,” that characterizes the structure (also called “topology”) of several real-world networks, including graphs that represent the interconnectivity of movie actors, world-wide web (WWW) pages, and elements in the electrical power grid in the western United States.

They measure the degree (number of connections) of each node and compute P(k), the probability that a vertex has degree k; then they plot P(k) versus k on a log-log scale. The tail of the plot fits a straight line, so they conclude that it obeys a power law: as k gets large, P(k) is asymptotic to k− γ, where γ is a parameter that determines the rate of decay.

They also propose a model that generates random graphs with the same property. The essential features of the model, which distinguish it from the Erdős-Rényi  model and the Watts-Strogatz model, are:

Growth:
Instead of starting with a fixed number of vertices, Barabási  and Albert start with a small graph, add vertices over time, and characterize the structure of the graph as the number of vertices grows.
Preferential attachment:
When a new edge is created, it is more likely to connect to a vertex that already has a large number of edges. This “rich get richer” effect is characteristic of the growth patterns of some real-world networks.

Finally, they show that graphs generated by this model have a distribution of degrees that obeys a power law. Graphs that have this property are sometimes called scale-free networks (see wikipedia.org/wiki/Scale-free_network), but the name can be confusing because it is the distribution of degrees that is scale-free.

In order to maximize confusion, distributions that obey the power law are also sometimes called scaling distributions because they are invariant under a change of scale. That means that if you change the units the quantities are expressed in, the slope parameter, γ, doesn’t change. You can read wikipedia.org/wiki/Power_law for the details, but it is not important for what we are doing here.

Exercise 7  

This exercise asks you to make connections between the Watts-Strogatz (WS) and Barabási-Albert (BA) models:

  1. Read Barabási  and Albert’s paper and implement their algorithm for generating graphs. See if you can replicate their Figure 2(A), which shows P(k) versus k for a graph with 150 000 vertices.
  2. Use the WS model to generate the largest graph you can in a reasonable amount of time. Plot P(k) versus k and see if you can characterize the rate of decay.
  3. Use the BA model to generate a graph with about 1000 vertices and compute the characteristic length and clustering coefficient as defines in the Watts and Strogatz paper. Do scale-free networks have the characteristics of a small-world graph?

4.6  Zipf, Pareto and power laws

At this point we have seen three phenomena that yield a straight line on a log-log plot:

Zipf law:
A Zipf plot shows frequency as a function of rank.
Pareto CCDF:
A CCDF shows percentile as a function of quantile.
Power law distribution:
A power law plot shows frequency (or probability) as a function of quantile.

The similarity in these plots is not a coincidence; these visual tests are closely related.

Starting with a power-law distribution, we have:

P(k) ∼ k− γ 

P(k) is the probability that a random variable X equals k, so the cumulative distribution of X is:

cdf(x) = Pr{X ≤ x } = 
x
k=0
 P(k

For large values of k we can approximate the summation with an integral:

x
k=0
 k− γ ∼ 
x


k=0
 k− γ = 
1
γ −1
 (1 − x−γ + 1

To make this a proper CDF we could normalize it so that it goes to 1 as x goes to infinity, but that’s not necessary, because all we need to know is:

cdf(x) ∼ 1 − x−γ + 1 

Which shows that the distribution of x is asymptotic to a Pareto distribution with xm = 1 and α = γ − 1.

Similarly, if we start with a straight line on a Zipf plot, we have4

f = c rs 

Where f is the frequency of the word with rank r. Inverting this relationship yields:

r = (f/c)−1/s 

Now subtracting 1 and dividing through by the number of different words, n, we get

r−1
n
 = 
(f/c)−1/s
n
 − 
1
n
 

Which is only interesting because if r is the rank of a word, then (r−1)/n is the fraction of words with lower ranks, which is the fraction of words with higher frequency, which is the CCDF of the distribution of frequencies:

ccdf(x) = Pr{X > x } = 
(f/c)−1/s
n
 − 
1
n
 

Where X is a random variable drawn from the distribution of frequencies. To characterize the asymptotic behavior for large n we can ignore c and 1/n, which yields:

ccdf(x) ∼ f−1/s 

Which shows that if a set of words obeys Zipf’s law then the distribution of their frequencies is asymptotic to a Pareto distribution with xm = 1 and α = 1/s.

So the three visual tests are mathematically equivalent; a dataset that passes one test will pass all three. But as a practical matter, the power law plot is noisier than the other two, because it is the (discrete) derivative of the CCDF.

The Zipf and CCDF plots are more robust, but Zipf’s law is only applicable to discrete data (like words), not continuous quantities. CCDF plots work with both.

For these reasons—robusteness and generality—I recommend using CCDFs exclusively.

Exercise 8  

The Internet Movie Database (IMDb) is an online collection of information about movies. Their database is available in plain text format, so it is reasonably easy to read from Python. For this exercise, the files you need are actors.list.gz and actresses.list.gz; you can download them from www.imdb.com/interfaces#plain.

I have written a program that parses these files and splits them into actor names, movie titles, etc. You can download it from greenteapress.com/compmod/imdb.py.

If you run imdb.py as a script, it reads actors.list.gz and prints one actor-movie pair per line. Or, if you import imdb you can use the function process_file to, well, process the file. The arguments are a filename, a function object and an optional number of lines to process. Here is an example:

import imdb

def print_info(actor, date, title, role):
    print actor, date, title, role

imdb.process_file('actors.list.gz', print_info)

When you call process_file, it opens filename, reads the contents, and calls print_info once for each line in the file. print_info takes an actor, date, movie title and role as arguments and prints them.

  1. Write a program that reads actors.list.gz and actresses.list.gz and builds a database that maps from each actor to a list of his or her films.
  2. Two actors are “costars” if they have been in at least one movie together. Process the database you built in the previous step and build a second database that maps from each actor to a list of his or her costars.
  3. Write a program that can play the “Six Degrees of Kevin Bacon” game, which you can read about at wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon.
  4. Estimate the characteristic length and clustering coefficient of the costar graph. Is it a small world after all?
  5. Plot the CCDF of the distribution of degrees in the graph on a log-log scale. Is it scale-free?

1
See wikipedia.org/wiki/Zipf's_law
2
This is not a perfect example, since the CDF of a Gaussian is an integral with no closed-form solution.
3
This example is based on information and data from Dunn, “A Simple Dataset for Demonstrating Common Distributions,” Journal of Statistics Education v.7, n.3 (1999). You can download a file with this data from greenteapress.com/compmod/babyboom.dat.
4
This derivation follows Adamic, “Zipf, power law and Pareto—a ranking tutorial,” available at www.hpl.hp.com/research/idl/papers/ranking/ranking.html

Previous Up Next