Zipf’s law

Zipf’s law, an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. The law is named after the linguist George Kingsley Zipf who first proposed it (Zipf 1935, 1949), though Jean-Baptiste Estoup appears to have noticed the regularity before Zipf.

Motivation

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. Empirically, a data set can be tested to see if Zipf’s law applies by running the regression log R = a - b log n where R is the rank of the datum, n is its value and a and b are constants. Zipf’s law applies when b = 1. When this regression is applied to cities, a better fit has been found with b = 1.07. While Zipf’s law holds for the upper tail of the distribution, the entire distribution of cities is log-normal and follows Gibrat’s law. Both laws are consistent because a log-normal tail can typically not be distinguished from a Pareto (Zipf) tail.

Theoretical review

Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word “the” (as described above) would appear at x = log(1), y = log(69971). The data conform to Zipf’s law to the extent that the plot is linear.

Zipf PMF for N = 10 on a log-log scale. The horizontal axis is the index k . (Note that the function is only defined at integer values of k. The connecting lines do not indicate continuity.)

See also:

Human Language is not Random