Heaps’ law


Heaps’ law describes the portion of a vocabulary which is represented by an instance document (or set of instance documents) consisting of words chosen from the vocabulary. This can be formulated as

VR(n)=Knβ

Where VR is the portion of the vocabulary V (VRV) represented by the instance text of size n. K and β are free parameters determined empirically.

With English text corpuses, typically K is between 10 and 100, and .4β.6.

(generated by GNU Octave and gnuplot)
Figure 1: A typical Heaps-law plot. The y-axis represents the text size, and the x-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes.

Heaps’ law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.

It is interesting to note that Heaps’ law applies in the general case where the “vocabulary” is just some set of distinct types which are attributesPlanetmathPlanetmath of some collectionMathworldPlanetmath of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps’ law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling.

1 References

  • Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.

Title Heaps’ law
Canonical name HeapsLaw
Date of creation 2013-03-22 13:01:56
Last modified on 2013-03-22 13:01:56
Owner akrowne (2)
Last modified by akrowne (2)
Numerical id 6
Author akrowne (2)
Entry type Definition
Classification msc 94A99
Classification msc 68P20
Classification msc 60E05