PlanetMath Automatic Reference Linking
PlanetMath Automatic Reference Linking
The PlanetMath encyclopedia is a “semantic network” of sorts. It is a set of concepts (entries in the encyclopedia) whose meanings depend, not only on the content of the entries, but on the position in the network (the connections to other entries.) This is of course the case because known mathematics represents one big semantic network. It is possible to understand portions of it that are only “linked” to each other and derive meanings from each other, but overall the entire “network” is connected and one must be able to move throughout the network to understand any given “node” (concept).
Because mathematics is like this, it is supremely important that users of PlanetMath are able to jump to requisite concepts in the network in order to understand the current one, all the way down to the concepts that are so simple they are evident to the intuition. However, there are so many of these connections that is is too much to ask for authors of entries to do the linking themselves. Not only is it too much work, but the authors just may not be aware of which requisite concepts are already defined on PlanetMath.
To solve this problem, PlanetMath implements automatic reference linking between entries in the encyclopedia. In essence, when an entry is “rendered” for display, the text is broken down and scanned for words that invoke entries that have been defined already. These words are then turned into hyperlinks. Doing this automatically almost completely frees the author from having to think about links, however, it is not completely infallible. Because of this, the user can ultimately override the automatic linking, or create their own original linking (see User Linking Controls.)
Linking is done, roughly, from instances of encyclopedia entry titles appearing in other entries, to those entries themselves. If there is an entry called “widget” in the encyclopedia, and I create a new entry “thingamagig” that contains the word “widget”, “widget” in “thingamagig” will be turned into a hyperlink which leads to the “widget” entry.
There are more nuances to the behavior, however. What if I mention the phrase “green widget” in my “thingamagig” entry, and there is not only a “widget” entry, but a “green widget” entry? Well, in this case, the automatic linking is smart enough to link to “green widget”. In general, the longest match at a given location in entry text wins, and shorter matches occurring “within” larger matches are ignored.
So, for example, say we have entries “large green widget”, “widget”, and “large green”. If we mention “large green widget” in a new entry, all 3 words will become one hyperlink to the “large green widget” object, despite the fact that there are two matches starting at word “large”, and the match to “widget” alone occurs at a different location, but is still within the set of words matched for a larger title.
Synonyms must be handled as well (synonyms are multiple titles for the same entry or concept.) Within the text of an object that has synonyms, all mentions of the object’s title or synonym are escaped from linking. An example will illustrate why this makes sense. Say, in the “large green widget” entry, we mention the entire term “large green widget” (indeed, repeating the concept being defined is overwhelmingly common.) Since we expect these entire 3 words to define a single concept, but that concept is the current entry, we don’t want them to be hyperlinked to “large green widget”, “large green”, “green widget”, or “widget”, ad nauseum. One might point out that the smaller titles may be requisite concepts, and one would be right in saying this. However, it happens that the natural tendency is for the author of an entry to think of all occurrences of the title or synonym in his entry as one concept, which results in explicitly mentioning requisite concepts. For example, invocations of “green widget” in “large green widget” that are not preceded by “large” will link to “green widget” as usual.
PlanetMath also performs a couple of “morphological” transformations on titles in order to ensure they are linked to. The first, and most important transformation, has the effect of invariance of pluralization. For example, creating an entry that says “green widgets” will link to the “green widget” entry. The second invariance is due to possessiveness. If there is a “Green’s theorem” entry, mentioning “the Green theorem” in another entry will result in “Green theorem” being linked to the “Green’s theorem” entry. This also works in reverse, if the entry was called “Green theorem” and you mentioned “Green’s theorem”, the linking would still take place.
Another morphological invariance concerns international characters. Plain Roman character versions of a title which contains western European characters will still be “matched” by the linking system. For example, “Groebner” will link to an entry entitled “GrÃÂ¶bner.” A TeX trigraph version like “Grob̈ner” will link properly as well.
The final transformation has the effect of invariance under title style. Titles can be phrased as index entries, containing a comma (like “Mersenne, Marin” for a proper name) or normally, like “Green’s theorem”. Users are encouraged to use index-style titles where appropriate, so that entries appear alphabetically in the encyclopedia where they’d be expected to appear. However, index-style titles could present a problem to the automatic reference linking if they were not handled; when mentioning Mersenne primes, nobody calls the inventor “Mersenne, Marin”. Instead we’d say “Father Marin Mersenne”. PlanetMath automatically takes care of this, so “Marin Mersenne” would link to “Mersenne, Marin”.
There is one other important facet of behavior; linking is only done on the first occurrence of a concept. Why? Well, there is really no reason to make a link to every mention of “widget” if we have repeated it multiple times in an entry. We can assume our entry is read linearly, in which case the user will naturally follow hyperlinks immediately when they see a concept they dont understand. Once they come back to our entry, they understand the concept, so they don’t need the hyperlink anymore. While we can’t remove the link they just followed, we can avoid linking the rest of the occurrences of “widget”. This also results in a much cleaner appearance, as links tend to make text look cluttered.
The link processing system is written in Perl, like the rest of PlanetMath. It is the single most complicated portion of the system. However, a decent explanation of how it works can be made without going into messy details (and they do abound).
The cross referencing engine starts by pulling out math environments and other portions of text that need to be escaped. These portions are replaced by single tokens, which are unique. The engine then first breaks the text of an entry into a single words/tokens array. This array form makes it easy to associate a particular word with a unique integer position.
All titles and synonyms in the encyclopedia are kept in a hash index structure. This structure contains at the top level only words that occur as first words of some title. Following these words leads to a list of entries that have titles starting with that particular word. This list is kept sorted in descending order of length, which allows the longest match to be chosen first, simply by early-out.
The now tokenized text of the entry is then iterated through. Checking to see if a word should be the start of a link is an operation, thanks to the hash structure just described. If a word indeed matches the start of a title, the following words are checked to see if they match the longest title starting with that word. If this fails, the next longest title is checked, and so on. When a matching title is found, it is submitted to a subroutine for inclusion in the match array.
The match array is actually a hash of positions – a match is first stored by the position where it matches within the entry text. When a candidate match is found, the match subroutine checks to make sure that no other entry already in the match array contains the position of the new match. It doesn’t have to worry about checking to see if there is a longer match at the same position, because we already have guaranteed this by sorting titles by descending order of length.
After the match array is built, the engine iterates through it and replaces matching words with hyperlinks. The engine keeps track of when it makes a link to a particular entry, so that if it runs into another mention of this entry, a link is not made. Links are also kept in an array and returned to the calling PlanetMath code, so they can be displayed separately.
Finally, the linked text array is recombined with the removed tokens, and returned as a single text string. This final string is more or less what is sent to your web browser.
4 Asymptotic Complexity of the Algorithm
If is the number of titles (and synonyms) in the encyclopedia, and is the number of words (minus escaped portions) in a new entry, it only “costs” to process the new entry due to the above architecture! Notice how is not involved at all. This means that automatic reference linking is completely scaleable; as PlanetMath grows ( increases), linking will not get any slower.
|Title||PlanetMath Automatic Reference Linking|
|Date of creation||2013-03-11 19:20:05|
|Last modified on||2013-03-11 19:20:05|
|Last modified by||(0)|