encoding words

Let Σ be an alphabetMathworldPlanetmath, and Σ* the set of all words over Σ. An encoding of words over Σ is, loosely speaking, an assignment of words to numbers such that the numbers uniquely identify the words.

Definition. An encoding for a languagePlanetmathPlanetmath LΣ* is a one-to-one function E:L.

If L is finite, then it is easy to find an encoding for L. We are interested mainly in encoding infinite setsMathworldPlanetmath. By the definition above, L can not be encoded if it is uncountable. We can therefore limit the discussion to Σ that is at most countably infiniteMathworldPlanetmath by listing some common methods of encoding L.

Among the methods liststed, Σ is an enumerated set {a1,a2,}. In the first method, Σ is assumed to be finite, and countably infinite in the last three. In additionPlanetmathPlanetmath, L is assumed to be Σ* in the first three methods.

Method 1. First, set E(ai):=i. In addition, for the empty wordPlanetmathPlanetmathPlanetmath λ, we set E(λ):=0. Next, inductively define E(w) on the length of w. Suppose now that E(w) has been defiend. Set

E(wa):=nE(w)+E(a), where aΣ.

It is easy to see that if any non-empty word wA, with w=b1bm, where biΣ. Then


Then E is an encoding of L. This is really just the base-n digital representation of integers, where each ai can be thought of as digits used by the representationPlanetmathPlanetmath. The only differencePlanetmathPlanetmath here is that 0 is not used as a “digit” (every letter gets mapped to a positive integer), except when the word is empty.

For example, let Σ={0,1,,9}. Then the words 01,001, and 10 have code numbers 12,112, and 21.

It is easy to see that the encoding is one-to-one (see proof here (http://planetmath.org/UniquenessOfDigitalRepresentation)).

Method 2. Pick three positive number p,q,r such that p,q are coprimeMathworldPlanetmath, with p>1. Set E(ai):=pi and E(λ):=1. Inductively define E(w) on length of w. Suppose now that E(w) has been defined. Then

E(wa):=(rE(w)+q)E(a), where aΣ.

For example, E(a2a5a3)=(r(rp2+q)p5+q)p3=r2p10+rqp8+qp3.

To see that E is injective, we make the following series of observations:

  1. 1.

    E is injective on Σ. In addition, either E(a)|E(b) or E(b)|E(a) for any a,bΣ.

  2. 2.

    p|E(w) iff wλ.

  3. 3.

    If E(w)=E(a) for some aΣ, then w=a. First, note that w1, and if wΣ, then w=a. So suppose w=vb, with bΣ and vλ. Then (rE(v)+q)E(b)=E(a). If E(b)|E(a), then rE(v)+q=pi. Since E(v)>1, i0. But if i>0, p and q would not be coprime as p|E(v). On the other hand, if E(a)|E(b), then (E(v)+q)pj=1, again impossible. So w must be a letter, and therefore is a.

  4. 4.

    Now, suppose E(w)=E(v), and E(a)|E(b), where a,b are the rightmost letters of w,v respectively. By the same argument as previously, a=b, so we may cancel the letters, leaving us with the equation E(w)=E(v), where w=wa and v=vb. Continue the process of canceling the last letters in pairs, we end up with E(u)=E(c) for some letter cΣ. So u=c. This shows that w=v.

A variation of this method is to set E(aw):=(rE(w)+q)E(a).

If we set p=2 and r=q=1, then the range of E is the set of all positive integers.

Method 3. The third method utilizes the uniqueness of prime decomposition of integers. First, define f:Σ by f(ai)=i. Then, for any w=b1bm, with biΣ, define


where pi is the i-th prime numberMathworldPlanetmath (for example, p1=2). We again set E(λ):=1. By the fundamental theorem of arithmeticMathworldPlanetmath, and the fact that f is a bijection, E is injective (and maps onto the set of positive integers). This method is known as the multiplicative encoding of Gödel.

Method 4. Once an encoding E is found for Σ*, an encoding for LΣ* can be obtained by restricting the domain of E to L. Depending on how L is defined, other methods of encoding L via E are possible. We illustrate one example.

Let L=L1L2Σ*, where L1,L2 are disjoint non-empty finite setsMathworldPlanetmath not containing the empty word. Encode L as follows: suppose L1={v1,,vm} and L2={w1,,wn}. Define E:L such that:

  1. 1.

    E(vi):=10i-1 and E(wj):=10m+j-1, where i{1,,m} and j{1,,n};

  2. 2.

    E(w):=E(wj)E(u)10m+n-1, where w=wju, and λuΣ*.

Essentially, the first m+n digits are reserved for encoding words vi and wj. E is easily seen to be injective.

Method 5. Let L2 be the language consisting of all words of length 2. Define E2:L2 by E2(aiaj):=J(i,j), where J is a pairing functionMathworldPlanetmath that codes pairs of positive integers. Since J is an injection (actually maps onto the set of positive integers), so is E2. Using J, one can encode the language L3 of all length 3 words. Define E3:L2 by E3(aiajak):=J(i,J(j,k)). Again, E3 is an injection. By inductionMathworldPlanetmath, one can encode the language Ln of all words of length n, for any positive integer n.

Method 6. Let L(n) be the language consisting of all words of length at most n. We can utilize Method 5 to code L. First, let Σ1:=Σ{a0}, where a0 is a letter not in Σ. Define ϕ:L(n)Σ1* by ϕ(w):=a0n-|w|w, where |w| is the length of w. Then ϕ(L)Ln, the language of all length n words over Σ1*. It is easy to see that ϕ is one-to-one. Then E(n):=Enϕ is an encoding for L, where En is defined in Method 5 that encodes Ln, via the modified version of the pairing function J(i,j):=J(i+1,j+1), where i,j0.

Method 7. Can Method 5 be used to encode Σ*? The answer is yes. However, a direct extensionPlanetmathPlanetmathPlanetmath of En does not work. By this we mean that E:Σ*, given by E(w)=En(w) where |w|=n, though a function, is not injective. For any positive integer m, there is a word wn of length n for every n>0, such that En(wn)=m. Instead, define E so that E(w):=E2(|w|,E|w|(w)) if wλ, and E(λ):=0. It is easy to see that E is injective, since both E2 and En are (in fact, E is a bijection).

Remark. An encoding E for L can be thought of as a partial functionMathworldPlanetmath from Σ* to , whose domain is LΣ*. E is said to be effective if E(L) is a recursive setMathworldPlanetmath. Equivalently, the partial function E* on Σ* given by

E*(w)={a1E(w)if wL,undefinedotherwise.

is Turing-computable. An enumeration of Σ can be thought of as an encoding for Σ. If Σ is finite, any enumeration of Σ is effective. Assume that Σ is effectively enumerated, whether or not Σ is finite (so that a1 in the definition of E* can be effectively chosen). Then it is not hard to see that all of the encodings described above are effective. In fact, all of the sets E(L) described are primitive recursive.

Title encoding words
Canonical name EncodingWords
Date of creation 2013-03-22 19:06:09
Last modified on 2013-03-22 19:06:09
Owner CWoo (3771)
Last modified by CWoo (3771)
Numerical id 20
Author CWoo (3771)
Entry type Feature
Classification msc 94A60
Classification msc 68Q05
Classification msc 68Q45
Classification msc 03D20