encoding words
Let Σ be an alphabet, and Σ* the set of all words over Σ. An encoding of words over Σ is, loosely speaking, an assignment of words to numbers such that the numbers uniquely identify the words.
Definition. An encoding for a language L⊆Σ* is a one-to-one function E:L→ℕ.
If L is finite, then it is easy to find an encoding for L. We are interested mainly in encoding infinite sets. By the definition above, L can not be encoded if it is uncountable. We can therefore limit the discussion to Σ that is at most countably infinite
by listing some common methods of encoding L.
Among the methods liststed, Σ is an enumerated set {a1,a2,…}. In the first method, Σ is assumed to be finite, and countably infinite in the last three. In addition, L is assumed to be Σ* in the first three methods.
Method 1. First, set E(ai):=i. In addition, for the empty word λ, we set E(λ):=0. Next, inductively define E(w) on the length of w. Suppose now that E(w) has been defiend. Set
E(wa):=nE(w)+E(a), where a∈Σ. |
It is easy to see that if any non-empty word w∈A, with w=b1⋯bm, where bi∈Σ. Then
E(w)=E(b1)nm-1+⋯+E(bm). |
Then E is an encoding of L. This is really just the base-n digital representation of integers, where each ai can be thought of as digits used by the representation. The only difference
here is that 0 is not used as a “digit” (every letter gets mapped to a positive integer), except when the word is empty.
For example, let Σ={0,1,…,9}. Then the words 01,001, and 10 have code numbers 12,112, and 21.
It is easy to see that the encoding is one-to-one (see proof here (http://planetmath.org/UniquenessOfDigitalRepresentation)).
Method 2. Pick three positive number p,q,r such that p,q are coprime, with p>1. Set E(ai):=pi and E(λ):=1. Inductively define E(w) on length of w. Suppose now that E(w) has been defined. Then
E(wa):=(rE(w)+q)E(a), where a∈Σ. |
For example, E(a2a5a3)=(r(rp2+q)p5+q)p3=r2p10+rqp8+qp3.
To see that E is injective, we make the following series of observations:
-
1.
E is injective on Σ. In addition, either E(a)|E(b) or E(b)|E(a) for any a,b∈Σ.
-
2.
p|E(w) iff w≠λ.
-
3.
If E(w)=E(a) for some a∈Σ, then w=a. First, note that w≠1, and if w∈Σ, then w=a. So suppose w=vb, with b∈Σ and v≠λ. Then (rE(v)+q)E(b)=E(a). If E(b)|E(a), then rE(v)+q=pi. Since E(v)>1, i≠0. But if i>0, p and q would not be coprime as p|E(v). On the other hand, if E(a)|E(b), then (E(v)+q)pj=1, again impossible. So w must be a letter, and therefore is a.
-
4.
Now, suppose E(w)=E(v), and E(a)|E(b), where a,b are the rightmost letters of w,v respectively. By the same argument as previously, a=b, so we may cancel the letters, leaving us with the equation E(w′)=E(v′), where w=w′a and v=v′b. Continue the process of canceling the last letters in pairs, we end up with E(u)=E(c) for some letter c∈Σ. So u=c. This shows that w=v.
A variation of this method is to set E(aw):=(rE(w)+q)E(a).
If we set p=2 and r=q=1, then the range of E is the set of all positive integers.
Method 3. The third method utilizes the uniqueness of prime decomposition of integers. First, define f:Σ→ℕ by f(ai)=i. Then, for any w=b1⋯bm, with bi∈Σ, define
E(w):=pf(b1)1⋯pf(bm)m=m∏i=1pf(bi)i, |
where pi is the i-th prime number (for example, p1=2). We again set E(λ):=1. By the fundamental theorem of arithmetic
, and the fact that f is a bijection, E is injective (and maps onto the set of positive integers). This method is known as the multiplicative encoding of Gödel.
Method 4. Once an encoding E is found for Σ*, an encoding for L⊆Σ* can be obtained by restricting the domain of E to L. Depending on how L is defined, other methods of encoding L via E are possible. We illustrate one example.
Let L=L1∪L2Σ*, where L1,L2 are disjoint non-empty finite sets not containing the empty word. Encode L as follows: suppose L1={v1,…,vm} and L2={w1,…,wn}. Define E′:L→ℕ such that:
-
1.
E′(vi):=10i-1 and E′(wj):=10m+j-1, where i∈{1,…,m} and j∈{1,…,n};
-
2.
E′(w):=E′(wj)E(u)10m+n-1, where w=wju, and λ≠u∈Σ*.
Essentially, the first m+n digits are reserved for encoding words vi and wj. E′ is easily seen to be injective.
Method 5. Let L2 be the language consisting of all words of length 2. Define E2:L2→ℕ by E2(aiaj):=J(i,j), where J is a pairing function that codes pairs of positive integers. Since J is an injection (actually maps onto the set of positive integers), so is E2. Using J, one can encode the language L3 of all length 3 words. Define E3:L2→ℕ by E3(aiajak):=J(i,J(j,k)). Again, E3 is an injection. By induction
, one can encode the language Ln of all words of length n, for any positive integer n.
Method 6. Let L(n) be the language consisting of all words of length at most n. We can utilize Method 5 to code L. First, let Σ1:=Σ∪{a0}, where a0 is a letter not in Σ. Define ϕ:L(n)→Σ*1 by ϕ(w):=an-|w|0w, where |w| is the length of w. Then ϕ(L)⊆Ln, the language of all length n words over Σ*1. It is easy to see that ϕ is one-to-one. Then E(n):=En∘ϕ is an encoding for L, where En is defined in Method 5 that encodes Ln, via the modified version of the pairing function J′(i,j):=J(i+1,j+1), where i,j≥0.
Method 7. Can Method 5 be used to encode Σ*? The answer is yes. However, a direct extension of En does not work. By this we mean that E:Σ*→ℕ, given by E(w)=En(w) where |w|=n, though a function, is not injective. For any positive integer m, there is a word wn of length n for every n>0, such that En(wn)=m. Instead, define E so that E(w):=E2(|w|,E|w|(w)) if w≠λ, and E(λ):=0. It is easy to see that E is injective, since both E2 and En are (in fact, E is a bijection).
Remark. An encoding E for L can be thought of as a partial function from Σ* to ℕ, whose domain is L⊆Σ*. E is said to be effective if E(L) is a recursive set
. Equivalently, the partial function E* on Σ* given by
E*(w)={aE(w)1if w∈L,undefinedotherwise. |
is Turing-computable. An enumeration of Σ can be thought of as an encoding for Σ. If Σ is finite, any enumeration of Σ is effective. Assume that Σ is effectively enumerated, whether or not Σ is finite (so that a1 in the definition of E* can be effectively chosen). Then it is not hard to see that all of the encodings described above are effective. In fact, all of the sets E(L) described are primitive recursive.
Title | encoding words |
---|---|
Canonical name | EncodingWords |
Date of creation | 2013-03-22 19:06:09 |
Last modified on | 2013-03-22 19:06:09 |
Owner | CWoo (3771) |
Last modified by | CWoo (3771) |
Numerical id | 20 |
Author | CWoo (3771) |
Entry type | Feature |
Classification | msc 94A60 |
Classification | msc 68Q05 |
Classification | msc 68Q45 |
Classification | msc 03D20 |