# encoding words

Let $\Sigma$ be an alphabet, and $\Sigma^{*}$ the set of all words over $\Sigma$. An encoding of words over $\Sigma$ is, loosely speaking, an assignment of words to numbers such that the numbers uniquely identify the words.

Definition. An encoding for a language $L\subseteq\Sigma^{*}$ is a one-to-one function $E:L\to\mathbb{N}$.

If $L$ is finite, then it is easy to find an encoding for $L$. We are interested mainly in encoding infinite sets. By the definition above, $L$ can not be encoded if it is uncountable. We can therefore limit the discussion to $\Sigma$ that is at most countably infinite by listing some common methods of encoding $L$.

Among the methods liststed, $\Sigma$ is an enumerated set $\{a_{1},a_{2},\ldots\}$. In the first method, $\Sigma$ is assumed to be finite, and countably infinite in the last three. In addition, $L$ is assumed to be $\Sigma^{*}$ in the first three methods.

Method 1. First, set $E(a_{i}):=i$. In addition, for the empty word $\lambda$, we set $E(\lambda):=0$. Next, inductively define $E(w)$ on the length of $w$. Suppose now that $E(w)$ has been defiend. Set

 $E(wa):=nE(w)+E(a),\mbox{ where }a\in\Sigma.$

It is easy to see that if any non-empty word $w\in A$, with $w=b_{1}\cdots b_{m}$, where $b_{i}\in\Sigma$. Then

 $E(w)=E(b_{1})n^{m-1}+\cdots+E(b_{m}).$

Then $E$ is an encoding of $L$. This is really just the base-$n$ digital representation of integers, where each $a_{i}$ can be thought of as digits used by the representation. The only difference here is that $0$ is not used as a “digit” (every letter gets mapped to a positive integer), except when the word is empty.

For example, let $\Sigma=\{0,1,\ldots,9\}$. Then the words $01,001$, and $10$ have code numbers $12,112$, and $21$.

It is easy to see that the encoding is one-to-one (see proof here (http://planetmath.org/UniquenessOfDigitalRepresentation)).

Method 2. Pick three positive number $p,q,r$ such that $p,q$ are coprime, with $p>1$. Set $E(a_{i}):=p^{i}$ and $E(\lambda):=1$. Inductively define $E(w)$ on length of $w$. Suppose now that $E(w)$ has been defined. Then

 $E(wa):=(rE(w)+q)E(a),\mbox{ where }a\in\Sigma.$

For example, $E(a_{2}a_{5}a_{3})=(r(rp^{2}+q)p^{5}+q)p^{3}=r^{2}p^{10}+rqp^{8}+qp^{3}$.

To see that $E$ is injective, we make the following series of observations:

1. 1.

$E$ is injective on $\Sigma$. In addition, either $E(a)|E(b)$ or $E(b)|E(a)$ for any $a,b\in\Sigma$.

2. 2.

$p|E(w)$ iff $w\neq\lambda$.

3. 3.

If $E(w)=E(a)$ for some $a\in\Sigma$, then $w=a$. First, note that $w\neq 1$, and if $w\in\Sigma$, then $w=a$. So suppose $w=vb$, with $b\in\Sigma$ and $v\neq\lambda$. Then $(rE(v)+q)E(b)=E(a)$. If $E(b)|E(a)$, then $rE(v)+q=p^{i}$. Since $E(v)>1$, $i\neq 0$. But if $i>0$, $p$ and $q$ would not be coprime as $p|E(v)$. On the other hand, if $E(a)|E(b)$, then $(E(v)+q)p^{j}=1$, again impossible. So $w$ must be a letter, and therefore is $a$.

4. 4.

Now, suppose $E(w)=E(v)$, and $E(a)|E(b)$, where $a,b$ are the rightmost letters of $w,v$ respectively. By the same argument as previously, $a=b$, so we may cancel the letters, leaving us with the equation $E(w^{\prime})=E(v^{\prime})$, where $w=w^{\prime}a$ and $v=v^{\prime}b$. Continue the process of canceling the last letters in pairs, we end up with $E(u)=E(c)$ for some letter $c\in\Sigma$. So $u=c$. This shows that $w=v$.

A variation of this method is to set $E(aw):=(rE(w)+q)E(a)$.

If we set $p=2$ and $r=q=1$, then the range of $E$ is the set of all positive integers.

Method 3. The third method utilizes the uniqueness of prime decomposition of integers. First, define $f:\Sigma\to\mathbb{N}$ by $f(a_{i})=i$. Then, for any $w=b_{1}\cdots b_{m}$, with $b_{i}\in\Sigma$, define

 $E(w):=p_{1}^{f(b_{1})}\cdots p_{m}^{f(b_{m})}=\prod_{i=1}^{m}p_{i}^{f(b_{i})},$

where $p_{i}$ is the $i$-th prime number (for example, $p_{1}=2$). We again set $E(\lambda):=1$. By the fundamental theorem of arithmetic, and the fact that $f$ is a bijection, $E$ is injective (and maps onto the set of positive integers). This method is known as the multiplicative encoding of Gödel.

Method 4. Once an encoding $E$ is found for $\Sigma^{*}$, an encoding for $L\subseteq\Sigma^{*}$ can be obtained by restricting the domain of $E$ to $L$. Depending on how $L$ is defined, other methods of encoding $L$ via $E$ are possible. We illustrate one example.

Let $L=L_{1}\cup L_{2}\Sigma^{*}$, where $L_{1},L_{2}$ are disjoint non-empty finite sets not containing the empty word. Encode $L$ as follows: suppose $L_{1}=\{v_{1},\ldots,v_{m}\}$ and $L_{2}=\{w_{1},\ldots,w_{n}\}$. Define $E^{\prime}:L\to\mathbb{N}$ such that:

1. 1.

$E^{\prime}(v_{i}):=10^{i-1}$ and $E^{\prime}(w_{j}):=10^{m+j-1}$, where $i\in\{1,\ldots,m\}$ and $j\in\{1,\ldots,n\}$;

2. 2.

$E^{\prime}(w):=E^{\prime}(w_{j})E(u)10^{m+n-1}$, where $w=w_{j}u$, and $\lambda\neq u\in\Sigma^{*}$.

Essentially, the first $m+n$ digits are reserved for encoding words $v_{i}$ and $w_{j}$. $E^{\prime}$ is easily seen to be injective.

Method 5. Let $L_{2}$ be the language consisting of all words of length $2$. Define $E_{2}:L_{2}\to\mathbb{N}$ by $E_{2}(a_{i}a_{j}):=J(i,j)$, where $J$ is a pairing function that codes pairs of positive integers. Since $J$ is an injection (actually maps onto the set of positive integers), so is $E_{2}$. Using $J$, one can encode the language $L_{3}$ of all length $3$ words. Define $E_{3}:L_{2}\to\mathbb{N}$ by $E_{3}(a_{i}a_{j}a_{k}):=J(i,J(j,k))$. Again, $E_{3}$ is an injection. By induction, one can encode the language $L_{n}$ of all words of length $n$, for any positive integer $n$.

Method 6. Let $L(n)$ be the language consisting of all words of length at most $n$. We can utilize Method 5 to code $L$. First, let $\Sigma_{1}:=\Sigma\cup\{a_{0}\}$, where $a_{0}$ is a letter not in $\Sigma$. Define $\phi:L(n)\to\Sigma_{1}^{*}$ by $\phi(w):=a_{0}^{n-|w|}w$, where $|w|$ is the length of $w$. Then $\phi(L)\subseteq L_{n}$, the language of all length $n$ words over $\Sigma_{1}^{*}$. It is easy to see that $\phi$ is one-to-one. Then $E(n):=E_{n}\circ\phi$ is an encoding for $L$, where $E_{n}$ is defined in Method 5 that encodes $L_{n}$, via the modified version of the pairing function $J^{\prime}(i,j):=J(i+1,j+1)$, where $i,j\geq 0$.

Method 7. Can Method 5 be used to encode $\Sigma^{*}$? The answer is yes. However, a direct extension of $E_{n}$ does not work. By this we mean that $E:\Sigma^{*}\to\mathbb{N}$, given by $E(w)=E_{n}(w)$ where $|w|=n$, though a function, is not injective. For any positive integer $m$, there is a word $w_{n}$ of length $n$ for every $n>0$, such that $E_{n}(w_{n})=m$. Instead, define $E$ so that $E(w):=E_{2}(|w|,E_{|w|}(w))$ if $w\neq\lambda$, and $E(\lambda):=0$. It is easy to see that $E$ is injective, since both $E_{2}$ and $E_{n}$ are (in fact, $E$ is a bijection).

Remark. An encoding $E$ for $L$ can be thought of as a partial function from $\Sigma^{*}$ to $\mathbb{N}$, whose domain is $L\subseteq\Sigma^{*}$. $E$ is said to be effective if $E(L)$ is a recursive set. Equivalently, the partial function $E^{*}$ on $\Sigma^{*}$ given by

 $E^{*}(w)=\left\{\begin{array}[]{ll}a_{1}^{E(w)}&\textrm{if }w\in L,\\ \textrm{undefined}&\textrm{otherwise}.\end{array}\right.$

is Turing-computable. An enumeration of $\Sigma$ can be thought of as an encoding for $\Sigma$. If $\Sigma$ is finite, any enumeration of $\Sigma$ is effective. Assume that $\Sigma$ is effectively enumerated, whether or not $\Sigma$ is finite (so that $a_{1}$ in the definition of $E^{*}$ can be effectively chosen). Then it is not hard to see that all of the encodings described above are effective. In fact, all of the sets $E(L)$ described are primitive recursive.

Title encoding words EncodingWords 2013-03-22 19:06:09 2013-03-22 19:06:09 CWoo (3771) CWoo (3771) 20 CWoo (3771) Feature msc 94A60 msc 68Q05 msc 68Q45 msc 03D20