encoding words
Let $\mathrm{\Sigma}$ be an alphabet^{}, and ${\mathrm{\Sigma}}^{*}$ the set of all words over $\mathrm{\Sigma}$. An encoding of words over $\mathrm{\Sigma}$ is, loosely speaking, an assignment of words to numbers such that the numbers uniquely identify the words.
Definition. An encoding for a language^{} $L\subseteq {\mathrm{\Sigma}}^{*}$ is a onetoone function $E:L\to \mathbb{N}$.
If $L$ is finite, then it is easy to find an encoding for $L$. We are interested mainly in encoding infinite sets^{}. By the definition above, $L$ can not be encoded if it is uncountable. We can therefore limit the discussion to $\mathrm{\Sigma}$ that is at most countably infinite^{} by listing some common methods of encoding $L$.
Among the methods liststed, $\mathrm{\Sigma}$ is an enumerated set $\{{a}_{1},{a}_{2},\mathrm{\dots}\}$. In the first method, $\mathrm{\Sigma}$ is assumed to be finite, and countably infinite in the last three. In addition^{}, $L$ is assumed to be ${\mathrm{\Sigma}}^{*}$ in the first three methods.
Method 1. First, set $E({a}_{i}):=i$. In addition, for the empty word^{} $\lambda $, we set $E(\lambda ):=0$. Next, inductively define $E(w)$ on the length of $w$. Suppose now that $E(w)$ has been defiend. Set
$$E(wa):=nE(w)+E(a),\text{where}a\in \mathrm{\Sigma}.$$ 
It is easy to see that if any nonempty word $w\in A$, with $w={b}_{1}\mathrm{\cdots}{b}_{m}$, where ${b}_{i}\in \mathrm{\Sigma}$. Then
$$E(w)=E({b}_{1}){n}^{m1}+\mathrm{\cdots}+E({b}_{m}).$$ 
Then $E$ is an encoding of $L$. This is really just the base$n$ digital representation of integers, where each ${a}_{i}$ can be thought of as digits used by the representation^{}. The only difference^{} here is that $0$ is not used as a “digit” (every letter gets mapped to a positive integer), except when the word is empty.
For example, let $\mathrm{\Sigma}=\{0,1,\mathrm{\dots},9\}$. Then the words $01,001$, and $10$ have code numbers $12,112$, and $21$.
It is easy to see that the encoding is onetoone (see proof here (http://planetmath.org/UniquenessOfDigitalRepresentation)).
Method 2. Pick three positive number $p,q,r$ such that $p,q$ are coprime^{}, with $p>1$. Set $E({a}_{i}):={p}^{i}$ and $E(\lambda ):=1$. Inductively define $E(w)$ on length of $w$. Suppose now that $E(w)$ has been defined. Then
$$E(wa):=(rE(w)+q)E(a),\text{where}a\in \mathrm{\Sigma}.$$ 
For example, $E({a}_{2}{a}_{5}{a}_{3})=(r(r{p}^{2}+q){p}^{5}+q){p}^{3}={r}^{2}{p}^{10}+rq{p}^{8}+q{p}^{3}$.
To see that $E$ is injective, we make the following series of observations:

1.
$E$ is injective on $\mathrm{\Sigma}$. In addition, either $E(a)E(b)$ or $E(b)E(a)$ for any $a,b\in \mathrm{\Sigma}$.

2.
$pE(w)$ iff $w\ne \lambda $.

3.
If $E(w)=E(a)$ for some $a\in \mathrm{\Sigma}$, then $w=a$. First, note that $w\ne 1$, and if $w\in \mathrm{\Sigma}$, then $w=a$. So suppose $w=vb$, with $b\in \mathrm{\Sigma}$ and $v\ne \lambda $. Then $(rE(v)+q)E(b)=E(a)$. If $E(b)E(a)$, then $rE(v)+q={p}^{i}$. Since $E(v)>1$, $i\ne 0$. But if $i>0$, $p$ and $q$ would not be coprime as $pE(v)$. On the other hand, if $E(a)E(b)$, then $(E(v)+q){p}^{j}=1$, again impossible. So $w$ must be a letter, and therefore is $a$.

4.
Now, suppose $E(w)=E(v)$, and $E(a)E(b)$, where $a,b$ are the rightmost letters of $w,v$ respectively. By the same argument as previously, $a=b$, so we may cancel the letters, leaving us with the equation $E({w}^{\prime})=E({v}^{\prime})$, where $w={w}^{\prime}a$ and $v={v}^{\prime}b$. Continue the process of canceling the last letters in pairs, we end up with $E(u)=E(c)$ for some letter $c\in \mathrm{\Sigma}$. So $u=c$. This shows that $w=v$.
A variation of this method is to set $E(aw):=(rE(w)+q)E(a)$.
If we set $p=2$ and $r=q=1$, then the range of $E$ is the set of all positive integers.
Method 3. The third method utilizes the uniqueness of prime decomposition of integers. First, define $f:\mathrm{\Sigma}\to \mathbb{N}$ by $f({a}_{i})=i$. Then, for any $w={b}_{1}\mathrm{\cdots}{b}_{m}$, with ${b}_{i}\in \mathrm{\Sigma}$, define
$$E(w):={p}_{1}^{f({b}_{1})}\mathrm{\cdots}{p}_{m}^{f({b}_{m})}=\prod _{i=1}^{m}{p}_{i}^{f({b}_{i})},$$ 
where ${p}_{i}$ is the $i$th prime number^{} (for example, ${p}_{1}=2$). We again set $E(\lambda ):=1$. By the fundamental theorem of arithmetic^{}, and the fact that $f$ is a bijection, $E$ is injective (and maps onto the set of positive integers). This method is known as the multiplicative encoding of Gödel.
Method 4. Once an encoding $E$ is found for ${\mathrm{\Sigma}}^{*}$, an encoding for $L\subseteq {\mathrm{\Sigma}}^{*}$ can be obtained by restricting the domain of $E$ to $L$. Depending on how $L$ is defined, other methods of encoding $L$ via $E$ are possible. We illustrate one example.
Let $L={L}_{1}\cup {L}_{2}{\mathrm{\Sigma}}^{*}$, where ${L}_{1},{L}_{2}$ are disjoint nonempty finite sets^{} not containing the empty word. Encode $L$ as follows: suppose ${L}_{1}=\{{v}_{1},\mathrm{\dots},{v}_{m}\}$ and ${L}_{2}=\{{w}_{1},\mathrm{\dots},{w}_{n}\}$. Define ${E}^{\prime}:L\to \mathbb{N}$ such that:

1.
${E}^{\prime}({v}_{i}):={10}^{i1}$ and ${E}^{\prime}({w}_{j}):={10}^{m+j1}$, where $i\in \{1,\mathrm{\dots},m\}$ and $j\in \{1,\mathrm{\dots},n\}$;

2.
${E}^{\prime}(w):={E}^{\prime}({w}_{j})E(u){10}^{m+n1}$, where $w={w}_{j}u$, and $\lambda \ne u\in {\mathrm{\Sigma}}^{*}$.
Essentially, the first $m+n$ digits are reserved for encoding words ${v}_{i}$ and ${w}_{j}$. ${E}^{\prime}$ is easily seen to be injective.
Method 5. Let ${L}_{2}$ be the language consisting of all words of length $2$. Define ${E}_{2}:{L}_{2}\to \mathbb{N}$ by ${E}_{2}({a}_{i}{a}_{j}):=J(i,j)$, where $J$ is a pairing function^{} that codes pairs of positive integers. Since $J$ is an injection (actually maps onto the set of positive integers), so is ${E}_{2}$. Using $J$, one can encode the language ${L}_{3}$ of all length $3$ words. Define ${E}_{3}:{L}_{2}\to \mathbb{N}$ by ${E}_{3}({a}_{i}{a}_{j}{a}_{k}):=J(i,J(j,k))$. Again, ${E}_{3}$ is an injection. By induction^{}, one can encode the language ${L}_{n}$ of all words of length $n$, for any positive integer $n$.
Method 6. Let $L(n)$ be the language consisting of all words of length at most $n$. We can utilize Method 5 to code $L$. First, let ${\mathrm{\Sigma}}_{1}:=\mathrm{\Sigma}\cup \{{a}_{0}\}$, where ${a}_{0}$ is a letter not in $\mathrm{\Sigma}$. Define $\varphi :L(n)\to {\mathrm{\Sigma}}_{1}^{*}$ by $\varphi (w):={a}_{0}^{nw}w$, where $w$ is the length of $w$. Then $\varphi (L)\subseteq {L}_{n}$, the language of all length $n$ words over ${\mathrm{\Sigma}}_{1}^{*}$. It is easy to see that $\varphi $ is onetoone. Then $E(n):={E}_{n}\circ \varphi $ is an encoding for $L$, where ${E}_{n}$ is defined in Method 5 that encodes ${L}_{n}$, via the modified version of the pairing function ${J}^{\prime}(i,j):=J(i+1,j+1)$, where $i,j\ge 0$.
Method 7. Can Method 5 be used to encode ${\mathrm{\Sigma}}^{*}$? The answer is yes. However, a direct extension^{} of ${E}_{n}$ does not work. By this we mean that $E:{\mathrm{\Sigma}}^{*}\to \mathbb{N}$, given by $E(w)={E}_{n}(w)$ where $w=n$, though a function, is not injective. For any positive integer $m$, there is a word ${w}_{n}$ of length $n$ for every $n>0$, such that ${E}_{n}({w}_{n})=m$. Instead, define $E$ so that $E(w):={E}_{2}(w,{E}_{w}(w))$ if $w\ne \lambda $, and $E(\lambda ):=0$. It is easy to see that $E$ is injective, since both ${E}_{2}$ and ${E}_{n}$ are (in fact, $E$ is a bijection).
Remark. An encoding $E$ for $L$ can be thought of as a partial function^{} from ${\mathrm{\Sigma}}^{*}$ to $\mathbb{N}$, whose domain is $L\subseteq {\mathrm{\Sigma}}^{*}$. $E$ is said to be effective if $E(L)$ is a recursive set^{}. Equivalently, the partial function ${E}^{*}$ on ${\mathrm{\Sigma}}^{*}$ given by
$${E}^{*}(w)=\{\begin{array}{cc}{a}_{1}^{E(w)}\hfill & \text{if}w\in L,\hfill \\ \text{undefined}\hfill & \text{otherwise}.\hfill \end{array}$$ 
is Turingcomputable. An enumeration of $\mathrm{\Sigma}$ can be thought of as an encoding for $\mathrm{\Sigma}$. If $\mathrm{\Sigma}$ is finite, any enumeration of $\mathrm{\Sigma}$ is effective. Assume that $\mathrm{\Sigma}$ is effectively enumerated, whether or not $\mathrm{\Sigma}$ is finite (so that ${a}_{1}$ in the definition of ${E}^{*}$ can be effectively chosen). Then it is not hard to see that all of the encodings described above are effective. In fact, all of the sets $E(L)$ described are primitive recursive.
Title  encoding words 

Canonical name  EncodingWords 
Date of creation  20130322 19:06:09 
Last modified on  20130322 19:06:09 
Owner  CWoo (3771) 
Last modified by  CWoo (3771) 
Numerical id  20 
Author  CWoo (3771) 
Entry type  Feature 
Classification  msc 94A60 
Classification  msc 68Q05 
Classification  msc 68Q45 
Classification  msc 03D20 