LR(k)

Given a word $u$ and a context-free grammar $G$, how do we determine if $u\in L(G)$?

One way is to look for any subword $v$ of $u$ such that there is a production $A\to v$. If this is successful, we may replace $v$ with $A$ in $u$ to obtain a word $w$, so that $w\Rightarrow u$. We may then repeat the process on $w$ to obtain another word $x$ such that $x\Rightarrow w$ (if successful). In the end, if everything works successfully, we arrive at the starting non-terminal symbol $\sigma$, and get a derivation $\sigma\Rightarrow^{*}u$ as a result, so that $u\in L(G)$. This procedure is known as the bottom-up parsing of the word $u$.

In general, unless one is very lucky, successfully finding a derivation $\sigma\Rightarrow^{*}u$ requires many trials and errors, since at each stage, for a given word $u$, there may be several words $w$ such that $w\Rightarrow u$.

Nevertheless, there is a particular family of context-free grammars, called the $LR(k)$ grammars, which make the bottom-up parsing described above straightforward in the sense that, given a word $u$, once a word $w$ is found such that $w\Rightarrow_{R}u$, any other word $w^{\prime}$ such that $w^{\prime}\Rightarrow_{R}u$ forces $w^{\prime}=w$. Here, $\Rightarrow_{R}$ is known as the rightmost derivation (meaning that $u$ is obtained from $w$ by replacing the rightmost non-terminal in $w$). The $L$ in $LR(k)$ means scanning the symbols of $u$ from left to right, $R$ stands for finding a rightmost derivation for $u$, and $k$ means having the allowance to look at up to $k$ symbols ahead while scanning.

The details are as follows:

Definition. Let $G=(\Sigma,N,P,\sigma)$ be a context-free grammar such that $\sigma\to\sigma$ is not a production of $G$, and $k\geq 0$ an integer. Suppose $U$ is any sentential form over $\Sigma$ with the following setup: $U=U_{1}U_{2}U_{3}$ where

• $U_{3}$ is a terminal word,

• $X\to U_{2}$ a production, and

• $\sigma\Rightarrow_{R}^{*}U_{1}XU_{3}\Rightarrow_{R}U$.

Let $n=|U_{1}U_{2}|+k$, and $Z$ the prefix of $U$ of length $n$ (if $|U|, then set $Z=U$).

Then $G$ is said to be $LR(k)$ if $W$ is another sentential form having $Z$ as a prefix, with the following setup: $W=W_{1}W_{2}W_{3}$, where

• $W_{3}$ is a terminal,

• $Y\to W_{2}$ is a production, and

• $\sigma\Rightarrow_{R}^{*}W_{1}YW_{3}\Rightarrow_{R}W$,

implies that

 $W_{1}=U_{1},\qquad Y=X,\qquad\mbox{and}\qquad W_{2}=U_{2}.$

Simply put, if $D_{U}$ and $D_{W}$ are the rightmost derivations of $U$ and $W$ respectively, and if the prefix of $U$ obtained by including $k$ symbols beyond the last replacement in $D_{U}$ is also a prefix of $W$, then the prefix of $U^{\prime}$ obtained by including $k$ symbols beyond the last replacement in $D_{U}$ is also a prefix of $W^{\prime}$, where $U^{\prime}$ and $W^{\prime}$ are words at the next to the last step in $D_{U}$ and $D_{W}$ respectively. In particular, if $U=W$, then $U^{\prime}=W^{\prime}$. This implies that any derivable in an $LR(k)$ grammar has a unique rightmost derivation, hence

Proposition 1.

Any $LR(k)$ grammar is unambiguous.

Examples.

• Let $G$ be the grammar consisting of one non-terminal symbol $\sigma$ (which is also the final non-terminal symbol), two terminal symbols $a,b$, with productions

 $\sigma\to a\sigma b,\qquad\sigma\to\sigma b\qquad\mbox{and}\qquad\sigma\to b.$

Then $G$ is not $LR(k)$ for any $k\geq 0$. For instance, look at the following two derivations of $U=a^{2}\sigma b^{3}$:

 $\sigma\Rightarrow^{*}a\sigma b^{2}\Rightarrow a^{2}\sigma b^{3}\qquad\mbox{and% }\qquad\sigma\Rightarrow^{*}a^{2}\sigma b^{2}\Rightarrow a^{2}\sigma b^{3}$

Here, $U_{1}=a$, $U_{2}=\sigma b$. Let $k=1$. Then the criteria in the definition are satisfied. Yet, $W_{1}=a^{2}\neq U_{1}$. Therefore, $G$ is not $LR(1)$.

• Note that the grammar $G$ above generates the language $L=\{a^{m}b^{n}\mid n>m\}$, which can also be generated by the grammar with three non-terminal symbols $\sigma,X,Y$, with $\sigma$ the final non-terminal symbol, where the productions are given by

 $\sigma\to XY,\quad X\to aXb,\quad X\to\lambda,\quad Y\to Yb,\quad\mbox{and}% \quad Y\to b.$

However, this grammar is $LR(1)$.

Determining whether a context-free grammar is $LR(k)$ is a non-trivial problem. Nevertheless, an algorithm exists for determining, given a context-free grammar $G$ and a non-negative integer $k$, whether $G$ is $LR(k)$. On the other hand, without specifying $k$ in advance, no algorithms exist that determine if $G$ is $LR(k)$ for some $k$.

Definition. A language is said to be $LR(k)$ if it can be generated by an $LR(k)$ grammar.

Theorem 1.

Every $LR(k)$ language is deterministic context-free. Every deterministic context-free language is $LR(1)$.

Hence, deterministic context-free languages are the same as $LR(1)$ languages.

References

• 1 A. Salomaa, Formal Languages, Academic Press, New York (1973).
• 2 J.E. Hopcroft, J.D. Ullman, Formal Languages and Their Relation to Automata, Addison-Wesley, (1969).
Title LR(k) LRk 2013-03-22 19:00:31 2013-03-22 19:00:31 CWoo (3771) CWoo (3771) 11 CWoo (3771) Definition msc 03D10 msc 68Q42 msc 68Q05 $LR(k)$ LLk