# regular expression

While variations abound, fundamentally a regular expression consists of the following pieces:

• Parentheses can be used for grouping and nesting, and must contain a fully-formed regular expression.

• The | symbol can be used for denoting alternatives. Some specifications do not provide nesting or alternatives.

• There are also a number of postfix operators. The ? operator means that the preceding element can either be present or non-present, and corresponds to a rule of the form $A\to B\,|\,\lambda$.

• The * operator means that the preceding element can be present zero or more times, and corresponds to a rule of the form $A\to BA\,|\,\lambda$.

• The + operator means that the preceding element can be present one or more times, and corresponds to a rule of the form $A\to BA\,|\,B$.

Formally, let $S=\{\varnothing,\cup,^{*},(,)\}$ and $\Sigma$ an alphabet disjoint from $S$. Consider the language  $L(\Sigma)$ over $\Sigma\cup S$ specified below

1. 1.

$\varnothing\in L(\Sigma)$,

2. 2.

$a\in L(\Sigma)$ for each $a\in\Sigma$,

3. 3.

if $u\in L(\Sigma)$, then $u^{*}\in L(\Sigma)$,

4. 4.

if $u_{1},u_{2}\in L(\Sigma)$, then $(u_{1}\cup u_{2})$ and $(u_{1}u_{2})$ are both in $L(\Sigma)$, and

5. 5.

among all languages over $\Sigma\cup S$ satisfying conditions 1-4, $L(\Sigma)$ is the smallest.

Then any element $u\in L(\Sigma)$ is called a regular expression over $\Sigma$.

 $(0^{*}(1(01^{*}0)^{*}1)^{*})^{*}0^{*}$

This specifies the context-free grammar (in BNF):

 $\displaystyle S$ $\displaystyle\verb.::=.$ $\displaystyle AB$ $\displaystyle A$ $\displaystyle\verb.::=.$ $\displaystyle CD$ $\displaystyle B$ $\displaystyle\verb.::=.$ $\displaystyle\verb=0=B|\lambda$ $\displaystyle C$ $\displaystyle\verb.::=.$ $\displaystyle\verb=0=C|\lambda$ $\displaystyle D$ $\displaystyle\verb.::=.$ $\displaystyle\verb=1=E\verb=1=$ $\displaystyle E$ $\displaystyle\verb.::=.$ $\displaystyle FE|\lambda$ $\displaystyle F$ $\displaystyle\verb.::=.$ $\displaystyle\verb=0=G\verb=0=$ $\displaystyle G$ $\displaystyle\verb.::=.$ $\displaystyle\verb=1=G|\lambda$

One can understand the language described by a regular expression in another way, by viewing the regular expression operators as shorthand for various set-theoretic operations  . Formally, the language $L(u)$ over $\Sigma$ associated with a regular expression $u$ over $\Sigma$ is inductively defined as follows:

A language $L$ over $\Sigma$ is regular iff there is a regular expression $u$ over $\Sigma$ such that $L=L(u)$.

With this interpretation   , it is quite straightforward to design a non-deterministic finite automaton that recognizes the language described by a regular expression. Of course, for computer implementations, one must transform this into a deterministic finite automaton, but there are various algorithms  for doing this efficiently. This process, production of a non-deterministic automaton and conversion to an equivalent deterministic automaton is approximately what is done in software packages implementing regular expression searching. In fact, most such packages implement operations impossible for a finite automaton, such as requiring a later part of the string to be the same as a previous part (the language $\{A^{n}BA^{n}\text{ for }n\geq 0\}$ is not regular but can be matched by most “regular expression” software; such capabilities are called “extended regular expressions”. None of these systems are powerful enough to recognize the language of balanced parentheses.

Regular expressions have many applications. Quite often they are used for powerful string matching and substitution features in many text editors and programming languages.

Title regular expression RegularExpression 2013-03-22 12:26:56 2013-03-22 12:26:56 CWoo (3771) CWoo (3771) 17 CWoo (3771) Definition msc 20M35 msc 68Q70 RegularLanguage KleeneStar KleeneAlgebra