While variations abound, fundamentally a regular expression consists of the following pieces:
Parentheses can be used for grouping and nesting, and must contain a fully-formed regular expression.
|symbol can be used for denoting alternatives. Some specifications do not provide nesting or alternatives.
*operator means that the preceding element can be present zero or more times, and corresponds to a rule of the form .
+operator means that the preceding element can be present one or more times, and corresponds to a rule of the form .
Note that while these rules are not immediately in regular form, they can be transformed so that they are.
for each ,
if , then ,
if , then and are both in , and
among all languages over satisfying conditions 1-4, is the smallest.
Then any element is called a regular expression over .
This specifies the context-free grammar (in BNF):
A little further work is required to transform this grammar into an acceptable form for regular grammars, but it can be shown that this grammar (and any grammar specified by a regular expression) is equivalent to some regular grammar.
One can understand the language described by a regular expression in another way, by viewing the regular expression operators as shorthand for various set-theoretic operations. Formally, the language over associated with a regular expression over is inductively defined as follows:
A language over is regular iff there is a regular expression over such that .
With this interpretation, it is quite straightforward to design a non-deterministic finite automaton that recognizes the language described by a regular expression. Of course, for computer implementations, one must transform this into a deterministic finite automaton, but there are various algorithms for doing this efficiently. This process, production of a non-deterministic automaton and conversion to an equivalent deterministic automaton is approximately what is done in software packages implementing regular expression searching. In fact, most such packages implement operations impossible for a finite automaton, such as requiring a later part of the string to be the same as a previous part (the language is not regular but can be matched by most “regular expression” software; such capabilities are called “extended regular expressions”. None of these systems are powerful enough to recognize the language of balanced parentheses.
|Date of creation||2013-03-22 12:26:56|
|Last modified on||2013-03-22 12:26:56|
|Last modified by||CWoo (3771)|