Context Free Grammars
A context-free grammar (CFG) is a precise description of language syntax. Each rule or production of the grammar defines an interpretation for the named symbol on the left side of the rule as a sequence of symbols on the right side of the rule. The right side can be a combination of nonterminals (themselves defined by rules) or terminal symbols defined simply as strings, such as "the", "a", "cat", "milk", and "drank".
Parsing a text string \(S\) according to a given context-free grammar \(G\) is the algorithm problem of constructing a parse tree of rule substitutions defining \(S\) as a single nonterminal symbol of \(G\). One of the nonterminals, usually the one on the left-hand side of the first production, is called the start symbol. It names the construct defined by the overall grammar.
Example
Sentence ::= NounPhrase VerbPhrase NounPhrase ::= Article Noun VerbPhrase ::= Verb NounPhrase Article ::= the, a Noun ::= cat, milk Verb ::= drank
Chomsky Normal Form
Chomsky normal form (CNF) requires the right side of every rule to consist of
- exactly two nonterminals, i.e. \(X -> YZ\), or
- exactly one terminal symbol, i.e. \(X -> \alpha\).
Any context-free grammar can be easily and mechanically transformed into Chomsky normal form by repeatedly shortening long right-hand sides at the cost of adding extra nonterminals and productions. Thus, there is no loss of generality with this assumption.
Bottom-Up Parsing
The CYK algorithm efficiently parses a string using dynamic programming. The algorithm only works on grammars in Chomsky normal form. This section is taken from The Algorithm Design Manual by Steven S. Skiena.
Define \(M[i, j, X]\) to be a boolean function that is true iff substring \(S[i, j]\) is generated by nonterminal \(X\). This is true if there exists a production \(X -> YZ\) and breaking point \(k\) between \(i\) and \(j\) such that the left part generates \(Y\) and the right part \(Z\). In other words,
\begin{equation} M[i, j, X] = \bigvee\limits_{(X -> YZ) \in G} \Big( \bigvee\limits_{k=i}^j M[i, k, Y] \wedge M[k + 1, j, Z] \Big) \end{equation}Where \(\vee\) denotes the logical or over all productions and split positions, and \(\wedge\) denotes the logical and of two boolean values.
The one-character terminal symbols define the boundary conditions of the recurrence. In particular, \(M[i, i, X]\) is true iff there exists a production \(X -> \alpha\) such that \(S[i] = \alpha\).
TODO Example
TODO Implementation
TODO Top-Down Parsing
TODO: LL parser, LL(1)