Context Free Grammars

A context-free grammar (CFG) is a precise description of language syntax. Each rule or production of the grammar defines an interpretation for the named symbol on the left side of the rule as a sequence of symbols on the right side of the rule. The right side can be a combination of nonterminals (themselves defined by rules) or terminal symbols defined simply as strings, such as "the", "a", "cat", "milk", and "drank".

Parsing a text string \(S\) according to a given context-free grammar \(G\) is the algorithm problem of constructing a parse tree of rule substitutions defining \(S\) as a single nonterminal symbol of \(G\). One of the nonterminals, usually the one on the left-hand side of the first production, is called the start symbol. It names the construct defined by the overall grammar.

Example

Sentence   ::= NounPhrase VerbPhrase
NounPhrase ::= Article Noun
VerbPhrase ::= Verb NounPhrase
Article    ::= the, a
Noun       ::= cat, milk
Verb       ::= drank

cfg.png

Chomsky Normal Form

Chomsky normal form (CNF) requires the right side of every rule to consist of

  • exactly two nonterminals, i.e. \(X -> YZ\), or
  • exactly one terminal symbol, i.e. \(X -> \alpha\).

Any context-free grammar can be easily and mechanically transformed into Chomsky normal form by repeatedly shortening long right-hand sides at the cost of adding extra nonterminals and productions. Thus, there is no loss of generality with this assumption.

Bottom-Up Parsing

The CYK algorithm efficiently parses a string using dynamic programming. The algorithm only works on grammars in Chomsky normal form. This section is taken from The Algorithm Design Manual by Steven S. Skiena.

Define \(M[i, j, X]\) to be a boolean function that is true iff substring \(S[i, j]\) is generated by nonterminal \(X\). This is true if there exists a production \(X -> YZ\) and breaking point \(k\) between \(i\) and \(j\) such that the left part generates \(Y\) and the right part \(Z\). In other words,

\begin{equation} M[i, j, X] = \bigvee\limits_{(X -> YZ) \in G} \Big( \bigvee\limits_{k=i}^j M[i, k, Y] \wedge M[k + 1, j, Z] \Big) \end{equation}

Where \(\vee\) denotes the logical or over all productions and split positions, and \(\wedge\) denotes the logical and of two boolean values.

The one-character terminal symbols define the boundary conditions of the recurrence. In particular, \(M[i, i, X]\) is true iff there exists a production \(X -> \alpha\) such that \(S[i] = \alpha\).

TODO Example

TODO Implementation

TODO Top-Down Parsing

TODO: LL parser, LL(1)