# Context Free Grammars

A context-free grammar (CFG) is a precise description of language syntax. Each rule or production of the grammar defines an interpretation for the named symbol on the left side of the rule as a sequence of symbols on the right side of the rule. The right side can be a combination of nonterminals (themselves defined by rules) or terminal symbols defined simply as strings, such as "the", "a", "cat", "milk", and "drank".

Parsing a text string $$S$$ according to a given context-free grammar $$G$$ is the algorithm problem of constructing a parse tree of rule substitutions defining $$S$$ as a single nonterminal symbol of $$G$$. One of the nonterminals, usually the one on the left-hand side of the first production, is called the start symbol. It names the construct defined by the overall grammar.

## Example

Sentence   ::= NounPhrase VerbPhrase
NounPhrase ::= Article Noun
VerbPhrase ::= Verb NounPhrase
Article    ::= the, a
Noun       ::= cat, milk
Verb       ::= drank


## Chomsky Normal Form

Chomsky normal form (CNF) requires the right side of every rule to consist of

• exactly two nonterminals, i.e. $$X -> YZ$$, or
• exactly one terminal symbol, i.e. $$X -> \alpha$$.

Any context-free grammar can be easily and mechanically transformed into Chomsky normal form by repeatedly shortening long right-hand sides at the cost of adding extra nonterminals and productions. Thus, there is no loss of generality with this assumption.

## Bottom-Up Parsing

The CYK algorithm efficiently parses a string using dynamic programming. The algorithm only works on grammars in Chomsky normal form. This section is taken from The Algorithm Design Manual by Steven S. Skiena.

Define $$M[i, j, X]$$ to be a boolean function that is true iff substring $$S[i, j]$$ is generated by nonterminal $$X$$. This is true if there exists a production $$X -> YZ$$ and breaking point $$k$$ between $$i$$ and $$j$$ such that the left part generates $$Y$$ and the right part $$Z$$. In other words,

$$M[i, j, X] = \bigvee\limits_{(X -> YZ) \in G} \Big( \bigvee\limits_{k=i}^j M[i, k, Y] \wedge M[k + 1, j, Z] \Big)$$

Where $$\vee$$ denotes the logical or over all productions and split positions, and $$\wedge$$ denotes the logical and of two boolean values.

The one-character terminal symbols define the boundary conditions of the recurrence. In particular, $$M[i, i, X]$$ is true iff there exists a production $$X -> \alpha$$ such that $$S[i] = \alpha$$.

## TODO Top-Down Parsing

TODO: LL parser, LL(1)