# The Algorithm Design Manual

This document houses my notes from *The Algorithm Design Manual* by Steven
S. Skiena. Also see my answers to odd exercises.

## Chapter 1: Introduction

A **problem** is specified by describing the complete set of **instances** it must
work on and of its output after running on one of these instances. For
example, an instance of the sorting problem might be `[1, 4, 3]`

. An
**algorithm** is a procedure that takes any of the possible input instances and
transforms it to the desired output. There are many different algorithms that
solve the problem of sorting.

We seek algorithms that are *correct*, *efficient*, and *easy to
implement*. These goals may not be simultaneously achievable!

There is a fundamental difference between algorithms, which always produce a
correct result, and **heuristics**, which may usually do a good job but without
providing any guarantee. For example, the nearest-neighbor heuristic can be
applied to the *traveling salesman problem (TSP)*, but sometimes this
heuristic doesn't even come close to finding the shortest possible tour.

### Correctness

Reasonable-looking algorithms can easily be incorrect. Algorithm correctness
is a property that must be carefully demonstrated. The primary tool to
distinguish correct algorithms from incorrect ones is a **proof**. A proof has
four main parts.

- Clear, precise statement of what you are trying to prove.
- Set of assumptions.
- Chain of reasoning that takes you from the assumptions to the statement you are trying to prove.
- Little, black square or
*QED*.

Computer scientists usually prove things with **proof by induction** or **proof
by contradiction**. We'll see induction below.

To reason about an algorithm, you need a careful description of the sequence of steps to be performed. The three most common forms of algorithmic notation are English, pseudocode, or a real programming language. These methods have natural tradeoffs between expression and precision.

#### Incorrectness

The best way to prove that an algorithm is *incorrect* is to produce an
instance in which it yields an incorrect answer. Such instances are called
**counter-examples**. Good counter-examples are verifiable and simple. Many
tricks exist for finding counter-examples.

*Think small*: small examples are easier to reason about.*Think exhaustively*: consider different types of examples.*Go for a tie*: try similar values in the input collection.*Seek extremes*: e.g. use values that are far apart or close together.

#### Induction

**Mathematical induction** is a common choice for proving correctness. If
you're familiar with recursion, you're familiar with induction; recursion
*is* induction. In both, we have general and boundary conditions. The
general condition breaks the problem into smaller and smaller pieces and the
boundary condition terminates the recursion. Suppose that you are trying to
prove that a statement holds true for all natural numbers (all
\(n\)). Induction would usually take the form:

- Show that the statement holds for the base case (usually \(P(0)\) or \(P(1)\)).
- Assume that \(P(k)\) is true.
- Prove that the statement also holds for \(P(k + 1)\).

#### Summations

Summation formula are concise expressions describing the addition of an arbitrarily large set of numbers.

\begin{equation} \sum_{i=1}^{n} f(i) = f(1) + f(2) + ... + f(n) \end{equation}There are simple closed forms for summations of many algebraic functions.

\begin{equation} \sum_{i=1}^{n} 1 = n \end{equation} \begin{equation} \sum_{i=1}^{n} i = \frac{n(n + 1)}{2} \end{equation}### Program Modeling

Modeling is the art of formulating your application in terms of precisely described, well-understood problems. Proper modeling can eliminate the need to design or even implement algorithms, by relating your application to what has been done before.

Real-world applications involve real-world objects. Most algorithms, however,
are designed to work on rigorously defined *abstract* structures. To exploit
the algorithms literature, you must learn to describe your problem
abstractly, in terms of procedures on fundamental structures.

**Permutations**are arrangements, or orderings of items. Usually the object in question if your problem seeks an "arrangement," "tour," "ordering," or "sequence."**Subsets**are selects from a set of items. Usually the object in question if your problem seeks a "cluster," "collection," "committee," "group," "packaging," or "selection."**Trees**are hierarchical relationships between items. Usually the object in question whenever your problem seeks a "hierarchy," "dominance relationship," "ancestor/descendant relationship," or "taxonomy."**Graphs**represent relationships between arbitrary pairs of objects. Usually the object in question whenever you seek a "network," "circuit," "web," or "relationship."**Points**represent locations in some geometric space. Usually the object in question whenever your problems work on "sites," "positions," "date records," or "locations."**Polygons**represent regions in some geometric spaces. Usually the object in question whenever you are working on "shapes," "regions," "configurations," or "boundaries."**Strings**represent sequences of characters or patterns. Usually the object in question whenever you are dealing with "text," "characters," "patterns," or "labels."

Learn to think recursively. Recursive structures occur everywhere in the
algorithmic world. Each of the abstract structures described above can be
thought about recursively; they are big things made of smaller things of the
same type. Each structure has operations (like *delete*) that produce new
versions of the same type.

## Chapter 2: Algorithm Analysis

### RAM Model of Computation

Machine-independent algorithm design depends upon a hypothetical computer
called the **Random Access Machine** or **RAM**. Under this model of computation,
we are confronted with a computer where

- Each
*simple*operation (+, *, -, =, if, call) takes exactly one time step. - Loops and subroutines are the composition of many single-step operations.
- Each memory access takes exactly one time step. Further, we have as much memory as we need.

Under the RAM model, we measure run time by counting up the number of steps an algorithm takes on a given problem instance. We consider different time complexities that define a numerical function, representing time versus problem size.

**Worst-case complexity**of the algorithm is the function defined by the maximum number of steps taken in any instance of size \(n\).**Best-case complexity**of the algorithm is the function defined by the minimum number of steps taken in any instance of size \(n\).**Average-case complexity**of the algorithm is the function defined by the average number of steps taken in any instance of size \(n\).

### Big Oh Notation

**Big Oh** simplifies our analysis by ignoring levels of detail that do not
impact our comparison of algorithms. The formal definitions are as follows:

- \(f(n) = O(g(n))\) means \(c \cdot g(n)\) is an
*upper bound*on \(f(n)\). Thus there exists some constant \(c\) such that \(f(n)\) is always \(\leq c \cdot g(n)\), for large enough \(n\). - \(f(n) = \Omega(g(n))\) means \(c \cdot g(n)\) is an
*lower bound*on \(f(n)\). Thus there exists some constant \(c\) such that \(f(n)\) is always \(\geq c \cdot g(n)\), for large enough \(n\). - \(f(n) = \Theta(g(n))\) means \(c_1 \cdot g(n)\) is an upper bound on \(f(n)\) and \(c_2 \cdot g(n)\) is a lower bound on \(f(n)\). Thus there exists constants \(c_1\) and \(c_2\) such that \(f(n) \leq c_1 \cdot g(n)\) and \(f(n) \geq c_2 \cdot g(n)\).

For example, \(2n^2 + 100n + 6 = O(n^2)\), because I choose \(c = 3\) and \(3n^2 \geq 2n^2 + 100n + 6\) when \(n\) is big enough.

#### Big Oh Classes

Big Oh groups functions into a set of classes, such that all the functions in a particular class are equivalent with respect to the Big Oh. A small variety of time complexities suffice and account for most algorithms that are widely used in practice.

Class Name | Function |
---|---|

Constant | \(f(n) = 1\) |

Logarithmic | \(f(n) = \log n\) |

Linear | \(f(n) = n\) |

Superlinear | \(f(n) = n lg n\) |

Quadratic | \(f(n) = n^2\) |

Cubic | \(f(n) = n^3\) |

Exponential | \(f(n) = c^n\) |

Factorial | \(f(n) = n!\) |

We say that a faster-growing function **dominates** a slower-growing
one. Specifically, when \(f\) and \(g\) belong to different classes (i.e. \(f(n)
\neq \Theta(g(n))\)), we say \(g\) dominates \(f\) when \(f(n) = O(g(n))\),
sometimes written \(g >> f\).

#### Big Oh Operations

The sum of two functions is governed by the dominant one.

\begin{equation} O(f(n)) + O(g(n)) \rightarrow O(max(f(n), g(n))) \end{equation}Multiplying a function by a constant can not affect its asymptotic behavior.

\begin{equation} O(c \cdot f(n)) \rightarrow O(f(n)) \end{equation}When two functions in a product are increasing, both are important.

\begin{equation} O(f(n)) * O(g(n)) \rightarrow O(f(n) * g(n)) \end{equation}### Logarithms

A **logarithm** is simply an inverse exponential function. Saying \(b^x = y\) is
equivalent to saying that \(x = \log_b y\). Exponential functions grow at a
distressingly fast rate. Thus, inverse exponential functions -
i.e. logarithms - grow refreshingly slowly. Logarithms arise in any process
where things are repeatedly halved.

**Binary search** is a good example of an \(O(\log n)\) algorithm. If searching
for a particular name \(p\) in a telephone book, we start by comparing \(p\)
against the middle. Then we discard half the names. Only twenty comparisons
suffice to find any name in the million-name Manhattan phone book!

Logarithms appear in trees (height is \(\log_2 n\)), bits (\(\log_2 n\) bits required to store a number in binary).

#### Logarithm Properties

The \(b\) term in \(\log_b y\) is the **base** of the logarithm. Three bases are of
importance for mathematical and historical reasons.

- Base \(b = 2\): The
**binary logarithm**, usually denoted \(lg n\), is a base 2 logarithm. Most algorithm applications of logarithms imply binary logarithms. - Base \(b = e\): The
**natural log**, usually denoted \(ln x\), is a base $e = 2.71828…$ logarithm. - Base \(b = 10\): Less common today is the base-10 or
**common logarithm**, usually denoted as \(\log x\).

It is easy to convert a logarithm from one base to another. This is a consequence of the formula:

\begin{equation} \log_a b = \frac{\log_c b}{\log_c a} \end{equation}Thus, changing the base of \(\log b\) from base-a to base-c simply involves dividing by \(\log_c a\).

The base of the logarithm has no real impact on the growth rate. We are usually justified in ignoring the base of the logarithm when analyzing algorithms.

## Chapter 3: Data Structures

Classes of **abstract data types** such as containers, dictionaries, and
priority queues, have many different but functionally equivalent **data
structures** that implement them. These different data structures realize
different tradeoffs in the time to execute various operations.

### Contiguous vs. Linked Data Structures

Data structures are either **contiguous** or **linked**, depending upon whether
they are based on arrays or pointers.

#### Arrays

The **array** is the fundamental contiguously-allocated data structures. These
single slabs of memory have constant access given the index and space
efficiency. **Dynamic arrays** enable resizing. First, an initial size is
allocated. If we run out of space, a larger array (usually 2x) is allocated
and the elements are copied over. Insertion amortizes to \(O(1)\).

#### Lists

The **list** is the simplest linked structure. Each node in the list has data
and pointer fields. **Pointers** are the connections that hold the pieces of
together. Pointers represent the address of a location in memory. List don't
incur overflow, but require extra space for pointer fields and don't given
efficient random access to items.

### Containers: Stacks and Queues

A **container** denotes a data structure that permits storage and retrieval of
data items independent of content. Containers are distinguished by the
particular retrieval order they support. **Stacks** support retrieval by
last-in, first-out (LIFO) order. The *put* and *get* operations for stacks
are usually called *push* and *pop*. **Queues** support retrieval in first in,
first out (FIFO) order. The *put* and *get* operations for queues are usually
called *enqueue* and *dequeue*.

### Dictionaries

The **dictionary** data type permits access to data items by
content. Operations include *search* (when given a *key*), *insert*, and
*delete*.

### Binary Search Trees

A **binary tree** is recursively defined as being empty or consisting of a root
node with left and right subtrees. A **binary search tree** labels each node
in a binary tree with a single key such that for any node labeled \(x\), all
nodes in the left subtree have \(keys < x\) while all nodes in the right
subtree have \(keys > x\). Binary tree nodes have left and right point fields,
an optional parent pointer, and a data field.

#### Traversal

Traversal involves visiting all nodes. **In-order** traversal of a binary
search tree can be done recursively with the following.

def traverse(tree): if tree: traverse(tree.left) process(tree.item) traverse(tree.right)

Changing the position of `process`

gives alternate traversal
orders. Processing the item first yields a **pre-order** traversal, while
processing it last gives a **post-order** traversal.

#### Performance

*Search*, *insert*, and *delete* all take \(O(h)\) time, where \(h\) is the
height of the tree. A perfectly balanced tree has \(h = \lceil \log n
\rceil\). Unfortunately, inserting keys in sorted order produces a skinny
linear height tree, \(h = n\). Randomizing insert order will produce \(O(\log
n)\) height on average.

#### Balanced Search Trees

**Balanced binary search tree** data structures adjust the tree during
insertion/delete to guarantee that height will always be \(O(\log
n)\). Balanced tree implementations include **red-black trees** and **splay
trees**.

### Priority Queues

**Priority queues** are data structures that provide more flexibility than
simple sorting, because they allow new elements to enter a system at
arbitrary intervals. The basic priority queue supports three primary
operations: *insert*, *find-minimum/maximum*, and
*delete-minimum/maximum*. Priority queues can be implemented with arrays or
BSTs, but a particularly nice implementation is the **heap**.

### Hashing and Strings

**Hash tables** are a very practical way to maintain a dictionary. A **hash
function** is a mathematical function that maps keys to integers. Hash table
use the value of a hash function as an index into an array, and store our
item at that position.

The first step of the hash function is usually to map each key to a big
integer. Let \(\alpha\) be the size of the alphabet on which a given string \(S\)
is written. Let `char(c)`

be a function that maps each symbol of the alphabet
to a unique integer from 0 to \(\alpha - 1\). The function

maps each string to a unique (but large) integer by treating the characters of the strings as "digits" in a base-\(\alpha\) number system.

#### Collision Resolution

Two distinct keys will occasionally hash to the same value. This is a
**collision**. **Chaining** is the easiest approach to collision
resolution. Represent the hash table as an array of \(m\) linked
lists. Chaining devotes a considerable amount of memory to pointers. **Open
addressing** is an alternative to chaining. The hash table is maintained as
an array of elements, each initialized to null. On an insertion, we check to
see if the desired position is empty. If so, we insert it. If not, we must
find some other place to insert it instead. The simplest possibility (called
**sequential probing**) inserts the item in the next open spot in the table.

#### String Matching via Hashing

The **Rabin-Karp algorithm** gives a linear-time solution to substring
search. Substring search asks if string \(t\) contains the pattern \(p\) as a
substring, and if so where. In the Rabin-Karp algorithm, we compute a given
hash function on both the pattern string \(p\) and the $|p|$-character
substring starting from the $i$th position of \(t\). If these two strings are
identical, clearly the resulting hash values must be the same. If the two
strings are different, the hash values will *almost certainly be different*
(we can check). Note that we need our hashing function to be constant for
this algorithm to be \(O(n)\) instead of \(O(mn)\). This may be accomplished
with a rolling hash function.

#### Duplicate Detection via Hashing

The key idea of hashing is to represent a large object using a single number. Hashing can be applied to duplicate detection. Suppose we're looking to find if a given document is contained in a corpus. Explicitly comparing the new document \(D\) to all \(n\) documents is hopelessly inefficient. But we can hash \(D\) to an integer, and compare it to the hash codes of the rest of the corpus.

## Chapter 4: Sorting and Searching

Sorting a the basic building block that many other algorithms are built
around. Many other problems become easy once a set of items is sorted
(e.g. *searching*, *closest pair*).

Many things need to be considered when sorting:

*Ascending*or*descending*order.- Key or entire record.
- What to do with equal keys (
*stable*sort?). *Comparison function*.

### Heapsort

**Selection sort** is a simple algorithm that repeatedly extracts the smallest
remaining element from the unsorted part of an array. A computer takes \(O(n)\)
time to find the smallest element in an array. This is the operation
supported by a priority queue. What if we improve the data structure?
**Heapsort** is nothing but an implementation of selection sort using the right
data structure.

#### Heaps

Heaps are a simple and elegant data structure that efficiently support the
priority queue operations insert and extract-min. They work by maintaining a
partial order on the set of elements. A *heap-labeled tree* is a binary tree
such that the key labeling of each node *dominates* the key labeling of each
of its children. In a *min-heap* a node dominates its children by containing
a smaller key than they do.

Heaps can be stored with pointers (node with children) or arrays. In a array, the root of the tree is in the first position, and its left and right children are in the second and third positions. In general the keys of the $i$th level of the binary tree are stored in \(2^{i - 1}\) to \(2^i - 1\). This means that the left child of \(k\) sits in position \(2k\) and the right child in \(2k + 1\), while the parent of \(k\) is in \(\lceil k / 2 \rceil\). Note that sparse trees can be very space inefficient, we need to be careful to pack our elements as far left as possible. This implicit representation of binary saves memory, but is less flexible than using pointers. We cannot store arbitrary tree topologies without wasting large amount of space. We cannot move subtrees around by just changing a single pointer. This loss of flexibility explains why we cannot use this idea to represent binary search trees.

- Insert
Place the new element into the left-most

*open*spot in the array, namely the $(n + 1)$st position of a previously $n$-element heap. Then,*bubble up*the new key to its proper position in the hierarchy by swapping the element with its parent until the parent dominates the element. Insertion takes at most \(O(\log n)\) time.def insert(element, heap): heap = heap + [element] bubble_up(len(heap), heap) def bubble_up(index, heap): if index > 0 and heap[index] > heap[index // 2]: array[index], array[index // 2] = array[index // 2], array[index] bubble_up(index // 2, heap)

- Extracting the Minimum
The minimum can easily be found by looking in the first position in the array. Removing the top element leaves a hole in the array. Fill by moving the element from the

*right-most*leaf (sitting in the $n$th position of the array) into the first position. Then,*bubble down*the new key until it dominates all its children. The key should be switched with the dominant child.def extract_minimum(heap): minimum = heap[0] heap = [heap[-1]] + heap[1:-1] bubble_down(0, heap) return minimum def bubble_down(index, heap): smaller_index = find_smaller_child(index, heap) if smaller_index: heap[index], heap[smaller_index] = heap[smaller_index], heap[index] bubble_down[smaller_index] def find_smaller_child(index, heap): if 2 * index + 1 < len(heap) and heap[2 * index] > heap[2 * index + 1]: return 2 * index + 1 elif 2 * index < len(heap): return 2 * index

#### Heapsort Implementation

Heapsort creates a heap and repeatedly extracts the minimum to give a
worst-case \(O(n \log n)\) algorithm. It is an *in-place* sort, meaning it
uses no extra memory over the array containing the elements to be sorted.

def heapsort(array): heap = [] for element in array: insert(element, heap) for index in range(len(array)): array[index] = extract_min(heap)

### Mergesort

Mergesort is a classic divide-and-conquer algorithm. This recursive approach to sorting involves partitioning the elements into two groups, sorting each of the smaller problems recursively, and then interleaving the two sorted lists to totally order the elements.

The efficiency of mergesort depends upon how efficiently we combine the two
sorted halves into a single sorted list. We need to *merge* the two lists
together. Observe that the smallest overall item in the two sorted lists must
sit at the top of one of the two lists. To merge, we remove the smallest
element, then repeat. Because the recursion goes \(\lg n\) levels deep, and a
linear amount of work is done per level, mergesort takes \(O(n \log n)\) time
in the worst case.

def mergesort(array): left, right = array[:len(array) / 2], array[len(array / 2):] return merge(mergesort(left), mergesort(right)) def merge(array1, array2): merged = [] while array1 or array2: if not array2 or (array1 and array1[0] < array2[0]): merged.append(array1.pop(0)) else: merged.append(array2.pop(0)) return merged

### Quicksort

Quicksort selects a item \(p\) from the collection then separates the other elements into piles: those before \(p\) and those after \(p\). We place the pivot \(p\) between the other two piles, and then sort piles independently.

Quicksort runs in \(O(n * h)\), where \(h\) is the height of the recursion tree. Suppose, luckily, we always the median element, the subproblems are always half the size of the previous level. This produces \(O(n \log n)\), the best case of quicksort. Suppose, unluckily, we always choose the biggest or smallest element in the sub-array. This produces \(O(n^2)\), the worst case of quicksort.

Quicksort is typically 2-3 times faster than mergesort or heapsort when implemented well. All three algorithms are \(O(n \log n)\), but experimentation shows that the simpler operations in the inner loop give quicksort a constant improvement.

#### Randomization

Randomization is a powerful tool to improve algorithms with bad worst-case but good average-case complexity.

If we randomly choose the pivot in quicksort, we can expect, with high probability, \(O(n \log n)\). The best possible selection for the pivot is the median. Suppose a key is good enough if it lies in the center half of the sorted space of keys. Since the expected number of good splits and bad splits is the same, the bad splits can only double the height of the tree, which still produces \(O(\log n)\) height. This randomization may be done by either shuffling the array first or by selecting a random index at each step.

### Distribution Sort

Suppose we have a list of names to sort. We could partition them according to
the first letter. This creates 26 different piles, or buckets, or
names. Then, we partition each pile based on the second letter of each name,
etc. The names will be sorted as soon as each bucket contains only a single
name. At the end, we'll be able to simply concatenate the bunch of piles
together. This algorithm is commonly called **bucketsort** or **distribution
sort**.

**Bucketing** is a very effective idea whenever we are confident that the
distribution of data will be roughly uniform. It is the idea that underlies
hash tables, kd-trees, and a variety of other practical data structures. The
downside of such techniques is that the performance can be terrible when the
data distribution is not what we expected.

### Binary Search and Related Algorithms

**Binary search** is a fast algorithm for searching in a sorted array. We
compare the key \(q\) to the middle item. If \(q\) is smaller, it must appear in
the first half; if not it must reside in the second half. By repeating this
process recursively on the correct half, we locate the key in \(\lg n\)
comparisons.

def binary_search(item, array, start=None, end=None): if start is None or end is None: start, end = 0, len(array) middle_index = (start + end) // 2 if start > end: return False elif item == array[middle_index]: return True elif item < array[middle_index]: return binary_search(item, array, start, middle_index - 1) else: return binary_search(item, array, middle_index + 1, end)

Binary search is the power behind twenty questions!

#### Counting Occurrences

Suppose we want to count the number of times a given key \(k\) occurs in a given sorted array. We could use binary search to find the index of an element in the correct block in \(O(\lg n)\) time. Then we sequentially test elements to the left and right until we find one that differs from the key. The difference between the boundaries (plus one) gives the count of the number of occurrences of \(k\). This algorithm runs in \(O(\lg n + s)\), where \(s\) is the number of occurrences of the key.

A fast algorithm results by modifying binary search to search for the
*boundary* of the block containing \(k\), instead of \(k\) itself. We perform
this search twice, for a total time of \(O(\lg n)\), so we can count the
occurrences in logarithmic time regardless of the size of the block.

#### One-Sided Binary Search

Suppose we don't know the bounds of our sorted collection. Binary search can
also proceed from a specific position at repeatedly larger intervals (1, 2,
4, 8, 16) until we find a value greater than our key. We now have a window
containing the target and can proceed with binary search. *One-sided binary
search* is most useful whenever we are looking for a key that lies close to
our current position.

#### Square and Other Roots

Suppose we are searching for the square root \(r\) of \(n\). Notice that the square root of \(n \leq 1\) must be at least 1 and at most \(n\). Consider the midpoint \(m\) of this interval. How does \(m^2\) compare to \(n\)? If \(n \leq m^2\), then the square root must be greater than \(m\), so the algorithm repeats on a new range of values. This application of binary search identifies the square root within ±1 after only \(\lg n\) rounds. Root-finding algorithms that converge faster are known, but this is simple, robust and applies to other functions.

### Divide-and-Conquer

One of the most powerful techniques for solving problems is to break them
down into smaller, more easily solved pieces. A recursive algorithm starts to
become apparent when we break the problem into smaller instances of the same
type of problem. **Divide-and-conquer** splits the problem in (say) halves,
solves each half, then stitches the pieces back together to form a full
solution. Whenever the merging takes less time than recursively solving the
two subproblems, we get an efficient algorithm. For example, mergesort takes
linear time to merge two sorted lists of \(n/2\) elements, each of which was
obtained in \(O(n \lg n)\) time.

#### Recurrence Relations

Many divide-and-conquer algorithms have time complexities that are naturally
modeled by **recurrence relations**. A recurrence relation is an equation that
is defined in terms of itself. The Fibonacci numbers are described by the
recurrence relation \(F_n = F_{n - 1} + F_{n - 2}\). Many other natural
functions are easily expressed as recurrences. For example, \(a_n = 2a_{n -
1}, a_1 = 1 \rightarrow a_n = 2^{n - 1}\).

Divide-and-conquer algorithms tend to break a given problem into some number of smaller pieces (say \(a\)), each of which is of size \(n/b\). Further, they spend \(f(n)\) time to combine these subproblem solutions into a complete result. Let \(T(n)\) denote the worst-case time the algorithm takes to solve a problem of size \(n\). Then \(T(n)\) is given by the following recurrence relation.

\begin{equation} T(n) = aT(n/b) + f(n) \end{equation}For example, the running time behavior of mergesort is governed the recurrence \(T(n) = 2T(n/2) + O(n)\). This recurrence evaluates to \(T(n) = O(n \lg n)\). Binary search is governed by the recurrence \(T(n) = T(n/2) + O(1)\).

## Chapter 5: Graph Traversal

A **graph** \(G = (V, E)\) consists of a set of **vertices** \(V\) together with a set
\(E\) of vertex pairs or **edges**. Graphs can represent essentially *any*
relationship. The key to using graph algorithms effectively in applications
lies in correctly modeling your problem so you can take advantage of existing
algorithms.

### Flavors of Graphs

Several fundamental properties of graphs impact the choice of the data structures used to represent them and algorithms available to analyze them.

**Undirected**vs.**Directed**- A graph \(G = (V, E)\) is undirected if edge \((x, y) \in E\) implies that \((y, x) \in E\). If not, we say that the graph is directed.

**Weighted**vs.**Unweighted**- Each edge (or vertex) in a weighted graph \(G\) is assigned a numerical value, or weight. In unweighted graphs, there is no cost distinction between various edges and vertices.
- The difference between weighted and unweighted graphs becomes particularly apparent in finding the shortest path between two vertices.

**Simple**vs.**Non-simple**- Any graph that avoids
**self-loops**and**multiedges**is called simple. A self-loop is an edge \((x, x)\) involving only one vertex. An edge \((x, y)\) is a multiedge if it occurs more than once in the graph.

- Any graph that avoids
**Sparse**vs.**Dense**- Graphs are sparse when only a small fraction of the possible vertex pairs actually have edges defined between them. Graphs where a large fraction of the vertex pairs define edges are called dense.
- The
**degree**of a vertex is the number of edges adjacent to it. - In a
**regular graph**, each vertex has exactly the same degree.

**Cyclic**vs.**Acyclic**- An acyclic graph does not contain any cycles.
**Trees**are connected, acyclic undirected graphs.

**Embedded**vs.**Topological**- A graph is embedded if the vertices and edges are assigned geometric positions.

**Implicit**vs.**Explicit**- Certain graphs are not explicitly constructed and then traversed, but built as we use them.

**Labeled**vs.**Unlabeled**- Each vertex is assigned a unique name or identifier in a labeled graph to distinguish it from all other vertices. In unlabeled graphs, no such distinctions have been made.

**Social networks** are graphs where the vertices are people, and there is an
edge between two people if and only if they are friends.

### Data Structures for Graphs

The two basic graph data structures are **adjacency matrices** and **adjacency
lists**. We assume a graph \(G = (V, E)\) contains \(n\) vertices and \(m\) edges.

Adjacency lists are the right data structure for most applications of graphs.

#### Adjacency Matrix

We can represent \(G\) using an \(n x n\) matrix \(M\), where element \(M[i,j] = 1\) if \((i, j)\) is an edge of \(G\), and 0 if it isn't. This allows fast answers to the question "is \((i, j)\) in \(G\)?", and rapid updates for edge insertion and deletion. IT may use excessive space for graphs with many vertices and relatively few edges, however.

#### Adjacency Lists

We can more efficiently represent sparse graphs by using linked lists to store the neighbors adjacent to each vertex. Adjacency lists make it harder to verify whether a given edge \((i, j)\) is in \(G\), since we must search through th3e appropriate list to find the edge.

### Traversing a Graph

The key idea behind graph traversal is to mark each vertex when we first
visit it and keep track of what we have not yet completely explored. Each
vertex may be *undiscovered*, *discovered*, or *processed*. We must maintain
a structure containing the vertices that we have discovered but not yet
completely processed.

#### Breadth-First Search

The basic breadth-first search algorithm is given below. It takes \(O(n + m)\) time.

def bfs(graph, root): discovered = {root} parent = {} queue = [root] while queue: current = queue.pop(0) for neighbor in graph.find_adjacent(current): if neighbor not in discovered: discovered.add(neighbor) parent[neighbor] = current queue.append(neighbor)

This implementation of breadth-first search, we assign a direction to each
edge, from the discoverer `current`

to the discovered `neighbor`

. We
maintain a `parent`

map which defines a tree on the vertices of the
graph. This tree contains the shortest path from the root to every other
node in the tree. A breadth-first search tree can be seen in the right of
the image below.

The graph edges that do not appear in the breadth-first search tree also have special properties. For undirected graphs, non-tree edges can point only to vertices on the same level as the parent vertex, or to vertices on the level directly below the parent. These properties follow easily from the fact that each path in the tree must be the shortest path in the graph.

- Applications of Breadth-First Search
- Connected Components
A

**connected component**of an undirected graph is a maximal set of vertices such that there is a path between every pair of vertices. The components are separate "pieces" of the graph such that there is no connection between the pieces. An amazing number of seemingly complicated problems reduce to finding or counting connected components. For example, testing whether a puzzle such as the Rubik's cube or the 15 puzzle can be solved from any position is really asking whether the graph of legal configurations is connected.Connected components can be found using breadth-first search since the vertex order does not matter. We start from the first vertex. Anything we discover during this search must be part of the same connected component. We then repeat the search from any undiscovered vertex (if one exists) to define the next component, and so on until all vertices have been found.

- Two-Coloring Graphs
The

**vertex-coloring**problem seeks to assign a label (or color) to each vertex of a graph such that no edge links any two vertices of the same color. We can easily avoid all conflicts by assigning each vertex a unique color. However, the goal is to use as few colors as possible.A graph is

**bipartite**if it can be colored without conflicts while using only two colors. Consider the "had-sex-with" graph in a heterosexual work. Men have sex only with women, and vice versa. Thus gender defines a legal two-coloring, in this simple model.We can argument breadth-first search so that whenever we discover a new vertex, we color it the opposite of its parent. We check whether any nondiscovery edge links two vertices of the same color. Such a conflict means that the graph cannot be two-colored.

- Connected Components

#### Depth-First Search

The difference between BFS and DFS results is in the order in which they explore vertices. This order depends completely upon the container data structure used to store the unprocessed vertices: BFS uses a queue, DFS uses a stack. DFS implementations often use recursion instead of an explicit stack.

discovered = set() time = 0 entry_time = {} exit_time = {} parent = {} def dfs(root, graph): discovered.add(root) time += 1 entry_time[root] = time for neighbor in graph.get_adjacent(root): if neighbor not in discovered: parent[neighbor] = root dfs(neighbor, graph) exit_time[root] = time time += 1

This implementation of depth-first search maintains the traversal time for
each vertex. The `time`

clock ticks each time we enter or exit any
vertex. The time intervals can tell us a vertex's ancestor and how many
descendants it has.

Depth-first search partitions the edges of an undirected graph into exactly
two classes: **tree edges** and **back edges**. The tree edges discover new
vertices, and are those encoding in the `parent`

relation (seen in the image
above). Back edges are those whose other endpoint is an ancestor of the
vertex being expanded, so they point back into the tree.

- Applications of Depth-First Search
- Finding Cycles
Back edges are the key to finding a cycle in an undirected graph. If there is no back edge, all edges are tree edges, and no cycle exists in a tree. But

*any*back edge going from \(x\) to an ancestor \(y\) creates a cycle with the tree path from \(y\) to \(x\). - Articulation Vertices
An

**articulation vertex**is a single vertex whose deletion disconnects a connected component of the graph. Any graph that contains an articulation vertex is inherently fragile, because deleting that single vertex causes a loss of connectivity between other nodes. The**connectivity**of a graph is the smallest number of vertices whose deletion will disconnect the graph. The connectivity is one if the graph has an articulation vertex. More robust graphs without such a vertex are said to be**biconnected**.Testing for articulation vertices by brute force is easy. Temporarily delete each vertex \(v\), and then do a BFS or DFS traversal of the remaining graph to establish whether it is still connected. The total time is \(O(n(m + n))\).

DFS gives a clever, linear-time algorithm. Look at the depth-first search tree. If this tree represents the entirety of the graph, all internal (non-leaf) nodes would be articulation vertices, since deleting any one of them would separate a leaf from the root. A depth-first search of a

*general*graph partitions the edges into tree edges and back edges. Think of these back edges as security cables linking a vertex back to one of its ancestors. Finding articulation vertices requires maintaining the extent to which back edges (i.e. security cables) link chunks of the DFS tree back to ancestor nodes.

- Finding Cycles
- DFS on Directed Graphs
When traversing undirected graphs, every edge is either in the depth-first search tree or a back edge to an ancestor in the tree. For directed graphs, depth-first search labelings can take on a wider range of possibilities: tree edges, forward edges, back edges, and cross edges.

- Topological Sorting
Topological sorting is the most important operation on directed acyclic graphs (DAGs). It orders the vertices on a line such that all directed edges go from left to right. Such an ordering cannot exist if the graph contains a directed cycle, because there is no way you can keep going right on a line and still return back to where you started from!

Each DAG has at least one topological sort. The importance of topological sorting is that it gives us an ordering to process each vertex before any of its successors. For example, suppose college courses are vertices and prerequisites are edges. Your transcript is a topological sort of courses.

Topological sorting can be performed efficiently using depth-first searching. A directed graph is a DAG if and only if no back edges are encountered. Labeling the vertices in the reverse order that they are marked

*processed*finds a topological sort of a DAG (i.e. record when you finish processing then reverse the collection).

- Topological Sorting

## Chapter 6: Weighted Graph Algorithms

There is an alternate universe of problems for **weighted graphs**. If we're
traveling to California, we don't just care about the number of roads traveled
on. The edges of road networks are naturally bound to numerical values such as
construction cost, traversal time, length, or speed limit. Identifying the
shortest path in such graphs proves more complicated than breadth-first search
in unweighted graphs, but opens the door to a wide range of applications.

### Minimum Spanning Trees

A **spanning tree** of a graph \(G = (V, E)\) is a subset of edges from \(E\)
forming a tree connecting all vertices of \(V\). For edge-weighted graphs, we
are particularly interested in the **minimum spanning tree** - the spanning
tree whose sum of edge weights is as small as possible. Minimum spanning tree
are the answer whenever we need to connect a set of points (representing
cities, homes, junctions, or other locations) by the smallest amount of
roadway, wire, or pipe. Minimum spanning trees are also useful for
clustering.

There can be more than one minimum spanning tree in a graph. Indeed, all spanning trees of an unweighted (or equally weighted) graph \(G\) are minimum spanning trees, since each contains exactly \(n - 1\) equal-weight edges. Such a spanning tree can be found using depth-first or breadth-first search. Finding a minimum spanning tree is more difficult for general weighted graphs, however two different algorithms are presented below.

#### Prim's Algorithm

**Prim's algorithm** for minimum spanning tree starts from one vertex and
grows the rest of the tree one edge at a time until all vertices are
included. **Greedy algorithms** make the decision of what to do next by
selecting the best local option from all available choices without regard to
the global structure. Since we seek the tree of minimum weight, the natural
greedy algorithm for a minimum spanning tree repeatedly selects the smallest
weight edge that will enlarge the number of vertices in the tree. Prim's
algorithm can be implemented as \(O(m + n\lg{n})\) with a priority-queue.

prim-mst(G) select an arbitrary vertex s to start the tree from while (there are still non tree vertices) select the edge of minimum weight between a tree and nontree vertex add the selected edge and vertex to the tree T_prim

The correctness of this algorithm can be proven by contradiction. We assert that there must be a specific instant where the tree went wrong. However, since we always select the smallest edge, it's not possible for a smaller edge to exist than the one we're adding (otherwise it would have already been chosen).

#### Kruskal's Algorithm

**Kruskal's algorithm** is an alternate approach to finding minimum spanning
trees that proves more efficient on spare graphs. Like Prim's, Kruskal's
algorithm is greedy. Unlike Prim's, it does not start with a particular
vertex. Kruskal's algorithm builds up connected components of vertices,
culminating in a minimum spanning tree. Initially, each vertex forms its own
separate component in the tree-to-be. The algorithm repeatedly considers the
lightest remaining edge and tests whether its two endpoints lie within the
same connected component. If so, this edge will be discarded, because adding
it would create a cycle in the tree-to-be. If the endpoints are in different
components, we insert the edge and merge the two components into one. Since
each connected component is always a tree, we need never explicitly test for
cycles.

kruskal-mst(G) put the edges in a priority queue ordered by weight count = 0 while (count < n - 1) do get next edge (v, w) if (component(v) != component(w)) add to T_kruskal merge component(v) and component(w)

The speed of Kruskal's algorithm depends on the component test. This test
may be implemented by a breadth-first or depth-first search in a sparse
graph. With this approach, Kruskal's algorithm is \(O(mn)\). However, a faster
implementation exists with the **union-find** data structure.

- The Union-Find Data Structure
A

**set partition**is a partitioning of the elements of some universal set (say the integers 1 to \(n\)) into a collection of disjointed subsets. Thus, each element must be in exactly one subset. Set partitions naturally arise in graph problems such as connected components (each vertex is in exactly one connected component) and vertex coloring (a person may be male or female, but not both or neither).The connected components in a graph can be represented as a set partition. For Kruskal's algorithm to run efficiently, we need a data structure that efficiently supports the following operations:

- \(same component(v_1, v_2)\)
- \(merge components(C_1, C_2)\)

The union-find data structure represents each subset as a "backwards" tree, with pointers from a node to its parent. Each node of this tree contains a set element, and the

*name*of the set is taken from the key at the root.We implement our desired component operations in terms of two simpler operations,

**union**and**find**:- \(find(i)\)
- Find the root of tree containing element \(i\), by walking up the pointers until there is nowhere to go. Return the label of the root.

- \(union(i, j)\)
- Link the root of one of the trees (say containing \(i\)) to the root of the tree containing the other (say \(j\)) so \(find(i)\) now equals \(find(j)\).

Tree structures can be very unbalanced, so we must limit the height of our trees. The most obvious means of control is the decision of which of the two component roots becomes the root of the combined component on each \(union\). To minimize the tree height, it is of course better to make the smaller tree the subtree of the bigger one.

With union-set, we can do both unions and finds in \(O(\log{n})\).

### Shortest Paths

A **path** is a sequence of edges connecting two vertices. The **shortest path**
from \(s\) to \(t\) in an unweighted graph can be constructed using a
breadth-first search from \(s\). The minimum-link path is recorded in the
breadth-first search tree, and it provides the shortest path when all edges
have equal weight. However, BFS does not suffice to find shortest paths in
weighted graphs. The shortest weighted path might use a large number of
edges.

Finding the shortest path between two nodes in a graph arises in many
different applications. These may include transportation problems and
computer network communication problems. Many applications reduce to finding
shortest path, learn to smell this! Page 212 of The Algorithm Design Manual
contains a lovely example (*Dialing for Documents*).

#### Dijkstra's Algorithm

**Dijkstra's algorithm** is the method of choice for finding shortest paths in
an edge-and/or vertex-weighted graph. Given a particular start vertex \(s\),
it finds the shortest path from \(s\) to every other vertex in the graph,
including your desired destination \(t\).

Suppose the shortest path from \(s\) to \(t\) in graph \(G\) passes through a particular intermediate vertex \(x\). Clearly, this path must contain the shortest path from \(s\) to \(x\) as its prefix, because if not, we could shorten our $s$-to-\(t\) path by using a shorter $s$-to-\(t\) prefix. Thus, we must compute the shortest path from \(s\) to \(x\) before we find the path from \(s\) to \(t\).

Dijkstra's algorithm proceeds in a series of rounds, where each round establishes the shortest path from \(s\) to some new vertex. Specifically, \(x\) is the vertex that minimizes \(dist(s, v_i) + w(v_i, x)\) over all finished \(1 \leq i \leq n\), where \(w(i, j)\) is the length of the edge from \(i\) to \(j\), and \(dist(i, j)\) is the length of the shortest path between them.

import math def dijkstra(graph, s, t): known = {s} distances = {vertex: math.inf for vertex in graph.all_vertices()} for neighbor in s.get_neighbors(): distances[neighbor] = weight(s, neighbor) last = s while last != t: v_next = min(distances[v] for v in (graph.all_vertices() - known)) for neighbor in v_next.get_neighbors(): distances[neighbor] = min(distances[neighbor], distances[v_next] + weight(v_next, neighbor)) last = v_next known.add(v_next)

The basic idea is very similar to Prim's algorithm. In each iteration, we
add exactly one vertex to the tree of vertices for which we *know* the
shortest path from \(s\). The difference between Dijkstra's and Prim's
algorithms is how they rate the desirability of each outside vertex. In the
minimum spanning tree problem, all we cared about was the weight of the next
potential tree edge. In shortest path, we want to include the closest
outside vertex (in shortest-path distance) to \(s\). This is a function of
both the new edge weight *and* the distance from \(s\) to the tree vertex it
is adjacent to.

#### All-Pairs Shortest Path

Sometimes we want to find the shortest path between all pairs of vertices in
a given graph. We could solve the **all-pairs shortest path** by calling
Dijkstra's algorithm from each of the \(n\) possible starting vertices
(\(O(n^3)\)). But Floyd's all-pairs shortest-path algorithm is a slick way to
construct an \(n x n\) distance matrix from the original weight matrix of the
graph. This algorithm is also \(O(n^3)\), but the loops are so tight and the
program so short that it runs better in practice.

Floyd's algorithm starts with the adjacency matrix. The edge \((i, j)\) should have its weight in matrix[i][j]. Cells for which the edge doesn't exist should be set to MAXINT.

def floyd(adjacency_matrix): n_vertices = len(adjacency_matrix) for k in range(n_vertices): for i in range(n_vertices): for j in range(n_vertices): through_k = adjacency_matrix[x][k] + adjacency_matrix[k][y] if (through_k < adjacency_matrix[x][y]): adjacency_matrix[x][y] = through_k

We define \(W[i, j]^k\) to be the length of the shortest path from \(i\) to \(j\) using only vertices numbered from 1, 2, …, \(k\) as possible intermediate vertices. At each iteration, we allow a richer set of possible shortest paths by adding a new vertex as a possible intermediary. Allowing the $k$th vertex as a stop helps only if there is a short path that goes through \(k\), so \(W[i, j]^k = min(W[i, j]^{k - 1}, W[i, k]^{k - 1}, + W[k, j]^{k - 1})\).

## Chapter 7: Combinatorial Search and Heuristic Methods

### Backtracking

**Backtracking** is a systematic way to iterate through all the possible
configurations of a combinatorial search space. These configurations may
represent all possible arrangements of objects (permutations), all possible
ways of building a collection of them (subsets), or even possible move
sequences in a game. We model each solution as a vector \(a = (a_1, a_2, ...,
a_n)\), where each element \(a_i\) is selected from a finite ordered set
\(S_i\). We must be careful to avoid repetitions and missing configurations.

def backtrack(a): if is_solution(a): report(a) else: s_i = find_candidates(a) while s_i: backtrack(a + [s_i.pop()])

At each step in the backtracking algorithm, we try to extend a given partial solution \(a = (a_1, a_2, ..., a_k)\) by adding another element at the end. After extending it, we must test whether what we now have is a solution or if not we must check whether the partial solution is still extendible to some complete solution. We're using a depth-first search to enumerate solutions. Breadth-first search would require more space (proportional to the width instead of the height of the search tree).

#### Backtracking Subsets

Suppose we are generating subsets of an *n*-element set, say
\(\{1,...,n\}\). Define each subset as an array of \(n\) cells, where the value
of \(a_i\) (true or false) signifies whether the ith item is in the given
subset. We consider the subset a solution when every cell has true/false
(length == n).

#### Backtracking Permutations

\(\{1,...,n\}\) has \(n!\) distinct permutations. Each permutation is represented by an array of \(n\) cells. The set of candidates for the ith position will be the set of elements that have not appeared in the \((i - 1)\) elements of the partial solution, corresponding to the first \((i - 1)\) elements of the permutation. Our array is a solution whenever length equals \(n\).

#### Backtracking Graph Paths

The starting point of any path from \(s\) to \(t\) is always \(s\). Thus, \(s\) is the only candidate for the first position and \(S_0 = \{s\}\). The possible candidates for the second position are the vertices \(v\) such that \((s, v)\) is an edge of the graph and \(v\) hasn't been used in the partial solution. We have a solution when \(a_k\) is equal to \(t\). Some paths might be shorter than others.

### Search Pruning

**Pruning** is the technique of cutting off the search the instant we have
established that a partial solution cannot be extended into a full
solution. Pruning is powerful. Even simple pruning strategies can suffice to
reduce running time from impossible to instantaneous.

For the traveling salesman, we seek the cheapest tour that visits all vertices. Suppose that in the course of our search we find a tour \(t\) whose cost is \(C_t\). Later, we may have a partial solution \(a\) whose edge sum \(C_A > C_t\). Any tour with this prefix will have cost greater than tour \(t\), and hence is doomed to be nonoptimal. Cutting away such failed partial tours as soon as possible can have an enormous impact on running time.

As another example, suppose we're solving a Sudoku puzzle. We run through
empty squares, try candidate numbers, and backtrack when we are out of
candidates. The naive search randomly chooses open squares. Instead, we could
choose the square with the fewest number of candidates. Additionally, when
generating candidates, we could look ahead to see if the partial solution
causes some *other* open square to have no candidates. Successful pruning
often requires looking ahead to see when a solution is doomed to go nowhere,
and backing off as soon as possible.

Exploiting symmetry is another avenue for reducing combinatorial searches

### Heuristic Search Methods

Heuristic methods provide an alternate way to approach difficult combinatorial optimization problems. Backtracking gave us a method to find the best of all possible solutions, as scored by a given objective function. However, any algorithm searching all configurations is doomed to be impossible on large instances.

The methods observed below have two common components: solution space representation and a cost function.

#### Random Sampling

The simplest method to search in a solution space uses random sampling. It
is also called the **Monte Carlo method**. We repeatedly construct random
solutions and evaluate them, stopping as soon as we get a good enough
solution, or (more likely) when we are tired of waiting. We report the best
solution found over the course of our sampling.

True random sampling requires that we are able to select elements form the
solution space *uniformly at random*. This means that each of the elements
of the solution space must have an equal probability of being the next
candidate selected.

Random sampling does well when there's a high proportion of acceptable solutions or when there is no coherence in the solution space. For example, hunting for a any large prime number.

#### Local Search

A local search employs the *local neighborhood* around every element in the
solution space. Think of each element \(x\) in the solution space as a vertex,
with a directed edge \((x, y)\) to every candidate solution \(y\) that is a
neighbor of \(x\). Our search proceeds from \(x\) to the most promising
candidate in *x*'s neighborhood.

We certainly do not want to construct the neighborhood graph for any sizable solution space. We want a general transition mechanism that takes us to the next solution by slightly modifying the current one. Typical mechanisms include swapping a random pair of items or changing (inserting or deleting) a single item in the solution.

In a **hill-climbing** procedure, we try to find the top of a mountain (or
alternatively, the lowest point in a ditch) by starting at some arbitrary
point and taking any step that leads in the direction we want to travel. We
repeat until we have reached a point where all our neighbors lead us in the
wrong direction.

Suppose you wake up in a sky lodge, eager to reach the top of the
neighboring peak. Your first transition to grain altitude might be to go
upstairs to the top of the building. And then you are trapped. To reach the
top of the mountain, you must go downstairs and walk outside, but this
violates the requirement that each step has to increase your
score. Hill-climbing and closely related heuristics such as **greedy search**
or **gradient descent search** are great at finding local optima quickly, but
often fail to find the globally best solution.

Use local search when there is great coherence in the solution space. Hill
climbing is at its best when the solution space is *convex*. Local search is
also useful whenever the cost of incremental evaluation is much cheaper than
global evaluation.

#### Simulated Annealing

Simulated annealing is a heuristic search procedure that allows occasional transitions leading to more expensive (and hence inferior) solutions. This may not sound like progress, but it helps keep our search from getting stuck in local optima.

The inspiration for simulated annealing comes from the physical process of cooling molten materials down to the solid state. In thermodynamic theory, a particle's energy state is a function of its temperature. We can mimic physics to solve combinatorial optimization problems.

Our problem representation includes both a representation of the solution
space and an easily computable cost function \(C(s)\) measuring the quality of
a given solution. The new component is the *cooling schedule*, whose
parameters govern how likely we are to accept a bad transition as a function
of time.

At the beginning of the search, we are eager to use randomness to explore the search space widely, so the probability of accepting a negative transition should be high. As the search progresses, we seek to limit transitions to local improvements and optimizations.

#### Genetic Algorithms

Genetic algorithms draw their inspiration from evolution and natural selection. Through the process of natural selection, organisms adapt to optimize their chances for survival in a given environment. Random mutations occur in an organism's genetic description, which then get passed on to its children. Should a mutation prove helpful, these children are more likely to survive and reproduce. Should it be harmful, these children won't, and so the bad trait will die with them.

Genetic algorithms maintain a "population" of solution candidates for the given problem. Elements are drawn at random from this population and allowed to "reproduce" by combining aspects of the two-parent solutions. The probability that an element is chosen to reproduce is based on its "fitness," - essentially the cost of the solution it represents. Unfit elements die from the population, to be replaced by a successful-solution offspring.

The idea behind genetic algorithms is extremely appealing. However, they don't seem to work as well on practical combinatorial optimization problems as simulated annealing does.

## Chapter 8: Dynamic Programming

Dynamic programming is a technique for efficiently implementing a recursive
algorithm by storing partial results. Dynamic programming guarantees
*correctness* by searching all possibilities and provides *efficiency* by
storing results to avoid recomputing. If the naive recursive algorithm
computes the same subproblems over and over again, storing the answer for each
subproblem in a table to look up instead of recompute can lead to an efficient
algorithm. Dynamic programming is essentially a tradeoff of space for time.

### Fibonacci Example

Let's look at a simple program for computing the /n/th Fibonacci number.

def fib(n): if n == 0: return 0 if n == 1: return 1 return fib(n - 1) + fib(n - 2)

The course of execution for this recursive algorithm is illustrated by its
**recursion tree**.

Note that \(F(4)\) is computed on both sides of the recursion tree, and \(F(2)\) is computed no less than five times in this small example. This redundancy drastically affects performance.

We can improve performance by storing (or **caching**) the results of each
Fibonacci computation \(F(k)\) indexed by the parameter \(k\).

cache = {0: 0, 1: 1} def fib(n): if n not in cache: cache[n] = fib(n - 1) + fib(n - 2) return cache[n]

This approach is a simple way to get *most* of the benefits of full dynamic
programming. Here's the recursion tree:

Let's go a step further with full dynamic programming! We can calculate \(F(n)\) in linear time and space with no recursive calls by explicitly specifying the order of evaluation of the recurrence relation.

def fib(n): f = [0, 1] for i in range(2, n + 1): f.append(f[i - 1] + f[i - 2]) return f[n]

However, more careful study shows that we do not need to store all the intermediate values for the entire period of execution.

def fib(n): if n == 0: return 0 back_2, back_1 = 0, 1 for _ in range(2, n): back_2, back_1 = back_1, back_1 + back_2 return back_1 + back_2

This analysis reduces the storage demands to constant space with no asymptotic degradation in running time.

### Approximate String Matching

To deal with inexact string matching, we must first define a cost function
telling us how far apart two strings are - i.e., a distance measure between
pairs of strings. **Edit distance** reflects the number of *changes* that must
be made to convert one string to another. There are three natural types of
changes: *substitution*, *insertion*, and *deletion*. Edit distance assigns
each operation an equal cost of 1. Here's a recursive edit distance function:

def edit_distance(source, target): if not source: return len(target) if not target: return len(source) substitution_cost = 0 if source[-1] == target[-1] else 1 return min(edit_distance(source[:-1], target[:-1]) + substitution_cost, edit_distance(source, target[:-1]) + 1, # insertion edit_distance(source[:-1], target) + 1) # deletion

This program is absolutely correct but impossible slow. A table-based,
dynamic programming implementation of this algorithm is given below. `costs`

is a two-dimensional matrix where each cell contains the optimal solution to
a subproblem (i.e. `costs[x][y]`

is ```
edit_distance(source[:x],
target[:y])
```

).

def edit_distance(source, target): costs = [[None for _ in range(len(target) + 1)] for _ in range(len(source) + 1)] for index in range(len(costs)): costs[index][0] = index for index in range(len(costs[0])): costs[0][index] = index for x in range(1, len(source) + 1): for y in range(1, len(target) + 1): substitution_cost = 0 if source[x - 1] == target[y - 1] else 1 costs[x][y] = min(costs[x - 1][y - 1] + substitution_cost, costs[x - 1][y] + 1, # insertion costs[x][y - 1] + 1) # deletion return costs[-1][-1]

The first row and the first column represent the empty prefix of the source and target, respectively. This is why the matrix height/width is larger than the source/target length.

Note that it is unnecessary to store the entire `O(mn)`

matrix. The
recurrence only requires two rows at a time. Thus, this algorithm could be
further optimized to `O(n)`

space without changing the time complexity.

### Dynamic Programming in Practice

There are three steps involved in solving a problem by dynamic programming:

- Formulate the answer as a recurrence relation or recursive algorithm.
- Show that the number of different parameter values taken on by your recurrence is bounded by a (hopefully small) polynomial.
- Specify an order of evaluation for the recurrence so the partial results you need are always available when you need them.

In practice, you'll find that dynamic programming algorithms are usually easier to work out from scratch than look up.

### The Partition Problem

Suppose three workers are given the task of scanning through a shelf of books
in search of a given piece of information. To get the job done fairly and
efficiently, the books are to be partitioned among the three workers. If the
books are the same length, the job is easy: ```
100 100 100 | 100 100 100 | 100
100 100
```

. If the books are not the same length, the task becomes more
difficult (`100 200 300 400 500 | 600 700 | 800 900`

). An algorithm that
solves this **linear partition problem** takes as input an arrangement \(S\) of
nonnegative numbers and an integer \(k\). The algorithm should partition \(S\)
into \(k\) or fewer ranges, to minimize the maximum sum over all ranges,
without reordering any of the numbers.

A heuristic to solve this problem might compute the average size of a partition and then try and insert dividers to come close to this average. Unfortunately, this method is doomed to fail on certain inputs.

Instead, consider a recursive, exhaustive search approach to solving this problem. The /k/th partition starts right ater we placed the (k - 1)st divider. Where can we place this last divider? Between the ith and (i + 1)st elements for some \(i\), where \(1 \leq i \leq n\). What is the cost of this? The total cost will be the larger of two qualtities - (1) the cost of the last partition and (2) the cost of the largest partition formed to the left of \(i\). What is the size of this let partition? To minimize our total, we want to use the \(k - 2\) remaining dividers to partition the elements \(\{s_1, ..., s_i\}\) as equally as possible. This is a smaller instance of the same problem and hence can be solved recursively!

Therefore, let us define \(M[n, k]\) to be the minimum possible cost over all partitions of \(\{s_1, ..., s_n\}\) into \(k\) ranges, where the cost of a partition is the largest sum of elements in one of its parts. Thus defined, this function cab be evaluated:

\begin{equation} M[n,k] = min(i=1, n)(max(M[i, k - 1], \sum_{j = i + 1}^{n} s_j)) \end{equation}This recurrence can be solved with dynamic programming in \(O(kn^2)\) time. Note that we also need a second matrix, \(D\) to reconstruct the optimal partition. Whenever we update the value of \(M[i, j]\), we record which divider position was required to achieve that value.

### Parsing Context-Free Grammars

See ../theory/cfg.html.