The Algorithm Design Manual

This document houses my notes from The Algorithm Design Manual by Steven S. Skiena. Also see my answers to odd exercises.

Chapter 1: Introduction

A problem is specified by describing the complete set of instances it must work on and of its output after running on one of these instances. For example, an instance of the sorting problem might be [1, 4, 3]. An algorithm is a procedure that takes any of the possible input instances and transforms it to the desired output. There are many different algorithms that solve the problem of sorting.

We seek algorithms that are correct, efficient, and easy to implement. These goals may not be simultaneously achievable!

There is a fundamental difference between algorithms, which always produce a correct result, and heuristics, which may usually do a good job but without providing any guarantee. For example, the nearest-neighbor heuristic can be applied to the traveling salesman problem (TSP), but sometimes this heuristic doesn't even come close to finding the shortest possible tour.

Correctness

Reasonable-looking algorithms can easily be incorrect. Algorithm correctness is a property that must be carefully demonstrated. The primary tool to distinguish correct algorithms from incorrect ones is a proof. A proof has four main parts.

  1. Clear, precise statement of what you are trying to prove.
  2. Set of assumptions.
  3. Chain of reasoning that takes you from the assumptions to the statement you are trying to prove.
  4. Little, black square or QED.

Computer scientists usually prove things with proof by induction or proof by contradiction. We'll see induction below.

To reason about an algorithm, you need a careful description of the sequence of steps to be performed. The three most common forms of algorithmic notation are English, pseudocode, or a real programming language. These methods have natural tradeoffs between expression and precision.

Incorrectness

The best way to prove that an algorithm is incorrect is to produce an instance in which it yields an incorrect answer. Such instances are called counter-examples. Good counter-examples are verifiable and simple. Many tricks exist for finding counter-examples.

  • Think small: small examples are easier to reason about.
  • Think exhaustively: consider different types of examples.
  • Go for a tie: try similar values in the input collection.
  • Seek extremes: e.g. use values that are far apart or close together.

Induction

Mathematical induction is a common choice for proving correctness. If you're familiar with recursion, you're familiar with induction; recursion is induction. In both, we have general and boundary conditions. The general condition breaks the problem into smaller and smaller pieces and the boundary condition terminates the recursion. Suppose that you are trying to prove that a statement holds true for all natural numbers (all \(n\)). Induction would usually take the form:

  1. Show that the statement holds for the base case (usually \(P(0)\) or \(P(1)\)).
  2. Assume that \(P(k)\) is true.
  3. Prove that the statement also holds for \(P(k + 1)\).

Summations

Summation formula are concise expressions describing the addition of an arbitrarily large set of numbers.

\begin{equation} \sum_{i=1}^{n} f(i) = f(1) + f(2) + ... + f(n) \end{equation}

There are simple closed forms for summations of many algebraic functions.

\begin{equation} \sum_{i=1}^{n} 1 = n \end{equation} \begin{equation} \sum_{i=1}^{n} i = \frac{n(n + 1)}{2} \end{equation}

Program Modeling

Modeling is the art of formulating your application in terms of precisely described, well-understood problems. Proper modeling can eliminate the need to design or even implement algorithms, by relating your application to what has been done before.

Real-world applications involve real-world objects. Most algorithms, however, are designed to work on rigorously defined abstract structures. To exploit the algorithms literature, you must learn to describe your problem abstractly, in terms of procedures on fundamental structures.

  • Permutations are arrangements, or orderings of items. Usually the object in question if your problem seeks an "arrangement," "tour," "ordering," or "sequence."
  • Subsets are selects from a set of items. Usually the object in question if your problem seeks a "cluster," "collection," "committee," "group," "packaging," or "selection."
  • Trees are hierarchical relationships between items. Usually the object in question whenever your problem seeks a "hierarchy," "dominance relationship," "ancestor/descendant relationship," or "taxonomy."
  • Graphs represent relationships between arbitrary pairs of objects. Usually the object in question whenever you seek a "network," "circuit," "web," or "relationship."
  • Points represent locations in some geometric space. Usually the object in question whenever your problems work on "sites," "positions," "date records," or "locations."
  • Polygons represent regions in some geometric spaces. Usually the object in question whenever you are working on "shapes," "regions," "configurations," or "boundaries."
  • Strings represent sequences of characters or patterns. Usually the object in question whenever you are dealing with "text," "characters," "patterns," or "labels."

Learn to think recursively. Recursive structures occur everywhere in the algorithmic world. Each of the abstract structures described above can be thought about recursively; they are big things made of smaller things of the same type. Each structure has operations (like delete) that produce new versions of the same type.

Chapter 2: Algorithm Analysis

RAM Model of Computation

Machine-independent algorithm design depends upon a hypothetical computer called the Random Access Machine or RAM. Under this model of computation, we are confronted with a computer where

  • Each simple operation (+, *, -, =, if, call) takes exactly one time step.
  • Loops and subroutines are the composition of many single-step operations.
  • Each memory access takes exactly one time step. Further, we have as much memory as we need.

Under the RAM model, we measure run time by counting up the number of steps an algorithm takes on a given problem instance. We consider different time complexities that define a numerical function, representing time versus problem size.

  • Worst-case complexity of the algorithm is the function defined by the maximum number of steps taken in any instance of size \(n\).
  • Best-case complexity of the algorithm is the function defined by the minimum number of steps taken in any instance of size \(n\).
  • Average-case complexity of the algorithm is the function defined by the average number of steps taken in any instance of size \(n\).

Big Oh Notation

Big Oh simplifies our analysis by ignoring levels of detail that do not impact our comparison of algorithms. The formal definitions are as follows:

  • \(f(n) = O(g(n))\) means \(c \cdot g(n)\) is an upper bound on \(f(n)\). Thus there exists some constant \(c\) such that \(f(n)\) is always \(\leq c \cdot g(n)\), for large enough \(n\).
  • \(f(n) = \Omega(g(n))\) means \(c \cdot g(n)\) is an lower bound on \(f(n)\). Thus there exists some constant \(c\) such that \(f(n)\) is always \(\geq c \cdot g(n)\), for large enough \(n\).
  • \(f(n) = \Theta(g(n))\) means \(c_1 \cdot g(n)\) is an upper bound on \(f(n)\) and \(c_2 \cdot g(n)\) is a lower bound on \(f(n)\). Thus there exists constants \(c_1\) and \(c_2\) such that \(f(n) \leq c_1 \cdot g(n)\) and \(f(n) \geq c_2 \cdot g(n)\).

For example, \(2n^2 + 100n + 6 = O(n^2)\), because I choose \(c = 3\) and \(3n^2 \geq 2n^2 + 100n + 6\) when \(n\) is big enough.

Big Oh Classes

Big Oh groups functions into a set of classes, such that all the functions in a particular class are equivalent with respect to the Big Oh. A small variety of time complexities suffice and account for most algorithms that are widely used in practice.

Class Name Function
Constant \(f(n) = 1\)
Logarithmic \(f(n) = \log n\)
Linear \(f(n) = n\)
Superlinear \(f(n) = n lg n\)
Quadratic \(f(n) = n^2\)
Cubic \(f(n) = n^3\)
Exponential \(f(n) = c^n\)
Factorial \(f(n) = n!\)

We say that a faster-growing function dominates a slower-growing one. Specifically, when \(f\) and \(g\) belong to different classes (i.e. \(f(n) \neq \Theta(g(n))\)), we say \(g\) dominates \(f\) when \(f(n) = O(g(n))\), sometimes written \(g >> f\).

Big Oh Operations

The sum of two functions is governed by the dominant one.

\begin{equation} O(f(n)) + O(g(n)) \rightarrow O(max(f(n), g(n))) \end{equation}

Multiplying a function by a constant can not affect its asymptotic behavior.

\begin{equation} O(c \cdot f(n)) \rightarrow O(f(n)) \end{equation}

When two functions in a product are increasing, both are important.

\begin{equation} O(f(n)) * O(g(n)) \rightarrow O(f(n) * g(n)) \end{equation}

Logarithms

A logarithm is simply an inverse exponential function. Saying \(b^x = y\) is equivalent to saying that \(x = \log_b y\). Exponential functions grow at a distressingly fast rate. Thus, inverse exponential functions - i.e. logarithms - grow refreshingly slowly. Logarithms arise in any process where things are repeatedly halved.

Binary search is a good example of an \(O(\log n)\) algorithm. If searching for a particular name \(p\) in a telephone book, we start by comparing \(p\) against the middle. Then we discard half the names. Only twenty comparisons suffice to find any name in the million-name Manhattan phone book!

Logarithms appear in trees (height is \(\log_2 n\)), bits (\(\log_2 n\) bits required to store a number in binary).

Logarithm Properties

The \(b\) term in \(\log_b y\) is the base of the logarithm. Three bases are of importance for mathematical and historical reasons.

  • Base \(b = 2\): The binary logarithm, usually denoted \(lg n\), is a base 2 logarithm. Most algorithm applications of logarithms imply binary logarithms.
  • Base \(b = e\): The natural log, usually denoted \(ln x\), is a base $e = 2.71828…$ logarithm.
  • Base \(b = 10\): Less common today is the base-10 or common logarithm, usually denoted as \(\log x\).
\begin{equation} \log_x(xy) = \log_a(x) + \log_a(y) \end{equation}

It is easy to convert a logarithm from one base to another. This is a consequence of the formula:

\begin{equation} \log_a b = \frac{\log_c b}{\log_c a} \end{equation}

Thus, changing the base of \(\log b\) from base-a to base-c simply involves dividing by \(\log_c a\).

The base of the logarithm has no real impact on the growth rate. We are usually justified in ignoring the base of the logarithm when analyzing algorithms.

Chapter 3: Data Structures

Classes of abstract data types such as containers, dictionaries, and priority queues, have many different but functionally equivalent data structures that implement them. These different data structures realize different tradeoffs in the time to execute various operations.

Contiguous vs. Linked Data Structures

Data structures are either contiguous or linked, depending upon whether they are based on arrays or pointers.

Arrays

The array is the fundamental contiguously-allocated data structures. These single slabs of memory have constant access given the index and space efficiency. Dynamic arrays enable resizing. First, an initial size is allocated. If we run out of space, a larger array (usually 2x) is allocated and the elements are copied over. Insertion amortizes to \(O(1)\).

Lists

The list is the simplest linked structure. Each node in the list has data and pointer fields. Pointers are the connections that hold the pieces of together. Pointers represent the address of a location in memory. List don't incur overflow, but require extra space for pointer fields and don't given efficient random access to items.

Containers: Stacks and Queues

A container denotes a data structure that permits storage and retrieval of data items independent of content. Containers are distinguished by the particular retrieval order they support. Stacks support retrieval by last-in, first-out (LIFO) order. The put and get operations for stacks are usually called push and pop. Queues support retrieval in first in, first out (FIFO) order. The put and get operations for queues are usually called enqueue and dequeue.

Dictionaries

The dictionary data type permits access to data items by content. Operations include search (when given a key), insert, and delete.

Binary Search Trees

A binary tree is recursively defined as being empty or consisting of a root node with left and right subtrees. A binary search tree labels each node in a binary tree with a single key such that for any node labeled \(x\), all nodes in the left subtree have \(keys < x\) while all nodes in the right subtree have \(keys > x\). Binary tree nodes have left and right point fields, an optional parent pointer, and a data field.

Traversal

Traversal involves visiting all nodes. In-order traversal of a binary search tree can be done recursively with the following.

def traverse(tree):
    if tree:
        traverse(tree.left)
        process(tree.item)
        traverse(tree.right)

Changing the position of process gives alternate traversal orders. Processing the item first yields a pre-order traversal, while processing it last gives a post-order traversal.

Performance

Search, insert, and delete all take \(O(h)\) time, where \(h\) is the height of the tree. A perfectly balanced tree has \(h = \lceil \log n \rceil\). Unfortunately, inserting keys in sorted order produces a skinny linear height tree, \(h = n\). Randomizing insert order will produce \(O(\log n)\) height on average.

Balanced Search Trees

Balanced binary search tree data structures adjust the tree during insertion/delete to guarantee that height will always be \(O(\log n)\). Balanced tree implementations include red-black trees and splay trees.

Priority Queues

Priority queues are data structures that provide more flexibility than simple sorting, because they allow new elements to enter a system at arbitrary intervals. The basic priority queue supports three primary operations: insert, find-minimum/maximum, and delete-minimum/maximum. Priority queues can be implemented with arrays or BSTs, but a particularly nice implementation is the heap.

Hashing and Strings

Hash tables are a very practical way to maintain a dictionary. A hash function is a mathematical function that maps keys to integers. Hash table use the value of a hash function as an index into an array, and store our item at that position.

The first step of the hash function is usually to map each key to a big integer. Let \(\alpha\) be the size of the alphabet on which a given string \(S\) is written. Let char(c) be a function that maps each symbol of the alphabet to a unique integer from 0 to \(\alpha - 1\). The function

\begin{equation} H(S) = \sum_{i = 0}^{|S| - 1} \alpha^{|S| - (i + 1)} \times char(s_i) \end{equation}

maps each string to a unique (but large) integer by treating the characters of the strings as "digits" in a base-\(\alpha\) number system.

Collision Resolution

Two distinct keys will occasionally hash to the same value. This is a collision. Chaining is the easiest approach to collision resolution. Represent the hash table as an array of \(m\) linked lists. Chaining devotes a considerable amount of memory to pointers. Open addressing is an alternative to chaining. The hash table is maintained as an array of elements, each initialized to null. On an insertion, we check to see if the desired position is empty. If so, we insert it. If not, we must find some other place to insert it instead. The simplest possibility (called sequential probing) inserts the item in the next open spot in the table.

String Matching via Hashing

The Rabin-Karp algorithm gives a linear-time solution to substring search. Substring search asks if string \(t\) contains the pattern \(p\) as a substring, and if so where. In the Rabin-Karp algorithm, we compute a given hash function on both the pattern string \(p\) and the $|p|$-character substring starting from the $i$th position of \(t\). If these two strings are identical, clearly the resulting hash values must be the same. If the two strings are different, the hash values will almost certainly be different (we can check). Note that we need our hashing function to be constant for this algorithm to be \(O(n)\) instead of \(O(mn)\). This may be accomplished with a rolling hash function.

Duplicate Detection via Hashing

The key idea of hashing is to represent a large object using a single number. Hashing can be applied to duplicate detection. Suppose we're looking to find if a given document is contained in a corpus. Explicitly comparing the new document \(D\) to all \(n\) documents is hopelessly inefficient. But we can hash \(D\) to an integer, and compare it to the hash codes of the rest of the corpus.

Chapter 4: Sorting and Searching

Sorting a the basic building block that many other algorithms are built around. Many other problems become easy once a set of items is sorted (e.g. searching, closest pair).

Many things need to be considered when sorting:

  • Ascending or descending order.
  • Key or entire record.
  • What to do with equal keys (stable sort?).
  • Comparison function.

Heapsort

Selection sort is a simple algorithm that repeatedly extracts the smallest remaining element from the unsorted part of an array. A computer takes \(O(n)\) time to find the smallest element in an array. This is the operation supported by a priority queue. What if we improve the data structure? Heapsort is nothing but an implementation of selection sort using the right data structure.

Heaps

Heaps are a simple and elegant data structure that efficiently support the priority queue operations insert and extract-min. They work by maintaining a partial order on the set of elements. A heap-labeled tree is a binary tree such that the key labeling of each node dominates the key labeling of each of its children. In a min-heap a node dominates its children by containing a smaller key than they do.

heap.png

Heaps can be stored with pointers (node with children) or arrays. In a array, the root of the tree is in the first position, and its left and right children are in the second and third positions. In general the keys of the $i$th level of the binary tree are stored in \(2^{i - 1}\) to \(2^i - 1\). This means that the left child of \(k\) sits in position \(2k\) and the right child in \(2k + 1\), while the parent of \(k\) is in \(\lceil k / 2 \rceil\). Note that sparse trees can be very space inefficient, we need to be careful to pack our elements as far left as possible. This implicit representation of binary saves memory, but is less flexible than using pointers. We cannot store arbitrary tree topologies without wasting large amount of space. We cannot move subtrees around by just changing a single pointer. This loss of flexibility explains why we cannot use this idea to represent binary search trees.

  1. Insert

    Place the new element into the left-most open spot in the array, namely the $(n + 1)$st position of a previously $n$-element heap. Then, bubble up the new key to its proper position in the hierarchy by swapping the element with its parent until the parent dominates the element. Insertion takes at most \(O(\log n)\) time.

    def insert(element, heap):
        heap = heap + [element]
        bubble_up(len(heap), heap)
    
    def bubble_up(index, heap):
        if index > 0 and heap[index] > heap[index // 2]:
            array[index], array[index // 2] = array[index // 2], array[index]
            bubble_up(index // 2, heap)
    
  2. Extracting the Minimum

    The minimum can easily be found by looking in the first position in the array. Removing the top element leaves a hole in the array. Fill by moving the element from the right-most leaf (sitting in the $n$th position of the array) into the first position. Then, bubble down the new key until it dominates all its children. The key should be switched with the dominant child.

    def extract_minimum(heap):
        minimum = heap[0]
        heap = [heap[-1]] + heap[1:-1]
        bubble_down(0, heap)
        return minimum
    
    def bubble_down(index, heap):
        smaller_index = find_smaller_child(index, heap)
        if smaller_index:
            heap[index], heap[smaller_index] = heap[smaller_index], heap[index]
            bubble_down[smaller_index]
    
    def find_smaller_child(index, heap):
        if 2 * index + 1 < len(heap) and heap[2 * index] > heap[2 * index + 1]:
            return 2 * index + 1
        elif 2 * index < len(heap):
            return 2 * index
    

Heapsort Implementation

Heapsort creates a heap and repeatedly extracts the minimum to give a worst-case \(O(n \log n)\) algorithm. It is an in-place sort, meaning it uses no extra memory over the array containing the elements to be sorted.

def heapsort(array):
    heap = []
    for element in array:
        insert(element, heap)
    for index in range(len(array)):
        array[index] = extract_min(heap)

Mergesort

Mergesort is a classic divide-and-conquer algorithm. This recursive approach to sorting involves partitioning the elements into two groups, sorting each of the smaller problems recursively, and then interleaving the two sorted lists to totally order the elements.

The efficiency of mergesort depends upon how efficiently we combine the two sorted halves into a single sorted list. We need to merge the two lists together. Observe that the smallest overall item in the two sorted lists must sit at the top of one of the two lists. To merge, we remove the smallest element, then repeat. Because the recursion goes \(\lg n\) levels deep, and a linear amount of work is done per level, mergesort takes \(O(n \log n)\) time in the worst case.

def mergesort(array):
    left, right = array[:len(array) / 2], array[len(array / 2):]
    return merge(mergesort(left), mergesort(right))

def merge(array1, array2):
    merged = []
    while array1 or array2:
        if not array2 or (array1 and array1[0] < array2[0]):
            merged.append(array1.pop(0))
        else:
            merged.append(array2.pop(0))
    return merged

Quicksort

Quicksort selects a item \(p\) from the collection then separates the other elements into piles: those before \(p\) and those after \(p\). We place the pivot \(p\) between the other two piles, and then sort piles independently.

Quicksort runs in \(O(n * h)\), where \(h\) is the height of the recursion tree. Suppose, luckily, we always the median element, the subproblems are always half the size of the previous level. This produces \(O(n \log n)\), the best case of quicksort. Suppose, unluckily, we always choose the biggest or smallest element in the sub-array. This produces \(O(n^2)\), the worst case of quicksort.

Quicksort is typically 2-3 times faster than mergesort or heapsort when implemented well. All three algorithms are \(O(n \log n)\), but experimentation shows that the simpler operations in the inner loop give quicksort a constant improvement.

Randomization

Randomization is a powerful tool to improve algorithms with bad worst-case but good average-case complexity.

If we randomly choose the pivot in quicksort, we can expect, with high probability, \(O(n \log n)\). The best possible selection for the pivot is the median. Suppose a key is good enough if it lies in the center half of the sorted space of keys. Since the expected number of good splits and bad splits is the same, the bad splits can only double the height of the tree, which still produces \(O(\log n)\) height. This randomization may be done by either shuffling the array first or by selecting a random index at each step.

Distribution Sort

Suppose we have a list of names to sort. We could partition them according to the first letter. This creates 26 different piles, or buckets, or names. Then, we partition each pile based on the second letter of each name, etc. The names will be sorted as soon as each bucket contains only a single name. At the end, we'll be able to simply concatenate the bunch of piles together. This algorithm is commonly called bucketsort or distribution sort.

Bucketing is a very effective idea whenever we are confident that the distribution of data will be roughly uniform. It is the idea that underlies hash tables, kd-trees, and a variety of other practical data structures. The downside of such techniques is that the performance can be terrible when the data distribution is not what we expected.

Binary Search and Related Algorithms

Binary search is a fast algorithm for searching in a sorted array. We compare the key \(q\) to the middle item. If \(q\) is smaller, it must appear in the first half; if not it must reside in the second half. By repeating this process recursively on the correct half, we locate the key in \(\lg n\) comparisons.

def binary_search(item, array, start=None, end=None):
    if start is None or end is None:
        start, end = 0, len(array)

    middle_index = (start + end) // 2
    if start > end:
        return False
    elif item == array[middle_index]:
        return True
    elif item < array[middle_index]:
        return binary_search(item, array, start, middle_index - 1)
    else:
        return binary_search(item, array, middle_index + 1, end)

Binary search is the power behind twenty questions!

Counting Occurrences

Suppose we want to count the number of times a given key \(k\) occurs in a given sorted array. We could use binary search to find the index of an element in the correct block in \(O(\lg n)\) time. Then we sequentially test elements to the left and right until we find one that differs from the key. The difference between the boundaries (plus one) gives the count of the number of occurrences of \(k\). This algorithm runs in \(O(\lg n + s)\), where \(s\) is the number of occurrences of the key.

A fast algorithm results by modifying binary search to search for the boundary of the block containing \(k\), instead of \(k\) itself. We perform this search twice, for a total time of \(O(\lg n)\), so we can count the occurrences in logarithmic time regardless of the size of the block.

One-Sided Binary Search

Suppose we don't know the bounds of our sorted collection. Binary search can also proceed from a specific position at repeatedly larger intervals (1, 2, 4, 8, 16) until we find a value greater than our key. We now have a window containing the target and can proceed with binary search. One-sided binary search is most useful whenever we are looking for a key that lies close to our current position.

Square and Other Roots

Suppose we are searching for the square root \(r\) of \(n\). Notice that the square root of \(n \leq 1\) must be at least 1 and at most \(n\). Consider the midpoint \(m\) of this interval. How does \(m^2\) compare to \(n\)? If \(n \leq m^2\), then the square root must be greater than \(m\), so the algorithm repeats on a new range of values. This application of binary search identifies the square root within ±1 after only \(\lg n\) rounds. Root-finding algorithms that converge faster are known, but this is simple, robust and applies to other functions.

Divide-and-Conquer

One of the most powerful techniques for solving problems is to break them down into smaller, more easily solved pieces. A recursive algorithm starts to become apparent when we break the problem into smaller instances of the same type of problem. Divide-and-conquer splits the problem in (say) halves, solves each half, then stitches the pieces back together to form a full solution. Whenever the merging takes less time than recursively solving the two subproblems, we get an efficient algorithm. For example, mergesort takes linear time to merge two sorted lists of \(n/2\) elements, each of which was obtained in \(O(n \lg n)\) time.

Recurrence Relations

Many divide-and-conquer algorithms have time complexities that are naturally modeled by recurrence relations. A recurrence relation is an equation that is defined in terms of itself. The Fibonacci numbers are described by the recurrence relation \(F_n = F_{n - 1} + F_{n - 2}\). Many other natural functions are easily expressed as recurrences. For example, \(a_n = 2a_{n - 1}, a_1 = 1 \rightarrow a_n = 2^{n - 1}\).

Divide-and-conquer algorithms tend to break a given problem into some number of smaller pieces (say \(a\)), each of which is of size \(n/b\). Further, they spend \(f(n)\) time to combine these subproblem solutions into a complete result. Let \(T(n)\) denote the worst-case time the algorithm takes to solve a problem of size \(n\). Then \(T(n)\) is given by the following recurrence relation.

\begin{equation} T(n) = aT(n/b) + f(n) \end{equation}

For example, the running time behavior of mergesort is governed the recurrence \(T(n) = 2T(n/2) + O(n)\). This recurrence evaluates to \(T(n) = O(n \lg n)\). Binary search is governed by the recurrence \(T(n) = T(n/2) + O(1)\).

Chapter 5: Graph Traversal

A graph \(G = (V, E)\) consists of a set of vertices \(V\) together with a set \(E\) of vertex pairs or edges. Graphs can represent essentially any relationship. The key to using graph algorithms effectively in applications lies in correctly modeling your problem so you can take advantage of existing algorithms.

Flavors of Graphs

Several fundamental properties of graphs impact the choice of the data structures used to represent them and algorithms available to analyze them.

  • Undirected vs. Directed
    • A graph \(G = (V, E)\) is undirected if edge \((x, y) \in E\) implies that \((y, x) \in E\). If not, we say that the graph is directed.
  • Weighted vs. Unweighted
    • Each edge (or vertex) in a weighted graph \(G\) is assigned a numerical value, or weight. In unweighted graphs, there is no cost distinction between various edges and vertices.
    • The difference between weighted and unweighted graphs becomes particularly apparent in finding the shortest path between two vertices.
  • Simple vs. Non-simple
    • Any graph that avoids self-loops and multiedges is called simple. A self-loop is an edge \((x, x)\) involving only one vertex. An edge \((x, y)\) is a multiedge if it occurs more than once in the graph.
  • Sparse vs. Dense
    • Graphs are sparse when only a small fraction of the possible vertex pairs actually have edges defined between them. Graphs where a large fraction of the vertex pairs define edges are called dense.
    • The degree of a vertex is the number of edges adjacent to it.
    • In a regular graph, each vertex has exactly the same degree.
  • Cyclic vs. Acyclic
    • An acyclic graph does not contain any cycles.
    • Trees are connected, acyclic undirected graphs.
  • Embedded vs. Topological
    • A graph is embedded if the vertices and edges are assigned geometric positions.
  • Implicit vs. Explicit
    • Certain graphs are not explicitly constructed and then traversed, but built as we use them.
  • Labeled vs. Unlabeled
    • Each vertex is assigned a unique name or identifier in a labeled graph to distinguish it from all other vertices. In unlabeled graphs, no such distinctions have been made.

Social networks are graphs where the vertices are people, and there is an edge between two people if and only if they are friends.

Data Structures for Graphs

The two basic graph data structures are adjacency matrices and adjacency lists. We assume a graph \(G = (V, E)\) contains \(n\) vertices and \(m\) edges.

graph-data-structures.png

Adjacency lists are the right data structure for most applications of graphs.

Adjacency Matrix

We can represent \(G\) using an \(n x n\) matrix \(M\), where element \(M[i,j] = 1\) if \((i, j)\) is an edge of \(G\), and 0 if it isn't. This allows fast answers to the question "is \((i, j)\) in \(G\)?", and rapid updates for edge insertion and deletion. IT may use excessive space for graphs with many vertices and relatively few edges, however.

Adjacency Lists

We can more efficiently represent sparse graphs by using linked lists to store the neighbors adjacent to each vertex. Adjacency lists make it harder to verify whether a given edge \((i, j)\) is in \(G\), since we must search through th3e appropriate list to find the edge.

Traversing a Graph

The key idea behind graph traversal is to mark each vertex when we first visit it and keep track of what we have not yet completely explored. Each vertex may be undiscovered, discovered, or processed. We must maintain a structure containing the vertices that we have discovered but not yet completely processed.

Breadth-First Search

The basic breadth-first search algorithm is given below. It takes \(O(n + m)\) time.

def bfs(graph, root):
    discovered = {root}
    parent = {}
    queue = [root]
    while queue:
        current = queue.pop(0)
        for neighbor in graph.find_adjacent(current):
            if neighbor not in discovered:
                discovered.add(neighbor)
                parent[neighbor] = current
                queue.append(neighbor)

This implementation of breadth-first search, we assign a direction to each edge, from the discoverer current to the discovered neighbor. We maintain a parent map which defines a tree on the vertices of the graph. This tree contains the shortest path from the root to every other node in the tree. A breadth-first search tree can be seen in the right of the image below.

bfs.png

The graph edges that do not appear in the breadth-first search tree also have special properties. For undirected graphs, non-tree edges can point only to vertices on the same level as the parent vertex, or to vertices on the level directly below the parent. These properties follow easily from the fact that each path in the tree must be the shortest path in the graph.

  1. Applications of Breadth-First Search
    1. Connected Components

      A connected component of an undirected graph is a maximal set of vertices such that there is a path between every pair of vertices. The components are separate "pieces" of the graph such that there is no connection between the pieces. An amazing number of seemingly complicated problems reduce to finding or counting connected components. For example, testing whether a puzzle such as the Rubik's cube or the 15 puzzle can be solved from any position is really asking whether the graph of legal configurations is connected.

      Connected components can be found using breadth-first search since the vertex order does not matter. We start from the first vertex. Anything we discover during this search must be part of the same connected component. We then repeat the search from any undiscovered vertex (if one exists) to define the next component, and so on until all vertices have been found.

    2. Two-Coloring Graphs

      The vertex-coloring problem seeks to assign a label (or color) to each vertex of a graph such that no edge links any two vertices of the same color. We can easily avoid all conflicts by assigning each vertex a unique color. However, the goal is to use as few colors as possible.

      A graph is bipartite if it can be colored without conflicts while using only two colors. Consider the "had-sex-with" graph in a heterosexual work. Men have sex only with women, and vice versa. Thus gender defines a legal two-coloring, in this simple model.

      We can argument breadth-first search so that whenever we discover a new vertex, we color it the opposite of its parent. We check whether any nondiscovery edge links two vertices of the same color. Such a conflict means that the graph cannot be two-colored.

Depth-First Search

The difference between BFS and DFS results is in the order in which they explore vertices. This order depends completely upon the container data structure used to store the unprocessed vertices: BFS uses a queue, DFS uses a stack. DFS implementations often use recursion instead of an explicit stack.

discovered = set()
time = 0
entry_time = {}
exit_time = {}
parent = {}

def dfs(root, graph):
    discovered.add(root)
    time += 1
    entry_time[root] = time
    for neighbor in graph.get_adjacent(root):
        if neighbor not in discovered:
            parent[neighbor] = root
            dfs(neighbor, graph)
    exit_time[root] = time
    time += 1

This implementation of depth-first search maintains the traversal time for each vertex. The time clock ticks each time we enter or exit any vertex. The time intervals can tell us a vertex's ancestor and how many descendants it has.

dfs.png

Depth-first search partitions the edges of an undirected graph into exactly two classes: tree edges and back edges. The tree edges discover new vertices, and are those encoding in the parent relation (seen in the image above). Back edges are those whose other endpoint is an ancestor of the vertex being expanded, so they point back into the tree.

  1. Applications of Depth-First Search
    1. Finding Cycles

      Back edges are the key to finding a cycle in an undirected graph. If there is no back edge, all edges are tree edges, and no cycle exists in a tree. But any back edge going from \(x\) to an ancestor \(y\) creates a cycle with the tree path from \(y\) to \(x\).

    2. Articulation Vertices

      articulation-vertex.png

      An articulation vertex is a single vertex whose deletion disconnects a connected component of the graph. Any graph that contains an articulation vertex is inherently fragile, because deleting that single vertex causes a loss of connectivity between other nodes. The connectivity of a graph is the smallest number of vertices whose deletion will disconnect the graph. The connectivity is one if the graph has an articulation vertex. More robust graphs without such a vertex are said to be biconnected.

      Testing for articulation vertices by brute force is easy. Temporarily delete each vertex \(v\), and then do a BFS or DFS traversal of the remaining graph to establish whether it is still connected. The total time is \(O(n(m + n))\).

      DFS gives a clever, linear-time algorithm. Look at the depth-first search tree. If this tree represents the entirety of the graph, all internal (non-leaf) nodes would be articulation vertices, since deleting any one of them would separate a leaf from the root. A depth-first search of a general graph partitions the edges into tree edges and back edges. Think of these back edges as security cables linking a vertex back to one of its ancestors. Finding articulation vertices requires maintaining the extent to which back edges (i.e. security cables) link chunks of the DFS tree back to ancestor nodes.

  2. DFS on Directed Graphs

    When traversing undirected graphs, every edge is either in the depth-first search tree or a back edge to an ancestor in the tree. For directed graphs, depth-first search labelings can take on a wider range of possibilities: tree edges, forward edges, back edges, and cross edges.

    1. Topological Sorting

      Topological sorting is the most important operation on directed acyclic graphs (DAGs). It orders the vertices on a line such that all directed edges go from left to right. Such an ordering cannot exist if the graph contains a directed cycle, because there is no way you can keep going right on a line and still return back to where you started from!

      Each DAG has at least one topological sort. The importance of topological sorting is that it gives us an ordering to process each vertex before any of its successors. For example, suppose college courses are vertices and prerequisites are edges. Your transcript is a topological sort of courses.

      Topological sorting can be performed efficiently using depth-first searching. A directed graph is a DAG if and only if no back edges are encountered. Labeling the vertices in the reverse order that they are marked processed finds a topological sort of a DAG (i.e. record when you finish processing then reverse the collection).

Chapter 6: Weighted Graph Algorithms

There is an alternate universe of problems for weighted graphs. If we're traveling to California, we don't just care about the number of roads traveled on. The edges of road networks are naturally bound to numerical values such as construction cost, traversal time, length, or speed limit. Identifying the shortest path in such graphs proves more complicated than breadth-first search in unweighted graphs, but opens the door to a wide range of applications.

Minimum Spanning Trees

A spanning tree of a graph \(G = (V, E)\) is a subset of edges from \(E\) forming a tree connecting all vertices of \(V\). For edge-weighted graphs, we are particularly interested in the minimum spanning tree - the spanning tree whose sum of edge weights is as small as possible. Minimum spanning tree are the answer whenever we need to connect a set of points (representing cities, homes, junctions, or other locations) by the smallest amount of roadway, wire, or pipe. Minimum spanning trees are also useful for clustering.

There can be more than one minimum spanning tree in a graph. Indeed, all spanning trees of an unweighted (or equally weighted) graph \(G\) are minimum spanning trees, since each contains exactly \(n - 1\) equal-weight edges. Such a spanning tree can be found using depth-first or breadth-first search. Finding a minimum spanning tree is more difficult for general weighted graphs, however two different algorithms are presented below.

Prim's Algorithm

Prim's algorithm for minimum spanning tree starts from one vertex and grows the rest of the tree one edge at a time until all vertices are included. Greedy algorithms make the decision of what to do next by selecting the best local option from all available choices without regard to the global structure. Since we seek the tree of minimum weight, the natural greedy algorithm for a minimum spanning tree repeatedly selects the smallest weight edge that will enlarge the number of vertices in the tree. Prim's algorithm can be implemented as \(O(m + n\lg{n})\) with a priority-queue.

prim-mst(G)
    select an arbitrary vertex s to start the tree from
    while (there are still non tree vertices)
        select the edge of minimum weight between a tree and nontree vertex
        add the selected edge and vertex to the tree T_prim

The correctness of this algorithm can be proven by contradiction. We assert that there must be a specific instant where the tree went wrong. However, since we always select the smallest edge, it's not possible for a smaller edge to exist than the one we're adding (otherwise it would have already been chosen).

Kruskal's Algorithm

Kruskal's algorithm is an alternate approach to finding minimum spanning trees that proves more efficient on spare graphs. Like Prim's, Kruskal's algorithm is greedy. Unlike Prim's, it does not start with a particular vertex. Kruskal's algorithm builds up connected components of vertices, culminating in a minimum spanning tree. Initially, each vertex forms its own separate component in the tree-to-be. The algorithm repeatedly considers the lightest remaining edge and tests whether its two endpoints lie within the same connected component. If so, this edge will be discarded, because adding it would create a cycle in the tree-to-be. If the endpoints are in different components, we insert the edge and merge the two components into one. Since each connected component is always a tree, we need never explicitly test for cycles.

kruskal-mst(G)
    put the edges in a priority queue ordered by weight
    count = 0
    while (count < n - 1) do
        get next edge (v, w)
        if (component(v) != component(w))
            add to T_kruskal
            merge component(v) and component(w)

The speed of Kruskal's algorithm depends on the component test. This test may be implemented by a breadth-first or depth-first search in a sparse graph. With this approach, Kruskal's algorithm is \(O(mn)\). However, a faster implementation exists with the union-find data structure.

  1. The Union-Find Data Structure

    A set partition is a partitioning of the elements of some universal set (say the integers 1 to \(n\)) into a collection of disjointed subsets. Thus, each element must be in exactly one subset. Set partitions naturally arise in graph problems such as connected components (each vertex is in exactly one connected component) and vertex coloring (a person may be male or female, but not both or neither).

    The connected components in a graph can be represented as a set partition. For Kruskal's algorithm to run efficiently, we need a data structure that efficiently supports the following operations:

    • \(same component(v_1, v_2)\)
    • \(merge components(C_1, C_2)\)

    The union-find data structure represents each subset as a "backwards" tree, with pointers from a node to its parent. Each node of this tree contains a set element, and the name of the set is taken from the key at the root.

    union-find.png

    We implement our desired component operations in terms of two simpler operations, union and find:

    • \(find(i)\)
      • Find the root of tree containing element \(i\), by walking up the pointers until there is nowhere to go. Return the label of the root.
    • \(union(i, j)\)
      • Link the root of one of the trees (say containing \(i\)) to the root of the tree containing the other (say \(j\)) so \(find(i)\) now equals \(find(j)\).

    Tree structures can be very unbalanced, so we must limit the height of our trees. The most obvious means of control is the decision of which of the two component roots becomes the root of the combined component on each \(union\). To minimize the tree height, it is of course better to make the smaller tree the subtree of the bigger one.

    With union-set, we can do both unions and finds in \(O(\log{n})\).

Shortest Paths

A path is a sequence of edges connecting two vertices. The shortest path from \(s\) to \(t\) in an unweighted graph can be constructed using a breadth-first search from \(s\). The minimum-link path is recorded in the breadth-first search tree, and it provides the shortest path when all edges have equal weight. However, BFS does not suffice to find shortest paths in weighted graphs. The shortest weighted path might use a large number of edges.

Finding the shortest path between two nodes in a graph arises in many different applications. These may include transportation problems and computer network communication problems. Many applications reduce to finding shortest path, learn to smell this! Page 212 of The Algorithm Design Manual contains a lovely example (Dialing for Documents).

Dijkstra's Algorithm

Dijkstra's algorithm is the method of choice for finding shortest paths in an edge-and/or vertex-weighted graph. Given a particular start vertex \(s\), it finds the shortest path from \(s\) to every other vertex in the graph, including your desired destination \(t\).

Suppose the shortest path from \(s\) to \(t\) in graph \(G\) passes through a particular intermediate vertex \(x\). Clearly, this path must contain the shortest path from \(s\) to \(x\) as its prefix, because if not, we could shorten our $s$-to-\(t\) path by using a shorter $s$-to-\(t\) prefix. Thus, we must compute the shortest path from \(s\) to \(x\) before we find the path from \(s\) to \(t\).

Dijkstra's algorithm proceeds in a series of rounds, where each round establishes the shortest path from \(s\) to some new vertex. Specifically, \(x\) is the vertex that minimizes \(dist(s, v_i) + w(v_i, x)\) over all finished \(1 \leq i \leq n\), where \(w(i, j)\) is the length of the edge from \(i\) to \(j\), and \(dist(i, j)\) is the length of the shortest path between them.

import math

def dijkstra(graph, s, t):
    known = {s}
    distances = {vertex: math.inf for vertex in graph.all_vertices()}
    for neighbor in s.get_neighbors():
        distances[neighbor] = weight(s, neighbor)
    last = s
    while last != t:
        v_next = min(distances[v] for v in (graph.all_vertices() - known))
        for neighbor in v_next.get_neighbors():
            distances[neighbor] = min(distances[neighbor],
                                      distances[v_next] + weight(v_next, neighbor))
        last = v_next
        known.add(v_next)

The basic idea is very similar to Prim's algorithm. In each iteration, we add exactly one vertex to the tree of vertices for which we know the shortest path from \(s\). The difference between Dijkstra's and Prim's algorithms is how they rate the desirability of each outside vertex. In the minimum spanning tree problem, all we cared about was the weight of the next potential tree edge. In shortest path, we want to include the closest outside vertex (in shortest-path distance) to \(s\). This is a function of both the new edge weight and the distance from \(s\) to the tree vertex it is adjacent to.

All-Pairs Shortest Path

Sometimes we want to find the shortest path between all pairs of vertices in a given graph. We could solve the all-pairs shortest path by calling Dijkstra's algorithm from each of the \(n\) possible starting vertices (\(O(n^3)\)). But Floyd's all-pairs shortest-path algorithm is a slick way to construct an \(n x n\) distance matrix from the original weight matrix of the graph. This algorithm is also \(O(n^3)\), but the loops are so tight and the program so short that it runs better in practice.

Floyd's algorithm starts with the adjacency matrix. The edge \((i, j)\) should have its weight in matrix[i][j]. Cells for which the edge doesn't exist should be set to MAXINT.

def floyd(adjacency_matrix):
    n_vertices = len(adjacency_matrix)
    for k in range(n_vertices):
        for i in range(n_vertices):
            for j in range(n_vertices):
                through_k = adjacency_matrix[x][k] + adjacency_matrix[k][y]
                if (through_k < adjacency_matrix[x][y]):
                    adjacency_matrix[x][y] = through_k

We define \(W[i, j]^k\) to be the length of the shortest path from \(i\) to \(j\) using only vertices numbered from 1, 2, …, \(k\) as possible intermediate vertices. At each iteration, we allow a richer set of possible shortest paths by adding a new vertex as a possible intermediary. Allowing the $k$th vertex as a stop helps only if there is a short path that goes through \(k\), so \(W[i, j]^k = min(W[i, j]^{k - 1}, W[i, k]^{k - 1}, + W[k, j]^{k - 1})\).

Chapter 7: Combinatorial Search and Heuristic Methods

Backtracking

Backtracking is a systematic way to iterate through all the possible configurations of a combinatorial search space. These configurations may represent all possible arrangements of objects (permutations), all possible ways of building a collection of them (subsets), or even possible move sequences in a game. We model each solution as a vector \(a = (a_1, a_2, ..., a_n)\), where each element \(a_i\) is selected from a finite ordered set \(S_i\). We must be careful to avoid repetitions and missing configurations.

def backtrack(a):
    if is_solution(a):
        report(a)
    else:
        s_i = find_candidates(a)
        while s_i:
            backtrack(a + [s_i.pop()])

At each step in the backtracking algorithm, we try to extend a given partial solution \(a = (a_1, a_2, ..., a_k)\) by adding another element at the end. After extending it, we must test whether what we now have is a solution or if not we must check whether the partial solution is still extendible to some complete solution. We're using a depth-first search to enumerate solutions. Breadth-first search would require more space (proportional to the width instead of the height of the search tree).

Backtracking Subsets

Suppose we are generating subsets of an n-element set, say \(\{1,...,n\}\). Define each subset as an array of \(n\) cells, where the value of \(a_i\) (true or false) signifies whether the ith item is in the given subset. We consider the subset a solution when every cell has true/false (length == n).

Backtracking Permutations

\(\{1,...,n\}\) has \(n!\) distinct permutations. Each permutation is represented by an array of \(n\) cells. The set of candidates for the ith position will be the set of elements that have not appeared in the \((i - 1)\) elements of the partial solution, corresponding to the first \((i - 1)\) elements of the permutation. Our array is a solution whenever length equals \(n\).

Backtracking Graph Paths

The starting point of any path from \(s\) to \(t\) is always \(s\). Thus, \(s\) is the only candidate for the first position and \(S_0 = \{s\}\). The possible candidates for the second position are the vertices \(v\) such that \((s, v)\) is an edge of the graph and \(v\) hasn't been used in the partial solution. We have a solution when \(a_k\) is equal to \(t\). Some paths might be shorter than others.

Search Pruning

Pruning is the technique of cutting off the search the instant we have established that a partial solution cannot be extended into a full solution. Pruning is powerful. Even simple pruning strategies can suffice to reduce running time from impossible to instantaneous.

For the traveling salesman, we seek the cheapest tour that visits all vertices. Suppose that in the course of our search we find a tour \(t\) whose cost is \(C_t\). Later, we may have a partial solution \(a\) whose edge sum \(C_A > C_t\). Any tour with this prefix will have cost greater than tour \(t\), and hence is doomed to be nonoptimal. Cutting away such failed partial tours as soon as possible can have an enormous impact on running time.

As another example, suppose we're solving a Sudoku puzzle. We run through empty squares, try candidate numbers, and backtrack when we are out of candidates. The naive search randomly chooses open squares. Instead, we could choose the square with the fewest number of candidates. Additionally, when generating candidates, we could look ahead to see if the partial solution causes some other open square to have no candidates. Successful pruning often requires looking ahead to see when a solution is doomed to go nowhere, and backing off as soon as possible.

Exploiting symmetry is another avenue for reducing combinatorial searches

Heuristic Search Methods

Heuristic methods provide an alternate way to approach difficult combinatorial optimization problems. Backtracking gave us a method to find the best of all possible solutions, as scored by a given objective function. However, any algorithm searching all configurations is doomed to be impossible on large instances.

The methods observed below have two common components: solution space representation and a cost function.

Random Sampling

The simplest method to search in a solution space uses random sampling. It is also called the Monte Carlo method. We repeatedly construct random solutions and evaluate them, stopping as soon as we get a good enough solution, or (more likely) when we are tired of waiting. We report the best solution found over the course of our sampling.

True random sampling requires that we are able to select elements form the solution space uniformly at random. This means that each of the elements of the solution space must have an equal probability of being the next candidate selected.

Random sampling does well when there's a high proportion of acceptable solutions or when there is no coherence in the solution space. For example, hunting for a any large prime number.

Local Search

A local search employs the local neighborhood around every element in the solution space. Think of each element \(x\) in the solution space as a vertex, with a directed edge \((x, y)\) to every candidate solution \(y\) that is a neighbor of \(x\). Our search proceeds from \(x\) to the most promising candidate in x's neighborhood.

We certainly do not want to construct the neighborhood graph for any sizable solution space. We want a general transition mechanism that takes us to the next solution by slightly modifying the current one. Typical mechanisms include swapping a random pair of items or changing (inserting or deleting) a single item in the solution.

In a hill-climbing procedure, we try to find the top of a mountain (or alternatively, the lowest point in a ditch) by starting at some arbitrary point and taking any step that leads in the direction we want to travel. We repeat until we have reached a point where all our neighbors lead us in the wrong direction.

Suppose you wake up in a sky lodge, eager to reach the top of the neighboring peak. Your first transition to grain altitude might be to go upstairs to the top of the building. And then you are trapped. To reach the top of the mountain, you must go downstairs and walk outside, but this violates the requirement that each step has to increase your score. Hill-climbing and closely related heuristics such as greedy search or gradient descent search are great at finding local optima quickly, but often fail to find the globally best solution.

Use local search when there is great coherence in the solution space. Hill climbing is at its best when the solution space is convex. Local search is also useful whenever the cost of incremental evaluation is much cheaper than global evaluation.

Simulated Annealing

Simulated annealing is a heuristic search procedure that allows occasional transitions leading to more expensive (and hence inferior) solutions. This may not sound like progress, but it helps keep our search from getting stuck in local optima.

The inspiration for simulated annealing comes from the physical process of cooling molten materials down to the solid state. In thermodynamic theory, a particle's energy state is a function of its temperature. We can mimic physics to solve combinatorial optimization problems.

Our problem representation includes both a representation of the solution space and an easily computable cost function \(C(s)\) measuring the quality of a given solution. The new component is the cooling schedule, whose parameters govern how likely we are to accept a bad transition as a function of time.

At the beginning of the search, we are eager to use randomness to explore the search space widely, so the probability of accepting a negative transition should be high. As the search progresses, we seek to limit transitions to local improvements and optimizations.

Genetic Algorithms

Genetic algorithms draw their inspiration from evolution and natural selection. Through the process of natural selection, organisms adapt to optimize their chances for survival in a given environment. Random mutations occur in an organism's genetic description, which then get passed on to its children. Should a mutation prove helpful, these children are more likely to survive and reproduce. Should it be harmful, these children won't, and so the bad trait will die with them.

Genetic algorithms maintain a "population" of solution candidates for the given problem. Elements are drawn at random from this population and allowed to "reproduce" by combining aspects of the two-parent solutions. The probability that an element is chosen to reproduce is based on its "fitness," - essentially the cost of the solution it represents. Unfit elements die from the population, to be replaced by a successful-solution offspring.

The idea behind genetic algorithms is extremely appealing. However, they don't seem to work as well on practical combinatorial optimization problems as simulated annealing does.

Chapter 8: Dynamic Programming

Dynamic programming is a technique for efficiently implementing a recursive algorithm by storing partial results. Dynamic programming guarantees correctness by searching all possibilities and provides efficiency by storing results to avoid recomputing. If the naive recursive algorithm computes the same subproblems over and over again, storing the answer for each subproblem in a table to look up instead of recompute can lead to an efficient algorithm. Dynamic programming is essentially a tradeoff of space for time.

Fibonacci Example

Let's look at a simple program for computing the /n/th Fibonacci number.

def fib(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fib(n - 1) + fib(n - 2)

The course of execution for this recursive algorithm is illustrated by its recursion tree.

fib-recursion-tree.png

Note that \(F(4)\) is computed on both sides of the recursion tree, and \(F(2)\) is computed no less than five times in this small example. This redundancy drastically affects performance.

We can improve performance by storing (or caching) the results of each Fibonacci computation \(F(k)\) indexed by the parameter \(k\).

cache = {0: 0, 1: 1}
def fib(n):
    if n not in cache:
        cache[n] = fib(n - 1) + fib(n - 2)
    return cache[n]

This approach is a simple way to get most of the benefits of full dynamic programming. Here's the recursion tree:

fib-caching.png

Let's go a step further with full dynamic programming! We can calculate \(F(n)\) in linear time and space with no recursive calls by explicitly specifying the order of evaluation of the recurrence relation.

def fib(n):
    f = [0, 1]
    for i in range(2, n + 1):
        f.append(f[i - 1] + f[i - 2])
    return f[n]

However, more careful study shows that we do not need to store all the intermediate values for the entire period of execution.

def fib(n):
    if n == 0:
        return 0

    back_2, back_1 = 0, 1
    for _ in range(2, n):
        back_2, back_1 = back_1, back_1 + back_2
    return back_1 + back_2

This analysis reduces the storage demands to constant space with no asymptotic degradation in running time.

Approximate String Matching

To deal with inexact string matching, we must first define a cost function telling us how far apart two strings are - i.e., a distance measure between pairs of strings. Edit distance reflects the number of changes that must be made to convert one string to another. There are three natural types of changes: substitution, insertion, and deletion. Edit distance assigns each operation an equal cost of 1. Here's a recursive edit distance function:

def edit_distance(source, target):
    if not source:
        return len(target)
    if not target:
        return len(source)

    substitution_cost = 0 if source[-1] == target[-1] else 1
    return min(edit_distance(source[:-1], target[:-1]) + substitution_cost,
               edit_distance(source, target[:-1]) + 1,  # insertion
               edit_distance(source[:-1], target) + 1)  # deletion

This program is absolutely correct but impossible slow. A table-based, dynamic programming implementation of this algorithm is given below. costs is a two-dimensional matrix where each cell contains the optimal solution to a subproblem (i.e. costs[x][y] is edit_distance(source[:x], target[:y])).

def edit_distance(source, target):
    costs = [[None for _ in range(len(target) + 1)]
             for _ in range(len(source) + 1)]

    for index in range(len(costs)):
        costs[index][0] = index
    for index in range(len(costs[0])):
        costs[0][index] = index

    for x in range(1, len(source) + 1):
        for y in range(1, len(target) + 1):
            substitution_cost = 0 if source[x - 1] == target[y - 1] else 1
            costs[x][y] = min(costs[x - 1][y - 1] + substitution_cost,
                              costs[x - 1][y] + 1,  # insertion
                              costs[x][y - 1] + 1)  # deletion
    return costs[-1][-1]

The first row and the first column represent the empty prefix of the source and target, respectively. This is why the matrix height/width is larger than the source/target length.

Note that it is unnecessary to store the entire O(mn) matrix. The recurrence only requires two rows at a time. Thus, this algorithm could be further optimized to O(n) space without changing the time complexity.

Dynamic Programming in Practice

There are three steps involved in solving a problem by dynamic programming:

  1. Formulate the answer as a recurrence relation or recursive algorithm.
  2. Show that the number of different parameter values taken on by your recurrence is bounded by a (hopefully small) polynomial.
  3. Specify an order of evaluation for the recurrence so the partial results you need are always available when you need them.

In practice, you'll find that dynamic programming algorithms are usually easier to work out from scratch than look up.

The Partition Problem

Suppose three workers are given the task of scanning through a shelf of books in search of a given piece of information. To get the job done fairly and efficiently, the books are to be partitioned among the three workers. If the books are the same length, the job is easy: 100 100 100 | 100 100 100 | 100 100 100. If the books are not the same length, the task becomes more difficult (100 200 300 400 500 | 600 700 | 800 900). An algorithm that solves this linear partition problem takes as input an arrangement \(S\) of nonnegative numbers and an integer \(k\). The algorithm should partition \(S\) into \(k\) or fewer ranges, to minimize the maximum sum over all ranges, without reordering any of the numbers.

A heuristic to solve this problem might compute the average size of a partition and then try and insert dividers to come close to this average. Unfortunately, this method is doomed to fail on certain inputs.

Instead, consider a recursive, exhaustive search approach to solving this problem. The /k/th partition starts right ater we placed the (k - 1)st divider. Where can we place this last divider? Between the ith and (i + 1)st elements for some \(i\), where \(1 \leq i \leq n\). What is the cost of this? The total cost will be the larger of two qualtities - (1) the cost of the last partition and (2) the cost of the largest partition formed to the left of \(i\). What is the size of this let partition? To minimize our total, we want to use the \(k - 2\) remaining dividers to partition the elements \(\{s_1, ..., s_i\}\) as equally as possible. This is a smaller instance of the same problem and hence can be solved recursively!

Therefore, let us define \(M[n, k]\) to be the minimum possible cost over all partitions of \(\{s_1, ..., s_n\}\) into \(k\) ranges, where the cost of a partition is the largest sum of elements in one of its parts. Thus defined, this function cab be evaluated:

\begin{equation} M[n,k] = min(i=1, n)(max(M[i, k - 1], \sum_{j = i + 1}^{n} s_j)) \end{equation}

This recurrence can be solved with dynamic programming in \(O(kn^2)\) time. Note that we also need a second matrix, \(D\) to reconstruct the optimal partition. Whenever we update the value of \(M[i, j]\), we record which divider position was required to achieve that value.

Parsing Context-Free Grammars