# Papers

## Linear Feature-Based Models for Information Retrieval

Features lie at the very heart of information retrieval. A model that can
handle arbitrary **query-dependent** and **query-independent** features is
desirable. This paper describes the theory behind a class of models called
**linear feature-based models**.

### Model Framework

Suppose we are given a set of **documents** \(D\), **queries** \(Q\) and **training
data** \(T\). In addition, we are given a real-valued **scoring function**
\(S_\Lambda(D;Q)\) parameterized by \(\Lambda\), a vector of parameters. Given the
query \(Q_i\), the scoring function \(S_\Lambda(D;Q)\) is computed for each \(d \in
D\) and the documents are then ranked in descending order according to their
score. The scoring function induces a total ordering (**ranking**)
\(R_i(\Lambda)\) on \(D\) for each query \(Q_i\).

Finally, in order to evaluate a parameter setting, we need an **evaluation
function** \(E(R_\Lambda;T)\) that produces real valued output given a set of
ranked lists and the training data. We require \(E\) to only consider the
document *rankings* and not the document scores. The scores are only used to
rank the documents and not used to evaluate the ranking. Example evaluation
metrics are **mean average precision** and **percent correct in the top n**.

Our goal is to find the parameter setting \(\Lambda\) that maximizes the
evaluation metric \(E\) over the parameter space (\(\hat{\Lambda} =
argmax_\Lambda E(R_\Lambda;T)\)). To restrict our focus, scoring function
often appears as \(S_\Lambda(D;Q) = \Lambda^Tf(D, Q) + Z\) where \(f\) is a
**feature function** that maps query/document pairs to real-valued vectors in
\(\mathbb{R}^d\) and \(Z\) is a constant. We also require monotonically
increasing values.

### Features

Common features include

**Term occurrence/non-occurrence**- Whether or not a term occurs within a document.

**Term frequency**- The number of times a term occurs within a document.

**Inverse document frequency**- Inverse of the proportion of documents that contain a given term.

**Document length**- Number of terms within the document.

**Term proximity**- Occurrence patterns of terms within a document.

Non-textual features include **PageRank**, URL depth, document quality,
readability, sentiment, and query clarity.