Papers
Linear Feature-Based Models for Information Retrieval
Features lie at the very heart of information retrieval. A model that can handle arbitrary query-dependent and query-independent features is desirable. This paper describes the theory behind a class of models called linear feature-based models.
Model Framework
Suppose we are given a set of documents \(D\), queries \(Q\) and training data \(T\). In addition, we are given a real-valued scoring function \(S_\Lambda(D;Q)\) parameterized by \(\Lambda\), a vector of parameters. Given the query \(Q_i\), the scoring function \(S_\Lambda(D;Q)\) is computed for each \(d \in D\) and the documents are then ranked in descending order according to their score. The scoring function induces a total ordering (ranking) \(R_i(\Lambda)\) on \(D\) for each query \(Q_i\).
Finally, in order to evaluate a parameter setting, we need an evaluation function \(E(R_\Lambda;T)\) that produces real valued output given a set of ranked lists and the training data. We require \(E\) to only consider the document rankings and not the document scores. The scores are only used to rank the documents and not used to evaluate the ranking. Example evaluation metrics are mean average precision and percent correct in the top n.
Our goal is to find the parameter setting \(\Lambda\) that maximizes the evaluation metric \(E\) over the parameter space (\(\hat{\Lambda} = argmax_\Lambda E(R_\Lambda;T)\)). To restrict our focus, scoring function often appears as \(S_\Lambda(D;Q) = \Lambda^Tf(D, Q) + Z\) where \(f\) is a feature function that maps query/document pairs to real-valued vectors in \(\mathbb{R}^d\) and \(Z\) is a constant. We also require monotonically increasing values.
Features
Common features include
- Term occurrence/non-occurrence
- Whether or not a term occurs within a document.
- Term frequency
- The number of times a term occurs within a document.
- Inverse document frequency
- Inverse of the proportion of documents that contain a given term.
- Document length
- Number of terms within the document.
- Term proximity
- Occurrence patterns of terms within a document.
Non-textual features include PageRank, URL depth, document quality, readability, sentiment, and query clarity.