Thus far we have dealt with indexes that support Boolean queries: a document either matches or does not match a query. In the case of large document
collections, the resulting number of matching documents can far exceed the
number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do
this, the search engine computes, for each matching document, a score with
respect to the query at hand. In this chapter we initiate the study of assigning
a score to a (query, document) pair. This chapter consists of three main ideas.
1. We introduce parametric and zone indexes in Section 6.1, which serve
two purposes. First, they allow us to index and retrieve documents by
metadata such as the language in which a document is written. Second,
they give us a simple means for scoring (and thereby ranking) documents
in response to a query.
2. Next, in Section 6.2 we develop the idea of weighting the importance of a
term in a document, based on the statistics of occurrence of the term.
3. In Section 6.3 we show that by viewing each document as a vector of such
weights, we can compute a score between a query and each document.
This view is known as vector space scoring.
Section 6.4 develops several variants of term-weighting for the vector space
model. Chapter 7 develops computational aspects of vector space scoring,
and related topics.
As we develop these ideas, the notion of a query will assume multiple
nuances. In Section 6.1 we consider queries in which speciﬁc query terms
occur in speciﬁed regions of a matching document. Beginning Section 6.2 we
will in fact relax the requirement of matching speciﬁc regions of a document;
instead, we will look at so-called free text queries that simply consist of query
terms with no speciﬁcation on their relative order, importance or where in a
document they should be found. The bulk of our study of scoring will be in
this latter notion of a query being such a set of terms.
6.1 Parametric and zone indexes
We have thus far viewed a document as a sequence of terms. In fact, most
documents have additional structure. Digital documents generally encode,
METADATA in machine-recognizable form, certain metadata associated with each document. By metadata, we mean speciﬁc forms of data about a document, such
as its author(s), title and date of publication. This metadata would generally
FIELD include ﬁelds such as the date of creation and the format of the document, as
well the author and possibly the title of the document. The possible values
of a ﬁeld should be thought of as ﬁnite – for instance, the set of all dates of
authorship.
Consider queries of the form “ﬁnd documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick”. Query processing then
consists as usual of postings intersections, except that we may merge postPARAMETRIC INDEX ings from standard inverted as well as parametric indexes. There is one parametric index for each ﬁeld (say, date of creation); it allows us to select only
the documents matching a date speciﬁed in the query. Figure 6.1 illustrates
the user’s view of such a parametric search. Some of the ﬁelds may assume
ordered values, such as dates; in the example query above, the year 1601 is
one such ﬁeld value. The search engine may support querying ranges on
such ordered values; to this end, a structure like a B-tree may be used for the
ﬁeld’s dictionary.
ZONE Zones are similar to ﬁelds, except the contents of a zone can be arbitrary
free text. Whereas a ﬁeld may take on a relatively small set of values, a zone
can be thought of as an arbitrary, unbounded amount of text. For instance,
document titles and abstracts are generally treated as zones. We may build a
separate inverted index for each zone of a document, to support queries such
as “ﬁnd documents with merchant in the title and william in the author list and
the phrase gentle rain in the body”. This has the effect of building an index
that looks like Figure 6.2. Whereas the dictionary for a parametric index
comes from a ﬁxed vocabulary (the set of languages, or the set of dates), the
dictionary for a zone index must structure whatever vocabulary stems from
the text of that zone.
In fact, we can reduce the size of the dictionary by encoding the zone in
which a term occurs in the postings. In Figure 6.3 for instance, we show how
occurrences of william in the title and author zones of various documents are
encoded. Such an encoding is useful when the size of the dictionary is a
concern (because we require the dictionary to ﬁt in main memory). But there
is another important reason why the encoding of Figure 6.3 is useful: the
WEIGHTED ZONE efﬁcient computation of scores using a technique we will call weighted zone
SCORING
scoring