## Probabilistic Information Retrieval and Language Model for Information Retrieval

WSM Assignment 2: ProbabilisticInformation Retrieval and

Language Model for Information

Retrieval

1. [20 points] Probabilistic Information Retrieval.

? [10/20 points] What are the differences between standard vector space

tf-idf weighting and the RSJ probabilistic retrieval model (in the case

where no document relevance information is available)?

? [10/20 points] Describe the differences between vector space relevance

feedback and probabilistic relevance feedback.

2. [15 points] Language Model.

? Consider making a language model from the given following training text:

?

? the martian has landed on the latin pop sensation ricky martin

?

? [6/15 points] Under a MLE-estimated unigram probability model,

what are the probabilities of P(the) and P(martian)?

?

? [9/15 points] Under a MLE-estimated bigram model, what are the

probabilities of P(sensation | pop), P(pop | the), and P(pop | latin)?

?

3. [30 points] Language Model for Information Retrieval.

? Suppose we have a collection that consists of the 4 documents given in

the below table.

?

?

Try to build a query likelihood language model for this document collection.

Assume a mixture model between the documents and the collection, with both

weighted at 0.5, and maximum likelihood estimation (MLE) is used to estimate

both as unigram models. Work out the model probabilities of the queries “click”,

“shears”, and hence “click shears” for each document, and use those

probabilities to rank the documents returned by each query.

? [24/30 points] Fill in these probabilities in the below table:

?

?

? [6/30 points] What is the final ranking of the 4 documents for the query

“click shears”?

4. [35 points] Classic Probabilistic Retrieval Model.

? [15/35 points] In the derivation of the Robertson-Sparck-Jones (RSJ)

model (see the slides and the survey paper by Norbert Fuhr for detail

about this derivation), a multi-variate Bernoulli model was used to model

term presence/absence in a relevant document and a non-relevant

document. Suppose, we change the model to a multinomial model (see

the slide that covers both models for computing query likelihood). Using a

similar independence assumption as we used in deriving RSJ, show that

ranking based on probability that a document is relevant to a query Q,

i.e., p(R=1 | D,Q), is equivalent to ranking based on the following

formula:

?

? where the sum is taken over all the word in our vocabulary (denoted by

V). How many parameters are there in such a retrieval model?

?

? [5/35 points] The retrieval function above won’t work unless we can

estimate all the parameters. Suppose we use the entire collection

C={D1,…,Dn} as an approximation of the examples of non-relevant

documents. Propose an estimate of p(w|Q,R=0). (Hint: study the slide

about how to do this for the RSJ model.)

?

? [5/35 points] Suppose we use the query as the only example of a relevant

document. Propose an estimate of p(w|Q,R=1).

?

? [5/35 points] With the two estimates you proposed, we should now have a

retrieval function that can be used to compute a score for any document D

and any query Q. Write down your retrieval function by plugging in the

two estimates. Can your retrieval function capture the three major

retrieval heuristics (i.e., TF, IDF, and document length normalization)?

How?

?

? [5/35 points] Do you believe your formula would work well as compared

with a state of the art formula such as BM25? Can you propose a way to

further improve your formula? (While it’s the best if you could improve

your formula through using an improved estimate of p(w|Q,R=1), it would

also be fine to propose any reasonable heuristic modification of the

formula.)

?

Submission Details

? Due: 18:10 in class, Tuesday, 18 April 2017

? What to turn in:

?

? Please turn in a hardcopy of your written answers

## Leave a Reply

You must be logged in to post a comment.