100% found this document useful (1 vote)
148 views

Information Retrieval

Uploaded by

k20pro9t4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
148 views

Information Retrieval

Uploaded by

k20pro9t4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Information Retrieval

Unit 1
Foundations of Information Retrieval
1. Define Information Retrieval (IR) and explain its goals.
2. Discuss the key components of an IR system.
3. What are the major challenges faced in Information Retrieval?
4. Provide examples of applications of Information Retrieval.

Introduction to Information Retrieval (IR) systems


1. Explain the process of constructing an inverted index. How does it facilitate efficient
information retrieval?
2. Discuss techniques for compressing inverted indexes.
3. How are documents represented in an IR system? Discuss different term weighting
schemes.
4. With the help of examples, explain the process of storing and retrieving indexed
documents.
5. Discuss storage mechanisms for indexed documents.
6. Explain the retrieval process of indexed documents.
7. Define k-gram indexing and explain its significance in Information Retrieval systems.
8. Describe the process of constructing a k-gram index. Highlight the key steps
involved and the data structures used.
9. Explain how wildcard queries are handled in k-gram indexing. Discuss the challenges
associated with wildcard queries and potential solutions.

Retrieval Models
1. Describe the Boolean model in Information Retrieval. Discuss Boolean operators and
query processing.
2. Explain the Vector Space Model (VSM) in Information Retrieval. Discuss TF-IDF,
cosine similarity, and query-document matching.
3. What is the Probabilistic Model in Information Retrieval? Discuss Bayesian retrieval
and relevance feedback.
4. How does cosine similarity measure the similarity between queries and documents
in the Vector Space Model?
5. What is relevance feedback in the context of retrieval models? How does it enhance
search results?
Spelling Correction in IR Systems
1. What are the challenges posed by spelling errors in queries and documents?
2. What is edit distance, and how is it used in measuring string similarity? Provide
examples.
3. Discuss string similarity measures used for spelling correction in IR systems.
4. Describe techniques employed for spelling correction in IR systems. Assess their
effectiveness and limitations.
5. What is the Soundex Algorithm and how does it address spelling errors in IR
systems?
6. Discuss the steps involved in the Soundex Algorithm for phonetic matching.

Performance Evaluation
1. Define evaluation metrics used in Information Retrieval, including precision, recall,
and F-measure.
2. Explain the concept of average precision in evaluating IR systems.
3. Explain the importance of test collections and relevance judgments in evaluating
Information Retrieval systems.
4. Discuss the process of relevance judgments and their importance in performance
evaluation.
5. Describe experimental design and significance testing in the context of evaluating
IR systems.
6. Discuss significance testing in Information Retrieval and its role in performance
evaluation.

Numericals
1. Given the following document-term matrix:
Document Terms
Doc1 cat, dog, fish
Doc2 cat, bird, fish
Doc3 dog, bird, elephant
Doc4 cat, dog, elephant
Construct the posting list for each term: cat, dog, fish, bird, elephant.

2. Consider the following document-term matrix:


Document Terms
Doc1 apple, banana, grape
Doc2 apple, grape, orange
Doc3 banana, orange, pear
Doc4 apple, grape, pear
Create the posting list for each term: apple, banana, grape, orange, pear.

3. Given the inverted index with posting lists:


Term Posting List
cat Doc1, Doc2, Doc4
dog Doc1, Doc3, Doc4
fish Doc1, Doc2
Calculate the Term Document Matrix and find the documents that contain both 'cat' and
'fish' using the Boolean Retrieval Model.

4. Given the following term-document matrix for a set of documents:


Term Doc1 Doc2 Doc3 Doc4
cat 15 28 0 0
dog 18 0 32 25
fish 11 19 13 0
Total No of terms in Doc1, Doc2, Doc3 and Doc4 are 48, 85, 74 and 30 respectively.

Calculate the TF-IDF score for each term-document pair using the following TF and IDF
calculations:
● Term Frequency (TF) = (Number of occurrences of the term in the document) /
(Total number of terms in the document)
● Inverse Document Frequency (IDF) = log(Total number of documents / Number of
documents containing the term) + 1

5. Given the term-document matrix and the TF-IDF scores calculated from Problem 4,
calculate the cosine similarity between each pair of documents (Doc1, Doc2), (Doc1,
Doc3), (Doc1, Doc4), (Doc2, Doc3), (Doc2, Doc4), and (Doc3, Doc4).

6. Consider the following queries expressed in terms of TF-IDF weighted vectors:


Query1: cat: 0.5, dog: 0.5, fish: 0
Query2: cat: 0, dog: 0.5, fish: 0.5

Calculate the cosine similarity between each query and each document from the
term-document matrix in Problem 4.

7. Given the following term-document matrix:


Term Doc1 Doc2 Doc3 Doc4
apple 22 9 0 40
banana 14 0 12 0
orange 0 23 14 0
Total No of terms in Doc1, Doc2, Doc3 and Doc4 are 65, 48, 36 and 92 respectively.
Calculate the TF-IDF score for each term-document pair.

8. Suppose you have a test collection with 50 relevant documents for a given query.
Your retrieval system returns 30 documents, out of which 20 are relevant. Calculate
the Recall, Precision, and F-score for this retrieval.
● Recall = (Number of relevant documents retrieved) / (Total number of relevant
documents)
● Precision = (Number of relevant documents retrieved) / (Total number of
documents retrieved)
● F-score = 2 * (Precision * Recall) / (Precision + Recall)

9. You have a test collection containing 100 relevant documents for a query. Your
retrieval system retrieves 80 documents, out of which 60 are relevant. Calculate the
Recall, Precision, and F-score for this retrieval.

10. In a test collection, there are a total of 50 relevant documents for a query. Your
retrieval system retrieves 60 documents, out of which 40 are relevant. Calculate the
Recall, Precision, and F-score for this retrieval.

11. You have a test collection with 200 relevant documents for a query. Your retrieval
system retrieves 150 documents, out of which 120 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.

12. In a test collection, there are 80 relevant documents for a query. Your retrieval
system retrieves 90 documents, out of which 70 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.

13. Construct 2-gram, 3-gram and 4-gram index for the following terms:
a. banana
b. pineapple
c. computer
d. programming
e. elephant
f. Database
14. Calculate the Levenshtein distance between the following pair of words:
a. kitten and sitting
b. intention and execution
c. robot and orbit
d. power and flower

15. Using the Soundex algorithm, encode the following:


a. Williams
b. Gonzalez
c. Harrison
d. Parker
e. Jackson
f. Thompson

Unit 2
Text Categorization and Filtering:
1. Define text categorization and explain its importance in information retrieval
systems. Discuss the challenges associated with text categorization.
2. Discuss the Naive Bayes algorithm for text classification. How does it work, and
what are its assumptions?
3. Explain Support Vector Machines (SVM) and their application in text categorization.
How does SVM handle text classification tasks?
4. Compare and contrast the Naive Bayes and Support Vector Machines (SVM)
algorithms for text classification. Highlight their strengths and weaknesses.
5. Describe feature selection and dimensionality reduction techniques used in text
categorization. Why are these techniques important?
6. Discuss the applications of text categorization and filtering in real-world scenarios
such as spam detection, sentiment analysis, and news categorization.

Text Clustering for Information Retrieval:


1. Explain the K-means clustering algorithm and how it is applied to text data. What are
its key steps, and how does it handle document clustering? Discuss its strengths
and limitations.
2. Describe hierarchical clustering techniques and their relevance in organizing text
data for information retrieval. What are the advantages and disadvantages of
hierarchical clustering compared to K-means?
3. Discuss the evaluation measures used to assess the quality of clustering results in
text data. Explain purity, normalized mutual information, and F-measure in the
context of text clustering evaluation.
4. How can clustering be utilized for query expansion and result grouping in
information retrieval systems? Provide examples.
5. Compare and contrast the effectiveness of K-means and hierarchical clustering in
text data analysis. Discuss their suitability for different types of text corpora and
retrieval tasks.
6. Discuss challenges and issues in applying clustering techniques to large-scale text
data.

Web Information Retrieval:


1. Describe the architecture of a web search engine. Explain the components involved
in crawling and indexing web pages.
2. Discuss the challenges faced by web search engines, such as spam, dynamic
content, and scale. How are these challenges addressed in modern web search
engines?
3. Explain link analysis and the PageRank algorithm. How does PageRank work to
determine the importance of web pages?
4. Describe the PageRank algorithm and how it calculates the importance of web
pages based on their incoming links. Discuss its role in web search ranking.
5. Explain how link analysis algorithms like HITS (Hypertext Induced Topic Search)
contribute to improving search engine relevance.
6. Discuss the impact of web information retrieval on modern search engine
technologies and user experiences.
7. Discuss applications of link analysis in information retrieval systems beyond web
search.

Learning to Rank
1. Explain the concept of learning to rank and its importance in search engine result
ranking.
2. Discuss algorithms and techniques used in learning to rank for Information Retrieval.
Explain the principles behind RankSVM, RankBoost, and their application in ranking
search results.
3. Compare and contrast pairwise and listwise learning to rank approaches. Discuss
their advantages and limitations.
4. Explain evaluation metrics used to assess the performance of learning to rank
algorithms. Discuss metrics such as Mean Average Precision (MAP), Normalized
Discounted Cumulative Gain (NDCG), and Precision at K (P@K).
5. Discuss the role of supervised learning techniques in learning to rank and their
impact on search engine result quality.
6. How does supervised learning for ranking differ from traditional relevance feedback
methods in Information Retrieval? Discuss their respective advantages and
limitations.
7. Describe the process of feature selection and extraction in learning to rank. What are
the key features used to train ranking models, and how are they selected or
engineered?

Link Analysis and its Role in IR Systems:


1. Describe web graph representation in link analysis. How are web pages and
hyperlinks represented in a web graph OR Explain how web graphs are represented
in link analysis. Discuss the concepts of nodes, edges, and directed graphs in the
context of web pages and hyperlinks.
2. Explain the HITS algorithm for link analysis. How does it compute authority and hub
scores?
3. Discuss the PageRank algorithm and its significance in web search engines. How is
PageRank computed?
4. Discuss the difference between the PageRank and HITS algorithms.
5. How are link analysis algorithms applied in information retrieval systems? Provide
examples.
6. Discuss future directions and emerging trends in link analysis and its role in modern
IR systems. OR Discuss how link analysis can be used in social network analysis and
recommendation systems.
7. How do link analysis algorithms contribute to combating web spam and improving
search engine relevance?

Numerical Questions
1. Consider a simplified web graph with the following link structure:
• Page A has links to pages B, C, and D.
• Page B has links to pages C and E.
• Page C has links to pages A and D.
• Page D has a link to page E.
• Page E has a link to page A.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and
hub
scores for each page after one/two iteration(s) of the HITS algorithm.

2. Consider a web graph with the following link structure:


• Page A has links to pages B and C.
• Page B has a link to page C.
• Page C has links to pages A and D.
• Page D has a link to page A.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores
for
each page. Assume the initial authority and hub scores are both 1 for all pages.

3. Given the following link structure:


• Page A has links to pages B and C.
• Page B has a link to page D.
• Page C has links to pages B and D.
• Page D has links to pages A and C.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and
hub
scores for each page after one iteration of the HITS algorithm.

4. Consider a web graph with the following link structure:


• Page A has links to pages B and C.
• Page B has links to pages C and D.
• Page C has links to pages A and D.
• Page D has a link to page B.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores
for
each page. Assume the initial authority and hub scores are both 1 for all pages.
Unit 3
Web Page Crawling Techniques:
1. Explain the breadth-first and depth-first crawling strategies. Compare their
advantages and disadvantages.
2. Describe focused crawling and its significance in building specialized search
engines. Discuss the key components of a focused crawling system. Discuss the
importance of focused crawling in targeted web data collection. Provide examples
of scenarios where focused crawling is preferred over general crawling.
3. How do web crawlers handle dynamic web content during crawling? Explain
techniques such as AJAX crawling, HTML parsing, URL normalization and session
handling for dynamic content extraction. Explain the challenges associated with
handling dynamic web content during crawling.
4. Describe the role of AJAX crawling scheme and the use of sitemaps in crawling
dynamic web content. Provide examples of how these techniques are implemented
in practice.

Near-Duplicate Page Detection:


1. Define near-duplicate page detection and its significance in web search. Discuss the
challenges associated with identifying near-duplicate pages.
2. Discuss common techniques used for near-duplicate detection, such as
fingerprinting and shingling.
3. Compare and contrast local and global similarity measures for near-duplicate
detection. Provide examples of scenarios where each measure is suitable.
4. Describe common near-duplicate detection algorithms such as SimHash and
MinHash. Explain how these algorithms work and their computational complexities.
5. Provide examples of applications where near-duplicate page detection is critical,
such as detecting plagiarism and identifying duplicate content in search results.

Text Summarization:
1. Explain the difference between extractive and abstractive text summarization
methods. Compare their advantages and disadvantages.
2. Describe common techniques used in extractive text summarization, such as
graph-based methods and sentence scoring approaches.
3. Discuss challenges in abstractive text summarization and recent advancements in
neural network-based approaches.
4. Discuss common evaluation metrics used to assess the quality of text summaries,
such as ROUGE and BLEU. Explain how these metrics measure the similarity
between generated summaries and reference summaries.
Question Answering:
1. Discuss different approaches for question answering in information retrieval,
including keyword-based, document retrieval, and passage retrieval methods.
2. Explain how natural language processing techniques such as Named Entity
Recognition (NER) and semantic parsing contribute to question answering systems.
3. Provide examples of question answering systems and evaluate their effectiveness in
providing precise answers.
4. Discuss the challenges associated with question answering, including ambiguity
resolution, answer validation, and handling of incomplete or noisy queries.

Recommender Systems:
1. Define collaborative filtering and content-based filtering in recommender systems.
Compare their strengths and weaknesses.
2. Explain how collaborative filtering algorithms such as user-based and item-based
methods work. Discuss techniques to address the cold start problem in collaborative
filtering.
3. Describe content-based filtering approaches, including feature extraction and
similarity measures used in content-based recommendation systems.

Cross-Lingual and Multilingual Retrieval:


1. Discuss the challenges associated with cross-lingual retrieval, including language
barriers, lexical gaps, and cultural differences.
2. Describe the role of machine translation in information retrieval. Discuss different
approaches to machine translation, including rule-based, statistical, and neural
machine translation models.
3. Describe methods for multilingual document representations and query translation,
including cross-lingual word embeddings and bilingual lexicons.

Evaluation Techniques for IR Systems:


1. Explain user-based evaluation methods, including user studies and surveys, and their
role in assessing the effectiveness of IR systems. Discuss methodologies for
conducting user studies, including usability testing, eye-tracking experiments, and
relevance assessments.
2. Describe the role of test collections and benchmarking datasets in evaluating IR
systems. Discuss common test collections, such as TREC and CLEF, and their use in
benchmarking retrieval algorithms.
3. Define A/B testing and interleaving experiments as online evaluation methods for
information retrieval systems. Explain how these methods compare different
retrieval algorithms or features using real user interactions.
4. Discuss the advantages and limitations of online evaluation methods compared to
offline evaluation methods, such as test collections and user studies.

You might also like