Learning Guide Unit 3 - Home
Learning Guide Unit 3 - Home
id=443836
1 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
2 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
3 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
• Index compression
• Rule of 30
• Lossless versus lossy compression
• Zipf’s law
• Dictionary compression
• Postings �le compression
• Variable byte codes (γ codes and δ codes)
1. Explain the need and value of compression within Information Retrieval (IR) systems.
2. Describe the di�erent forms of data compression including the di�erence between lossless and lossy compression.
3. Recognize Heaps’ law and be able to calculate the value of M for a collection.
4. Recognize Zipf’s law as it relates to the distribution of terms within a collection.
5. Implement techniques for dictionary compression.
6. Implement techniques for postings �le compression.
4 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
Unit three addresses the problem of index size and the techniques that can be employed to both reduce the size of an index as well as
approaches that can be used to improve the e�ciency of processing queries against an inverted index.
The basic concept of index compression is to reduce the size of the index. One approach to accomplish this is by using compression
algorithms against the data in the index. We are all aware of the compression technologies that employed in �le formats such as RAR, Zip,
and Gz. Compression programs that employ such algorithms such as the WinZip utility can often signi�cantly reduce the size of data in a
compressed form. These utilities typically employ a lossless compression which means that no data is lost during the compression
process. Further compression can be achieved if a lossy compression algorithm is used. In a lossly algorithm some amount of data is lost
as part of the compression process. We are all familiar with music that is stored in the MP3 format. The MP3 format is an example of a
compression algorithm that takes audio data and compresses it into a much smaller format. MP3 is a lossy algorithm because some of the
audio data is lost in the conversion process. The average person simply cannot hear the di�erence between the original audio and the
MP3 version that has lost some of the detail in the music. MP3 �les are an excellent example of using a lossy algorithm which accepts
some data loss in exchange for much smaller size and processing e�ciency. This is the same approach that is discussed in chapter 5.
5 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
• Compression
• Rule of 30
• Lossy Compression
• Lossless Compression
• Heap’s law
• Zipf’s law
• Power law
• Front Coding
• Variable Byte Encoding
• Nibble
• Unary Code
• γ Encoding
• Entropy
• δ Codes
6 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes statistics such
as the number of documents, number total terms and the number of unique terms in the collection added to the index.
≈in the dictionary of the inverted index. Heap’s law provides a formula that can be used to estimate the number of unique terms in a
collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.
In textbook in section 5.1.1 (page 88 of the textbook), we are provided typical values for both k and b. The value of k is typically a range
between 10 and 100 and ß ≈ .4 to .6. Using the formula for Heap’s law calculate the estimated size of the vocabulary (M) using the total
number of terms parsed from all documents statistic reported when running your indexer program. Given the fact that both k and ß are
typically found through empirical analysis, assume that k will be 40 and ß will be .50. Compare the estimate with the “total number of
unique terms found and added to the index” statistic reported by your indexer program which represents the actual size of the vocabulary
in your collection. Report your �ndings in a posting response in the unit 3 discussion forum. If the size of the vocabulary estimated by
Heap’s law is not consistent with the vocabulary discovered by your indexer process speculate on why this may have occurred. Consider
that this discrepancy may be uncovering a �aw in your program or that the corpus you are using may be limited in vocabulary due to its
subject content. Discuss your �ndings with your peers and provide feedback to at least 3 peers on this submission.
You must post your initial response before being able to review other student’s responses. Once you have made your �rst response, you
will be able to reply to other student’s posts. You are expected to make a minimum of 3 responses to your fellow student’s posts.
This assignment relies upon the completion of the indexer part 1 assigned in unit 2. Each student must use the statistics produced by their
indexer programs to complete this assignment.
• Does the posting include the statistics output from the student’s indexer part 1? (25%)
• Does the posting include calculations made using the Heap’s law formula that estimate the size of the vocabulary for the corpus?
(50%)
• Does the posting compare the actual vocabulary of the corpus as reported by the student’s indexer part 1 with the estimates
derived from Heap’s law? (50%)
• Does the discussion examine and explain and inconsistencies (if relevant and the actual vocabulary is signi�cantly di�erent than the
Heap’s law estimate) between the Heap’s law estimate and the actual vocabulary of the corpus? (25%)
7 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
Your learning journal entry must be a re�ective statement that considers the following questions:
• Describe what you did. This does not mean that you copy and paste from what you have posted or the assignments you have
prepared. You need to describe what you did and how you did it.
• Describe your reactions to what you did
• Describe any feedback you received or any speci�c interactions you had. Discuss how they were helpful
• Describe your feelings and attitudes
• Describe what you learned
8 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.
The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.
Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.
9 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836
Participate in the Discussion Assignment (post, comment, and rate in the Discussion Forum)
10 of 10 12/10/2024, 12:02 PM