0% found this document useful (0 votes)

12 views10 pages

Learning Guide Unit 3 - Home

Unit 3 of the CS 3308-01 Information Retrieval course focuses on index compression techniques to reduce index size and improve query processing efficiency. Key topics include lossless and lossy compression, Heap's law, and Zipf's law, along with practical assignments involving peer assessments, discussions, and self-reflection in learning journals. Students will apply these concepts through calculations and comparisons of vocabulary sizes derived from their indexer programs.

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

Learning Guide Unit 3 - Home

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

id=443836

Site: University of the People Printed by: Patrick Rolemodel Asante

Course: CS 3308-01 Information Retrieval - AY2025-T2 Date: Tuesday, 10 December 2024, 12:02 PM
Book: Learning Guide Unit 3

1 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

Learning Guide Unit 3

2 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

3 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

• Index compression
• Rule of 30
• Lossless versus lossy compression
• Zipf’s law
• Dictionary compression
• Postings �le compression
• Variable byte codes (γ codes and δ codes)

By the end of this Unit, you will be able to:

1. Explain the need and value of compression within Information Retrieval (IR) systems.
2. Describe the di�erent forms of data compression including the di�erence between lossless and lossy compression.
3. Recognize Heaps’ law and be able to calculate the value of M for a collection.
4. Recognize Zipf’s law as it relates to the distribution of terms within a collection.
5. Implement techniques for dictionary compression.
6. Implement techniques for postings �le compression.

• Peer assess Unit 2 Development Assignment

• Read the Learning Guide and Reading Assignments
• Participate in the Discussion Assignment (post, comment, and rate in the Discussion Forum)
• Make entries to the Learning Journal
• Take the Self-Quiz

4 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

Unit three addresses the problem of index size and the techniques that can be employed to both reduce the size of an index as well as
approaches that can be used to improve the e�ciency of processing queries against an inverted index.

The basic concept of index compression is to reduce the size of the index. One approach to accomplish this is by using compression
algorithms against the data in the index. We are all aware of the compression technologies that employed in �le formats such as RAR, Zip,
and Gz. Compression programs that employ such algorithms such as the WinZip utility can often signi�cantly reduce the size of data in a
compressed form. These utilities typically employ a lossless compression which means that no data is lost during the compression
process. Further compression can be achieved if a lossy compression algorithm is used. In a lossly algorithm some amount of data is lost
as part of the compression process. We are all familiar with music that is stored in the MP3 format. The MP3 format is an example of a
compression algorithm that takes audio data and compresses it into a much smaller format. MP3 is a lossy algorithm because some of the
audio data is lost in the conversion process. The average person simply cannot hear the di�erence between the original audio and the
MP3 version that has lost some of the detail in the music. MP3 �les are an excellent example of using a lossy algorithm which accepts
some data loss in exchange for much smaller size and processing e�ciency. This is the same approach that is discussed in chapter 5.

5 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Chapter 5: Index Compression

• Compression
• Rule of 30
• Lossy Compression
• Lossless Compression
• Heap’s law
• Zipf’s law
• Power law
• Front Coding
• Variable Byte Encoding
• Nibble
• Unary Code
• γ Encoding
• Entropy
• δ Codes

6 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes statistics such
as the number of documents, number total terms and the number of unique terms in the collection added to the index.

≈in the dictionary of the inverted index. Heap’s law provides a formula that can be used to estimate the number of unique terms in a
collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.

In textbook in section 5.1.1 (page 88 of the textbook), we are provided typical values for both k and b. The value of k is typically a range
between 10 and 100 and ß ≈ .4 to .6. Using the formula for Heap’s law calculate the estimated size of the vocabulary (M) using the total
number of terms parsed from all documents statistic reported when running your indexer program. Given the fact that both k and ß are
typically found through empirical analysis, assume that k will be 40 and ß will be .50. Compare the estimate with the “total number of
unique terms found and added to the index” statistic reported by your indexer program which represents the actual size of the vocabulary
in your collection. Report your �ndings in a posting response in the unit 3 discussion forum. If the size of the vocabulary estimated by
Heap’s law is not consistent with the vocabulary discovered by your indexer process speculate on why this may have occurred. Consider
that this discrepancy may be uncovering a �aw in your program or that the corpus you are using may be limited in vocabulary due to its
subject content. Discuss your �ndings with your peers and provide feedback to at least 3 peers on this submission.

You must post your initial response before being able to review other student’s responses. Once you have made your �rst response, you
will be able to reply to other student’s posts. You are expected to make a minimum of 3 responses to your fellow student’s posts.

*In addition to the criteria already posted in the Discussion Forum

This assignment relies upon the completion of the indexer part 1 assigned in unit 2. Each student must use the statistics produced by their
indexer programs to complete this assignment.

• Does the posting include the statistics output from the student’s indexer part 1? (25%)
• Does the posting include calculations made using the Heap’s law formula that estimate the size of the vocabulary for the corpus?
(50%)
• Does the posting compare the actual vocabulary of the corpus as reported by the student’s indexer part 1 with the estimates
derived from Heap’s law? (50%)
• Does the discussion examine and explain and inconsistencies (if relevant and the actual vocabulary is signi�cantly di�erent than the
Heap’s law estimate) between the Heap’s law estimate and the actual vocabulary of the corpus? (25%)

7 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

Your learning journal entry must be a re�ective statement that considers the following questions:

• Describe what you did. This does not mean that you copy and paste from what you have posted or the assignments you have
prepared. You need to describe what you did and how you did it.
• Describe your reactions to what you did
• Describe any feedback you received or any speci�c interactions you had. Discuss how they were helpful
• Describe your feelings and attitudes
• Describe what you learned

Another set of questions to consider in your learning journal statement include:

• What surprised me or caused me to wonder?

• What happened that felt particularly challenging? Why was it challenging to me?
• What skills and knowledge do I recognize that I am gaining?
• What am I realizing about myself as a learner?
• In what ways am I able to apply the ideas and concepts gained to my own experience?

Your Learning Journal must be a minimum of 500 words.

8 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.

The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.

Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.

9 of 10 12/10/2024, 12:02 PM
Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443836

Peer assess Unit 2 Development Assignment

Read the Learning Guide and Reading Assignments

Participate in the Discussion Assignment (post, comment, and rate in the Discussion Forum)

Make entries to the Learning Journal

Take the Self-Quiz

10 of 10 12/10/2024, 12:02 PM

Sample of Globe Proof of Billing
No ratings yet
Sample of Globe Proof of Billing
2 pages
Question Text: Correct Mark 1.00 Out of 1.00
No ratings yet
Question Text: Correct Mark 1.00 Out of 1.00
49 pages
CS 3308 Learning Journal Unit 3
No ratings yet
CS 3308 Learning Journal Unit 3
6 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
47 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
Lecture4 Compression 1per
No ratings yet
Lecture4 Compression 1per
50 pages
05comp Flat
No ratings yet
05comp Flat
59 pages
Compression
No ratings yet
Compression
46 pages
C4 Compression
No ratings yet
C4 Compression
44 pages
IR
No ratings yet
IR
8 pages
Lecture4 Compression
No ratings yet
Lecture4 Compression
61 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Learning Journal Entry Week 3 Reflection
No ratings yet
Learning Journal Entry Week 3 Reflection
2 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chap 4
No ratings yet
Chap 4
76 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Chapter 5 - Index Compression
No ratings yet
Chapter 5 - Index Compression
28 pages
Pression
No ratings yet
Pression
44 pages
Unit 2
No ratings yet
Unit 2
157 pages
ISR Chap... 4
No ratings yet
ISR Chap... 4
43 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Algorithm Analysis Chapter 4
No ratings yet
Algorithm Analysis Chapter 4
20 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
No ratings yet
CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
49 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
Learning Guide Unit 6 - Home
No ratings yet
Learning Guide Unit 6 - Home
10 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Week 4 - Information Retrieval Indexing
No ratings yet
Week 4 - Information Retrieval Indexing
55 pages
Learning Guide Unit 1 - Home
No ratings yet
Learning Guide Unit 1 - Home
10 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Big O
No ratings yet
Big O
9 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Combined Ir Exam
No ratings yet
Combined Ir Exam
50 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
13 Searching
No ratings yet
13 Searching
71 pages
2 - Text Operations
No ratings yet
2 - Text Operations
56 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Week 2 Practice Quiz
No ratings yet
Week 2 Practice Quiz
5 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
CS 3308 Discussion Assignment Unit 3
No ratings yet
CS 3308 Discussion Assignment Unit 3
5 pages
Ir Unit 3 Ir Unit 5 Notes
No ratings yet
Ir Unit 3 Ir Unit 5 Notes
11 pages
Simple Proven Approaches To Text Retrieval
No ratings yet
Simple Proven Approaches To Text Retrieval
8 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Lecture # 21&22&23 Searching
No ratings yet
Lecture # 21&22&23 Searching
45 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
MTS 102 Module 5
No ratings yet
MTS 102 Module 5
8 pages
Chapter 1 - Information Theory
No ratings yet
Chapter 1 - Information Theory
55 pages
NPD Module 5 PDF
No ratings yet
NPD Module 5 PDF
36 pages
Grade 8 Computer Question Bank
No ratings yet
Grade 8 Computer Question Bank
3 pages
Maths PQ2
No ratings yet
Maths PQ2
6 pages
Lecture 1-Introduction
No ratings yet
Lecture 1-Introduction
20 pages
The Complete Guide To Prompt Engineering....
No ratings yet
The Complete Guide To Prompt Engineering....
47 pages
Airbnb GRP 6
No ratings yet
Airbnb GRP 6
26 pages
Mega
No ratings yet
Mega
7 pages
A Concise Survey Paper On Automated Plant Irrigation System
No ratings yet
A Concise Survey Paper On Automated Plant Irrigation System
7 pages
Sip Parameters
100% (1)
Sip Parameters
20 pages
Microcontroller Question Bank
No ratings yet
Microcontroller Question Bank
5 pages
Remote Monitoring System For Cold Storage Warehouse Using IOT
No ratings yet
Remote Monitoring System For Cold Storage Warehouse Using IOT
7 pages
IFX Expo Limassol - Exhibitor Manual
No ratings yet
IFX Expo Limassol - Exhibitor Manual
19 pages
Rock Smith Configuration
No ratings yet
Rock Smith Configuration
23 pages
IT Security Hacker Pitch Deck by Slidesgo
No ratings yet
IT Security Hacker Pitch Deck by Slidesgo
42 pages
Diamond 3 13 User Guide
No ratings yet
Diamond 3 13 User Guide
152 pages
Brady Part - M7-107-427 - 174534 - Self-Laminating Vinyl Wrap Around Labels For M710 and BMP71 - Brady - Co.uk
No ratings yet
Brady Part - M7-107-427 - 174534 - Self-Laminating Vinyl Wrap Around Labels For M710 and BMP71 - Brady - Co.uk
6 pages
(CV) Emma Lobb Senior Project Manager / Producer : Candidate Summary
No ratings yet
(CV) Emma Lobb Senior Project Manager / Producer : Candidate Summary
4 pages
PastPapers Harony P4 2024
No ratings yet
PastPapers Harony P4 2024
484 pages
Access Networks: Introduction and Overview
No ratings yet
Access Networks: Introduction and Overview
15 pages
Exp 1
No ratings yet
Exp 1
9 pages
Smart Ups On Line Surt20krmxli
No ratings yet
Smart Ups On Line Surt20krmxli
4 pages
SV102
No ratings yet
SV102
2 pages
Digital Bot (En)
No ratings yet
Digital Bot (En)
20 pages
Tinder Tales An Exploratory Study of Online Dating Users and Their Most Interesting Stories
No ratings yet
Tinder Tales An Exploratory Study of Online Dating Users and Their Most Interesting Stories
16 pages
It Pre-Final Examination
No ratings yet
It Pre-Final Examination
11 pages
MCA 401 (Unit 05)
No ratings yet
MCA 401 (Unit 05)
6 pages
Ad SW Final Revision Essay Question
No ratings yet
Ad SW Final Revision Essay Question
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Learning Guide Unit 3 - Home

Uploaded by

Learning Guide Unit 3 - Home

Uploaded by

Learning Guide Unit 3 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

Site: University of the People Printed by: Patrick Rolemodel Asante

Learning Guide Unit 3

By the end of this Unit, you will be able to:

• Peer assess Unit 2 Development Assignment

Chapter 5: Index Compression

*In addition to the criteria already posted in the Discussion Forum

Another set of questions to consider in your learning journal statement include:

• What surprised me or caused me to wonder?

Your Learning Journal must be a minimum of 500 words.

Peer assess Unit 2 Development Assignment

Read the Learning Guide and Reading Assignments

Make entries to the Learning Journal

Take the Self-Quiz

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.