Hypothetical Document Embeddings (HyDE) - Jupyter Notebook
Hypothetical Document Embeddings (HyDE) - Jupyter Notebook
HyDE uses a Language Learning Model, like ChatGPT, to create a hypothetical document
when responding to a query, as opposed to using the query and its computed vector to
directly seek in the vector database.
It is inspired by the paper Precise Zero-Shot Dense Retrieval without Relevance Labels
(https://arxiv.org/pdf/2212.10496)
The Process
1. Query Formulation: Begin with your question. For instance, "What triggered the French
Revolution?"
2. LLM Generates Hypothetical Doc: HyDE leverages an LLM, like GPT-3, to craft a
hypothetical document based on your query. While factually inaccurate, this document
captures the essence and phrasing relevant to your question.
3. Embedding Creation: The system encodes this hypothetical document into a numerical
representation, known as an embedding. Think of this embedding as a unique fingerprint for
the hypothetical document.
Here we are doing answer to answer embedding similarity search as compared to query to
answer embedding similar search in traditional RAG retrieval approach.
In essence, HyDE utilizes the LLM to construct a "search template" based on your question.
It then retrieves real documents that align with this template.
Benefits of HyDE
Improved Retrieval Accuracy: HyDE can outperform traditional search methods,
particularly for intricate or nuanced questions.
Limitations to Consider
Factual Inconsistencies: The hypothetical documents are not factual, so retrieved
documents might contain irrelevant information.
The drawback to this approach is that it may not consistently produce good results. For
instance, if the subject being discussed is entirely unfamiliar to the language model, this
method is not effective and could lead to increased instances of generating incorrect
information.
Early Stage Technology: HyDE is a relatively new approach, and further research is needed
to refine its effectiveness.
Conclusion
HyDE, or Hypothetical Document Expansion, leverages Language Learning Models (LLMs)
like ChatGPT to generate theoretical documents that enhance search accuracy. It employs
an unsupervised encoder to convert theoretical documents into vectors for retrieval. This
method excels in tasks like web search, QA, and fact verification, displaying robust
performance comparable to well tuned retrievers However it’s not infallible; if the subject is
In [ ]: