0% found this document useful (0 votes)
8 views2 pages

Interview Task 1

The document outlines the development of an Intelligent Document Processing and Query System that processes technical PDF documents, extracts key information, and stores it in a vector database for user query responses. It includes requirements for document processing, information extraction, vector database integration, query processing, response generation, system integration, and performance optimization. The deliverables consist of Python code, documentation, and a performance report.

Uploaded by

phalkeshubham19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views2 pages

Interview Task 1

The document outlines the development of an Intelligent Document Processing and Query System that processes technical PDF documents, extracts key information, and stores it in a vector database for user query responses. It includes requirements for document processing, information extraction, vector database integration, query processing, response generation, system integration, and performance optimization. The deliverables consist of Python code, documentation, and a performance report.

Uploaded by

phalkeshubham19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Task: Intelligent Document Processing and

Query System
Objective:
Develop a system that processes technical PDF documents, extracts key
information, stores it in a vector database, and provides relevant responses to
user queries using Retrieval-Augmented Generation (RAG).

Requirements:
1. Document Processing:
- Accept 10 PDF files as input.
- Extract text content from each PDF.
- Split each document into logical sections (e.g., paragraphs or pages).
2. Information Extraction and Tagging:
- For each section, extract and tag the following information:
a. Equipment name
b. Domain (e.g., electronics, mechanical, software)
c. Model numbers
d. Manufacturer
3. Vector Database Integration:
- Choose and implement a suitable vector database (e.g., Pinecone, Weaviate, or
Milvus, or any other of your choice).
- Convert each tagged section into a vector representation.
- Store the vectors along with their associated metadata (tags) in the database.
4. Query Processing:
- Implement a user interface to accept natural language queries.
- Extract for which equipment, model or manufacture is the query for.
- Convert user queries into vector representations.
- Perform cosine similarity search in the vector database to retrieve the most
relevant sections for the matching (equipment, model or manufacturer)
5. Response Generation:
- Utilize a Language Model (e.g., GPT-3, GPT-4) for response generation.
- Use this API key if you do not have your own (key - sk-proj-
3NAMKruBiPy16sQr1ixNT3BlbkFJmRPJIl1zNhn7qH2bD1dI). Make sure that activity
on this key is monitored so use it only for this task.
- Design an effective prompt that incorporates the retrieved relevant sections
and the user's query.
- Generate a coherent and informative response based on the retrieved
information.
6. System Integration:
- Develop a Python application that integrates all the above components.
- Ensure smooth data flow from document processing to query response.
7. Performance and Scalability:
- Optimize the system for quick response times.
- Design the system to handle potential scaling to more documents in the future.

Example Scenario:
Input: 10 PDF files containing technical specifications of various electronic
devices.
User Query: "What is the power consumption of the latest XYZ Corp
smartphone?"

Expected System Behaviour:


1. Process and tag all 10 PDFs, storing information in the vector database.
2. Convert the user query to a vector.
3. Retrieve the most relevant section(s) from the database.
4. Generate a response using the LLM, incorporating the retrieved information.
5. Present the answer to the user, e.g., "The latest XYZ Corp smartphone, model
ABC123, has a power consumption of 5W in standby mode and up to 15W during
peak usage, according to the technical specifications."

Deliverables:
1. Python code for the entire system.
2. Documentation explaining the architecture, chosen technologies, and how to
run the system.
3. A brief report on the system's performance, including response times and
accuracy.

Tip:
Feel free to use LLM to generate code for you.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy