0% found this document useful (0 votes)
21 views11 pages

Notes - by Kishor

Uploaded by

Kishore Chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Notes - by Kishor

Uploaded by

Kishore Chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Let's understand what happens when you run the application:

1. Document Processing:

o The application loads all .txt files from your documents directory

o It splits these documents into smaller chunks (default 1000 characters)

o Each chunk is converted into a numerical vector using OpenAI's embedding model

2. Vector Storage:

o The document vectors are stored in a Chroma database

o This database is saved locally in a ./chroma_db directory

o This allows for quick similarity searching when answering questions

3. Question Answering:

o When you ask a question, the system:

 Converts your question to a vector

 Finds the most similar document chunks

 Combines these chunks with your question

 Generates an answer using the OpenAI model

Common Issues and Solutions:

1. Import Errors:

o Make sure your virtual environment is activated

o Verify all packages are installed correctly

2. API Key Errors:

o Check your .env file exists and contains the correct API key

o Or provide the API key via command line argument

3. Document Loading Issues:

o Verify your documents are in the correct directory

o Make sure they are .txt files

o Check file permissions

import os

import argparse

from typing import List, Dict

from langchain.document_loaders import DirectoryLoader, TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter


from langchain.embeddings import OpenAIEmbeddings

from langchain.vectorstores import Chroma

from langchain.llms import OpenAI

from langchain.chains import RetrievalQA

from dotenv import load_dotenv

class RAGApplication:

def __init__(self, documents_dir: str, openai_api_key: str = None):

"""

Initialize the RAG application.

Parameters:

- documents_dir: Directory containing the source documents

- openai_api_key: Optional API key (can also be set via .env file)

"""

# Load environment variables from .env file

load_dotenv()

# Set API key if provided, otherwise use from .env

if openai_api_key:

os.environ["OPENAI_API_KEY"] = openai_api_key

self.documents_dir = documents_dir

# Initialize OpenAI embeddings - this converts text to numerical vectors

self.embeddings = OpenAIEmbeddings()

# Initialize storage variables

self.vector_store = None # Will store document embeddings

self.qa_chain = None # Will handle the Q&A process


def load_documents(self) -> List[str]:

"""

Load all text documents from the specified directory.

Supports recursive directory search for .txt files.

"""

# Create a loader that will read all .txt files in the directory

loader = DirectoryLoader(

self.documents_dir,

glob="**/*.txt", # Pattern to match text files

loader_cls=TextLoader # Use basic text loader

# Load all documents

documents = loader.load()

return documents

def split_documents(self, documents: List[str], chunk_size: int = 1000) -> List[str]:

"""

Split documents into smaller chunks for better processing.

Parameters:

- documents: List of loaded documents

- chunk_size: Size of each chunk in characters

"""

# Create a text splitter that uses recursive character splitting

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=chunk_size, # Number of characters per chunk

chunk_overlap=200, # Overlap between chunks to maintain context

length_function=len # Function to measure text length

)
# Split all documents into chunks

splits = text_splitter.split_documents(documents)

return splits

def create_vector_store(self, splits: List[str]):

"""

Create a vector store from document chunks.

This converts text chunks to vectors and stores them for similarity search.

"""

# Create a Chroma vector store from the document chunks

self.vector_store = Chroma.from_documents(

documents=splits,

embedding=self.embeddings,

persist_directory="./chroma_db" # Store vectors on disk

# Save the vector store to disk

self.vector_store.persist()

def setup_qa_chain(self):

"""

Set up the question-answering chain that combines retrieval and generation.

"""

# Create a retriever that will fetch relevant documents

retriever = self.vector_store.as_retriever(

search_kwargs={"k": 3} # Number of documents to retrieve

# Create the QA chain

self.qa_chain = RetrievalQA.from_chain_type(

llm=OpenAI(), # Language model for generation


chain_type="stuff", # Combine retrieved docs into prompt

retriever=retriever, # Document retriever

return_source_documents=True # Include source docs in response

def query(self, question: str) -> Dict:

"""

Process a question and return an answer with source documents.

Parameters:

- question: The question to answer

Returns:

- Dictionary containing answer and source documents

"""

# Check if QA chain is initialized

if not self.qa_chain:

raise ValueError("QA chain not initialized. Run setup_qa_chain first.")

# Process the question

response = self.qa_chain({"query": question})

# Format the response

return {

"answer": response["result"],

"source_documents": [doc.page_content for doc in response["source_documents"]]

def main():

# Set up command line argument parsing

parser = argparse.ArgumentParser(description='RAG Application')


parser.add_argument('--docs_dir', type=str, default='./documents',

help='Directory containing documents')

parser.add_argument('--question', type=str,

help='Question to ask the RAG system')

parser.add_argument('--api_key', type=str,

help='OpenAI API key (optional if set in .env file)')

parser.add_argument('--chunk_size', type=int, default=1000,

help='Chunk size for splitting documents')

# Parse command line arguments

args = parser.parse_args()

# Initialize the RAG application

rag_app = RAGApplication(

documents_dir=args.docs_dir,

openai_api_key=args.api_key

# Process the documents

print("Loading documents...")

documents = rag_app.load_documents()

print(f"Found {len(documents)} documents")

print("Splitting documents...")

splits = rag_app.split_documents(documents, chunk_size=args.chunk_size)

print(f"Created {len(splits)} chunks")

print("Creating vector store...")

rag_app.create_vector_store(splits)

rag_app.setup_qa_chain()
# Handle questions either from command line or interactive mode

if args.question:

# Single question mode

print("\nProcessing question:", args.question)

response = rag_app.query(args.question)

print("\nAnswer:", response['answer'])

print("\nSource Documents:")

for idx, doc in enumerate(response['source_documents'], 1):

print(f"\nDocument {idx}:")

print(doc[:200] + "...")

else:

# Interactive mode

while True:

question = input("\nEnter your question (or 'quit' to exit): ")

if question.lower() == 'quit':

break

response = rag_app.query(question)

print("\nAnswer:", response['answer'])

print("\nSource Documents:")

for idx, doc in enumerate(response['source_documents'], 1):

print(f"\nDocument {idx}:")

print(doc[:200] + "...")

if __name__ == "__main__":

main()

Now comes the exciting part - actually running your RAG application! You have several ways to do
this:

1. Interactive Mode (recommended for first-time users):


```bash

python rag_app.py

```

When you run this, several things happen in sequence:

a) Document Processing:

- The system reads all your text files from the documents folder

- It prints "Loading documents..." and tells you how many it found

- Then it splits these documents into smaller, manageable chunks

- You'll see "Created X chunks" showing how many pieces it made

b) Vector Store Creation:

- The system converts your document chunks into numerical vectors

- These vectors are stored in a database for quick searching

- You'll see "Creating vector store..." during this process

c) Question-Answering:

- The system will prompt you to "Enter your question"

- Type your question and press Enter

- The system will:

- Find relevant document chunks

- Generate an answer using those chunks

- Show you both the answer and the source documents it used

2. Single Question Mode:

If you want to ask just one question:

```bash

python rag_app.py --question "What is the main topic of these documents?"

```
3. Advanced Configuration:

For more control over how the system works:

```bash

python rag_app.py --docs_dir "./my_documents" --chunk_size 500

```

Troubleshooting Common Issues:

If you encounter problems, here's what to check:

1. "Module not found" errors:

- Make sure your virtual environment is activated

- Try running `pip install -r requirements.txt` again

2. API Key errors:

- Check your .env file exists and has the correct API key

- Make sure there are no spaces around the equals sign

3. Document loading issues:

- Verify your documents are .txt files

- Check if they're in the correct folder

- Make sure they're readable (try opening them in Notepad)

4. Memory or performance issues:

- Try processing fewer documents first

- Use a larger chunk size (e.g., --chunk_size 2000)

Understanding the Output:

When you ask a question, you'll see:

1. The answer to your question


2. Snippets from the source documents that were used

3. This helps you understand where the information came from

Would you like me to explain any part of this process in more detail? Or would you like to see an
example of the system in action with some sample documents?

Linux commands :
Intro 0:05
⏩ ssh 0:21
⏩ ls 0:30
⏩ pwd 0:35
⏩ cd 0:51
⏩ touch 1:23
⏩ echo 1:32
⏩ nano 1:42
⏩ vim 1:56
⏩ cat 2:02
⏩ shred 2:10
⏩ mkdir 2:15
⏩ cp 2:26
⏩ rm 2:28
⏩ rmdir 2:38
⏩ ln 2:45
⏩ clear 2:50
⏩ whoami 2:57
⏩ useradd 3:02
⏩ sudo 3:08
⏩ adduser 3:15
⏩ su 3:21
⏩ exit 3:29
⏩ passwd 3:50
⏩ apt 4:12
⏩ finger 4:20
⏩ man 4:33
⏩ whatis 4:55 ⏩ curl 5:05 ⏩ zip 5:13 ⏩ unzip 5:20 ⏩ less 5:29 ⏩ head 5:32 ⏩ tail 5:34 ⏩ cmp
5:42 ⏩ diff 5:50 ⏩ sort 6:00 ⏩ find 6:19 ⏩ chmod 6:24 ⏩ chown 6:34 ⏩ ifconfig 6:40 ⏩ ip
address 6:47 ⏩ grep 7:02 ⏩ awk 7:26 ⏩ resolvectl status 7:31 ⏩ ping 7:57 ⏩ netstat 8:08 ⏩ ss
8:14 ⏩ iptables 8:24 ⏩ ufw 8:43 ⏩ uname 8:52 ⏩ neofetch 9:01 ⏩ cal 9:14 ⏩ free 9:21 ⏩ df
9:28 ⏩ ps 9:36 ⏩ top 9:40 ⏩ htop 9:44 ⏩ kill 10:03 ⏩ pkill 10:14 ⏩ systemctl 10:29 ⏩ history
10:35 ⏩ reboot 10:37 ⏩ shutdown

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy