0% found this document useful (0 votes)
39 views9 pages

Semantic Search

The document outlines the implementation of a semantic search engine using BERT and Anaconda, detailing steps for setting up the environment, loading pre-trained models, and performing searches using nearest neighbors and cosine similarity. It provides a structured approach to building a semantic search system for various applications, including research paper retrieval. Key tools discussed include FAISS, Haystack, OpenAI API, and Elasticsearch, with a focus on using Sentence-BERT for generating contextual embeddings.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views9 pages

Semantic Search

The document outlines the implementation of a semantic search engine using BERT and Anaconda, detailing steps for setting up the environment, loading pre-trained models, and performing searches using nearest neighbors and cosine similarity. It provides a structured approach to building a semantic search system for various applications, including research paper retrieval. Key tools discussed include FAISS, Haystack, OpenAI API, and Elasticsearch, with a focus on using Sentence-BERT for generating contextual embeddings.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Semantic Search:

Which One Should You Use?


Tool Best For Key Strength
Large-scale
FAISS similarity Efficient and optimized for vectors
search
Document
Haystack retrieval & Uses BM25 & Transformers for NLP
QA
Chatbots, AI
OpenAI API High-quality embeddings
search
Elasticsearch Enterprise- Hybrid search (BM25 + vectors)
level search
Step-by-Step Implementation of Semantic Search Using BERT
in Anaconda

BERT (Bidirectional Encoder Representations from Transformers) is a


powerful NLP model that understands the context and meaning of
text. We will use Sentence-BERT (SBERT), which fine-tunes BERT for
efficient semantic search.

Step 1: Set Up Anaconda Environment


Open Anaconda Prompt and create a new environment.

bash
CopyEdit
conda create --name semantic_bert python=3.9 -y
conda activate semantic_bert

Install Required Packages


bash
CopyEdit
pip install torch torchvision torchaudio
pip install sentence-transformers scikit-learn numpy

 torch → Required for deep learning models.


 sentence-transformers → Pre-trained BERT models for
text embeddings.
 scikit-learn → Efficient nearest neighbor search.
Tool Best For Key Strength
 numpy → Handles numerical operations.

Step 2: Load a Pre-Trained BERT


Model
We use Sentence-BERT (SBERT) to generate contextual embeddings.

python
CopyEdit
from sentence_transformers import
SentenceTransformer

# Load a pre-trained BERT model


model = SentenceTransformer("bert-base-nli-mean-
tokens")

# Example documents for semantic search


documents = [
"Artificial Intelligence is changing the
world.",
"Machine learning makes predictions using
data.",
"Deep learning is a branch of machine
learning.",
"Natural language processing helps chatbots
understand text.",
"Computer vision enables machines to interpret
images."
]

# Convert documents into vector embeddings


document_vectors = model.encode(documents,
convert_to_numpy=True)

print("Vector Shape:", document_vectors.shape) #


Example: (5, 768)

✅ This converts text into high-dimensional vectors (BERT


embeddings).

Step 3: Build a Semantic Search Engine


Using Nearest Neighbors
Now, we use Scikit-learn’s Nearest Neighbors Search to find the most
similar documents.

python
Tool Best For Key Strength
CopyEdit
from sklearn.neighbors import NearestNeighbors

# Initialize Nearest Neighbors model with cosine


similarity
nn_model = NearestNeighbors(n_neighbors=3,
metric="cosine")

# Fit model on document embeddings


nn_model.fit(document_vectors)

✅ The model is now trained to retrieve semantically similar text.

Step 4: Perform a Semantic Search


Query
Now, we take a user query, convert it to a BERT vector, and find
similar results.

python
CopyEdit
# Define a user query
query = "How does AI impact industries?"

# Convert query into a vector


query_vector = model.encode([query],
convert_to_numpy=True)

# Search for top 3 most similar documents


distances, indices =
nn_model.kneighbors(query_vector)

# Display results
print("\nTop Search Results:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {documents[idx]} (Score: {1 -
distances[0][i]})")

✅ The most contextually similar documents are returned.

Step 5: Improve Search Results Using


Cosine Similarity
We can manually compute cosine similarity for better ranking.

python
CopyEdit
Tool Best For Key Strength
from sklearn.metrics.pairwise import
cosine_similarity

# Compute cosine similarity between query and all


document vectors
cosine_similarities =
cosine_similarity(query_vector, document_vectors)

# Get top 3 results


top_indices = cosine_similarities.argsort()[0][-3:]
[::-1]

print("\nTop Search Results (Using Cosine


Similarity):")
for i, idx in enumerate(top_indices):
print(f"{i+1}. {documents[idx]} (Score:
{cosine_similarities[0][idx]})")

✅ This ensures better-ranked search results.

Step 6: Save and Load the Search Model


To avoid recomputation, save and reload the model.

Save the Model


python
CopyEdit
import pickle

# Save embeddings and model


with open("semantic_bert_model.pkl", "wb") as f:
pickle.dump((documents, document_vectors,
nn_model), f)

Load the Model


python
CopyEdit
# Load saved model
with open("semantic_bert_model.pkl", "rb") as f:
documents, document_vectors, nn_model =
pickle.load(f)

✅ This allows reusing the semantic search model without retraining.

Final Thoughts
Feature Implementation

Search Type Semantic (context-based)


Tool Best For Key Strength

Model Used bert-base-nli-mean-tokens (SBERT)

Search Method Nearest Neighbors + Cosine Similarity

Performance Accurate for contextual search

Scalability Works on large datasets

Next Steps 🚀

 Try larger BERT models (roberta-base-nli-stsb-mean-


tokens, all-mpnet-base-v2).
 Implement FAISS for large-scale search.
 Deploy as a REST API with Flask or FastAPI.

Would you like help with integrating this into a web app? 🚀

You said:
another interesting sample for semantic search using above

ChatGPT said:

Semantic Search Example: Finding Similar Research Papers


Using BERT

Let's build a semantic search system that allows users to search for
research papers based on their abstract or keywords. This is useful for
academic search engines, knowledge management systems, and AI-
powered document retrieval.

📌 Step 1: Install and Set Up Anaconda


Environment
If you haven’t already set up an environment, run:

bash
CopyEdit
conda create --name semantic_research python=3.9 -y
conda activate semantic_research

Install Dependencies
bash
CopyEdit
pip install torch torchvision torchaudio
Tool Best For Key Strength
pip install sentence-transformers scikit-learn numpy
pandas

📌 Step 2: Load Research Paper Data


We'll use a sample dataset of research paper titles and abstracts.

python
CopyEdit
import pandas as pd

# Sample dataset of research papers (title +


abstract)
data = [
{"title": "Deep Learning for Computer Vision",
"abstract": "Deep learning techniques such as
CNNs are widely used for image recognition and
object detection."},

{"title": "Natural Language Processing for


Chatbots",
"abstract": "NLP models, including
transformers, help chatbots understand and generate
human-like text."},

{"title": "Quantum Computing and Cryptography",


"abstract": "Quantum algorithms promise
breakthroughs in encryption and secure
communications."},

{"title": "Reinforcement Learning in Robotics",


"abstract": "Robotic control and automation
benefit from reinforcement learning techniques like
Q-learning."},

{"title": "Neural Networks for Financial


Forecasting",
"abstract": "Predictive models using deep
neural networks improve stock market analysis and
risk assessment."}
]

# Convert dataset into DataFrame


df = pd.DataFrame(data)
df["text"] = df["title"] + ". " + df["abstract"] #
Combine title & abstract

print(df.head())

✅ This dataset simulates research papers, where semantic search will


help find the most relevant paper.
Tool Best For Key Strength

📌 Step 3: Generate Semantic


Embeddings with BERT
We convert titles and abstracts into vector representations using
SBERT.

python
CopyEdit
from sentence_transformers import
SentenceTransformer

# Load BERT-based model


model = SentenceTransformer("bert-base-nli-mean-
tokens")

# Convert research papers to embeddings


document_vectors = model.encode(df["text"].tolist(),
convert_to_numpy=True)

print("Vector Shape:", document_vectors.shape) #


Example: (5, 768)

✅ Now, each research paper is represented as a high-dimensional


vector.

📌 Step 4: Build the Semantic Search


Model
We use Nearest Neighbors Search to find similar research papers.

python
CopyEdit
from sklearn.neighbors import NearestNeighbors

# Initialize Nearest Neighbors model with cosine


similarity
nn_model = NearestNeighbors(n_neighbors=3,
metric="cosine")

# Fit the model on research paper embeddings


nn_model.fit(document_vectors)

✅ The model is now trained for semantic search on research papers.

📌 Step 5: Search for Similar Research


Tool Best For Key Strength

Papers
Now, let's enter a query (e.g., "AI for chatbots") and retrieve the most
relevant papers.

python
CopyEdit
# Define a research query
query = "Conversational AI using deep learning"

# Convert query to vector


query_vector = model.encode([query],
convert_to_numpy=True)

# Perform search
distances, indices =
nn_model.kneighbors(query_vector)

# Display search results


print("\nTop 3 Relevant Research Papers:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {df.iloc[idx]['title']} (Score:
{1 - distances[0][i]})")

✅ This will return the most contextually similar research papers based
on meaning rather than just keywords.

📌 Step 6: Save and Load the Semantic


Search Model
To reuse the model without recomputation:

Save the Model


python
CopyEdit
import pickle

# Save model, document vectors, and dataset


with open("semantic_research_model.pkl", "wb") as f:
pickle.dump((df, document_vectors, nn_model), f)

Load the Model


python
CopyEdit
# Load saved model
with open("semantic_research_model.pkl", "rb") as f:
df, document_vectors, nn_model = pickle.load(f)

✅ This allows us to quickly load and perform searches without


Tool Best For Key Strength

recomputing embeddings.

📌 Step 7: Try More Complex Queries


Now, let's test some realistic academic queries.

python
CopyEdit
queries = [
"How can we use AI in finance?",
"Quantum encryption methods",
"Reinforcement learning for automation"
]

# Process each query


for query in queries:
query_vector = model.encode([query],
convert_to_numpy=True)
distances, indices =
nn_model.kneighbors(query_vector)

print(f"\n🔎 **Query:** {query}")


for i, idx in enumerate(indices[0]):
print(f"{i+1}. {df.iloc[idx]['title']}
(Score: {1 - distances[0][i]})")

✅ This tests semantic retrieval on diverse AI-related topics.

📌 Final Thoughts
Feature Implementation

Use Case Research Paper Search

Model Used bert-base-nli-mean-tokens (SBERT)

Search Type Semantic (meaning-based)

Performance Fast & Contextually Relevant

Scalability Works on Large Paper Repositories

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy