Articles Search Project
Articles Search Project
Directed by: Ahmed Benameur∗ Siwar Ben Gharsallah† Sarra Ben Hadj Slama‡
∗ Author
† Author
‡ Author
1
Contents
1 The problematic 4
1.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Semantic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 The Project 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data Base used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Generating data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Model Architecture Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2
List of Figures
1 Bibliomatrix sample data base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Data preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Defining the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Training the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Performing a query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3
1 The problematic
The goal of this project is to develop a search query system that can understand and process user queries
to retrieve the most relevant articles from a given corpus. Traditional search systems often rely on
keyword matching and simple statistical methods, which can fall short when dealing with complex queries
or nuanced language. By utilizing DL architectures, our system aims to grasp the contextual meaning
of search queries and articles, thereby enhancing retrieval accuracy. The following are 3 approaches to
tackle the problematic of this project.
4
2 The Project
2.1 Introduction
Throughout this report, we will detail the steps taken to develop and evaluate our RNN-based search
query system. This includes data preprocessing, model architecture selection, training and optimization
processes, and performance evaluation. We will also compare our RNN approach with traditional search
methods to highlight its advantages and areas for further improvement.
The findings from this project demonstrate the potential of RNNs to significantly enhance search
query performance in article retrieval, paving the way for more intelligent and efficient information
retrieval systems.
5
2.3 Generating data set
Through performing keywords queries, and observing their answers, we managed to generate a dataset
to train our RNN model with.
Figure 2: Keywords
6
2.4 Data Preprocessing
It consisted of three main steps:
Tokenization: Split the sentences into individual words or subword units to create a vocabulary.
Numerical Encoding: Convert words or subword units into numerical representations using the vo-
cabulary.
Padding: Ensure all sequences have the same length by padding shorter sentences with special tokens.
7
2.7 Model Training
Model training was one of the most critical steps in my project. We used a training dataset to adjust
the model’s weights by minimizing the loss function of each layer.
The search queries model we developed demonstrated promising performance, although further im-
provement is possible with ongoing training and additional adjustments.