Sanjeev Mishra
Sanjeev Mishra
Bachelor of Technology
In
Artificial Intelligence and Data Science
Submitted By
Sanjeev Mishra
0901AD211049
Submitted To
I hereby declare that the work entitled “Data Science Intern” is my work, during the session
May-June 2024. The report submitted by me is a record of bonafide work carried out by me.
I further declare that the work reported in this report has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or diploma in this institute
or any other institute or university.
--------------------------------
Sanjeev Mishra
0901AD211049
Date: 20.11.24
Place: Gwalior
This is to certify that the above statement made by the candidates is correct to the best of my
knowledge and belief.
Class Cordinator:
_
Dr. Tej Singh Dr. Rajni Ranjan Singh
Assistant Professor Prof. & Head
Centre for Artificial Intelligenc Centre for Artificial Intelligence
MITS, Gwalior MITS, Gwalior
i
ABSTRACT
Level 2 emphasizes deeper analysis, including insights into table booking and online
delivery services, with comparisons of ratings across restaurants offering or lacking
these services. Price range analysis identifies the most common price categories and
explores the relationship between price range and restaurant ratings. Feature engineering
is introduced to extract additional insights by creating new features from existing data.
ii
ACKNOWLEDGEMENT
The summer semester internship has proved to be pivotal to my career. I am thankful to my
institute,Madhav Institute of Technology and Science to allow me to continue my disciplinary /
interdisciplinary internship as a curriculum requirement, under the provisions of the Flexible
Curriculum Scheme (based on the AICTE Model Curriculum 2018), approved by the Academic
Council of the institute. I extend my gratitude to the Director of the institute, Dr. R. K. Pandit
and Dean Academics, Dr. Manjaree Pandit for this.
I would sincerely like to thank my department, Centre for Artificial Intelligence, for allowing
me to explore this internship. I humbly thank Dr. Rajni Ranjan Singh Makwana, Professor
and Head, Centre for Artificial Intelligence, for his continued support during the course of
this engagement, which eased the process and formalities involved.
Mr, Ashish Namdev, EddyTools Tech Solution Pvt. Ltd., for his guidance and mentorship
during the internship period.
-----------------------------
Sanjeev Mishra
0901AD211049
iii
Certificate
iv
CONTENT
Table of Contents
Declaration by the Candidate......................................................................................................... i
Abstract..........................................................................................................................................ii
Acknowledgement........................................................................................................................iii
Certificate..................................................................................................................................... iv
Content...........................................................................................................................................v
Acronyms......................................................................................................................................vi
Chapter 1: Introduction..................................................................................................................1
Chapter 3: Techniques/Methodology............................................................................................ 3
Chapter 5: Projects.........................................................................................................................6
References....................................................................................................................................14
v
ACRONYMS
DREAM:
Data Exploration, Restaurant Analysis, Engineering, Aggregation, and Modeling – summarizing the
multi-level project objectives and tasks.
SERVE:
Statistics, Exploration, Regression, Visualization, and Engineering – focusing on the core processes
involved in analyzing restaurant-related data.
TASTE:
Target Analysis, Aggregate Ratings, Statistics, Trends, and Engineering – emphasizing the analysis of
ratings, trends, and feature engineering.
MEAL:
Multi-level Exploration, Analysis, and Learning – representing the multi-stage structure and learning
outcomes of the internship.
PLATE:
Preprocessing, Level-Based Analysis, Aggregation, Trends, and Evaluation – summarizing the
progression through levels and focus on trends and evaluation.
vi
LIST OF FIGURES
vii
CHAPTER 1: INTRODUCTION
Tasks include dataset inspection, handling missing values, data type conversions, and analyzing the target
variable ("Aggregate Rating") for imbalances.
Descriptive statistics are calculated, with insights drawn from categorical variables like "Country Code," "City,"
and "Cuisines."
Focuses on table booking and online delivery, comparing ratings and availability across price ranges.
Price range analysis identifies common categories and explores their relationship with ratings, including
identifying standout attributes like color codes.
Feature engineering creates new insights by generating features such as name length or service availability.
Predictive modeling involves building regression models to forecast aggregate ratings, testing various
algorithms, and evaluating their performance.
Customer preference analysis examines cuisine popularity and ratings trends to uncover actionable insights.
Data visualization represents patterns and relationships through charts, highlighting trends across cuisines, cities,
and other features.
This structured task list equips participants with skills in data exploration, statistical analysis, feature
engineering, machine learning, and visualization, fostering a deeper understanding of restaurant
industry data and trends.
3
CHAPTER 2: COMPANY PROFILE
1. A versatile chatbot system that integrates seamlessly with communication channels such as
websites, social media, and messaging apps.
2. Automates customer support and engagement, reduces response times, and enhances customer
satisfaction.
ML-Based Solutions
1. Tools for predictive analytics, enabling real-time insights to optimize business strategies.
2. Fraud detection systems to safeguard transactions and minimize risks.
3. Recommendation engines to personalize customer experiences and boost engagement.
Cognifyz Technologies equips businesses with robust tools to harness the power of data, improve
decision-making, and enhance operational efficiency, ensuring they remain at the forefront of
technological innovation.
4
CHAPTER 3: TECHNIQUES/METHODOLOGY
Task 1: Data Exploration and Preprocessing
Techniques:
1. Dataset Inspection:
1. Use Pandas to load and inspect the dataset: df.shape for dimensions, and df.info() to
examine column types and data completeness.
2. Handling Missing Values:
1. Convert columns to the appropriate type (e.g., datetime, categorical, numeric) using
pd.to_datetime() or astype().
Methodology:
Techniques:
5
o Analyze unique values and their distributions using .value_counts() or bar plots.
o Investigate categorical columns such as “Country Code,” “City,” and “Cuisines.”
Methodology:
Techniques:
1. Dataset Inspection:
1. Use Pandas to load and inspect the dataset: df.shape for dimensions, and df.info() to
examine column types and data completeness.
2. Handling Missing Values:
1. Convert columns to the appropriate type (e.g., datetime, categorical, numeric) using
pd.to_datetime() or astype().
Methodology:
6
Task 2: Descriptive Analysis
Techniques:
o Analyze unique values and their distributions using .value_counts() or bar plots.
o Investigate categorical columns such as “Country Code,” “City,” and “Cuisines.”
Methodology:
Techniques:
1. Mapping Locations:
o Use Folium or Plotly to create interactive maps based on latitude and longitude.
o Overlay restaurant density in different regions.
3. Correlation Analysis:
o Correlate geographical location with restaurant ratings using scatter plots or correlation matrices
(Pandas/Seaborn).
Methodology:
Use a combination of geospatial tools and visualization techniques to extract location-based insights.
Present geospatial data in an interactive and understandable format for better interpretation.
7
CHAPTER 4: SOFTWARE USED
This project utilizes a combination of software tools and libraries to handle tasks ranging from data
preprocessing to advanced predictive modeling and visualization.
1. Python with Pandas, NumPy, and Jupyter Notebook for cleaning, manipulation, and
interactive exploration.
2.
Descriptive Analysis
1. Python with Pandas, NumPy, Matplotlib, and Seaborn for statistical computations and
visualizations.
2. Optional use of R for additional descriptive analysis.
Geospatial Analysis
1. Python with Folium, Plotly, and Geopandas for mapping and geospatial data visualization.
2. QGIS (Optional) for advanced geospatial analysis.
1. Python with Pandas for feature creation and Scikit-learn for transformations and statistical
evaluations.
Predictive Modeling
1. Python with Scikit-learn, XGBoost, LightGBM, and Statsmodels for building and comparing
regression models.
These tools ensure a comprehensive approach to data analysis, from foundational tasks to advanced
insights and visual representations.
8
CHAPTER 5: PROJECT
To approach this project, we begin by exploring and preprocessing the dataset, which is the first task in
the process. We start by loading the dataset into a working environment, such as a Jupyter notebook,
using a tool like Python’s Pandas library. This allows us to quickly assess the number of rows and
columns in the dataset. The next step involves checking each column for missing values or any
inconsistencies. Missing values can appear in various forms such as NaN (Not a Number), null, or
blank entries, and handling them is crucial for the integrity of our analysis. These can be filled using
various strategies like replacing them with the mean or median of the column, or in some cases,
removing rows with too many missing values. After addressing missing data, we also perform data
type conversion, ensuring that each column's data type is appropriate for analysis (e.g., ensuring
numerical data is in numeric format and categorical data is properly encoded). Additionally, we
analyze the distribution of the target variable, "Aggregate Rating." This involves checking whether the
ratings are distributed evenly or if there are imbalances, where some ratings may be overrepresented or
underrepresented. This step is crucial as imbalances could affect any predictive modeling we perform
later.
9
Descriptive Analysis
The second task involves performing Descriptive Analysis to get a deeper understanding of the dataset.
We begin by calculating basic statistical measures such as the mean, median, and standard deviation
for numerical columns. These measures help summarize the central tendency and spread of the data,
giving us insights into variables like restaurant prices or ratings. Next, we explore the categorical
variables like "Country Code," "City," and "Cuisines." This step is focused on identifying patterns and
frequencies within these categories. For example, we might find that certain countries or cities have a
higher concentration of restaurants, or that some cuisines are more popular than others. This
information can be helpful for further analysis or for identifying trends that may be important for
business decisions.
10
Geospatial Analysis
The third task is focused on Geospatial Analysis, where we work with the geographic locations of
restaurants. Using the latitude and longitude data available in the dataset, we visualize the locations of
restaurants on a map. This is done using libraries like Folium or Plotly, which allow us to create
interactive maps. These maps give us a clear view of how restaurants are distributed geographically.
We can then analyze if there are any clusters of restaurants in specific cities or countries, and whether
the location of the restaurant might correlate with its rating. For example, we may find that restaurants
in city centers have higher ratings than those in more rural locations. By visually representing this data,
we can uncover hidden patterns that may not be immediately obvious from the raw dataset.
11
LEVEL-2
In Task 1, we focus on analyzing the availability of table booking and online delivery services across
the dataset. The first step is to determine the percentage of restaurants that offer these services. This
can be done by calculating the proportion of restaurants with table booking and online delivery options
compared to the total number of restaurants. Next, we compare the average ratings between restaurants
that offer table booking and those that do not. This will help us understand if there's any significant
difference in customer satisfaction between these two categories. Similarly, we analyze the availability
of online delivery among restaurants with different price ranges. By grouping restaurants based on
their price range, we can identify if higher or lower-priced establishments are more likely to offer
online delivery services. This analysis can provide valuable insights into customer preferences and
business strategies.
12
Price Range Analysis
In Task 2, we dive into Price Range Analysis, which helps us understand how pricing correlates with
restaurant ratings. We begin by identifying the most common price range across all the restaurants in
the dataset. This can be done by calculating the frequency of each price range and determining which
one is most prevalent. After this, we calculate the average rating for each price range. This step will
provide insights into how restaurant prices might influence customer ratings. We can further visualize
these results by using color coding to identify which price range corresponds to the highest average
rating. This can help restaurants determine if adjusting their pricing strategy could potentially lead to
better customer satisfaction.
13
Feature Engineering
In Task 3, we focus on Feature Engineering, where we create new features that could enhance the
predictive power of our models. One common approach is to extract additional features from existing
columns. For example, we can calculate the length of the restaurant name or address, as this could
provide useful information about the restaurant’s branding or location. Additionally, we can create new
binary features such as “Has Table Booking” and “Has Online Delivery” by encoding the relevant
categorical variables. This allows us to easily quantify whether a restaurant offers these services or not,
which can be a significant predictor in later analysis or predictive modeling. These new features will
help us enrich the dataset and uncover additional patterns in the data.
14
15
LEVEL-3
Predictive Modeling
In Task 1, the goal is to build a regression model to predict the aggregate rating of a restaurant based
on available features. This task involves using machine learning techniques to understand how various
factors influence restaurant ratings and creating a model that can make predictions based on these
features. The first step is to split the dataset into two subsets: a training set and a testing set. The
training set is used to train the model, while the testing set helps evaluate the model’s performance on
unseen data, which ensures that the model generalizes well to new data. Once the data is split, we
proceed by experimenting with different algorithms, such as linear regression, decision trees, and
random forest, to determine which one best predicts the target variable (aggregate rating). After
training the model, we evaluate its performance using appropriate metrics, such as mean squared
error (MSE) or R-squared. This comparison helps identify the model that performs the best,
providing a foundation for future improvements and predictions.
16
Customer Preference Analysis
17
Data Visualization
Task 3 focuses on data visualization, which plays a crucial role in understanding patterns and
communicating insights. In this task, we create visualizations to represent the distribution of ratings
across the dataset. This can be done using various types of charts, such as histograms or bar plots, to
show how ratings are distributed across different restaurants. Visualizing the distribution of ratings
helps us understand the overall trend and identify any skewness in the data. We also use visualizations
to compare the average ratings of different cuisines or cities, which can help highlight regions or
cuisine types that consistently receive higher ratings. Lastly, we use visualizations to uncover
relationships between different features (e.g., price range, table booking, or delivery options) and the
target variable (aggregate rating). For instance, we can create scatter plots to see if higher price
ranges correlate with higher ratings. These visualizations help draw insights from the data, enabling
better decision-making and providing clarity on key factors that influence customer ratings.
18
REFERENCES
1
19
20