0% found this document useful (0 votes)
32 views26 pages

Sanjeev Mishra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views26 pages

Sanjeev Mishra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Science Intern

Summer Internship - III Report

Bachelor of Technology
In
Artificial Intelligence and Data Science

Submitted By

Sanjeev Mishra

0901AD211049

Submitted To

Dr. Bhagat S. Raghuwanshi


Assistant Professor

Prof. Ramnaresh Sharma


Assistant Professor

Centre for Artificial Intelligence

May - June 2024


DECLARATION BY THE CANDIDATE

I hereby declare that the work entitled “Data Science Intern” is my work, during the session
May-June 2024. The report submitted by me is a record of bonafide work carried out by me.

I further declare that the work reported in this report has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or diploma in this institute
or any other institute or university.

--------------------------------

Sanjeev Mishra
0901AD211049
Date: 20.11.24
Place: Gwalior

This is to certify that the above statement made by the candidates is correct to the best of my
knowledge and belief.
Class Cordinator:

Dr. Vibha Tiwari


Assistant Professor
Centre for Artificial Intelligence
MITS, Gwalior

Departmental Project Coordinator Approved by HoD

_
Dr. Tej Singh Dr. Rajni Ranjan Singh
Assistant Professor Prof. & Head
Centre for Artificial Intelligenc Centre for Artificial Intelligence
MITS, Gwalior MITS, Gwalior

i
ABSTRACT

This internship involves a structured, multi-level data analysis project aimed at


exploring, analyzing, and modeling restaurant-related data. The project progresses
through three levels, each encompassing distinct tasks that build upon the preceding
stages.

In Level 1, the focus is on data exploration and preprocessing by assessing dataset


dimensions, handling missing values, and performing data type conversions. The
distribution of the target variable, "Aggregate Rating," is analyzed to identify potential
class imbalances. Descriptive analysis follows, with statistical measures calculated for
numerical columns, and insights derived from categorical data, such as identifying the
most popular cuisines and cities. Geospatial analysis involves mapping restaurant
locations to explore correlations between geographic factors and ratings.

Level 2 emphasizes deeper analysis, including insights into table booking and online
delivery services, with comparisons of ratings across restaurants offering or lacking
these services. Price range analysis identifies the most common price categories and
explores the relationship between price range and restaurant ratings. Feature engineering
is introduced to extract additional insights by creating new features from existing data.

In Level 3, advanced tasks include predictive modeling to forecast aggregate ratings


using regression models. Various algorithms are implemented and evaluated to
determine the best-performing model. Additionally, customer preferences are analyzed
to identify popular cuisines and their influence on ratings. The internship concludes with
data visualization, leveraging charts and plots to illustrate insights, relationships, and
patterns within the data.

This comprehensive internship equips participants with skills in data exploration,


statistical analysis, geospatial analysis, feature engineering, predictive modeling, and
visualization, fostering a thorough understanding of restaurant data trends and behaviors.

ii
ACKNOWLEDGEMENT
The summer semester internship has proved to be pivotal to my career. I am thankful to my
institute,Madhav Institute of Technology and Science to allow me to continue my disciplinary /
interdisciplinary internship as a curriculum requirement, under the provisions of the Flexible
Curriculum Scheme (based on the AICTE Model Curriculum 2018), approved by the Academic
Council of the institute. I extend my gratitude to the Director of the institute, Dr. R. K. Pandit
and Dean Academics, Dr. Manjaree Pandit for this.

I would sincerely like to thank my department, Centre for Artificial Intelligence, for allowing
me to explore this internship. I humbly thank Dr. Rajni Ranjan Singh Makwana, Professor
and Head, Centre for Artificial Intelligence, for his continued support during the course of
this engagement, which eased the process and formalities involved.

I am sincerely thankful to my faculty mentors. I am grateful to the guidance of Prof. Vibha


Tiwari, Assistant Professor, Centre for Artificial Intelligence, for his continued support and
close mentoring throughout the internship. I am also very thankful to my industry mentor

Mr, Ashish Namdev, EddyTools Tech Solution Pvt. Ltd., for his guidance and mentorship
during the internship period.

-----------------------------

Sanjeev Mishra

0901AD211049

iii
Certificate

iv
CONTENT

Table of Contents
Declaration by the Candidate......................................................................................................... i

Abstract..........................................................................................................................................ii

Acknowledgement........................................................................................................................iii

Certificate..................................................................................................................................... iv

Content...........................................................................................................................................v

Acronyms......................................................................................................................................vi

List of Figures........................................................................ …………………………………vii.

Chapter 1: Introduction..................................................................................................................1

Chapter 2: Company Profile.......................................................................................................... 2

Chapter 3: Techniques/Methodology............................................................................................ 3

Chapter 4: Software Used..............................................................................................................5

Chapter 5: Projects.........................................................................................................................6

References....................................................................................................................................14

v
ACRONYMS

 DREAM:
Data Exploration, Restaurant Analysis, Engineering, Aggregation, and Modeling – summarizing the
multi-level project objectives and tasks.

  SERVE:
Statistics, Exploration, Regression, Visualization, and Engineering – focusing on the core processes
involved in analyzing restaurant-related data.

  TASTE:
Target Analysis, Aggregate Ratings, Statistics, Trends, and Engineering – emphasizing the analysis of
ratings, trends, and feature engineering.

  MEAL:
Multi-level Exploration, Analysis, and Learning – representing the multi-stage structure and learning
outcomes of the internship.

  PLATE:
Preprocessing, Level-Based Analysis, Aggregation, Trends, and Evaluation – summarizing the
progression through levels and focus on trends and evaluation.

vi
LIST OF FIGURES

1. Fig No. 1: Figma............................................................................................................5


2. Fig No. 2: Project 1........................................................................................................7
3. Fig No. 3: Project 2........................................................................................................9
4. Fig No. 4: Project 3........................................................................................................11
5. Fig No. 5: Project 4........................................................................................................13

vii
CHAPTER 1: INTRODUCTION

1.1 About Internship & Projects


My one-month internship at Cognify Technologies has beena transformative experience, enriching
my knowledge and skills in field of Data Science. This internship provided me with a platform to
apply theoretical knowledge in real-world projects and refine my analysis sensibilities under the
guidance of experienced professionals. Throughout this period, I contributed as Data Scientist and
worked in differet levels , each presenting uniquechallenges and learning opportunities.

Level 1: Data Exploration and Preprocessing

Tasks include dataset inspection, handling missing values, data type conversions, and analyzing the target
variable ("Aggregate Rating") for imbalances.

Descriptive statistics are calculated, with insights drawn from categorical variables like "Country Code," "City,"
and "Cuisines."

Level 2: Advanced Analysis and Feature Engineering

Focuses on table booking and online delivery, comparing ratings and availability across price ranges.

Price range analysis identifies common categories and explores their relationship with ratings, including
identifying standout attributes like color codes.

Feature engineering creates new insights by generating features such as name length or service availability.

Level 3: Predictive Modeling and Visualization

Predictive modeling involves building regression models to forecast aggregate ratings, testing various
algorithms, and evaluating their performance.

Customer preference analysis examines cuisine popularity and ratings trends to uncover actionable insights.

Data visualization represents patterns and relationships through charts, highlighting trends across cuisines, cities,
and other features.

This structured task list equips participants with skills in data exploration, statistical analysis, feature
engineering, machine learning, and visualization, fostering a deeper understanding of restaurant
industry data and trends.

3
CHAPTER 2: COMPANY PROFILE

Cognifyz Technologies is a forward-thinking technology company specializing in software solutions


for businesses. Their product suite spans cutting-edge areas like artificial intelligence (AI), machine
learning (ML), and data analytics, enabling businesses to stay competitive in a rapidly evolving
landscape.

Key offerings include:

AI-Powered Chatbot Platform

1. A versatile chatbot system that integrates seamlessly with communication channels such as
websites, social media, and messaging apps.
2. Automates customer support and engagement, reduces response times, and enhances customer
satisfaction.

ML-Based Solutions

1. Tools for predictive analytics, enabling real-time insights to optimize business strategies.
2. Fraud detection systems to safeguard transactions and minimize risks.
3. Recommendation engines to personalize customer experiences and boost engagement.

Cognifyz Technologies equips businesses with robust tools to harness the power of data, improve
decision-making, and enhance operational efficiency, ensuring they remain at the forefront of
technological innovation.

4
CHAPTER 3: TECHNIQUES/METHODOLOGY
Task 1: Data Exploration and Preprocessing

Techniques:

1. Dataset Inspection:
1. Use Pandas to load and inspect the dataset: df.shape for dimensions, and df.info() to
examine column types and data completeness.
2. Handling Missing Values:

1. Identify missing values using df.isnull().sum().


2. Apply appropriate strategies:

1. Fill missing numerical values with the mean/median.


2. Fill missing categorical values with the mode or “Unknown.”
3. Drop columns/rows if missing values exceed a threshold (e.g., 50%).
3. Data Type Conversion:

1. Convert columns to the appropriate type (e.g., datetime, categorical, numeric) using
pd.to_datetime() or astype().

4. Target Variable Analysis:

1. Check the distribution of "Aggregate Rating" using histograms (Matplotlib/Seaborn)


or .value_counts().
2. Identify class imbalances and consider techniques like SMOTE or oversampling for imbalanced
datasets.

Methodology:

 Use a step-by-step approach to clean and preprocess data.


 Document every change in a Jupyter Notebook for reproducibility.

Task 2: Descriptive Analysis

Techniques:

1. Statistical Measures for Numerical Columns:

o Use .describe() to compute mean, median, standard deviation, and more.


o Additional measures like skewness and kurtosis can be calculated using scipy.stats.

2. Categorical Variable Exploration:

5
o Analyze unique values and their distributions using .value_counts() or bar plots.
o Investigate categorical columns such as “Country Code,” “City,” and “Cuisines.”

3. Identifying Top Categories:

o Use group-by operations (df.groupby()) to rank cuisines or cities by the number of


restaurants.

Methodology:

 Focus on both numerical and categorical columns to derive insights.


 Visualize patterns with bar plots, pie charts, or count plots to communicate findings effectively.

Techniques and Methodology for the Restaurant Dataset Analysis Project


Level 1: Foundational Analysis

Task 1: Data Exploration and Preprocessing

Techniques:

1. Dataset Inspection:
1. Use Pandas to load and inspect the dataset: df.shape for dimensions, and df.info() to
examine column types and data completeness.
2. Handling Missing Values:

1. Identify missing values using df.isnull().sum().


2. Apply appropriate strategies:

1. Fill missing numerical values with the mean/median.


2. Fill missing categorical values with the mode or “Unknown.”
3. Drop columns/rows if missing values exceed a threshold (e.g., 50%).
3. Data Type Conversion:

1. Convert columns to the appropriate type (e.g., datetime, categorical, numeric) using
pd.to_datetime() or astype().

4. Target Variable Analysis:

1. Check the distribution of "Aggregate Rating" using histograms (Matplotlib/Seaborn)


or .value_counts().
2. Identify class imbalances and consider techniques like SMOTE or oversampling for imbalanced
datasets.

Methodology:

 Use a step-by-step approach to clean and preprocess data.


 Document every change in a Jupyter Notebook for reproducibility.

6
Task 2: Descriptive Analysis

Techniques:

1. Statistical Measures for Numerical Columns:

o Use .describe() to compute mean, median, standard deviation, and more.


o Additional measures like skewness and kurtosis can be calculated using scipy.stats.

2. Categorical Variable Exploration:

o Analyze unique values and their distributions using .value_counts() or bar plots.
o Investigate categorical columns such as “Country Code,” “City,” and “Cuisines.”

3. Identifying Top Categories:

o Use group-by operations (df.groupby()) to rank cuisines or cities by the number of


restaurants.

Methodology:

 Focus on both numerical and categorical columns to derive insights.


 Visualize patterns with bar plots, pie charts, or count plots to communicate findings effectively.

Task 3: Geospatial Analysis

Techniques:

1. Mapping Locations:

o Use Folium or Plotly to create interactive maps based on latitude and longitude.
o Overlay restaurant density in different regions.

2. Analyzing Spatial Distributions:

o Use heatmaps to analyze clustering patterns (e.g., using Seaborn or geopandas).


o Group data by city or country and examine rating distributions.

3. Correlation Analysis:

o Correlate geographical location with restaurant ratings using scatter plots or correlation matrices
(Pandas/Seaborn).

Methodology:

 Use a combination of geospatial tools and visualization techniques to extract location-based insights.
 Present geospatial data in an interactive and understandable format for better interpretation.

7
CHAPTER 4: SOFTWARE USED

This project utilizes a combination of software tools and libraries to handle tasks ranging from data
preprocessing to advanced predictive modeling and visualization.

Data Exploration and Preprocessing

1. Python with Pandas, NumPy, and Jupyter Notebook for cleaning, manipulation, and
interactive exploration.
2.

Descriptive Analysis

1. Python with Pandas, NumPy, Matplotlib, and Seaborn for statistical computations and
visualizations.
2. Optional use of R for additional descriptive analysis.

Geospatial Analysis

1. Python with Folium, Plotly, and Geopandas for mapping and geospatial data visualization.
2. QGIS (Optional) for advanced geospatial analysis.

Advanced Analysis and Feature Engineering

1. Python with Pandas for feature creation and Scikit-learn for transformations and statistical
evaluations.

Predictive Modeling

1. Python with Scikit-learn, XGBoost, LightGBM, and Statsmodels for building and comparing
regression models.

Customer Preference Analysis and Visualization

1. Python with Matplotlib, Seaborn, and Plotly for visualizations.


2. Optional tools: Tableau, Power BI, and Excel for enhanced interactive or quick visualizations.

These tools ensure a comprehensive approach to data analysis, from foundational tasks to advanced
insights and visual representations.

8
CHAPTER 5: PROJECT

Level 1: Data Exploration and Preprocessing

To approach this project, we begin by exploring and preprocessing the dataset, which is the first task in
the process. We start by loading the dataset into a working environment, such as a Jupyter notebook,
using a tool like Python’s Pandas library. This allows us to quickly assess the number of rows and
columns in the dataset. The next step involves checking each column for missing values or any
inconsistencies. Missing values can appear in various forms such as NaN (Not a Number), null, or
blank entries, and handling them is crucial for the integrity of our analysis. These can be filled using
various strategies like replacing them with the mean or median of the column, or in some cases,
removing rows with too many missing values. After addressing missing data, we also perform data
type conversion, ensuring that each column's data type is appropriate for analysis (e.g., ensuring
numerical data is in numeric format and categorical data is properly encoded). Additionally, we
analyze the distribution of the target variable, "Aggregate Rating." This involves checking whether the
ratings are distributed evenly or if there are imbalances, where some ratings may be overrepresented or
underrepresented. This step is crucial as imbalances could affect any predictive modeling we perform
later.

9
Descriptive Analysis

The second task involves performing Descriptive Analysis to get a deeper understanding of the dataset.
We begin by calculating basic statistical measures such as the mean, median, and standard deviation
for numerical columns. These measures help summarize the central tendency and spread of the data,
giving us insights into variables like restaurant prices or ratings. Next, we explore the categorical
variables like "Country Code," "City," and "Cuisines." This step is focused on identifying patterns and
frequencies within these categories. For example, we might find that certain countries or cities have a
higher concentration of restaurants, or that some cuisines are more popular than others. This
information can be helpful for further analysis or for identifying trends that may be important for
business decisions.

10
Geospatial Analysis

The third task is focused on Geospatial Analysis, where we work with the geographic locations of
restaurants. Using the latitude and longitude data available in the dataset, we visualize the locations of
restaurants on a map. This is done using libraries like Folium or Plotly, which allow us to create
interactive maps. These maps give us a clear view of how restaurants are distributed geographically.
We can then analyze if there are any clusters of restaurants in specific cities or countries, and whether
the location of the restaurant might correlate with its rating. For example, we may find that restaurants
in city centers have higher ratings than those in more rural locations. By visually representing this data,
we can uncover hidden patterns that may not be immediately obvious from the raw dataset.

11
LEVEL-2

Table Booking and Online Delivery Analysis

In Task 1, we focus on analyzing the availability of table booking and online delivery services across
the dataset. The first step is to determine the percentage of restaurants that offer these services. This
can be done by calculating the proportion of restaurants with table booking and online delivery options
compared to the total number of restaurants. Next, we compare the average ratings between restaurants
that offer table booking and those that do not. This will help us understand if there's any significant
difference in customer satisfaction between these two categories. Similarly, we analyze the availability
of online delivery among restaurants with different price ranges. By grouping restaurants based on
their price range, we can identify if higher or lower-priced establishments are more likely to offer
online delivery services. This analysis can provide valuable insights into customer preferences and
business strategies.

12
Price Range Analysis

In Task 2, we dive into Price Range Analysis, which helps us understand how pricing correlates with
restaurant ratings. We begin by identifying the most common price range across all the restaurants in
the dataset. This can be done by calculating the frequency of each price range and determining which
one is most prevalent. After this, we calculate the average rating for each price range. This step will
provide insights into how restaurant prices might influence customer ratings. We can further visualize
these results by using color coding to identify which price range corresponds to the highest average
rating. This can help restaurants determine if adjusting their pricing strategy could potentially lead to
better customer satisfaction.

13
Feature Engineering

In Task 3, we focus on Feature Engineering, where we create new features that could enhance the
predictive power of our models. One common approach is to extract additional features from existing
columns. For example, we can calculate the length of the restaurant name or address, as this could
provide useful information about the restaurant’s branding or location. Additionally, we can create new
binary features such as “Has Table Booking” and “Has Online Delivery” by encoding the relevant
categorical variables. This allows us to easily quantify whether a restaurant offers these services or not,
which can be a significant predictor in later analysis or predictive modeling. These new features will
help us enrich the dataset and uncover additional patterns in the data.

14
15
LEVEL-3

Predictive Modeling

In Task 1, the goal is to build a regression model to predict the aggregate rating of a restaurant based
on available features. This task involves using machine learning techniques to understand how various
factors influence restaurant ratings and creating a model that can make predictions based on these
features. The first step is to split the dataset into two subsets: a training set and a testing set. The
training set is used to train the model, while the testing set helps evaluate the model’s performance on
unseen data, which ensures that the model generalizes well to new data. Once the data is split, we
proceed by experimenting with different algorithms, such as linear regression, decision trees, and
random forest, to determine which one best predicts the target variable (aggregate rating). After
training the model, we evaluate its performance using appropriate metrics, such as mean squared
error (MSE) or R-squared. This comparison helps identify the model that performs the best,
providing a foundation for future improvements and predictions.

16
Customer Preference Analysis

In Task 2, we focus on understanding customer preferences by analyzing the relationship between


cuisine type and restaurant ratings. This task involves examining whether certain types of cuisines
are more highly rated than others. By analyzing the dataset, we can identify which cuisines tend to
receive higher aggregate ratings from customers, as well as which cuisines are more popular based on
the number of votes or reviews. For example, we might find that Italian or Indian cuisines are more
commonly rated highly compared to others. This analysis helps identify trends in customer behavior
and preferences, which can be useful for restaurant owners when determining menu offerings or
marketing strategies. Additionally, understanding the relationship between cuisine type and ratings can
guide restaurants in adjusting their offerings to meet customer preferences and boost satisfaction.

17
Data Visualization

Task 3 focuses on data visualization, which plays a crucial role in understanding patterns and
communicating insights. In this task, we create visualizations to represent the distribution of ratings
across the dataset. This can be done using various types of charts, such as histograms or bar plots, to
show how ratings are distributed across different restaurants. Visualizing the distribution of ratings
helps us understand the overall trend and identify any skewness in the data. We also use visualizations
to compare the average ratings of different cuisines or cities, which can help highlight regions or
cuisine types that consistently receive higher ratings. Lastly, we use visualizations to uncover
relationships between different features (e.g., price range, table booking, or delivery options) and the
target variable (aggregate rating). For instance, we can create scatter plots to see if higher price
ranges correlate with higher ratings. These visualizations help draw insights from the data, enabling
better decision-making and providing clarity on key factors that influence customer ratings.

18
REFERENCES

[1] Youtube: https://www.youtube.com/

[2] Course : https://learnuiux.in

1
19
20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy