FINAL INTERN DOCUMENT Dhanunjai
FINAL INTERN DOCUMENT Dhanunjai
MACHINE LEARNING, AI
A report submitted to the department of
Computer Science And Engineering in partial fulfillment of the
requirements of the award of the Degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
April 2024
i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
External Examiner
ii
ORGANIZATION CERTIFICATE
iii
DECLARATION
I certify that
a. The internship contained in the report is original and has been done by me under the
guidance of my supervisor.
b. The work has not been submitted to any other University for the award of any degree or
diploma.
c. The guidelines of the college are followed in writing the internship report.
Date:
iv
ACKNOWLEDGEMENT
v
INDEX
2 Organization Profile 2
3 Introduction 3
4 Software Requirement Specifications 4-5
5 About Technologies 6-10
5.1 Python for Data Science 6
5.2 SQL 7-8
5.3 Statistics for Data Science 8
5.4 Machine Learning 9-10
6 Screenshots 11-12
7 Quiz Questions 13-14
8 Internship Registration Proofs 15
9 Conclusion 16
10 List Of Reference 17
viii
DATE DAY NAME OF THE TOPIC
01/04/2024 MONDAY PROJECT
INTRODUCTION
8th
02/04/2024 TUESDAY OVERVIEW OF PROJECT
week 03/04/2024 WEDNESDAY DESIGN AND ANALYSIS
04/04/2024 THURSDAY PREPROCESSING DATA
05/04/2024 FRIDAY APPLYING ALGORITHMS
ix
1. LEARNING OBJECTIVES/INTERNSHIP OBJECTIVES
Internships are generally thought of to be reserved for college students looking to gain
experience in a particular field. However, a wide array of people can benefit from Training
Internships in order to receive real world experience and develop their skills.
An objective for this position should emphasize the skills you already possess in the area and your
interest in learning more.
Internships are utilized in a number of different career fields, including architecture, engineering,
healthcare, economics, advertising and many more.
Some internship is used to allow individuals to perform scientific research while others are
specifically designed to allow people to gain first-hand experience working.
Utilizing internships is a great way to build your resume and develop skills that can be emphasized in
your resume for future jobs. When you are applying for a Training Internship, make sure to highlight
any special skills or talents that can make you stand apart from the rest of the applicants so that you
have an improved chance of landing the position.
1
2. ORGANIZATION PROFILE
Organization Information:
Datavalley.ai is a leading provider of top-notch training and consulting services in the cutting-
edge fields of Big Data, Data Engineering, Data Architecture, DevOps, Data Science, Machine
Learning, IoT, and Cloud Technologies.
Training:
Data Valley training programs, led by industry experts, are tailored to equip professionals and
organizations with the essential skills and knowledge needed to thrive in the rapidly evolving data
landscape. We believe in continuous learning and growth, and our commitment to staying on top of
emerging trends and technologies ensures that our clients receive the most cutting-edge training
possible.
2
3. INTRODUCTION
Data science is the study of data.Like biological sciences is a study of biology, physical
sciences, it’s the study of physical reactions. Data is real, data has real properties, and we
need to study them if we’re going to work on them.Data Science involves data and some
signs. It is a process, not an event.It is the process of using data to understand too many
different things, to understand the world.
Let Suppose when you have a model or proposed explanation of a problem, and you try to
validate that proposed explanation or model with your data.It is the skill of unfolding the
insights and trends that are hiding (or abstract) behind data. It’s when you translate data into
a story. So, use storytelling to generate insight. And with these insights, you can make
strategic choices for a company or an institution.
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and probability to
forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to purchase
our new One AI software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to act and
adapt to new data without being programmed to do so. The computer is able to act
independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past and present data and
most commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a prediction, must
have logic to it. It must be defendable. This logic is what differentiates it from the magic 8 ball's lucky
guess. After all, even a broken watch is right two times a day.
3
4.SOFTWARE REQUIREMENT SPECIFICATIONS
For data science, you typically need a combination of software tools to perform various tasks
such as data manipulation, analysis, visualization, and machine learning. Here's a list of essential
software requirements for data science:
1. Programming Languages:
• Python: It's the most widely used language in data science due to its extensive libraries for
data manipulation (e.g., Pandas), visualization (e.g., Matplotlib, Seaborn), and machine
learning (e.g., Scikit-learn, TensorFlow, PyTorch).
• R: Another popular language for statistical analysis and visualization, particularly in
academia.
2. Integrated Development Environments (IDEs):
• Jupyter Notebook: A web-based interactive computing environment that allows you to create
and share documents containing live code, equations, visualizations, and narrative text.
• Spyder: A powerful IDE for Python that provides a MATLAB-like interface for data analysis.
3. Data Manipulation and Analysis:
• Pandas: A Python library for data manipulation and analysis.
• NumPy: Provides support for large, multi-dimensional arrays and matrices, along with a
collection of mathematical functions to operate on these arrays.
• R Studio: An integrated development environment (IDE) for R that makes data analysis
easier with its intuitive interface.
3. Data Visualization:
• Matplotlib: A plotting library for Python that provides a MATLAB-like interface for creating
static, interactive, and animated visualizations.
• Seaborn: A Python visualization library based on Matplotlib that provides a highlevel
interface for drawing attractive statistical graphics.
4
• ggplot2: A plotting system for R, based on the grammar of graphics, which provides a highly
customizable approach to data visualization.
4. Machine Learning:
• Scikit-learn: A simple and efficient tool for data mining and data analysis, built on NumPy,
SciPy, and Matplotlib.
• TensorFlow / Keras: TensorFlow is an open-source machine learning library developed by
Google. Keras is a high-level neural networks API, which can run on top of TensorFlow.
5. Deep Learning (Optional):
• TensorFlow / Keras: Widely used for deep learning tasks due to its flexibility and performance. •
PyTorch: Another popular choice for deep learning, known for its dynamic computational graph
and ease of use.
• NoSQL Databases (Optional): Depending on the project requirements, familiarity with NoSQL
databases like MongoDB or Cassandra might be necessary. 8. Version Control:
• Git: Essential for tracking changes in code and collaborating with other team members.
Platforms like GitHub, GitLab, or Bitbucket are commonly used for hosting Git repositories.
8. Text Editors:
• VS Code: A lightweight and powerful source code editor that comes with built-in support for
Python and many other languages.
• Atom, Sublime Text, etc.: Other popular text editors with extensive support for various
programming languages.
5
5. TECHNOLOGIES
PANDAS :
When it comes to data manipulation and analysis, nothing beats Pandas. It is the most popular
Python library, period. Pandas is written in the Python language especially for manipulation and
analysis tasks.
Pandas provides features like:
• Dataset joining and merging
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets • DataFrame objects to manipulate data, and much more!
NUMPY :
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in functions to support
large multi-dimensional arrays and matrices. It also brings in high-level mathematical functions to
work with these arrays and matrices. NumPy is an open-source library and has multiple contributors.
MATPLOTLIB :
Matplotlib is the most popular data visualization library in Python. It allows us to generate and build
plots of all kinds. This is my go-to library for exploring data visually along with Seaborn.
6
5.2 SQL Database :
SQL (Structured Query Language) databases are a cornerstone of modern data storage, retrieval, and
management systems. They are designed to efficiently handle structured data and have been the go-to
solution for relational data management for decades. If you need to describe SQL databases for your
report, here are the key points you might want to cover:
SQL databases use a structured approach to organize data into tables, which consist of rows and
columns. This tabular structure allows for easy querying, indexing, and relationships among data.
The relational model, proposed by Edgar F. Codd in 1970, underpins the design of SQL databases,
emphasizing the use of keys and relationships to maintain data integrity.
Relational Structure: Data is stored in tables, with each table representing a specific entity.
Relationships among tables are established using primary keys (unique identifiers for each row) and
foreign keys (references to primary keys in other tables).
SQL Language: SQL is the standard language for interacting with relational databases. It provides
commands for querying, updating, and managing data, such as SELECT, INSERT, UPDATE, and
DELETE.
Data Integrity and Constraints: SQL databases enforce data integrity through constraints like
primary keys, foreign keys, unique constraints, and check constraints. These rules ensure that the
data remains consistent and reliable.
ACID Properties: SQL databases adhere to ACID properties—Atomicity, Consistency, Isolation, and
Durability. These properties ensure that transactions are processed reliably, even in the event of system
failures or concurrent access.
7
Normalization: SQL databases use normalization to minimize data redundancy and avoid data
anomalies. This process involves decomposing complex tables into simpler ones to maintain
consistency and reduce duplication.
Several SQL database systems are widely used across different industries. Some of the most popular
ones include:
MySQL: An open-source SQL database known for its speed, reliability, and ease of use. It's
commonly used for web applications and small to medium-sized businesses.
Statistics simply means numerical data, and is field of math that generally deals with collection of
data, tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis
that uses different quantitative models to produce a set of experimental data or studies of real life.
It is an area of applied mathematics concern with data collection analysis, interpretation, and
presentation. Statistics deals with how data can be used to solve complex problems. Some people
consider statistics to be a distinct mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do on a
regular basis.
Basic terminology of Statistics:
• Population – It is actually a collection of set of individuals or objects or events whose
properties are to be analyzed.
• Sample – It is the subset of a population.
(i) Mean :
It is measure of average of all value in a sample set (ii)
Median :
It is measure of central value of a sample set. In these, data set is ordered from lowest to
highest value and then finds exact middle.
(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time in central
set is actually mode.
Understanding the spread of data
Measure of Variability is also known as measure of dispersion and used to describe variability in a
sample or population.
In statistics, there are three common measures of variability as shown below:
8
(i) Range:
It is given measure of how to spread apart values in sample set or data set. Range =
Maximum value - Minimum value
(ii) Variance :
It simply describes how much a random variable defers from expected value and it is also
computed as square of deviation
(iii) Dispersion:
It is measure of dispersion of set of data from its mean. Represe
Predictive analytics involves certain manipulations on data from existing data sets with the goal of
identifying some new trends and patterns. These trends and patterns are then used to predict future
outcomes and trends. By performing predictive analysis, we can predict future trends and
performance. It is also defined as the prognostic analysis, the word prognostic means prediction.
Predictive analytics uses the data, statistical algorithms and machine learning techniques to
identify the probability of future outcomes based on historical data.
•Supervised learning:
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine using data
which is well labeled that means some data is already tagged with the correct answer. After that,
the machine is provided with a new set of examples(data) so that supervised learning algorithm
analyses the training data (set of training examples) and produces a correct outcome from labeled
data.
• Unsupervised learning:
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without guidance.
Here the task of machine is to group unsorted information according to similarities, patterns and
differences without any prior training of data.
Stages of Predictive Models :
Steps To Perform Predictive Analysis:
9
Data collection involves gathering the necessary details required for the analysis.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of data cleaning,
we remove un-necessary and erroneous data. It involves removing the redundant data and
duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly in order
to identify some patterns or new outcomes from the data set. In this stage, we discover useful
information and conclude by identifying some patterns or trends.
10
6. SCREENSHOTS
11
12
7. QUIZ QUESTIONS
13
14
8. INTERNSHIP REGISTRATION PROOF
15
9. CONCLUSION
In summary, the convergence of data science, machine learning, and artificial intelligence
represents a transformative force in our digital landscape. This interdisciplinary fusion empowers
us to extract valuable insights from vast datasets, automate processes, and create intelligent
systems capable of learning and adapting. By leveraging advanced algorithms, statistical
techniques, and computational power, practitioners in these fields can tackle complex problems,
drive innovation, and enhance decision-making across diverse domains. As we continue to
advance, interdisciplinary collaboration and ongoing research will further propel the capabilities
of data science, machine learning, and AI, ushering in a new era of technological sophistication
and societal impact.
16
Schapire, R.E. (2003). The boosting approach to machine learning: An overview. In Nonlinear
Estimation and Classification, pp. 149–172. Springer. 341Google Scholar
Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new
explanation for the effectiveness of voting methods. Annals of Statistics 26(5):1651–1686.
341Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. 156Google
Scholar
Ragavan, H. and Rendell, L.A. (1993). Lookahead feature construction for learning hard concepts.
In Proceedings of the Tenth International Conference on Machine Learning (ICML 1993), pp. 252–
259. Morgan Kaufmann. 328 Google Scholar
Rajnarayan, D.G. and Wolpert, D. (2010). Bias-variance trade-offs: Novel applications. In C.,
Sammut and G.I., Webb (eds.), Encyclopedia of Machine Learning, pp. 101–110. Springer.
103Google Scholar
17