01 - Introduction To Data Science
01 - Introduction To Data Science
• Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to collect,
clean, integrate, analyze, visualize, interact with data to create data
products.
Dr. Iyad H. Alshami – ICTS 6345 4
Data Sources
• Data is everywhere and still growing.
• The amount of digital data that exists is growing at a rapid rate, doubling
every two years, and changing the way we live.
• Data is growing faster than ever before and by the year 2020, about 1.7
megabytes of new information will be created every second for every human
being on the planet.
https://web-assets.domo.com/blog/wp-content/uploads/2021/09/data-never-sleeps-9.0-1200px-1.png
http://www.tribune242.com/news/2020/aug/20/activtrades-data-new-oil/
Dr. Iyad H. Alshami – ICTS 6345 8
Business Purposes
• A data product embeds that algorithm in a web site so that users can input
values and get predictions.
• Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to collect,
clean, integrate, analyze, visualize, interact with data to create data
products.
• Data Science employs techniques and theories drawn from many fields
within the broad areas computer science, mathematics, statistics, data
engineering, visualization, predictive analytics, uncertainty modelling, data
warehousing and high performance computing.
• Before we even begin doing anything with “Data Science”, we must first take
into consideration what problem we’re trying to solve.
• The data scientists keep asking what and why to ensure that every decision made in the
company is supported by concrete data, and that it is guaranteed (with a high
probability) to achieve results.
• As a rule of thumb, there are some things we must take into consideration
when obtaining data.
• We must identify all the available datasets, which can be from the internet
or external/internal databases, that related to the project.
• This phase require the most time and effort. Because the results and output
of your machine learning model is only as good as what you put into it.
Basically, garbage in garbage out.
• For instance, the data could also have inconsistencies within the same
column, meaning that some rows could be labelled 0 or 1, and others could
be labelled no or yes.
• Transform the data into a form appropriate for given data mining method
• Data is transformed or consolidated into forms appropriate for mining
A1 A2 A3 A4 C A2 A4 C
• For example, if you have a feature for age, but your model only cares about if
a person is an adult or minor, you could threshold it at 18, and assign
different categories to instances above and below that threshold.
• For example, if you were predicting student scores and had features for the
number of hours of sleep on each night, you might want to create a feature
that denoted the average sleep that the student had instead.
• data scientist will have access to many algorithms and use them to
accomplish different business goals.
• The input data for predictive modeling consists of two types of variables:
• The first is explanatory variables , which define the essential properties of the data, and
• The second is one target variables , whose values are to be predicted. Classification is
used to predicate the value of discrete target variable.
• Approach:
• Identify frequently occurring terms in each document.
• Form a similarity measure based on the frequencies of different terms.
• Use it to cluster the documents.
• Gain:
• Information Retrieval can utilize the clusters to relate a new document or search
term to clustered documents.
Dr. Iyad H. Alshami – ICTS 6345 64
Data Science Tasks
Outlier Analysis
• Discovers data points that are significantly different than the rest of the data.
Such points are known as anomalies or outliers.
• Applications:
• Credit Card Fraud Detection
• Seek to produce a set of rules describing the set of features that are strongly
related to each others.
F 52 Y 85 100
M 62 N 80 0
M 75 Y 70 80
M 73 Y 40 99
M 66 N 50 45
… … … … …
LAD%- The percentage of heat disease caused by left anterior descending coronary artery
RCA%- The percentage of heat disease caused by right coronary artery
NO. Rule
1 Gender=M∩Age≥70∩Smoker=Y Þ RCA%≥50(40%,100%)
2 Gender=F∩Age<70∩Smoker=Y Þ LAD%≥70(20%,100%)
q Rule 1 indicates:40% of the cases are male, over 70 years old and have the habit of
smoking, the possibility of RCA%≥50% is 100%
q Rule 2 indicates:20% of the cases are female, under 70 years old and have the habit of
smoking, the possibility of LAD%≥70% is 100%
RDBMS Custom
/ SQL R R Code