Big Data
Big Data
Data Science
Data science is the process of examining data sets to conclude the information they contain,
increasingly with the aid of specialized systems and software, using techniques, scientific
models, theories, and hypotheses. These three pillars have very much been the mainstay of
data science ever since it started getting embraced by businesses over the past two decades
and should continue to be even in the future.
Data Science expressed like an idea accepted in academia and industry. It’s an intersection of
programming, analytical, and business skills that allows extracting meaningful insights from
data to benefit business growth. However, this is used in social research, scientific & space
programs, government planning, and so on.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Business Acumen in its purest form means running a Business Enterprise. Any business existing
to sell its product or services for a profit incurring some cost and generally having the functions
like HR, Supply Chain, Finance, Sales & marketing to support it.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Big data
Definition
Big Data is defined as a tool and platform that is used to store, process, and
analyze data to identify business insights that were not possible due to the
limitation of the traditional data processing and management technologies. Big
Data is also viewed as a technology for processing huge datasets in distributed
scalable platforms.
Big Data Applications in Industry 4.0 Data that cannot be stored and processed in
commodity hardware and greater than one terabyte is called Big Data. The
existing commodity hardware size of computing is only one terabyte where
processing and storage of data are limited.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
5V in Bigdata:
7V in Bigdata:
- Volume: Volume represents the amount of data that is growing at an exponential rate, i.e. in
Petabytes and Exabytes.
- Velocity: Velocity refers to the speed at which data is growing, very fast. Today, yesterday's data is
considered stale data. Today, social media is a big contributor to the growth of data.
- Variety: Variety refers to the heterogeneity of data types. In other words, the data collected comes
in many formats like video, audio, csv, etc. So these different formats represent many types of
data.
- Veracity (Tính xác thực): Veracity refers to doubtful or uncertain data of available data due to
inconsistent and incomplete data. Available data can sometimes be messy and hard to trust. With
many forms of big data, quality and accuracy are difficult to control. Volume is often the reason
behind the lack of quality and accuracy of the data.
- Validity: The fifth V denotes the validity of data that is essential in business to identify the validity
of data patterns for planning business strategies.
- Virality (Tính lan truyền): The sixth V denotes the virality aspect of data that is generally used to
measure the reach of data.
- Value: All is well and good to have access to big data but unless we can turn it into a value.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Sample :
The characteristics of Big Data for the coronavirus pandemic is mapped below.
+Volume: Huge volume of data is evolved every hour related to a patient affected, illness conditions,
precaution measures, diagnosis, and hospital facilities.
+Velocity: The information about people affected and the ill effects of COVID- 19 is streaming in
nature which is evolving dynamically.
+Variety: Huge volume of data related to COVID 19 is accumulated as structured data in patient
database, demographics of citizens, clinical diagnosis, travel data, genomic studies, and drug targets.
Unstructured data for COVID- 19 is voluminous in social media platforms of Twitter, Facebook, and
WhatsApp to share preventive measures in the form of text, audio, video, and related chats. Role of Big
Data Analytics
+Veracity and Virality: The information of preventive cure mechanism mentioned in social media
platforms are inconsistent and viral leading to uncertainty among people.
+Validity and Value: Measuring the validity and the value of the content available in the digital globe
for the pandemic has become a challenge.
To create big data for a manufacturing process, you can follow these steps:
1.Data Collection: Collect data from various sources such as sensors, machines,
production systems, and databases. This data can include production data,
machine performance data, quality control data, and supply chain data. When
using big data in production, the data integration process is critical to ensure that
the data is correctly formatted and usable by big data technologies.
2.Data Integration: Integrate the data from different sources into a centralized
repository such as a Hadoop cluster or a data lake.
Step 1: Collect data from different sources
Step 2: Standardize data
Step 3: Integrate data
Step 4: Ensure data consistency and format
Step 5: Store data
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
3. Data Processing:
- Data processing is the process of processing data to convert source data into useful
and reliable information. It includes activities such as collecting, storing, organizing,
classifying, calculating, analyzing, and presenting information in reports, charts, or
other formats.
- Data processing can be performed using various means and tools, including data
processing software, database systems, data query tools, algorithms and processing
techniques. different data. The purpose of data processing is to help managers,
researchers or other organizations find useful information from data and make the
right decisions.
- Use big data technologies such as Apache Spark or Apache Flink to process and lean
the data, identify patterns and anomalies.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
4. Data Analysis:
Data analysis can be applied to many different fields such as business, science,
health, education, politics... Each field has its own goals and methods of data
analysis. However, in general, data analysis aims to solve specific problems
using existing or newly collected data.
Analyze the data using tools such as Apache Hive, Apache Impala, or Apache
Drill to gain insights into the manufacturing process and make data-driven
decisions.
Statictical Analysis
Descriptive Analytics
• Define business metrics: Determine which metrics are important for evaluating
performance against business goals. Goals include to increase revenue, reduce costs,
improve operational efficiency and measure productivity. Each goal must have associated
key performance indicators (KPIs) to help monitor achievement.
• Identify data required: Data are located in many different sources within the enterprise,
including systems of record, databases, desktops and shadow IT repositories. To measure
data accurately against KPIs, companies must catalog and prepare the correct data sources
to extract the needed data and calculate metrics based on the current state of the business.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
• Extract and prepare data: Data must be prepared for analysis. Deduplication,
transformation and cleansing are a few examples of the data preparation steps that need to
occur before analysis. This is often the most time-consuming and labor-intensive step,
requiring up to 80% of an analyst’s time, but it is critical for ensuring accuracy.
• Analyze data: Data analysts can create models and run analyses such as summary
statistics, clustering and regression analysis on the data to determine patterns and measure
performance. Key metrics are calculated and compared with stated business goals to
evaluate performance based on historical results. Data scientists often use open source tools
such as R and Python to programmatically analyze and visualize data.
• Present data: Results of the analytics are usually presented to stakeholders in the form of
charts and graphs. This is where data visualization comes into play. Business intelligence
tools give users the ability to present data visually in a way that non-data analysts can
understand. Many self-service data visualization tools also enable business users to create
their own visualizations and manipulate the output. PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Diagnostic Analytics
At this stage, historical data can be measured against other data to answer the question of why
something happened. Thanks to diagnostic analytics, there is a possibility to drill down, to find out
dependencies and to identify patterns. Companies go for diagnostic analytics, as it gives a deep
insight into a particular problem. At the same time, a company should have detailed information at
their disposal, otherwise data collection may turn out to be individual for every issue and time-
consuming.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Predictive Analytics
Prescriptive Analysis
• Build a business case: Prescriptive analytics are best used when data-driven decision-
making goes beyond human capabilities, such as when there are too many input
variables, or data volumes are high. A business case will help identify whether machine-
generated recommendations are appropriate and trustworthy.
• Define rules: Prescriptive analytics require rules to be codified that can be applied to
generate recommendations. Business rules thus need to be identified and actions defined
for each possible outcome. Rules are decisions that are programmatically implemented
in software. The system receives and analyzes data, then prescribes the next best course
of action based on predetermined parameters. Prescriptive models can be very complex
to implement. Appropriate analytic techniques need to be applied to ensure that all
possible outcomes are considered to prevent missteps. This includes the application of
optimization and other analytic techniques in conjunction with rules management.
• Test, Test, Test: As the intent of prescriptive analytics is to automate the decision-
making process, testing the models to ensure that they are providing meaningful
recommendations is imperative to prevent costly mistakes. PhD. Quang-
Phuoc Tran
A1
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Statistical Learning
What is Statistical Learning ?
𝑌=𝑓 𝑋 + 𝜖
Why Estimate f ?
Prediction
𝑌 = 𝑓 (𝑋) Reducible Irreducible
E(𝑌 − 𝑌 ) = 𝐸[𝑓 𝑋 + 𝜖 − 𝑓 (𝑋)] = [𝑓 𝑋 − 𝑓 (𝑋)] +𝑉𝑎𝑟(𝜖)
Inference
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?
PhD. Quang-Phuoc Tran
Slide 19
A1 Admin, 4/5/2025
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Statistical Learning
How do we estimate f ?
Parametric Methods
Make an assumption about the functional form, or shape, of f.
𝑓 𝑋 = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +⋯+𝛽 𝑋
Y ≈ 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + ⋯+ 𝛽 𝑋
Statistical Learning
Linear Regression
- Linear Regression: one of the most fundamental machine learning methods. It covers
least squares estimation, model assumptions, and performance metrics like R-squared.
• Dependent Variable
• Independent Variables
• Regression Line
• Regression Equation
• Coefficients
• Intercept
• Residuals
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦 − 𝑦 )
𝑛
Linear Regression
+ Linearity:
+ Independence
+ Homoscedasticity
(hiệp phương sai đồng nhất)
Linear Regression
+ Normality
Residual = Observed value – Predicted value
𝑒 =𝑦 −𝑦
𝑅𝑆𝑆 = 𝑒 + 𝑒 + ⋯ + 𝑒
Linear Regression
∑ (𝑥 − 𝑥̅ )(𝑦 − 𝑦)
𝜃 =
∑ (𝑥 − 𝑥̅ )
𝜃 = 𝑦 − 𝜃 𝑥̅
where
1 1
𝑥̅ ≡ 𝑥 𝑦≡ 𝑦
𝑛 𝑛
Linear Regression
Example: We have the following dataset with 12 total observations
Linear Regression
xi yi (xi - x̄) (yi - ȳ) (xi - x̄)(yi - ȳ) (xi - x̄)² The slope is:
8 41 -11.25 -2.25 25.3125 126.5625 398.625
12 42 -7.25 -1.25 9.0625 52.5625 𝜃 = ≈ 0.732
544.75
12 39 -7.25 -4.25 30.8125 52.5625
13 37 -6.25 -6.25 39.0625 39.0625 𝜃 = -9.949
14 35 -5.25 -8.25 43.3125 27.5625
16 39 -3.25 -4.25 13.8125 10.5625
17 45 -2.25 1.75 -3.9375 5.0625 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙
22 46 2.75 2.75 7.5625 7.5625
24 39 4.75 -4.25 -20.1875 22.5625
26 49 6.75 5.75 38.8125 45.5625
29 55 9.75 11.75 114.5625 95.0625
30 57 10.75 13.75 147.8125 115.5625
Totals 398.625 544.75
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
1 1
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = (𝑦 − 𝑦 )
𝑛−2 𝑛−2
𝑹𝑺𝑺 = (𝒚𝒊 − 𝒚𝒊 )𝟐
𝒊 𝟏
R-square
𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅 = =1− 𝑊ℎ𝑒𝑟𝑒, 𝑇𝑆𝑆 = (𝑦 − 𝑦)
𝑇𝑆𝑆 𝑇𝑆𝑆
Ex 1 :
Suppose we have a data set with five predictors, X1 =GPA,X2 =IQ, X3 =Gender (1 for Female and 0 for Male),
X4 =Interactionbetween GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting
salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆ β0 =50,
ˆ β1 = 20, ˆ β2 =0.07, ˆ β3 =35, ˆ β4 =0.01, ˆ β5 = −10.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence
of an interaction effect. Justify your answer.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Ex 2 :
A scientist is studying the relationship between the amount of fertilizer used on a crop and
the resulting yield. They conduct an experiment using different amounts of fertilizer (in
kilograms per hectare) and measure the corresponding crop yield (in tonnes per hectare).
The data collected is as follows:
Fertilizer Yield The scientist wants to model this relationship using a simple
( kg/ha ) (tonnes/ha)
linear regression model:
1 2
Yield = β₀ + β₁ * Fertilizer + ε
2 3
where:
3 5
+ Yield is the crop yield (in tonnes per hectare)
4 4
+ Fertilizer is the amount of fertilizer used (in kg/ha )
5 6
+ β₀ and β₁ are the coefficients to be estimated
+ ε is the error term
1.Write out the cost function (also known as the sum of squared errors) that needs to be minimized. Express this
function in terms of β₀, β₁, Fertilizerᵢ, and Yieldᵢ, where i represents the data point index (i=1 to 5 in this case).
2.Derive the normal equations by taking the partial derivatives of the cost function with respect to β₀ and β₁, and
setting them equal to zero. This will give you two equations that can be solved simultaneously to find the
optimal values of β₀ and β₁.
3.Calculate the following sums from the provided data:
1. Σ Fertilizerᵢ
2. Σ Yieldᵢ
3. Σ (Fertilizerᵢ)²
4. Σ (Fertilizerᵢ * Yieldᵢ)
4.Use the sums calculated in step 3 and the normal equations derived in step 2 to calculate the least squares
estimates for β₀ and β₁.
5.Write out the final linear equation with the calculated values of β₀ and β₁.
6.Interpret the meaning of β₁. What does it tell you about the relationship between fertilizer and yield? For
example, what is the predicted increase in yield for every additional kg/ha of fertilizer used?
Example:
Let's say you want to predict a house's sale price (Y) based on its size (X₁ in square feet), the number of bedrooms (X₂), and
the age of the house (X₃ in years). You collect data on several houses and fit a multiple linear regression model.
Suppose you obtain the following estimated coefficients:
•β₀ = 50,000
•β₁ = 200
•β₂ = 10,000
•β₃ = -500
The resulting regression equation would be:
Exercise 1:
A data scientist is building a model to predict apartment rental prices (in dollars) based on the size of the apartment
(in square feet) and its distance from downtown (in miles). They collect data on four apartments:
Use the least squares method and matrix notation to find the coefficients of the multiple linear regression model:
Rent = β₀ + β₁ * Size + β₂ * Distance
Solution:
β = (XᵀX)⁻¹XᵀY
Exercise 2:
A drilling process is studying the effects of cutting speed (V), feed rate (S), and cutting time (T) on three
important output variables in a drilling process: hole temperature (Temp), thrust force (Force), and surface
roughness (Roughness). They conduct a series of experiments and collect the following data:
The engineer wants to build three separate multiple linear regression models:
Temp = β₀ + β₁V + β₂S + β₃T
Force = β₀ + β₁V + β₂S + β₃T
Roughness = β₀ + β₁V + β₂S + β₃T
Tasks:
1.Using the least squares method and matrix notation, determine the coefficients (β₀, β₁, β₂, β₃) for the chosen
model. Clearly show the following steps:
1. Set up the matrices Y and X.
2. Calculate Xᵀ (X transpose).
3. Calculate XᵀX.
4. Calculate (XᵀX)⁻¹ (You'll need a calculator or software for this).
5. Calculate XᵀY.
6. Calculate β = (XᵀX)⁻¹XᵀY.
2.Write out the final regression equation for your chosen output variable.
3.Using your model, predict the output variable when V = 62 m/min, S = 0.16 mm/rev, and T = 2.2 min.
4.(Critical Thinking): Based on the coefficients you obtained, discuss the relationship between the input variables (V, S,
T) and your chosen output variable. Which input variable seems to have the strongest effect on the output? (This doesn't
require further calculations, just interpretation of the results.)
Polynomial Regression
y = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ + ε
where:
•y is the dependent/outcome variable.
•x is the independent/predictor variable.
•β₀, β₁, β₂, ..., βₙ are the coefficients to be estimated.
•n is the degree of the polynomial (e.g., n=2 for quadratic, n=3 for cubic).
•ε is the error term.
Example:
y = β₀ + β₁x + β₂x² + ε
•Creating new predictor variable: Calculate x² for each data point.
Temperature (x) Yield (y) •Setting up the design matrix:
100 50 X = [[1, 100, 10000],
[1, 120, 14400],
120 65
[1, 140, 19600],
140 70 [1, 160, 25600],
160 65 [1, 180, 32400]]
•Solving for β: β₀ = -200, β₁ = 3, and β₂ = -0.01.
180 50 •Final equation: y = -200 + 3x - 0.01x² PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Exercise:
A machinist is analyzing the relationship between cutting speed (V) and the resulting
thrust force (Force) in a milling operation. They collect the following data
(m/min) Force (N) Tasks:
20 100 1.Create the necessary matrices (Y and X) for polynomial
regression. Remember to include the intercept term and the
30 120 squared term (V²) in the design matrix X.
40 130 2.Calculate Xᵀ (X transpose).
3.Calculate XᵀX.
50 120
4.Calculate (XᵀX)⁻¹ (the inverse of XᵀX). You'll likely need a
60 100 calculator or software for this step.
5.Calculate XᵀY.
The machinist suspects a non-linear 6.Calculate the coefficients β = (XᵀX)⁻¹XᵀY.
relationship and wants to fit a second-degree 7.Write out the final fitted quadratic equation for Force.
8.Predict the Force when V = 45 m/min.
polynomial (quadratic) regression model:
9.(Critical Thinking): Does a quadratic model seem
appropriate for this data? Explain your reasoning.
Force = β₀ + β₁V + β₂V² + ε
Logistic Regression
Mapping to Probability
logit(p) = ln(p / (1-p))
Logistic Regression
Sigmoid Function:
𝟏
𝑺𝒊𝒈𝒎𝒐𝒊𝒅 𝒙 = 𝑷 𝒀 = 𝟏 𝑿 = (𝜷𝟎 𝜷𝟏 𝑿𝟏 𝜷𝟐 𝑿𝟐 ⋯ 𝜷𝒏 𝑿𝒏 )
𝒆
Where:
•P(Y=1|X) is the probability of the dependent variable Y belonging to category 1, given the
independent variables X.
•X₁, X₂, ..., Xₙ are the independent variables (features).
•β₀, β₁, β₂, ..., βₙ are the coefficients learned by the model during training. These coefficients
represent the weights assigned to each feature.
•exp() is the exponential function.
𝑝
log = 𝑚𝑥 + 𝑏
1−𝑝
𝑝 1
= −1=𝑒
1−𝑝 1−𝑝
1
=𝑒 +1
1−𝑝
1 𝑒
𝑝=1− =
𝑒 +1 𝑒 +1
1
𝑝= ( )
𝑒 +1
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
Loss function
Apply MLE ( Maximum likelihood estimation)
p= 𝑝 (1 − 𝑝 )
− 𝑦 ∗ log 𝑦 + 1 − 𝑦 ∗ log 1 − 𝑦
Example:
We want to predict drill bit breakage (Yes/No) based on Drill Speed, Feed Rate,
Depth of Cut, and Material Hardness.
Drill Speed Bit
(RPM) Breakage
1500 No
1800 No
2000 No
2200 Yes
2500 Yes
• Quadratic Discriminant Analysis (QDA): Similar to LDA, but allows for non-
linear decision boundaries.
Testing
5. Data Visualization:
Common types of data visualization formats: Column chart, Bar chart, Line
graph, Two-axis chart, Mekko chart, Pie chart, Bubble chart, Domain chart,
Scatter chart, Heat map, Scatter plot diagram, Area chart….
Visualize the results of the data analysis using tools such as Apache Zeppelin,
Tableau, or PowerBI.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology
5. Data Management:
- Manage the data using tools such as Apache HBase or Apache Cassandra
to ensure data consistency, reliability, and security.