0% found this document useful (0 votes)

3 views52 pages

Big Data

The document discusses the integration of Industry 4.0 technologies in Mechanical Engineering, focusing on data science and big data applications. It outlines the processes of data collection, integration, processing, analysis, and the different types of analytics including descriptive, diagnostic, predictive, and prescriptive analytics. Additionally, it highlights the significance of big data characteristics, particularly in the context of the COVID-19 pandemic, and the role of statistical learning in deriving insights from data.

Uploaded by

nhat.tran2406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views52 pages

Big Data

Uploaded by

nhat.tran2406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Ho Chi Minh City Industry 4.

0 technologies in Mechanical Engineering

University of Technology

Course : Industry 4.0 Technologies in Mechanical Engineering

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Data Science

Data science is the process of examining data sets to conclude the information they contain,
increasingly with the aid of specialized systems and software, using techniques, scientific
models, theories, and hypotheses. These three pillars have very much been the mainstay of
data science ever since it started getting embraced by businesses over the past two decades
and should continue to be even in the future.

Data Science expressed like an idea accepted in academia and industry. It’s an intersection of
programming, analytical, and business skills that allows extracting meaningful insights from
data to benefit business growth. However, this is used in social research, scientific & space
programs, government planning, and so on.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Methods, Models, Process are

Computer Science & IT practice is
defined as industry and academia
the full range of hardware, the The Data Science model.
proved practices that are the
software involved in providing
backbone to Data Science,
computing for processing data,
including Mathematical models,
storage for storing and sharing data
theorems, Statistical methods,
and networking for collecting and
techniques, and process
movement.
methodologies likes CRISP-DM,
Six-Sigma, Lean, and so on.

Business Acumen in its purest form means running a Business Enterprise. Any business existing
to sell its product or services for a profit incurring some cost and generally having the functions
like HR, Supply Chain, Finance, Sales & marketing to support it.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Big data
Definition

Big Data is defined as a tool and platform that is used to store, process, and
analyze data to identify business insights that were not possible due to the
limitation of the traditional data processing and management technologies. Big
Data is also viewed as a technology for processing huge datasets in distributed
scalable platforms.

Big Data Applications in Industry 4.0 Data that cannot be stored and processed in
commodity hardware and greater than one terabyte is called Big Data. The
existing commodity hardware size of computing is only one terabyte where
processing and storage of data are limited.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5V in Bigdata:

PhD. Quang-Phuoc Tran

Ho Chi Minh City
University of Technology
Big data Industry 4.0 technologies in Mechanical Engineering

7V in Bigdata:
- Volume: Volume represents the amount of data that is growing at an exponential rate, i.e. in
Petabytes and Exabytes.
- Velocity: Velocity refers to the speed at which data is growing, very fast. Today, yesterday's data is
considered stale data. Today, social media is a big contributor to the growth of data.
- Variety: Variety refers to the heterogeneity of data types. In other words, the data collected comes
in many formats like video, audio, csv, etc. So these different formats represent many types of
data.
- Veracity (Tính xác thực): Veracity refers to doubtful or uncertain data of available data due to
inconsistent and incomplete data. Available data can sometimes be messy and hard to trust. With
many forms of big data, quality and accuracy are difficult to control. Volume is often the reason
behind the lack of quality and accuracy of the data.
- Validity: The fifth V denotes the validity of data that is essential in business to identify the validity
of data patterns for planning business strategies.
- Virality (Tính lan truyền): The sixth V denotes the virality aspect of data that is generally used to
measure the reach of data.
- Value: All is well and good to have access to big data but unless we can turn it into a value.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Sample :
The characteristics of Big Data for the coronavirus pandemic is mapped below.
+Volume: Huge volume of data is evolved every hour related to a patient affected, illness conditions,
precaution measures, diagnosis, and hospital facilities.
+Velocity: The information about people affected and the ill effects of COVID- 19 is streaming in
nature which is evolving dynamically.
+Variety: Huge volume of data related to COVID 19 is accumulated as structured data in patient
database, demographics of citizens, clinical diagnosis, travel data, genomic studies, and drug targets.
Unstructured data for COVID- 19 is voluminous in social media platforms of Twitter, Facebook, and
WhatsApp to share preventive measures in the form of text, audio, video, and related chats. Role of Big
Data Analytics
+Veracity and Virality: The information of preventive cure mechanism mentioned in social media
platforms are inconsistent and viral leading to uncertainty among people.
+Validity and Value: Measuring the validity and the value of the content available in the digital globe
for the pandemic has become a challenge.

PhD. Quang-Phuoc Tran

Ho Chi Minh City
University of Technology
Big data Industry 4.0 technologies in Mechanical Engineering

To create big data for a manufacturing process, you can follow these steps:
1.Data Collection: Collect data from various sources such as sensors, machines,
production systems, and databases. This data can include production data,
machine performance data, quality control data, and supply chain data. When
using big data in production, the data integration process is critical to ensure that
the data is correctly formatted and usable by big data technologies.
2.Data Integration: Integrate the data from different sources into a centralized
repository such as a Hadoop cluster or a data lake.
Step 1: Collect data from different sources
Step 2: Standardize data
Step 3: Integrate data
Step 4: Ensure data consistency and format
Step 5: Store data
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

3. Data Processing:

- Data processing is the process of processing data to convert source data into useful
and reliable information. It includes activities such as collecting, storing, organizing,
classifying, calculating, analyzing, and presenting information in reports, charts, or
other formats.

- Data processing can be performed using various means and tools, including data
processing software, database systems, data query tools, algorithms and processing
techniques. different data. The purpose of data processing is to help managers,
researchers or other organizations find useful information from data and make the
right decisions.

- Use big data technologies such as Apache Spark or Apache Flink to process and lean
the data, identify patterns and anomalies.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

4. Data Analysis:

Data analysis is the process of examining, cleaning, transforming and modeling

data with the purpose of discovering useful information, drawing conclusions
and supporting decision making. Data analysis uses analytical and logical
reasoning to interpret information obtained from data.

Data analysis can be applied to many different fields such as business, science,
health, education, politics... Each field has its own goals and methods of data
analysis. However, in general, data analysis aims to solve specific problems
using existing or newly collected data.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Analyze the data using tools such as Apache Hive, Apache Impala, or Apache
Drill to gain insights into the manufacturing process and make data-driven
decisions.

Data analysis methods :

+Statistical analysis
+Regression analysis
+Classification analysis
+Mining analysis
+Machine Learning

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statictical Analysis

The Analytics Advancement Model helps

define, identify and illustrate what these
types of analysis mean.

In the above model, we can visualize four

types of analysis possible and show them
in terms of complexity of analysis and
volume of analysis. Volume here means
done often. There is no apparent
relationship between volume and
complexity.

Analytics advancement model.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Descriptive Analytics

• Define business metrics: Determine which metrics are important for evaluating
performance against business goals. Goals include to increase revenue, reduce costs,
improve operational efficiency and measure productivity. Each goal must have associated
key performance indicators (KPIs) to help monitor achievement.

• Identify data required: Data are located in many different sources within the enterprise,
including systems of record, databases, desktops and shadow IT repositories. To measure
data accurately against KPIs, companies must catalog and prepare the correct data sources
to extract the needed data and calculate metrics based on the current state of the business.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

• Extract and prepare data: Data must be prepared for analysis. Deduplication,
transformation and cleansing are a few examples of the data preparation steps that need to
occur before analysis. This is often the most time-consuming and labor-intensive step,
requiring up to 80% of an analyst’s time, but it is critical for ensuring accuracy.

• Analyze data: Data analysts can create models and run analyses such as summary
statistics, clustering and regression analysis on the data to determine patterns and measure
performance. Key metrics are calculated and compared with stated business goals to
evaluate performance based on historical results. Data scientists often use open source tools
such as R and Python to programmatically analyze and visualize data.

• Present data: Results of the analytics are usually presented to stakeholders in the form of
charts and graphs. This is where data visualization comes into play. Business intelligence
tools give users the ability to present data visually in a way that non-data analysts can
understand. Many self-service data visualization tools also enable business users to create
their own visualizations and manipulate the output. PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Diagnostic Analytics

At this stage, historical data can be measured against other data to answer the question of why
something happened. Thanks to diagnostic analytics, there is a possibility to drill down, to find out
dependencies and to identify patterns. Companies go for diagnostic analytics, as it gives a deep
insight into a particular problem. At the same time, a company should have detailed information at
their disposal, otherwise data collection may turn out to be individual for every issue and time-
consuming.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Predictive Analytics

1. Define project: It defines the project outcomes,

deliverables, scoping of the effort, business
objectives and identifies the data sets to be used.
2. Data collection: Data mining for predictive
analytics prepares data from multiple sources for
analysis.
3. Data analysis: Data analysis is the process of
inspecting, cleaning, transforming and modeling
data with the objective of discovering useful
information and arriving at conclusions.
4. Statistics: Statistical analysis validates the
assumptions and hypotheses and tests them using
standard statistical models.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Data Analytics Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Modeling: Predictive modeling provides the

ability to automatically create accurate
predictive models about the future. There are
also options to choose the best solution with
multi-model evaluation.
6. Deployment: Predictive model deployment
provides the option to deploy the analytical
results in the everyday decision-making process
to obtain results, reports and output by
automating the decisions based on the
modeling.
7. Model monitoring: Models are managed and
monitored to review the model performance to
ensure they are providing the results expected.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Data Analytics Industry 4.0 technologies in Mechanical Engineering
University of Technology

Prescriptive Analysis
• Build a business case: Prescriptive analytics are best used when data-driven decision-
making goes beyond human capabilities, such as when there are too many input
variables, or data volumes are high. A business case will help identify whether machine-
generated recommendations are appropriate and trustworthy.
• Define rules: Prescriptive analytics require rules to be codified that can be applied to
generate recommendations. Business rules thus need to be identified and actions defined
for each possible outcome. Rules are decisions that are programmatically implemented
in software. The system receives and analyzes data, then prescribes the next best course
of action based on predetermined parameters. Prescriptive models can be very complex
to implement. Appropriate analytic techniques need to be applied to ensure that all
possible outcomes are considered to prevent missteps. This includes the application of
optimization and other analytic techniques in conjunction with rules management.
• Test, Test, Test: As the intent of prescriptive analytics is to automate the decision-
making process, testing the models to ensure that they are providing meaningful
recommendations is imperative to prevent costly mistakes. PhD. Quang-
Phuoc Tran
A1
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
What is Statistical Learning ?
𝑌=𝑓 𝑋 + 𝜖
Why Estimate f ?
Prediction
𝑌 = 𝑓 (𝑋) Reducible Irreducible
E(𝑌 − 𝑌 ) = 𝐸[𝑓 𝑋 + 𝜖 − 𝑓 (𝑋)] = [𝑓 𝑋 − 𝑓 (𝑋)] +𝑉𝑎𝑟(𝜖)
Inference
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?
PhD. Quang-Phuoc Tran
Slide 19

A1 Admin, 4/5/2025
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
How do we estimate f ?
Parametric Methods
Make an assumption about the functional form, or shape, of f.

𝑓 𝑋 = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +⋯+𝛽 𝑋

We want to find values of these parameters such that

Y ≈ 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + ⋯+ 𝛽 𝑋

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
Linear Regression
- Linear Regression: one of the most fundamental machine learning methods. It covers
least squares estimation, model assumptions, and performance metrics like R-squared.

Simple linear regression 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀 Cubic polynomial regression 𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + 𝜖

PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Several key concepts :

• Dependent Variable

• Independent Variables

• Regression Line

• Regression Equation

• Coefficients

• Intercept

• Residuals
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Types of Regression analysis

Linear Regression

How to update θ1 and θ2

values to get the best-fit line?

1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦 − 𝑦 )
𝑛

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

+ Linearity:

+ Independence

+ Homoscedasticity
(hiệp phương sai đồng nhất)

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

+ Normality
Residual = Observed value – Predicted value
𝑒 =𝑦 −𝑦

Residual sum of squares (RSS)

𝑅𝑆𝑆 = 𝑒 + 𝑒 + ⋯ + 𝑒

𝑅𝑆𝑆 = (𝑦 − 𝜃 − 𝜃 𝑥 ) +(𝑦 − 𝜃 − 𝜃 𝑥 ) +...+(𝑦 − 𝜃 − 𝜃 𝑥 )

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

The least squares approach chooses 𝜃 𝑎𝑛𝑑 𝜃 to minimize the RSS

∑ (𝑥 − 𝑥̅ )(𝑦 − 𝑦)
𝜃 =
∑ (𝑥 − 𝑥̅ )

𝜃 = 𝑦 − 𝜃 𝑥̅

where
1 1
𝑥̅ ≡ 𝑥 𝑦≡ 𝑦
𝑛 𝑛

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression
Example: We have the following dataset with 12 total observations

Calculate the means of X and Y:

+ Mean of X (x̄) = (8 + 12 + 12 + 13 + 14 + 16 + 17 + 22 + 24 +
26 + 29 + 30) / 12 = 19.25

+ Mean of Y (ȳ) = (41 + 42 + 39 + 37 + 35 + 39 + 45 + 46 + 39 +

49 + 55 + 57) / 12 = 43.25

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

xi yi (xi - x̄) (yi - ȳ) (xi - x̄)(yi - ȳ) (xi - x̄)² The slope is:
8 41 -11.25 -2.25 25.3125 126.5625 398.625
12 42 -7.25 -1.25 9.0625 52.5625 𝜃 = ≈ 0.732
544.75
12 39 -7.25 -4.25 30.8125 52.5625
13 37 -6.25 -6.25 39.0625 39.0625 𝜃 = -9.949
14 35 -5.25 -8.25 43.3125 27.5625
16 39 -3.25 -4.25 13.8125 10.5625
17 45 -2.25 1.75 -3.9375 5.0625 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙
22 46 2.75 2.75 7.5625 7.5625
24 39 4.75 -4.25 -20.1875 22.5625
26 49 6.75 5.75 38.8125 45.5625
29 55 9.75 11.75 114.5625 95.0625
30 57 10.75 13.75 147.8125 115.5625
Totals 398.625 544.75
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Residual standard error

1 1
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = (𝑦 − 𝑦 )
𝑛−2 𝑛−2

𝑹𝑺𝑺 = (𝒚𝒊 − 𝒚𝒊 )𝟐
𝒊 𝟏

R-square
𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅 = =1− 𝑊ℎ𝑒𝑟𝑒, 𝑇𝑆𝑆 = (𝑦 − 𝑦)
𝑇𝑆𝑆 𝑇𝑆𝑆

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Ex 1 :
Suppose we have a data set with five predictors, X1 =GPA,X2 =IQ, X3 =Gender (1 for Female and 0 for Male),
X4 =Interactionbetween GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting
salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆ β0 =50,
ˆ β1 = 20, ˆ β2 =0.07, ˆ β3 =35, ˆ β4 =0.01, ˆ β5 = −10.

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, males earn more on average than females.
ii. For a fixed value of IQ and GPA, females earn more on average than males.
iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high
enough.
iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high
enough.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence
of an interaction effect. Justify your answer.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Ex 2 :
A scientist is studying the relationship between the amount of fertilizer used on a crop and
the resulting yield. They conduct an experiment using different amounts of fertilizer (in
kilograms per hectare) and measure the corresponding crop yield (in tonnes per hectare).
The data collected is as follows:
Fertilizer Yield The scientist wants to model this relationship using a simple
( kg/ha ) (tonnes/ha)
linear regression model:
1 2
Yield = β₀ + β₁ * Fertilizer + ε
2 3
where:
3 5
+ Yield is the crop yield (in tonnes per hectare)
4 4
+ Fertilizer is the amount of fertilizer used (in kg/ha )
5 6
+ β₀ and β₁ are the coefficients to be estimated
+ ε is the error term

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Using the Least Squares method:

1.Write out the cost function (also known as the sum of squared errors) that needs to be minimized. Express this
function in terms of β₀, β₁, Fertilizerᵢ, and Yieldᵢ, where i represents the data point index (i=1 to 5 in this case).
2.Derive the normal equations by taking the partial derivatives of the cost function with respect to β₀ and β₁, and
setting them equal to zero. This will give you two equations that can be solved simultaneously to find the
optimal values of β₀ and β₁.
3.Calculate the following sums from the provided data:
1. Σ Fertilizerᵢ
2. Σ Yieldᵢ
3. Σ (Fertilizerᵢ)²
4. Σ (Fertilizerᵢ * Yieldᵢ)
4.Use the sums calculated in step 3 and the normal equations derived in step 2 to calculate the least squares
estimates for β₀ and β₁.
5.Write out the final linear equation with the calculated values of β₀ and β₁.
6.Interpret the meaning of β₁. What does it tell you about the relationship between fertilizer and yield? For
example, what is the predicted increase in yield for every additional kg/ha of fertilizer used?

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Multiple Linear Regression

Y = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +⋯+𝛽 𝑋 +𝜖
Where : +Y is the dependent/outcome variable (what you're predicting).
+X₁, X₂, ..., Xₙ are the independent/predictor variables (features).
+β₀ is the intercept (the value of Y when all Xs are zero).
+ β₁, β₂, ..., βₙ are the regression coefficients
+ ε is the error term (representing the variability not explained by the model).
Formulating the cost function (SSE):
SSE = Σ(Yᵢ - (β₀ + β₁X₁ᵢ + β₂X₂ᵢ + ... + βₙXₙᵢ))²
β = (XᵀX)⁻¹XᵀY

Where: + β is the vector of coefficients.

+ X is the matrix of predictor variables (including a column of 1s for the intercept).
+ Y is the vector of the outcome variable.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Multiple Linear Regression

Example:
Let's say you want to predict a house's sale price (Y) based on its size (X₁ in square feet), the number of bedrooms (X₂), and
the age of the house (X₃ in years). You collect data on several houses and fit a multiple linear regression model.
Suppose you obtain the following estimated coefficients:
•β₀ = 50,000
•β₁ = 200
•β₂ = 10,000
•β₃ = -500
The resulting regression equation would be:

Sale Price = 50,000 + 200(Size) + 10,000(Bedrooms) - 500*(Age)

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise 1:
A data scientist is building a model to predict apartment rental prices (in dollars) based on the size of the apartment
(in square feet) and its distance from downtown (in miles). They collect data on four apartments:

Apartment Size Distance Rent

(sq ft) (miles) ($)
1 600 2 1200
2 800 5 1400
3 700 3 1300
4 900 1 1600

Use the least squares method and matrix notation to find the coefficients of the multiple linear regression model:
Rent = β₀ + β₁ * Size + β₂ * Distance

Solution:
β = (XᵀX)⁻¹XᵀY

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise 2:
A drilling process is studying the effects of cutting speed (V), feed rate (S), and cutting time (T) on three
important output variables in a drilling process: hole temperature (Temp), thrust force (Force), and surface
roughness (Roughness). They conduct a series of experiments and collect the following data:

Experiment V (m/min) S (mm/rev) T (min) Temp (°C) Force (N) Roughness(μm)

1 50 0.1 1 80 1000 2.5
2 60 0.15 1.5 90 1200 2.8
3 55 0.12 2 85 1100 2.6
4 65 0.18 2.5 95 1300 3.0
5 70 0.2 3 100 1400 3.2

The engineer wants to build three separate multiple linear regression models:
Temp = β₀ + β₁V + β₂S + β₃T
Force = β₀ + β₁V + β₂S + β₃T
Roughness = β₀ + β₁V + β₂S + β₃T

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Tasks:

1.Using the least squares method and matrix notation, determine the coefficients (β₀, β₁, β₂, β₃) for the chosen
model. Clearly show the following steps:
1. Set up the matrices Y and X.
2. Calculate Xᵀ (X transpose).
3. Calculate XᵀX.
4. Calculate (XᵀX)⁻¹ (You'll need a calculator or software for this).
5. Calculate XᵀY.
6. Calculate β = (XᵀX)⁻¹XᵀY.

2.Write out the final regression equation for your chosen output variable.

3.Using your model, predict the output variable when V = 62 m/min, S = 0.16 mm/rev, and T = 2.2 min.

4.(Critical Thinking): Based on the coefficients you obtained, discuss the relationship between the input variables (V, S,
T) and your chosen output variable. Which input variable seems to have the strongest effect on the output? (This doesn't
require further calculations, just interpretation of the results.)

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Polynomial Regression
y = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ + ε
where:
•y is the dependent/outcome variable.
•x is the independent/predictor variable.
•β₀, β₁, β₂, ..., βₙ are the coefficients to be estimated.
•n is the degree of the polynomial (e.g., n=2 for quadratic, n=3 for cubic).
•ε is the error term.
Example:
y = β₀ + β₁x + β₂x² + ε
•Creating new predictor variable: Calculate x² for each data point.
Temperature (x) Yield (y) •Setting up the design matrix:
100 50 X = [[1, 100, 10000],
[1, 120, 14400],
120 65
[1, 140, 19600],
140 70 [1, 160, 25600],
160 65 [1, 180, 32400]]
•Solving for β: β₀ = -200, β₁ = 3, and β₂ = -0.01.
180 50 •Final equation: y = -200 + 3x - 0.01x² PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise:
A machinist is analyzing the relationship between cutting speed (V) and the resulting
thrust force (Force) in a milling operation. They collect the following data
(m/min) Force (N) Tasks:
20 100 1.Create the necessary matrices (Y and X) for polynomial
regression. Remember to include the intercept term and the
30 120 squared term (V²) in the design matrix X.
40 130 2.Calculate Xᵀ (X transpose).
3.Calculate XᵀX.
50 120
4.Calculate (XᵀX)⁻¹ (the inverse of XᵀX). You'll likely need a
60 100 calculator or software for this step.
5.Calculate XᵀY.
The machinist suspects a non-linear 6.Calculate the coefficients β = (XᵀX)⁻¹XᵀY.
relationship and wants to fit a second-degree 7.Write out the final fitted quadratic equation for Force.
8.Predict the Force when V = 45 m/min.
polynomial (quadratic) regression model:
9.(Critical Thinking): Does a quadratic model seem
appropriate for this data? Explain your reasoning.
Force = β₀ + β₁V + β₂V² + ε

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic Regression
Mapping to Probability
logit(p) = ln(p / (1-p))

+ p: Represents the probability of a specific event occurring. This probability must be

between 0 and 1 (inclusive).
+ (1-p): Represents the probability of the event not occurring.
+ p / (1-p): This is the odds of the event. Odds represent the ratio of the probability of
the event happening to the probability of it not happening.
+ ln(): This is the natural logarithm (logarithm base e). The natural logarithm is the
inverse of the exponential function e^x.
+ logit(p): This is the logit function applied to the probability p. It transforms the
probability into log-odds. Log-odds are the natural logarithm of the odds.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic Regression
Sigmoid Function:

𝟏
𝑺𝒊𝒈𝒎𝒐𝒊𝒅 𝒙 = 𝑷 𝒀 = 𝟏 𝑿 = (𝜷𝟎 𝜷𝟏 𝑿𝟏 𝜷𝟐 𝑿𝟐 ⋯ 𝜷𝒏 𝑿𝒏 )
𝒆

Where:
•P(Y=1|X) is the probability of the dependent variable Y belonging to category 1, given the
independent variables X.
•X₁, X₂, ..., Xₙ are the independent variables (features).
•β₀, β₁, β₂, ..., βₙ are the coefficients learned by the model during training. These coefficients
represent the weights assigned to each feature.
•exp() is the exponential function.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic regression equation:

𝑝
log = 𝑚𝑥 + 𝑏
1−𝑝
𝑝 1
= −1=𝑒
1−𝑝 1−𝑝
1
=𝑒 +1
1−𝑝
1 𝑒
𝑝=1− =
𝑒 +1 𝑒 +1

1
𝑝= ( )
𝑒 +1
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Loss function
Apply MLE ( Maximum likelihood estimation)

p= 𝑝 (1 − 𝑝 )

log 𝑝 = log( 𝑝 (1 − 𝑝 ) )= 𝑦 ∗ log 𝑝 + 1 − 𝑦 ∗ log(1 − 𝑝 )

Max log(p) <=> Min –log(p)

Negative log likelihood

− 𝑦 ∗ log 𝑦 + 1 − 𝑦 ∗ log 1 − 𝑦

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Example:
We want to predict drill bit breakage (Yes/No) based on Drill Speed, Feed Rate,
Depth of Cut, and Material Hardness.
Drill Speed Bit
(RPM) Breakage
1500 No
1800 No
2000 No
2200 Yes
2500 Yes

Find the coefficients of our logistic regression model

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

1.Assume a model: Let's assume a simple logistic regression model:

2.logit(p) = β₀ + β₁ * Drill_Speed
3.where p is the probability of breakage.
4.Likelihood function: For each observation, we calculate the probability predicted by our
model:
1. If breakage occurred (Yes), the probability is p.
2. If no breakage occurred (No), the probability is (1-p).
5.The likelihood function is the product of these individual probabilities across all observations.
The goal of MLE is to find the values of β₀ and β₁ that maximize this product. Intuitively, we're
finding the model parameters that make the observed outcomes the "most probable."
6.Log-likelihood: Instead of maximizing the product of probabilities (which can be
computationally tricky), we typically maximize the log-likelihood. The logarithm turns the
product into a sum, which is easier to work with mathematically.
7.Optimization: Software like scikit-learn or statsmodels use numerical optimization algorithms
to find the β₀ and β₁ that maximize the log-likelihood. These optimal coefficients are our MLE
estimates.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Classification Methods (for predicting categorical outcomes):

• Linear Discriminant Analysis (LDA): A classification method that finds linear

combinations of predictors that best separate different classes.

• Quadratic Discriminant Analysis (QDA): Similar to LDA, but allows for non-
linear decision boundaries.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

• K-Nearest Neighbors (KNN): A non-parametric method that classifies

observations based on the majority class among its k-nearest neighbors in the
feature space.

• Support Vector Machines (SVM): Powerful classification methods that find

optimal hyperplanes to separate different classes. ISLR covers both linear and non-
linear SVMs.

• Tree-Based Methods (Decision Trees, Bagging, Random Forests, Boosting):

Methods that build tree-like structures to partition the data and make predictions.
ISLR covers decision trees, bagging, random forests, and boosting.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Testing

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Data Visualization:

A technique of representing data in the form of images, graphs, and charts in an

intuitive, easy-to-understand way to clearly convey information from the data to
readers and user. Instead of keeping the data in spreadsheet form, we convert it
into charts and dashboards so it can be read and understood more easily.

Common types of data visualization formats: Column chart, Bar chart, Line
graph, Two-axis chart, Mekko chart, Pie chart, Bubble chart, Domain chart,
Scatter chart, Heat map, Scatter plot diagram, Area chart….

Visualize the results of the data analysis using tools such as Apache Zeppelin,
Tableau, or PowerBI.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Data Management:

- Data management is the process of managing data within an organization or

system, including collecting, organizing, storing, processing and protecting
data. It includes activities such as data entry, data processing, data backup and
recovery, and data security.

- Data management plays an important role in modern organizations and

information systems because it helps ensure that data is managed effectively,
reliably and is available when needed. A good data management also helps
improve an organization's ability to access and use data, improve data quality,
minimize security risks and comply with regulations related to data management.

PhD. Quang-Phuoc Tran

Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Data management methods

+ Database Management System

+ File Management System
+ Distributed Data Management System
+ Document Management System
+ Object-oriented Data Management System
+ Metadata Management System

- Manage the data using tools such as Apache HBase or Apache Cassandra
to ensure data consistency, reliability, and security.

PhD. Quang-Phuoc Tran

Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Big Data
100% (1)
Big Data
82 pages
Abhishek Data Scientist Resume
0% (1)
Abhishek Data Scientist Resume
5 pages
Big Data Analytics: - by Ayushi Gupta
No ratings yet
Big Data Analytics: - by Ayushi Gupta
94 pages
CSE545 sp23 (1) What Is Big Data 1-29
No ratings yet
CSE545 sp23 (1) What Is Big Data 1-29
88 pages
Unit 2
No ratings yet
Unit 2
26 pages
IMP Questions PDF in Big Data
No ratings yet
IMP Questions PDF in Big Data
15 pages
L2 Data Driven Decision Making
No ratings yet
L2 Data Driven Decision Making
53 pages
1 - 290 - LR623 Prediction of Storm Rainfall in East Africa
75% (4)
1 - 290 - LR623 Prediction of Storm Rainfall in East Africa
53 pages
Lecture 2 Data Driven Decision Making
No ratings yet
Lecture 2 Data Driven Decision Making
53 pages
Module 04 Ba
No ratings yet
Module 04 Ba
45 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
MergeResult 2024 12 01 08 49 35
No ratings yet
MergeResult 2024 12 01 08 49 35
30 pages
Qs Big Data and Data Warehousing
No ratings yet
Qs Big Data and Data Warehousing
45 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
BDAV Question Bank Solution
No ratings yet
BDAV Question Bank Solution
63 pages
Big Data Is A Broad Term For
No ratings yet
Big Data Is A Broad Term For
5 pages
Rainfall Prediction Using Machine Learning
100% (1)
Rainfall Prediction Using Machine Learning
6 pages
Cost Segregation and Estimation (Final)
No ratings yet
Cost Segregation and Estimation (Final)
4 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
A Review On Big Data
No ratings yet
A Review On Big Data
6 pages
Big Data
No ratings yet
Big Data
82 pages
Data Science Notes
No ratings yet
Data Science Notes
56 pages
Transportation of Children To School
No ratings yet
Transportation of Children To School
10 pages
BDA Unit - 1
No ratings yet
BDA Unit - 1
10 pages
(IJETA-V9I1P2) :yew Kee Wong
No ratings yet
(IJETA-V9I1P2) :yew Kee Wong
7 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Big Data
No ratings yet
Big Data
54 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
(IJCST-V9I6P1) :yew Kee Wong
No ratings yet
(IJCST-V9I6P1) :yew Kee Wong
7 pages
Trip Generation
No ratings yet
Trip Generation
36 pages
$R3N9XOZ
No ratings yet
$R3N9XOZ
56 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Big Data Presentation
No ratings yet
Big Data Presentation
22 pages
Dotong Manalang 2023 Entrepreneurial Intention
No ratings yet
Dotong Manalang 2023 Entrepreneurial Intention
13 pages
BDA Answerbank
No ratings yet
BDA Answerbank
71 pages
CC&BD Unit 3
No ratings yet
CC&BD Unit 3
16 pages
G7 - P3 - Big Data Concepts and Application - NoSQL Vs Relational DB - Key-Value Model
No ratings yet
G7 - P3 - Big Data Concepts and Application - NoSQL Vs Relational DB - Key-Value Model
33 pages
MBA Syllabus
No ratings yet
MBA Syllabus
8 pages
BD Unit 1
No ratings yet
BD Unit 1
63 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
IQRM Book 2020 Jan 28
No ratings yet
IQRM Book 2020 Jan 28
277 pages
Unit 5
No ratings yet
Unit 5
63 pages
Technical Research Report - Volume III - Hydrograph Analysis
No ratings yet
Technical Research Report - Volume III - Hydrograph Analysis
200 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Partiiunit5characteristics of Big Data and Data Analytics
No ratings yet
Partiiunit5characteristics of Big Data and Data Analytics
6 pages
Computer Networks TCP
No ratings yet
Computer Networks TCP
48 pages
BDA Unit 1
No ratings yet
BDA Unit 1
28 pages
9.bivariate Analysis
No ratings yet
9.bivariate Analysis
64 pages
Module 1 - Introduction
No ratings yet
Module 1 - Introduction
56 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
What Is Data Mining?: Warehousing
No ratings yet
What Is Data Mining?: Warehousing
12 pages
Content
No ratings yet
Content
7 pages
Unit 1
No ratings yet
Unit 1
21 pages
BDA Notes Part 1
No ratings yet
BDA Notes Part 1
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
Introduction To Big Data Unit - 2
No ratings yet
Introduction To Big Data Unit - 2
75 pages
Mba TTM
No ratings yet
Mba TTM
41 pages
SOWQMT1014JD11
No ratings yet
SOWQMT1014JD11
5 pages
Final F04soln
No ratings yet
Final F04soln
10 pages
Bda PST
No ratings yet
Bda PST
11 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
No ratings yet
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
13 pages
B.E Cse Batchno 185
No ratings yet
B.E Cse Batchno 185
42 pages
Bigdata
No ratings yet
Bigdata
12 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
What Is Big Data Analytics-1
No ratings yet
What Is Big Data Analytics-1
9 pages
Feasibility Studies 7th Material
No ratings yet
Feasibility Studies 7th Material
9 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
Extra Math QN
No ratings yet
Extra Math QN
11 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
Electropneumatics
No ratings yet
Electropneumatics
72 pages
Eliza Damayanti Lumban Gaol - 1600015007 - Tugas 6.1
No ratings yet
Eliza Damayanti Lumban Gaol - 1600015007 - Tugas 6.1
4 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Unit II. Methods and Techniques For Data Analytics
No ratings yet
Unit II. Methods and Techniques For Data Analytics
91 pages
Lec 9 Linear Correlation and Linear Regression
No ratings yet
Lec 9 Linear Correlation and Linear Regression
71 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
WEEK 3 PowerValue Child Labour
No ratings yet
WEEK 3 PowerValue Child Labour
31 pages
Regression Models Evaluation Metrics
No ratings yet
Regression Models Evaluation Metrics
11 pages
Multiple Linear Regression Model For Predicting Bidding Price
No ratings yet
Multiple Linear Regression Model For Predicting Bidding Price
9 pages
Hasil SPSS
No ratings yet
Hasil SPSS
10 pages
Scatterplots and Regression
No ratings yet
Scatterplots and Regression
17 pages
Bilic ZulleL. PassingandBablokregression
No ratings yet
Bilic ZulleL. PassingandBablokregression
5 pages
Case Study - Temp Viscosity and Comp
No ratings yet
Case Study - Temp Viscosity and Comp
4 pages
Data Analytics Chapter 2
No ratings yet
Data Analytics Chapter 2
16 pages
ABCA 2 Model Building
No ratings yet
ABCA 2 Model Building
9 pages
Chapter Artificial Intelligent: Industry 4.0 in Mechanical Engineering
No ratings yet
Chapter Artificial Intelligent: Industry 4.0 in Mechanical Engineering
44 pages
MTCN 2
No ratings yet
MTCN 2
9 pages
1968 5987 1 PB
No ratings yet
1968 5987 1 PB
14 pages
Calculating RWL
No ratings yet
Calculating RWL
4 pages
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data

Uploaded by

Big Data

Uploaded by

Ho Chi Minh City Industry 4.

0 technologies in Mechanical Engineering

Course : Industry 4.0 Technologies in Mechanical Engineering

PhD. Quang-Phuoc Tran

Methods, Models, Process are

PhD. Quang-Phuoc Tran

PhD. Quang-Phuoc Tran

Data analysis is the process of examining, cleaning, transforming and modeling

PhD. Quang-Phuoc Tran

Data analysis methods :

PhD. Quang-Phuoc Tran

The Analytics Advancement Model helps

In the above model, we can visualize four

Analytics advancement model.

1. Define project: It defines the project outcomes,

5. Modeling: Predictive modeling provides the

We want to find values of these parameters such that

PhD. Quang-Phuoc Tran

Simple linear regression 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀 Cubic polynomial regression 𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + 𝜖

Several key concepts :

Types of Regression analysis

How to update θ1 and θ2

PhD. Quang-Phuoc Tran

PhD. Quang-Phuoc Tran

Residual sum of squares (RSS)

𝑅𝑆𝑆 = (𝑦 − 𝜃 − 𝜃 𝑥 ) +(𝑦 − 𝜃 − 𝜃 𝑥 ) +...+(𝑦 − 𝜃 − 𝜃 𝑥 )

PhD. Quang-Phuoc Tran

The least squares approach chooses 𝜃 𝑎𝑛𝑑 𝜃 to minimize the RSS

PhD. Quang-Phuoc Tran

Calculate the means of X and Y:

+ Mean of Y (ȳ) = (41 + 42 + 39 + 37 + 35 + 39 + 45 + 46 + 39 +

PhD. Quang-Phuoc Tran

Residual standard error

PhD. Quang-Phuoc Tran

(a) Which answer is correct, and why?

PhD. Quang-Phuoc Tran

Using the Least Squares method:

PhD. Quang-Phuoc Tran

Multiple Linear Regression

Where: + β is the vector of coefficients.

PhD. Quang-Phuoc Tran

Multiple Linear Regression

Sale Price = 50,000 + 200*(Size) + 10,000*(Bedrooms) - 500*(Age)

PhD. Quang-Phuoc Tran

Apartment Size Distance Rent

PhD. Quang-Phuoc Tran

Experiment V (m/min) S (mm/rev) T (min) Temp (°C) Force (N) Roughness(μm)

PhD. Quang-Phuoc Tran

PhD. Quang-Phuoc Tran

PhD. Quang-Phuoc Tran

+ p: Represents the probability of a specific event occurring. This probability must be

PhD. Quang-Phuoc Tran

Logistic regression equation:

log 𝑝 = log( 𝑝 (1 − 𝑝 ) )= 𝑦 ∗ log 𝑝 + 1 − 𝑦 ∗ log(1 − 𝑝 )

Max log(p) <=> Min –log(p)

Negative log likelihood

PhD. Quang-Phuoc Tran

Find the coefficients of our logistic regression model

PhD. Quang-Phuoc Tran

1.Assume a model: Let's assume a simple logistic regression model:

Classification Methods (for predicting categorical outcomes):

• Linear Discriminant Analysis (LDA): A classification method that finds linear

PhD. Quang-Phuoc Tran

• K-Nearest Neighbors (KNN): A non-parametric method that classifies

• Support Vector Machines (SVM): Powerful classification methods that find

• Tree-Based Methods (Decision Trees, Bagging, Random Forests, Boosting):

PhD. Quang-Phuoc Tran

PhD. Quang-Phuoc Tran

A technique of representing data in the form of images, graphs, and charts in an

- Data management is the process of managing data within an organization or

- Data management plays an important role in modern organizations and

PhD. Quang-Phuoc Tran

Data management methods

+ Database Management System

PhD. Quang-Phuoc Tran

You might also like

Sale Price = 50,000 + 200(Size) + 10,000(Bedrooms) - 500*(Age)