0% found this document useful (0 votes)
3 views52 pages

Big Data

The document discusses the integration of Industry 4.0 technologies in Mechanical Engineering, focusing on data science and big data applications. It outlines the processes of data collection, integration, processing, analysis, and the different types of analytics including descriptive, diagnostic, predictive, and prescriptive analytics. Additionally, it highlights the significance of big data characteristics, particularly in the context of the COVID-19 pandemic, and the role of statistical learning in deriving insights from data.

Uploaded by

nhat.tran2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views52 pages

Big Data

The document discusses the integration of Industry 4.0 technologies in Mechanical Engineering, focusing on data science and big data applications. It outlines the processes of data collection, integration, processing, analysis, and the different types of analytics including descriptive, diagnostic, predictive, and prescriptive analytics. Additionally, it highlights the significance of big data characteristics, particularly in the context of the COVID-19 pandemic, and the role of statistical learning in deriving insights from data.

Uploaded by

nhat.tran2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Ho Chi Minh City Industry 4.

0 technologies in Mechanical Engineering


University of Technology

Course : Industry 4.0 Technologies in Mechanical Engineering

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Data Science

Data science is the process of examining data sets to conclude the information they contain,
increasingly with the aid of specialized systems and software, using techniques, scientific
models, theories, and hypotheses. These three pillars have very much been the mainstay of
data science ever since it started getting embraced by businesses over the past two decades
and should continue to be even in the future.

Data Science expressed like an idea accepted in academia and industry. It’s an intersection of
programming, analytical, and business skills that allows extracting meaningful insights from
data to benefit business growth. However, this is used in social research, scientific & space
programs, government planning, and so on.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Methods, Models, Process are


Computer Science & IT practice is
defined as industry and academia
the full range of hardware, the The Data Science model.
proved practices that are the
software involved in providing
backbone to Data Science,
computing for processing data,
including Mathematical models,
storage for storing and sharing data
theorems, Statistical methods,
and networking for collecting and
techniques, and process
movement.
methodologies likes CRISP-DM,
Six-Sigma, Lean, and so on.

Business Acumen in its purest form means running a Business Enterprise. Any business existing
to sell its product or services for a profit incurring some cost and generally having the functions
like HR, Supply Chain, Finance, Sales & marketing to support it.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Big data
Definition

Big Data is defined as a tool and platform that is used to store, process, and
analyze data to identify business insights that were not possible due to the
limitation of the traditional data processing and management technologies. Big
Data is also viewed as a technology for processing huge datasets in distributed
scalable platforms.

Big Data Applications in Industry 4.0 Data that cannot be stored and processed in
commodity hardware and greater than one terabyte is called Big Data. The
existing commodity hardware size of computing is only one terabyte where
processing and storage of data are limited.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5V in Bigdata:

PhD. Quang-Phuoc Tran


Ho Chi Minh City
University of Technology
Big data Industry 4.0 technologies in Mechanical Engineering

7V in Bigdata:
- Volume: Volume represents the amount of data that is growing at an exponential rate, i.e. in
Petabytes and Exabytes.
- Velocity: Velocity refers to the speed at which data is growing, very fast. Today, yesterday's data is
considered stale data. Today, social media is a big contributor to the growth of data.
- Variety: Variety refers to the heterogeneity of data types. In other words, the data collected comes
in many formats like video, audio, csv, etc. So these different formats represent many types of
data.
- Veracity (Tính xác thực): Veracity refers to doubtful or uncertain data of available data due to
inconsistent and incomplete data. Available data can sometimes be messy and hard to trust. With
many forms of big data, quality and accuracy are difficult to control. Volume is often the reason
behind the lack of quality and accuracy of the data.
- Validity: The fifth V denotes the validity of data that is essential in business to identify the validity
of data patterns for planning business strategies.
- Virality (Tính lan truyền): The sixth V denotes the virality aspect of data that is generally used to
measure the reach of data.
- Value: All is well and good to have access to big data but unless we can turn it into a value.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Sample :
The characteristics of Big Data for the coronavirus pandemic is mapped below.
+Volume: Huge volume of data is evolved every hour related to a patient affected, illness conditions,
precaution measures, diagnosis, and hospital facilities.
+Velocity: The information about people affected and the ill effects of COVID- 19 is streaming in
nature which is evolving dynamically.
+Variety: Huge volume of data related to COVID 19 is accumulated as structured data in patient
database, demographics of citizens, clinical diagnosis, travel data, genomic studies, and drug targets.
Unstructured data for COVID- 19 is voluminous in social media platforms of Twitter, Facebook, and
WhatsApp to share preventive measures in the form of text, audio, video, and related chats. Role of Big
Data Analytics
+Veracity and Virality: The information of preventive cure mechanism mentioned in social media
platforms are inconsistent and viral leading to uncertainty among people.
+Validity and Value: Measuring the validity and the value of the content available in the digital globe
for the pandemic has become a challenge.

PhD. Quang-Phuoc Tran


Ho Chi Minh City
University of Technology
Big data Industry 4.0 technologies in Mechanical Engineering

To create big data for a manufacturing process, you can follow these steps:
1.Data Collection: Collect data from various sources such as sensors, machines,
production systems, and databases. This data can include production data,
machine performance data, quality control data, and supply chain data. When
using big data in production, the data integration process is critical to ensure that
the data is correctly formatted and usable by big data technologies.
2.Data Integration: Integrate the data from different sources into a centralized
repository such as a Hadoop cluster or a data lake.
Step 1: Collect data from different sources
Step 2: Standardize data
Step 3: Integrate data
Step 4: Ensure data consistency and format
Step 5: Store data
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

3. Data Processing:

- Data processing is the process of processing data to convert source data into useful
and reliable information. It includes activities such as collecting, storing, organizing,
classifying, calculating, analyzing, and presenting information in reports, charts, or
other formats.

- Data processing can be performed using various means and tools, including data
processing software, database systems, data query tools, algorithms and processing
techniques. different data. The purpose of data processing is to help managers,
researchers or other organizations find useful information from data and make the
right decisions.

- Use big data technologies such as Apache Spark or Apache Flink to process and lean
the data, identify patterns and anomalies.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

4. Data Analysis:

Data analysis is the process of examining, cleaning, transforming and modeling


data with the purpose of discovering useful information, drawing conclusions
and supporting decision making. Data analysis uses analytical and logical
reasoning to interpret information obtained from data.

Data analysis can be applied to many different fields such as business, science,
health, education, politics... Each field has its own goals and methods of data
analysis. However, in general, data analysis aims to solve specific problems
using existing or newly collected data.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Analyze the data using tools such as Apache Hive, Apache Impala, or Apache
Drill to gain insights into the manufacturing process and make data-driven
decisions.

Data analysis methods :


+Statistical analysis
+Regression analysis
+Classification analysis
+Mining analysis
+Machine Learning

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statictical Analysis

The Analytics Advancement Model helps


define, identify and illustrate what these
types of analysis mean.

In the above model, we can visualize four


types of analysis possible and show them
in terms of complexity of analysis and
volume of analysis. Volume here means
done often. There is no apparent
relationship between volume and
complexity.

Analytics advancement model.


PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Descriptive Analytics

• Define business metrics: Determine which metrics are important for evaluating
performance against business goals. Goals include to increase revenue, reduce costs,
improve operational efficiency and measure productivity. Each goal must have associated
key performance indicators (KPIs) to help monitor achievement.

• Identify data required: Data are located in many different sources within the enterprise,
including systems of record, databases, desktops and shadow IT repositories. To measure
data accurately against KPIs, companies must catalog and prepare the correct data sources
to extract the needed data and calculate metrics based on the current state of the business.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

• Extract and prepare data: Data must be prepared for analysis. Deduplication,
transformation and cleansing are a few examples of the data preparation steps that need to
occur before analysis. This is often the most time-consuming and labor-intensive step,
requiring up to 80% of an analyst’s time, but it is critical for ensuring accuracy.

• Analyze data: Data analysts can create models and run analyses such as summary
statistics, clustering and regression analysis on the data to determine patterns and measure
performance. Key metrics are calculated and compared with stated business goals to
evaluate performance based on historical results. Data scientists often use open source tools
such as R and Python to programmatically analyze and visualize data.

• Present data: Results of the analytics are usually presented to stakeholders in the form of
charts and graphs. This is where data visualization comes into play. Business intelligence
tools give users the ability to present data visually in a way that non-data analysts can
understand. Many self-service data visualization tools also enable business users to create
their own visualizations and manipulate the output. PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Diagnostic Analytics

At this stage, historical data can be measured against other data to answer the question of why
something happened. Thanks to diagnostic analytics, there is a possibility to drill down, to find out
dependencies and to identify patterns. Companies go for diagnostic analytics, as it gives a deep
insight into a particular problem. At the same time, a company should have detailed information at
their disposal, otherwise data collection may turn out to be individual for every issue and time-
consuming.

PhD. Quang-
Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Predictive Analytics

1. Define project: It defines the project outcomes,


deliverables, scoping of the effort, business
objectives and identifies the data sets to be used.
2. Data collection: Data mining for predictive
analytics prepares data from multiple sources for
analysis.
3. Data analysis: Data analysis is the process of
inspecting, cleaning, transforming and modeling
data with the objective of discovering useful
information and arriving at conclusions.
4. Statistics: Statistical analysis validates the
assumptions and hypotheses and tests them using
standard statistical models.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Data Analytics Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Modeling: Predictive modeling provides the


ability to automatically create accurate
predictive models about the future. There are
also options to choose the best solution with
multi-model evaluation.
6. Deployment: Predictive model deployment
provides the option to deploy the analytical
results in the everyday decision-making process
to obtain results, reports and output by
automating the decisions based on the
modeling.
7. Model monitoring: Models are managed and
monitored to review the model performance to
ensure they are providing the results expected.
PhD. Quang-
Phuoc Tran
Ho Chi Minh City Data Analytics Industry 4.0 technologies in Mechanical Engineering
University of Technology

Prescriptive Analysis
• Build a business case: Prescriptive analytics are best used when data-driven decision-
making goes beyond human capabilities, such as when there are too many input
variables, or data volumes are high. A business case will help identify whether machine-
generated recommendations are appropriate and trustworthy.
• Define rules: Prescriptive analytics require rules to be codified that can be applied to
generate recommendations. Business rules thus need to be identified and actions defined
for each possible outcome. Rules are decisions that are programmatically implemented
in software. The system receives and analyzes data, then prescribes the next best course
of action based on predetermined parameters. Prescriptive models can be very complex
to implement. Appropriate analytic techniques need to be applied to ensure that all
possible outcomes are considered to prevent missteps. This includes the application of
optimization and other analytic techniques in conjunction with rules management.
• Test, Test, Test: As the intent of prescriptive analytics is to automate the decision-
making process, testing the models to ensure that they are providing meaningful
recommendations is imperative to prevent costly mistakes. PhD. Quang-
Phuoc Tran
A1
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
What is Statistical Learning ?
𝑌=𝑓 𝑋 + 𝜖
Why Estimate f ?
Prediction
𝑌 = 𝑓 (𝑋) Reducible Irreducible
E(𝑌 − 𝑌 ) = 𝐸[𝑓 𝑋 + 𝜖 − 𝑓 (𝑋)] = [𝑓 𝑋 − 𝑓 (𝑋)] +𝑉𝑎𝑟(𝜖)
Inference
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?
PhD. Quang-Phuoc Tran
Slide 19

A1 Admin, 4/5/2025
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
How do we estimate f ?
Parametric Methods
Make an assumption about the functional form, or shape, of f.

𝑓 𝑋 = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +⋯+𝛽 𝑋

We want to find values of these parameters such that

Y ≈ 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + ⋯+ 𝛽 𝑋

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Statistical Learning
Linear Regression
- Linear Regression: one of the most fundamental machine learning methods. It covers
least squares estimation, model assumptions, and performance metrics like R-squared.

Simple linear regression 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀 Cubic polynomial regression 𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + 𝜖


PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Several key concepts :

• Dependent Variable

• Independent Variables

• Regression Line

• Regression Equation

• Coefficients

• Intercept

• Residuals
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Types of Regression analysis


Linear Regression

How to update θ1 and θ2


values to get the best-fit line?

1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦 − 𝑦 )
𝑛

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

+ Linearity:

+ Independence

+ Homoscedasticity
(hiệp phương sai đồng nhất)

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

+ Normality
Residual = Observed value – Predicted value
𝑒 =𝑦 −𝑦

Residual sum of squares (RSS)

𝑅𝑆𝑆 = 𝑒 + 𝑒 + ⋯ + 𝑒

𝑅𝑆𝑆 = (𝑦 − 𝜃 − 𝜃 𝑥 ) +(𝑦 − 𝜃 − 𝜃 𝑥 ) +...+(𝑦 − 𝜃 − 𝜃 𝑥 )

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

The least squares approach chooses 𝜃 𝑎𝑛𝑑 𝜃 to minimize the RSS

∑ (𝑥 − 𝑥̅ )(𝑦 − 𝑦)
𝜃 =
∑ (𝑥 − 𝑥̅ )

𝜃 = 𝑦 − 𝜃 𝑥̅

where
1 1
𝑥̅ ≡ 𝑥 𝑦≡ 𝑦
𝑛 𝑛

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression
Example: We have the following dataset with 12 total observations

Calculate the means of X and Y:


+ Mean of X (x̄) = (8 + 12 + 12 + 13 + 14 + 16 + 17 + 22 + 24 +
26 + 29 + 30) / 12 = 19.25

+ Mean of Y (ȳ) = (41 + 42 + 39 + 37 + 35 + 39 + 45 + 46 + 39 +


49 + 55 + 57) / 12 = 43.25

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Linear Regression

xi yi (xi - x̄) (yi - ȳ) (xi - x̄)(yi - ȳ) (xi - x̄)² The slope is:
8 41 -11.25 -2.25 25.3125 126.5625 398.625
12 42 -7.25 -1.25 9.0625 52.5625 𝜃 = ≈ 0.732
544.75
12 39 -7.25 -4.25 30.8125 52.5625
13 37 -6.25 -6.25 39.0625 39.0625 𝜃 = -9.949
14 35 -5.25 -8.25 43.3125 27.5625
16 39 -3.25 -4.25 13.8125 10.5625
17 45 -2.25 1.75 -3.9375 5.0625 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙
22 46 2.75 2.75 7.5625 7.5625
24 39 4.75 -4.25 -20.1875 22.5625
26 49 6.75 5.75 38.8125 45.5625
29 55 9.75 11.75 114.5625 95.0625
30 57 10.75 13.75 147.8125 115.5625
Totals 398.625 544.75
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Residual standard error

1 1
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = (𝑦 − 𝑦 )
𝑛−2 𝑛−2

𝑹𝑺𝑺 = (𝒚𝒊 − 𝒚𝒊 )𝟐
𝒊 𝟏

R-square
𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅 = =1− 𝑊ℎ𝑒𝑟𝑒, 𝑇𝑆𝑆 = (𝑦 − 𝑦)
𝑇𝑆𝑆 𝑇𝑆𝑆

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Ex 1 :
Suppose we have a data set with five predictors, X1 =GPA,X2 =IQ, X3 =Gender (1 for Female and 0 for Male),
X4 =Interactionbetween GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting
salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ˆ β0 =50,
ˆ β1 = 20, ˆ β2 =0.07, ˆ β3 =35, ˆ β4 =0.01, ˆ β5 = −10.

(a) Which answer is correct, and why?


i. For a fixed value of IQ and GPA, males earn more on average than females.
ii. For a fixed value of IQ and GPA, females earn more on average than males.
iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high
enough.
iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high
enough.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence
of an interaction effect. Justify your answer.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Ex 2 :
A scientist is studying the relationship between the amount of fertilizer used on a crop and
the resulting yield. They conduct an experiment using different amounts of fertilizer (in
kilograms per hectare) and measure the corresponding crop yield (in tonnes per hectare).
The data collected is as follows:
Fertilizer Yield The scientist wants to model this relationship using a simple
( kg/ha ) (tonnes/ha)
linear regression model:
1 2
Yield = β₀ + β₁ * Fertilizer + ε
2 3
where:
3 5
+ Yield is the crop yield (in tonnes per hectare)
4 4
+ Fertilizer is the amount of fertilizer used (in kg/ha )
5 6
+ β₀ and β₁ are the coefficients to be estimated
+ ε is the error term

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Using the Least Squares method:

1.Write out the cost function (also known as the sum of squared errors) that needs to be minimized. Express this
function in terms of β₀, β₁, Fertilizerᵢ, and Yieldᵢ, where i represents the data point index (i=1 to 5 in this case).
2.Derive the normal equations by taking the partial derivatives of the cost function with respect to β₀ and β₁, and
setting them equal to zero. This will give you two equations that can be solved simultaneously to find the
optimal values of β₀ and β₁.
3.Calculate the following sums from the provided data:
1. Σ Fertilizerᵢ
2. Σ Yieldᵢ
3. Σ (Fertilizerᵢ)²
4. Σ (Fertilizerᵢ * Yieldᵢ)
4.Use the sums calculated in step 3 and the normal equations derived in step 2 to calculate the least squares
estimates for β₀ and β₁.
5.Write out the final linear equation with the calculated values of β₀ and β₁.
6.Interpret the meaning of β₁. What does it tell you about the relationship between fertilizer and yield? For
example, what is the predicted increase in yield for every additional kg/ha of fertilizer used?

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Multiple Linear Regression


Y = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +⋯+𝛽 𝑋 +𝜖
Where : +Y is the dependent/outcome variable (what you're predicting).
+X₁, X₂, ..., Xₙ are the independent/predictor variables (features).
+β₀ is the intercept (the value of Y when all Xs are zero).
+ β₁, β₂, ..., βₙ are the regression coefficients
+ ε is the error term (representing the variability not explained by the model).
Formulating the cost function (SSE):
SSE = Σ(Yᵢ - (β₀ + β₁X₁ᵢ + β₂X₂ᵢ + ... + βₙXₙᵢ))²
β = (XᵀX)⁻¹XᵀY

Where: + β is the vector of coefficients.


+ X is the matrix of predictor variables (including a column of 1s for the intercept).
+ Y is the vector of the outcome variable.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Multiple Linear Regression

Example:
Let's say you want to predict a house's sale price (Y) based on its size (X₁ in square feet), the number of bedrooms (X₂), and
the age of the house (X₃ in years). You collect data on several houses and fit a multiple linear regression model.
Suppose you obtain the following estimated coefficients:
•β₀ = 50,000
•β₁ = 200
•β₂ = 10,000
•β₃ = -500
The resulting regression equation would be:

Sale Price = 50,000 + 200*(Size) + 10,000*(Bedrooms) - 500*(Age)

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise 1:
A data scientist is building a model to predict apartment rental prices (in dollars) based on the size of the apartment
(in square feet) and its distance from downtown (in miles). They collect data on four apartments:

Apartment Size Distance Rent


(sq ft) (miles) ($)
1 600 2 1200
2 800 5 1400
3 700 3 1300
4 900 1 1600

Use the least squares method and matrix notation to find the coefficients of the multiple linear regression model:
Rent = β₀ + β₁ * Size + β₂ * Distance

Solution:
β = (XᵀX)⁻¹XᵀY

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise 2:
A drilling process is studying the effects of cutting speed (V), feed rate (S), and cutting time (T) on three
important output variables in a drilling process: hole temperature (Temp), thrust force (Force), and surface
roughness (Roughness). They conduct a series of experiments and collect the following data:

Experiment V (m/min) S (mm/rev) T (min) Temp (°C) Force (N) Roughness(μm)


1 50 0.1 1 80 1000 2.5
2 60 0.15 1.5 90 1200 2.8
3 55 0.12 2 85 1100 2.6
4 65 0.18 2.5 95 1300 3.0
5 70 0.2 3 100 1400 3.2

The engineer wants to build three separate multiple linear regression models:
Temp = β₀ + β₁V + β₂S + β₃T
Force = β₀ + β₁V + β₂S + β₃T
Roughness = β₀ + β₁V + β₂S + β₃T

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Tasks:

1.Using the least squares method and matrix notation, determine the coefficients (β₀, β₁, β₂, β₃) for the chosen
model. Clearly show the following steps:
1. Set up the matrices Y and X.
2. Calculate Xᵀ (X transpose).
3. Calculate XᵀX.
4. Calculate (XᵀX)⁻¹ (You'll need a calculator or software for this).
5. Calculate XᵀY.
6. Calculate β = (XᵀX)⁻¹XᵀY.

2.Write out the final regression equation for your chosen output variable.

3.Using your model, predict the output variable when V = 62 m/min, S = 0.16 mm/rev, and T = 2.2 min.

4.(Critical Thinking): Based on the coefficients you obtained, discuss the relationship between the input variables (V, S,
T) and your chosen output variable. Which input variable seems to have the strongest effect on the output? (This doesn't
require further calculations, just interpretation of the results.)

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Polynomial Regression
y = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ + ε
where:
•y is the dependent/outcome variable.
•x is the independent/predictor variable.
•β₀, β₁, β₂, ..., βₙ are the coefficients to be estimated.
•n is the degree of the polynomial (e.g., n=2 for quadratic, n=3 for cubic).
•ε is the error term.
Example:
y = β₀ + β₁x + β₂x² + ε
•Creating new predictor variable: Calculate x² for each data point.
Temperature (x) Yield (y) •Setting up the design matrix:
100 50 X = [[1, 100, 10000],
[1, 120, 14400],
120 65
[1, 140, 19600],
140 70 [1, 160, 25600],
160 65 [1, 180, 32400]]
•Solving for β: β₀ = -200, β₁ = 3, and β₂ = -0.01.
180 50 •Final equation: y = -200 + 3x - 0.01x² PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Exercise:
A machinist is analyzing the relationship between cutting speed (V) and the resulting
thrust force (Force) in a milling operation. They collect the following data
(m/min) Force (N) Tasks:
20 100 1.Create the necessary matrices (Y and X) for polynomial
regression. Remember to include the intercept term and the
30 120 squared term (V²) in the design matrix X.
40 130 2.Calculate Xᵀ (X transpose).
3.Calculate XᵀX.
50 120
4.Calculate (XᵀX)⁻¹ (the inverse of XᵀX). You'll likely need a
60 100 calculator or software for this step.
5.Calculate XᵀY.
The machinist suspects a non-linear 6.Calculate the coefficients β = (XᵀX)⁻¹XᵀY.
relationship and wants to fit a second-degree 7.Write out the final fitted quadratic equation for Force.
8.Predict the Force when V = 45 m/min.
polynomial (quadratic) regression model:
9.(Critical Thinking): Does a quadratic model seem
appropriate for this data? Explain your reasoning.
Force = β₀ + β₁V + β₂V² + ε

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic Regression
Mapping to Probability
logit(p) = ln(p / (1-p))

+ p: Represents the probability of a specific event occurring. This probability must be


between 0 and 1 (inclusive).
+ (1-p): Represents the probability of the event not occurring.
+ p / (1-p): This is the odds of the event. Odds represent the ratio of the probability of
the event happening to the probability of it not happening.
+ ln(): This is the natural logarithm (logarithm base e). The natural logarithm is the
inverse of the exponential function e^x.
+ logit(p): This is the logit function applied to the probability p. It transforms the
probability into log-odds. Log-odds are the natural logarithm of the odds.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic Regression
Sigmoid Function:

𝟏
𝑺𝒊𝒈𝒎𝒐𝒊𝒅 𝒙 = 𝑷 𝒀 = 𝟏 𝑿 = (𝜷𝟎 𝜷𝟏 𝑿𝟏 𝜷𝟐 𝑿𝟐 ⋯ 𝜷𝒏 𝑿𝒏 )
𝒆

Where:
•P(Y=1|X) is the probability of the dependent variable Y belonging to category 1, given the
independent variables X.
•X₁, X₂, ..., Xₙ are the independent variables (features).
•β₀, β₁, β₂, ..., βₙ are the coefficients learned by the model during training. These coefficients
represent the weights assigned to each feature.
•exp() is the exponential function.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Logistic regression equation:

𝑝
log = 𝑚𝑥 + 𝑏
1−𝑝
𝑝 1
= −1=𝑒
1−𝑝 1−𝑝
1
=𝑒 +1
1−𝑝
1 𝑒
𝑝=1− =
𝑒 +1 𝑒 +1

1
𝑝= ( )
𝑒 +1
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Loss function
Apply MLE ( Maximum likelihood estimation)

p= 𝑝 (1 − 𝑝 )

log 𝑝 = log( 𝑝 (1 − 𝑝 ) )= 𝑦 ∗ log 𝑝 + 1 − 𝑦 ∗ log(1 − 𝑝 )

Max log(p) <=> Min –log(p)

Negative log likelihood

− 𝑦 ∗ log 𝑦 + 1 − 𝑦 ∗ log 1 − 𝑦

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Example:
We want to predict drill bit breakage (Yes/No) based on Drill Speed, Feed Rate,
Depth of Cut, and Material Hardness.
Drill Speed Bit
(RPM) Breakage
1500 No
1800 No
2000 No
2200 Yes
2500 Yes

Find the coefficients of our logistic regression model

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

1.Assume a model: Let's assume a simple logistic regression model:


2.logit(p) = β₀ + β₁ * Drill_Speed
3.where p is the probability of breakage.
4.Likelihood function: For each observation, we calculate the probability predicted by our
model:
1. If breakage occurred (Yes), the probability is p.
2. If no breakage occurred (No), the probability is (1-p).
5.The likelihood function is the product of these individual probabilities across all observations.
The goal of MLE is to find the values of β₀ and β₁ that maximize this product. Intuitively, we're
finding the model parameters that make the observed outcomes the "most probable."
6.Log-likelihood: Instead of maximizing the product of probabilities (which can be
computationally tricky), we typically maximize the log-likelihood. The logarithm turns the
product into a sum, which is easier to work with mathematically.
7.Optimization: Software like scikit-learn or statsmodels use numerical optimization algorithms
to find the β₀ and β₁ that maximize the log-likelihood. These optimal coefficients are our MLE
estimates.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Classification Methods (for predicting categorical outcomes):

• Linear Discriminant Analysis (LDA): A classification method that finds linear


combinations of predictors that best separate different classes.

• Quadratic Discriminant Analysis (QDA): Similar to LDA, but allows for non-
linear decision boundaries.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

• K-Nearest Neighbors (KNN): A non-parametric method that classifies


observations based on the majority class among its k-nearest neighbors in the
feature space.

• Support Vector Machines (SVM): Powerful classification methods that find


optimal hyperplanes to separate different classes. ISLR covers both linear and non-
linear SVMs.

• Tree-Based Methods (Decision Trees, Bagging, Random Forests, Boosting):


Methods that build tree-like structures to partition the data and make predictions.
ISLR covers decision trees, bagging, random forests, and boosting.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Testing

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Data Visualization:

A technique of representing data in the form of images, graphs, and charts in an


intuitive, easy-to-understand way to clearly convey information from the data to
readers and user. Instead of keeping the data in spreadsheet form, we convert it
into charts and dashboards so it can be read and understood more easily.

Common types of data visualization formats: Column chart, Bar chart, Line
graph, Two-axis chart, Mekko chart, Pie chart, Bubble chart, Domain chart,
Scatter chart, Heat map, Scatter plot diagram, Area chart….

Visualize the results of the data analysis using tools such as Apache Zeppelin,
Tableau, or PowerBI.
PhD. Quang-Phuoc Tran
Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

5. Data Management:

- Data management is the process of managing data within an organization or


system, including collecting, organizing, storing, processing and protecting
data. It includes activities such as data entry, data processing, data backup and
recovery, and data security.

- Data management plays an important role in modern organizations and


information systems because it helps ensure that data is managed effectively,
reliably and is available when needed. A good data management also helps
improve an organization's ability to access and use data, improve data quality,
minimize security risks and comply with regulations related to data management.

PhD. Quang-Phuoc Tran


Ho Chi Minh City Industry 4.0 technologies in Mechanical Engineering
University of Technology

Data management methods

+ Database Management System


+ File Management System
+ Distributed Data Management System
+ Document Management System
+ Object-oriented Data Management System
+ Metadata Management System

- Manage the data using tools such as Apache HBase or Apache Cassandra
to ensure data consistency, reliability, and security.

PhD. Quang-Phuoc Tran

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy