0% found this document useful (0 votes)
19 views9 pages

Lab Rep

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Lab Rep

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Optimizing Wine Quality Prediction: A Machine Learning Approach

Using Chemical Properties

Submitted by:

Ballad, Jeremiah Khalil T.


Beig, Zyrin B.
Dela Cruz, John Benedict C.
Garrido, Kyla C.
Manalese, Jan Raya Altaire P.
Origenes, Joshua Paul Andrae A.

DS100-4 / B12
1st Term AY 2024 – 2025
INTRODUCTION
The physicochemical properties of wine play a crucial role in determining its quality, as
perceived by human tasters and analyzed through data-driven methodologies. Research conducted by
Cortez et al. (2009) and Nebot et al. (2018) highlights the application of various machine learning
techniques to analyze and predict wine quality based on these properties, ultimately aiming to support
the wine industry by providing objective and scalable assessments.
In their study, Cortez et al. (2009) employed multiple regression, neural networks, and support
vector machines (SVM) on the Vinho Verde dataset, concluding that certain chemical properties—such
as alcohol content, volatile acidity, and residual sugar—significantly influence wine ratings. This
research demonstrates that by identifying and modeling the relationships between chemical variables
and quality, machine learning models can predict wine quality with promising accuracy.
Conversely, fuzzy logic techniques were utilized to capture the intricacies of wine preferences,
revealing that factors such as alcohol content, fixed acidity, free sulfur dioxide, and volatile acidity are
critical indicators of wine quality. This study further corroborates that specific physicochemical
properties consistently affect quality, even across different modeling approaches. The interpretability
of the fuzzy model proved particularly advantageous for industry applications, where understanding
the influence of each variable is essential (Nebot et al., 2018).
Additionally, Angus (2020) explored the use of neural networks to automate wine scoring,
highlighting how the chemical properties of wine can predict sensory scores without the need for
human tasters. Collectively, the results from these studies suggest that machine learning and data
mining methodologies offer reliable and insightful approaches for assessing wine quality, benefiting
production processes and facilitating objective quality assessments. Together, these findings
underscore the significance of specific chemical properties in wine quality, supporting the
development of machine learning models as viable tools for the wine industry.
The quality of wine arises from a delicate balance of chemical elements that collectively shape
its taste, texture, and aging potential. According to Volschenk et al. (2017), one key factor is fixed
acidity, which consists mainly of tartaric and malic acids found naturally in grapes. This type of acidity
remains stable throughout fermentation, lending structure and a crispness to wine. This is especially
valued in white wines, where it enhances freshness and balances sweetness (Payan et al., 2023).
Another form of acidity is the volatile acidity which is primarily acetic acid that can add complexity but,
if present in excess, gives the wine an undesirable vinegar-like taste.
In addition to the natural acidity from tartaric and malic acids, winemakers introduce small
amounts of citric acid to boost acidity which adds a bright, fresh note that sharpens the wine’s edge.
Another key component shaping a wine’s profile is residual sugar. It is the sugar remaining after
fermentation that defines its level of sweetness. As Gadd (2021) notes, wines low in residual sugar are
considered dry, while those with higher levels are sweeter, catering to diverse palates and preferences.
Additionally, salt, specifically sodium chloride, also influences a wine's taste; as Logothetis and Walker
(2010) point out, it adds a subtle salinity that enhances texture, though an excess can disrupt the
wine’s balance.
Beyond the components that shape a wine's taste, it also contains elements with antimicrobial
properties. Sulfur dioxide, for instance, is commonly used as a preservative to prevent oxidation and
control microbial growth, but, as Grogan (2015) points out, overuse can compromise aroma and taste,
making balance essential. Additionally, the wine’s density reflects its sugar and alcohol content which
affects the body and mouthfeel. Furthermore, pH, generally between 3 and 4, plays a crucial role by
influencing acidity, which helps maintain freshness and balance between sweetness and alcohol.
Finally, Granuzzo et al. (2023) found that sulfates, acting as antioxidants, stabilize the wine, while
carefully managed alcohol levels contribute body and warmth, qualities especially valued in fuller-
bodied red wines. Together, these elements allow winemakers to craft wines with varied profiles,
catering to diverse tastes and preferences.
The wine quality dataset by Cortez et al. (2009) consists of 13 variables in total. There are 11
numerical variables representing various chemical properties, one categorical variable indicating wine
type (red or white), and one discrete variable for quality rating, with each sample evaluated on a scale
from 0 (worst) to 10 (best).
The objective of this project is to develop a machine learning model capable of predicting wine
quality based on its chemical properties. A prediction system like this could also be helpful for
marketing or oenology student training (Cortez et al., 2009). Furthermore, this study aims to identify
which chemical properties most significantly impact wine quality and compare the influence of these
chemical properties on quality between red and white wines. It is also necessary to evaluate the
predictive performance of the model to ensure its accuracy and reliability. Ultimately, this project
aspires to provide valuable insights into the relationship between chemical composition and perceived
quality in wines, enhancing the understanding of what defines exceptional wines in the competitive
market.

MODEL DESCRIPTION

Figure 1. Machine learning project procedure flowchart

Data Loading and Preparation


The dataset was sourced from the Kaggle repository and includes records of Portuguese Vinho
Verde wine, with 1,599 samples of red wine and 4,898 samples of white wine, totaling 6,497 entries.
Data loading was accomplished using the Pandas library in Python, with the pd.read_csv() function
to read the dataset. Once loaded, the dataset was divided into two distinct DataFrames for red and
white wines to facilitate separate analyses for each type. The initial data structure, types, and
completeness were assessed using the info() function, confirming that there were no missing values.
Additionally, Matplotlib and Seaborn libraries were utilized for visualization to allow for subsequent
visual analysis of wine characteristics by type.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) was conducted to gain insights into the distribution of the
dataset’s features and assess relationships between variables. Histograms were generated to
represent the distribution of physicochemical properties and quality scores for red and white wines
using sns.histplot function, allowing visual assessment of each variable's spread and central
tendency. Skewness was calculated for all numeric variables using the skew() function to identify any
significant asymmetries, with a threshold of absolute skewness >1 set for logarithmic transformations.
Furthermore, Pearson correlation coefficients were calculated, and heatmaps were generated to
assess the strength and direction of correlations between wine quality and other features in both the
red and white wine datasets.
Data Preprocessing
Since no missing values were detected in the dataset, imputation was deemed unnecessary.
Data preprocessing focused primarily on normalizing and standardizing skewed variables to meet
model assumptions. Variables with an absolute skewness greater than 1 were log-transformed to
approximate a normal distribution, after which they were standardized to ensure consistency across
features. Additionally, feature selection was conducted based on correlation significance. Only features
with correlation coefficients of |R| > 0.19 relative to the quality variable were retained, reducing
dimensionality and meeting assumptions of homoscedasticity (equal variance) and linearity for the
regression model.
Model Training and Testing
The study employed a multiple linear regression (MLR) model to predict wine quality based
on chemical and sensory attributes. The dataset was split into training (80%) and testing (20%) subsets
to allow model training and subsequent evaluation. The multiple linear regression model aimed to
predict the target variable (wine quality) using the selected features. Model predictions were
evaluated against both actual and synthetic data samples, with the ultimate goal of establishing a
reliable prediction model for wine quality based on physicochemical properties.
Dataset Summary
The dataset consists of chemical and sensory attributes, including fixed acidity, volatile acidity,
residual sugar, chlorides, sulfur dioxide levels (free and total), density, pH, sulphates, and alcohol
content. Quality scores, as determined by expert sensory evaluations, serve as the target variable. For
analysis purposes, the quality scores were grouped into three distinct categories: Low Quality (scores
0-4), Average Quality (scores 5-7), and High Quality (scores 8-10). This classification facilitated a
structured approach to examining the impact of each feature on wine quality across different quality
categories.
This machine learning facilitated an organized investigation of the physicochemical
characteristics of wine samples and their relationship with quality, culminating in a predictive model
designed to assess wine quality based on chemical composition and sensory metrics.

RESULTS AND DISCUSSION

Figure 2. White (left) and red (right) wine variable distributions

Figure 2 shows the histograms of the numerical variables in the white and red wine datasets
providing valuable insights into their distributions. Many variables exhibit right-skewed distributions,
such as the chlorides in white wine and sulphates in red wine, indicating a concentration of data points
towards lower values. On the other hand, the density and pH variables for both wine types exhibit
approximately normal distributions, indicating a more balanced distribution of values. The quality
variable, a discrete variable representing the wine quality rating, is also approximately normal, having
a central tendency around 5 to 6. This suggests that most wines in the dataset are of moderate quality.
Table 1. Skewness values of white and red wine variables before and after log transformations
White Red
Variable Before After Before After
fixed acidity 0.647553 0.647553 0.981829 0.981829
volatile acidity 1.576497 0.872987 0.670962 0.670962
citric acid 1.281528 0.612170 0.318039 0.318039
residual sugar 1.076764 0.004297 4.536395 0.958917
chlorides 5.021792 0.984836 5.675017 0.961571
free sulfur dioxide 1.406314 -0.828047 1.249394 -0.097307
total sulfur dioxide 0.390590 0.390590 1.514109 -0.035712
density 0.977474 0.977474 0.071221 0.071221
pH 0.457642 0.457642 0.193502 0.193502
sulphates 0.976894 0.976894 2.426393 0.877375
alcohol 0.487193 0.487193 0.860021 0.860021
quality 0.155749 0.155749 0.217597 0.217597

Table 1 above lists the quantitative skewness of the variables, before and after treatment, for
both white and red wine types. Reinforcing the histograms, the values calculated also show that some
of the variables are positively skewed, thereby making the data distribution far from normality. To
make the data suitable for regression modeling, the skewed variables underwent logarithmic
transformations to reduce the skewness and handle the outlier values. The processing assumes an
approximately normal distribution for skewness below the threshold of 1. For skewed data with
significant correlations, as seen later on, their contributions to the model prediction will also be on a
logarithmic scale.

Figure 3. Correlation heat map for white (left) and red (right) wine variables to quality

Figure 3 displays the heat maps for the correlations of white and red wine parameters with
the quality variable. In the used color map, a deep red represents a strong positive correlation,
whereas a deep blue connotes a strong negative relationship; a neutral gray color symbolizes very
weak to no correlation. For both, parameters for each type with a correlation coefficient absolute value
of 0.2 and higher, considered significant in determining wine quality, were selectively filtered from the
other variables. These include, for white wine: chlorides, density, and alcohol; and for red wine: volatile
acidity, citric acid, sulfates, and alcohol. This filtering of wine features enables the creation of a much
simpler regression model for the wine quality; this also decreases the noise of the data caused by the
effect of the other factors.
Figure 4. Predicted vs. true quality for the multiple regression model for white (left) and red (right) wine

Using Scikit-Learn’s linear regression tool, multiple linear regression models predicting the
quality of white and red wine using the data for the identified features with significant effects were
determined. The regression equations for the two models are demonstrated below (Note: VA – volatile
acidity, CA – citric acid). Notice that the features that were originally highly skewed, which were treated
with log transformation to normalize, were inscribed in a log function in the equations.
white_quality = 5.8721 – 0.0586 × log[chlorides] + 0.0756 × [density] + 0.4133 × [alcohol]
red_quality = 5.6278 – 0.2100 × [VA] – 0.0210 × [CA] + 0.1227 × log[sulphates] + 0.3394 × [alcohol]

Figure 4 above displays the scatterplot for the predicted versus true quality values for the
different entries in the database. The solid diagonal line denotes the perfect regression model, i.e.,
predicted value equals true value. Based on the visualization, the data points do not exactly follow the
trend of the solid line; the white wine data shows a more “flat” trend, while the red data is more
slanted but still diverting from the line. Noticeably, the modal values (5 and 6) are only those
successfully predicted by the model with good precision, exemplifying the limitation of both models.
This suggests that a better regression model, like the logistic model for dichotomous response (i.e.,
bad and good quality), could be more appropriate to model the dataset; the nature of the quality
feature as a discrete variable from 0 to 10, not a mere good/bad variable, averted us to use the said
regression model. This plot illustrates a weak fit of the regression model to the actual data; an
analytical or numerical evaluation highlights this hunch more.
Table 2. Regression model evaluation metrics
Metric White wine model Red wine model
R2 (test) 0.212749 0.341379
R2 (train) 0.192415 0.332091
MSE 0.609705 0.423061
RMSE 0.780836 0.641312
MAE 0.619060 0.503290

Table 2 enumerates the evaluation metrics for the white and red wine regression models. For
both models, the relatively low coefficient of determination (R2) model demonstrates a weak
prediction power, consistent with the inference from the visualization above. Included as well the
mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE), all
indicating quite large but tolerable variability between the predicted and true values. The R2 score for
the red wine is higher than that of white wine, also consistent with the aforementioned trend
discussed. The difference between the testing and training R2 is relatively small for both models,
meaning that the formulated regression model is not overfitted to the training data, and can
appropriately predict wine quality from a new set of data.

CONCLUSION
The development of this machine learning model has highlighted the potential of data driven
approaches in accurately predicting wine quality based on measurable chemical properties. Through
rigorous data exploration and analysis, several key variables were identified as significant indicators of
quality in both red and white wines. Among these, alcohol content was found to have a strong positive
correlation with quality ratings, suggesting that higher alcohol levels may enhance certain desirable
sensory attributes. In contrast, volatile acidity, chlorides, and residual sugar were generally associated
with lower quality scores, implying that the excessive levels of these compounds can detract from the
wine’s balance and overall flavor profile.
The process of data preprocessing, which included normalization, log transformation, and the
filtering of highly skewed data, ensured that the model was well-equipped to handle variability within
the dataset. Selecting only the most relevant features further streamlined the model, enhancing its
predictive capacity while minimizing unnecessary complexity. This focused approach not only
improved the model’s accuracy but also facilitated an interpretative framework for understanding the
influence of each variable on wine quality. In addition to identifying the primary factors affecting
quality, the model categorized wines into low, average, and high-quality tiers, providing an accessible
way to interpret the results and making it easier to derive actionable insights. This classification can
serve as a valuable tool for winemakers, offering guidance on the ideal chemical compositions that
may yield higher-quality wines. The findings also underscore the potential for machine learning
applications to standardize quality assessments in the wine industry, reducing reliance on subjective
sensory evaluations and supporting consistent quality control.
Future research could expand this work by incorporating additional sensory data and testing
the model on a more diverse range of wine types and varieties. Additionally, integrating more
advanced machine learning algorithms may uncover deeper insights into the complex relationships
between wine composition and perceived quality. The adaptability of this model positions it as a useful
asset for wine producers and researchers, who can leverage these insights to refine production
processes and tailor wine profiles to meet evolving consumer tastes. This project illustrates the power
of machine learning in offering precise, scalable solutions for quality prediction, with the potential to
transform quality assessment practices in oenology.

REFERENCES
Gadd, D. (2021, December 13). Understanding the dryness scale of wines. Wine Wisdoms.
https://winewisdoms.com/article/understanding-dryness-scale-of-wines
Granuzzo, S., Righetto, F., Peggion, C., Bosaro, M., Frizzarin, M., Antoniali, P., Sartori, G., & Lopreiato,
R. (2023). Sulphate uptake plays a major role in the production of sulphur dioxide by yeast cells
during oenological fermentations. Fermentation, 9(3), 280.
https://doi.org/10.3390/fermentation9030280
Grogan, K. A. (2015). The value of added sulfur dioxide in French organic wine. Agricultural and Food
Economics, 3(1). https://doi.org/10.1186/s40100-015-0038-1
Logothetis, S., & Walker, G. (2010). Influence of sodium chloride on wine yeast fermentation
performance. International Journal of Wine Research, 35. https://doi.org/10.2147/ijwr.s10889
Payan, C., Gancel, A., Jourdes, M., Christmann, M., & Teissedre, P. (2023). Wine acidification
methods: a review. OENO One, 57(3), 113–126. https://doi.org/10.20870/oeno-
one.2023.57.3.7476
Volschenk, H., Van Vuuren, H., & Viljoen-Bloom, M. (2017). Malic acid in wine: Origin, function and
metabolism during vinification. South African Journal of Enology and Viticulture, 27(2).
https://doi.org/10.21548/27-2-1613
APPENDIX: Python Code

# Import Python libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error

# Load csv file


data = pd.read_csv('wine-quality-white-and-red.csv')
data

# Split red and white wine data


whitewine_data = data[data['type'] == 'white']
redwine_data = data[data['type'] == 'red']

Exploratory Data Analysis

# Show information on datatypes, columns, and number of entries


whitewine_data.info()
redwine_data.info()

# Show statistics on different columns


whitewine_data.describe()
redwine_data.describe()

# Show distributions of other numerical variables


# Note that some of the distributions are right-skewed, thus needing treatment to normalize
whitewine_data.hist(bins=20, figsize=(14,10), color='pink')
plt.suptitle('White Wine Variable Distributions');
redwine_data.hist(bins=20, figsize=(14,10), color='red')
plt.suptitle('Red Wine Variable Distributions');

# Calculate skewness
white_numeric = whitewine_data.select_dtypes(include=[np.number]).columns
white_skewval = whitewine_data[white_numeric].apply(skew)
red_numeric = redwine_data.select_dtypes(include=[np.number]).columns
red_skewval = redwine_data[red_numeric].apply(skew)
white_skewval, red_skewval

# Analyze correlation between X variables vs. quality


white_num = whitewine_data.select_dtypes(include=[np.number])
white_correl = white_num.corr()
white_quality_correl = white_correl['quality'].drop('quality')
plt.figure(figsize=(8,6))
sns.heatmap(pd.DataFrame(white_quality_correl), annot=True, cmap='coolwarm', center=0, fmt='.2f',
linewidth=0.5, cbar_kws={'shrink': .8})
plt.title('Correlation Heat Map for White Wine Parameters to Quality');
red_num = redwine_data.select_dtypes(include=[np.number])
red_correl = red_num.corr()
red_quality_correl = red_correl['quality'].drop('quality')
plt.figure(figsize=(8,6))
sns.heatmap(pd.DataFrame(red_quality_correl), annot=True, cmap='coolwarm', center=0, fmt='.2f',
linewidth=0.5, cbar_kws={'shrink': .8})
plt.title('Correlation Heat Map for Red Wine Parameters to Quality');

Preprocessing

# Identify highly skewed variables, then normalize using log transformation


# Iteration is necessary for variables with really high skewness
white_highskew = white_skewval[abs(white_skewval) > 1].index
for col in white_highskew:
while abs(whitewine_data[col].skew()) > 1:
whitewine_data[col] = np.log1p(whitewine_data[col])
whitewine_data[white_numeric].apply(skew)
red_highskew = red_skewval[abs(red_skewval) > 1].index
for col in red_highskew:
while abs(redwine_data[col].skew()) > 1:
redwine_data[col] = np.log1p(redwine_data[col])
redwine_data[red_numeric].apply(skew)

# Filter variables with significant correlation


white_highcorrel = white_correl['quality'][abs(white_correl['quality']) >= 0.2].index
white_filtered = whitewine_data[white_highcorrel]
red_highcorrel = red_correl['quality'][abs(red_correl['quality']) >= 0.2].index
red_filtered = redwine_data[red_highcorrel]
white_highcorrel, red_highcorrel

# Classify variables as X or y
X_white = white_filtered.drop('quality', axis=1)
y_white = white_filtered['quality']
X_red = red_filtered.drop('quality', axis=1)
y_red = red_filtered['quality']

# Standardize variables
scaler = StandardScaler()
X_white_scaled = scaler.fit_transform(X_white)
X_red_scaled = scaler.fit_transform(X_red)

# Split data into training and testing sets


Xw_train, Xw_test, yw_train, yw_test = train_test_split(X_white_scaled, y_white, test_size=0.2)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_red_scaled, y_red, test_size=0.2)

Multiple Linear Regression Model

# Creating linear regression model


# White wine model
linreg_white = LinearRegression()
white_model = linreg_white.fit(Xw_train,yw_train)
white_target = linreg_white.predict(Xw_test)
plt.figure()
plt.scatter(yw_test, white_target, color='pink')
plt.plot(range(10), range(10), color='black')
plt.xlabel('True Values')
plt.ylabel('Predictions');

# White wine model evaluation metrics


print('R2 (test):', white_model.score(Xw_test, yw_test))
print('Coefficients:', linreg_white.coef_)
print('R2 (train):', white_model.score(Xw_train, yw_train))
print('Intercept:', linreg_white.intercept_)
print('MSE:', mean_squared_error(yw_test, white_target))
print('RMSE:', root_mean_squared_error(yw_test, white_target))
print('MAE:', mean_absolute_error(yw_test, white_target))

# Red wine model


linreg_red = LinearRegression()
red_model = linreg_red.fit(Xr_train,yr_train)
red_target = linreg_red.predict(Xr_test)
plt.figure()
plt.scatter(yr_test, red_target, color='red')
plt.plot(range(10), range(10), color='black')
plt.xlabel('True Values')
plt.ylabel('Predictions');

# Red wine model evaluation


print('R-squared Score (test):', red_model.score(Xr_test, yr_test))
print('Coefficients:', linreg_red.coef_)
print('R-squared Score (train):', red_model.score(Xr_train, yr_train))
print('Intercept:', linreg_red.intercept_)
print('Mean Squared Error:', mean_squared_error(yr_test, red_target))
print('RMSE:', root_mean_squared_error(yr_test, red_target))
print('MAE:', mean_absolute_error(yr_test, red_target))

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy