Lab Rep
Lab Rep
Submitted by:
DS100-4 / B12
1st Term AY 2024 – 2025
INTRODUCTION
The physicochemical properties of wine play a crucial role in determining its quality, as
perceived by human tasters and analyzed through data-driven methodologies. Research conducted by
Cortez et al. (2009) and Nebot et al. (2018) highlights the application of various machine learning
techniques to analyze and predict wine quality based on these properties, ultimately aiming to support
the wine industry by providing objective and scalable assessments.
In their study, Cortez et al. (2009) employed multiple regression, neural networks, and support
vector machines (SVM) on the Vinho Verde dataset, concluding that certain chemical properties—such
as alcohol content, volatile acidity, and residual sugar—significantly influence wine ratings. This
research demonstrates that by identifying and modeling the relationships between chemical variables
and quality, machine learning models can predict wine quality with promising accuracy.
Conversely, fuzzy logic techniques were utilized to capture the intricacies of wine preferences,
revealing that factors such as alcohol content, fixed acidity, free sulfur dioxide, and volatile acidity are
critical indicators of wine quality. This study further corroborates that specific physicochemical
properties consistently affect quality, even across different modeling approaches. The interpretability
of the fuzzy model proved particularly advantageous for industry applications, where understanding
the influence of each variable is essential (Nebot et al., 2018).
Additionally, Angus (2020) explored the use of neural networks to automate wine scoring,
highlighting how the chemical properties of wine can predict sensory scores without the need for
human tasters. Collectively, the results from these studies suggest that machine learning and data
mining methodologies offer reliable and insightful approaches for assessing wine quality, benefiting
production processes and facilitating objective quality assessments. Together, these findings
underscore the significance of specific chemical properties in wine quality, supporting the
development of machine learning models as viable tools for the wine industry.
The quality of wine arises from a delicate balance of chemical elements that collectively shape
its taste, texture, and aging potential. According to Volschenk et al. (2017), one key factor is fixed
acidity, which consists mainly of tartaric and malic acids found naturally in grapes. This type of acidity
remains stable throughout fermentation, lending structure and a crispness to wine. This is especially
valued in white wines, where it enhances freshness and balances sweetness (Payan et al., 2023).
Another form of acidity is the volatile acidity which is primarily acetic acid that can add complexity but,
if present in excess, gives the wine an undesirable vinegar-like taste.
In addition to the natural acidity from tartaric and malic acids, winemakers introduce small
amounts of citric acid to boost acidity which adds a bright, fresh note that sharpens the wine’s edge.
Another key component shaping a wine’s profile is residual sugar. It is the sugar remaining after
fermentation that defines its level of sweetness. As Gadd (2021) notes, wines low in residual sugar are
considered dry, while those with higher levels are sweeter, catering to diverse palates and preferences.
Additionally, salt, specifically sodium chloride, also influences a wine's taste; as Logothetis and Walker
(2010) point out, it adds a subtle salinity that enhances texture, though an excess can disrupt the
wine’s balance.
Beyond the components that shape a wine's taste, it also contains elements with antimicrobial
properties. Sulfur dioxide, for instance, is commonly used as a preservative to prevent oxidation and
control microbial growth, but, as Grogan (2015) points out, overuse can compromise aroma and taste,
making balance essential. Additionally, the wine’s density reflects its sugar and alcohol content which
affects the body and mouthfeel. Furthermore, pH, generally between 3 and 4, plays a crucial role by
influencing acidity, which helps maintain freshness and balance between sweetness and alcohol.
Finally, Granuzzo et al. (2023) found that sulfates, acting as antioxidants, stabilize the wine, while
carefully managed alcohol levels contribute body and warmth, qualities especially valued in fuller-
bodied red wines. Together, these elements allow winemakers to craft wines with varied profiles,
catering to diverse tastes and preferences.
The wine quality dataset by Cortez et al. (2009) consists of 13 variables in total. There are 11
numerical variables representing various chemical properties, one categorical variable indicating wine
type (red or white), and one discrete variable for quality rating, with each sample evaluated on a scale
from 0 (worst) to 10 (best).
The objective of this project is to develop a machine learning model capable of predicting wine
quality based on its chemical properties. A prediction system like this could also be helpful for
marketing or oenology student training (Cortez et al., 2009). Furthermore, this study aims to identify
which chemical properties most significantly impact wine quality and compare the influence of these
chemical properties on quality between red and white wines. It is also necessary to evaluate the
predictive performance of the model to ensure its accuracy and reliability. Ultimately, this project
aspires to provide valuable insights into the relationship between chemical composition and perceived
quality in wines, enhancing the understanding of what defines exceptional wines in the competitive
market.
MODEL DESCRIPTION
Figure 2 shows the histograms of the numerical variables in the white and red wine datasets
providing valuable insights into their distributions. Many variables exhibit right-skewed distributions,
such as the chlorides in white wine and sulphates in red wine, indicating a concentration of data points
towards lower values. On the other hand, the density and pH variables for both wine types exhibit
approximately normal distributions, indicating a more balanced distribution of values. The quality
variable, a discrete variable representing the wine quality rating, is also approximately normal, having
a central tendency around 5 to 6. This suggests that most wines in the dataset are of moderate quality.
Table 1. Skewness values of white and red wine variables before and after log transformations
White Red
Variable Before After Before After
fixed acidity 0.647553 0.647553 0.981829 0.981829
volatile acidity 1.576497 0.872987 0.670962 0.670962
citric acid 1.281528 0.612170 0.318039 0.318039
residual sugar 1.076764 0.004297 4.536395 0.958917
chlorides 5.021792 0.984836 5.675017 0.961571
free sulfur dioxide 1.406314 -0.828047 1.249394 -0.097307
total sulfur dioxide 0.390590 0.390590 1.514109 -0.035712
density 0.977474 0.977474 0.071221 0.071221
pH 0.457642 0.457642 0.193502 0.193502
sulphates 0.976894 0.976894 2.426393 0.877375
alcohol 0.487193 0.487193 0.860021 0.860021
quality 0.155749 0.155749 0.217597 0.217597
Table 1 above lists the quantitative skewness of the variables, before and after treatment, for
both white and red wine types. Reinforcing the histograms, the values calculated also show that some
of the variables are positively skewed, thereby making the data distribution far from normality. To
make the data suitable for regression modeling, the skewed variables underwent logarithmic
transformations to reduce the skewness and handle the outlier values. The processing assumes an
approximately normal distribution for skewness below the threshold of 1. For skewed data with
significant correlations, as seen later on, their contributions to the model prediction will also be on a
logarithmic scale.
Figure 3. Correlation heat map for white (left) and red (right) wine variables to quality
Figure 3 displays the heat maps for the correlations of white and red wine parameters with
the quality variable. In the used color map, a deep red represents a strong positive correlation,
whereas a deep blue connotes a strong negative relationship; a neutral gray color symbolizes very
weak to no correlation. For both, parameters for each type with a correlation coefficient absolute value
of 0.2 and higher, considered significant in determining wine quality, were selectively filtered from the
other variables. These include, for white wine: chlorides, density, and alcohol; and for red wine: volatile
acidity, citric acid, sulfates, and alcohol. This filtering of wine features enables the creation of a much
simpler regression model for the wine quality; this also decreases the noise of the data caused by the
effect of the other factors.
Figure 4. Predicted vs. true quality for the multiple regression model for white (left) and red (right) wine
Using Scikit-Learn’s linear regression tool, multiple linear regression models predicting the
quality of white and red wine using the data for the identified features with significant effects were
determined. The regression equations for the two models are demonstrated below (Note: VA – volatile
acidity, CA – citric acid). Notice that the features that were originally highly skewed, which were treated
with log transformation to normalize, were inscribed in a log function in the equations.
white_quality = 5.8721 – 0.0586 × log[chlorides] + 0.0756 × [density] + 0.4133 × [alcohol]
red_quality = 5.6278 – 0.2100 × [VA] – 0.0210 × [CA] + 0.1227 × log[sulphates] + 0.3394 × [alcohol]
Figure 4 above displays the scatterplot for the predicted versus true quality values for the
different entries in the database. The solid diagonal line denotes the perfect regression model, i.e.,
predicted value equals true value. Based on the visualization, the data points do not exactly follow the
trend of the solid line; the white wine data shows a more “flat” trend, while the red data is more
slanted but still diverting from the line. Noticeably, the modal values (5 and 6) are only those
successfully predicted by the model with good precision, exemplifying the limitation of both models.
This suggests that a better regression model, like the logistic model for dichotomous response (i.e.,
bad and good quality), could be more appropriate to model the dataset; the nature of the quality
feature as a discrete variable from 0 to 10, not a mere good/bad variable, averted us to use the said
regression model. This plot illustrates a weak fit of the regression model to the actual data; an
analytical or numerical evaluation highlights this hunch more.
Table 2. Regression model evaluation metrics
Metric White wine model Red wine model
R2 (test) 0.212749 0.341379
R2 (train) 0.192415 0.332091
MSE 0.609705 0.423061
RMSE 0.780836 0.641312
MAE 0.619060 0.503290
Table 2 enumerates the evaluation metrics for the white and red wine regression models. For
both models, the relatively low coefficient of determination (R2) model demonstrates a weak
prediction power, consistent with the inference from the visualization above. Included as well the
mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE), all
indicating quite large but tolerable variability between the predicted and true values. The R2 score for
the red wine is higher than that of white wine, also consistent with the aforementioned trend
discussed. The difference between the testing and training R2 is relatively small for both models,
meaning that the formulated regression model is not overfitted to the training data, and can
appropriately predict wine quality from a new set of data.
CONCLUSION
The development of this machine learning model has highlighted the potential of data driven
approaches in accurately predicting wine quality based on measurable chemical properties. Through
rigorous data exploration and analysis, several key variables were identified as significant indicators of
quality in both red and white wines. Among these, alcohol content was found to have a strong positive
correlation with quality ratings, suggesting that higher alcohol levels may enhance certain desirable
sensory attributes. In contrast, volatile acidity, chlorides, and residual sugar were generally associated
with lower quality scores, implying that the excessive levels of these compounds can detract from the
wine’s balance and overall flavor profile.
The process of data preprocessing, which included normalization, log transformation, and the
filtering of highly skewed data, ensured that the model was well-equipped to handle variability within
the dataset. Selecting only the most relevant features further streamlined the model, enhancing its
predictive capacity while minimizing unnecessary complexity. This focused approach not only
improved the model’s accuracy but also facilitated an interpretative framework for understanding the
influence of each variable on wine quality. In addition to identifying the primary factors affecting
quality, the model categorized wines into low, average, and high-quality tiers, providing an accessible
way to interpret the results and making it easier to derive actionable insights. This classification can
serve as a valuable tool for winemakers, offering guidance on the ideal chemical compositions that
may yield higher-quality wines. The findings also underscore the potential for machine learning
applications to standardize quality assessments in the wine industry, reducing reliance on subjective
sensory evaluations and supporting consistent quality control.
Future research could expand this work by incorporating additional sensory data and testing
the model on a more diverse range of wine types and varieties. Additionally, integrating more
advanced machine learning algorithms may uncover deeper insights into the complex relationships
between wine composition and perceived quality. The adaptability of this model positions it as a useful
asset for wine producers and researchers, who can leverage these insights to refine production
processes and tailor wine profiles to meet evolving consumer tastes. This project illustrates the power
of machine learning in offering precise, scalable solutions for quality prediction, with the potential to
transform quality assessment practices in oenology.
REFERENCES
Gadd, D. (2021, December 13). Understanding the dryness scale of wines. Wine Wisdoms.
https://winewisdoms.com/article/understanding-dryness-scale-of-wines
Granuzzo, S., Righetto, F., Peggion, C., Bosaro, M., Frizzarin, M., Antoniali, P., Sartori, G., & Lopreiato,
R. (2023). Sulphate uptake plays a major role in the production of sulphur dioxide by yeast cells
during oenological fermentations. Fermentation, 9(3), 280.
https://doi.org/10.3390/fermentation9030280
Grogan, K. A. (2015). The value of added sulfur dioxide in French organic wine. Agricultural and Food
Economics, 3(1). https://doi.org/10.1186/s40100-015-0038-1
Logothetis, S., & Walker, G. (2010). Influence of sodium chloride on wine yeast fermentation
performance. International Journal of Wine Research, 35. https://doi.org/10.2147/ijwr.s10889
Payan, C., Gancel, A., Jourdes, M., Christmann, M., & Teissedre, P. (2023). Wine acidification
methods: a review. OENO One, 57(3), 113–126. https://doi.org/10.20870/oeno-
one.2023.57.3.7476
Volschenk, H., Van Vuuren, H., & Viljoen-Bloom, M. (2017). Malic acid in wine: Origin, function and
metabolism during vinification. South African Journal of Enology and Viticulture, 27(2).
https://doi.org/10.21548/27-2-1613
APPENDIX: Python Code
# Calculate skewness
white_numeric = whitewine_data.select_dtypes(include=[np.number]).columns
white_skewval = whitewine_data[white_numeric].apply(skew)
red_numeric = redwine_data.select_dtypes(include=[np.number]).columns
red_skewval = redwine_data[red_numeric].apply(skew)
white_skewval, red_skewval
Preprocessing
# Classify variables as X or y
X_white = white_filtered.drop('quality', axis=1)
y_white = white_filtered['quality']
X_red = red_filtered.drop('quality', axis=1)
y_red = red_filtered['quality']
# Standardize variables
scaler = StandardScaler()
X_white_scaled = scaler.fit_transform(X_white)
X_red_scaled = scaler.fit_transform(X_red)