Prastiani Social Media Sentiment Analysis For Local
Prastiani Social Media Sentiment Analysis For Local
Abstract— The United Nations under the WMO predicts the PDAM account on Instagram updates activities more
that more than 5 billion people will experience a water crisis in
2023 10th International Conference on ICT for Smart Society (ICISS) | 979-8-3503-3954-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICISS59129.2023.10291991
Authorized licensed use limited to: Universita degli Studi di Napoli Federico II. Downloaded on May 16,2024 at 07:46:58 UTC from IEEE Xplore. Restrictions apply.
III. METHODOLOGY ASCII and Unicode/emoticon, remove punctuation, remove
The Cross-Industry Standard Process for Data Mining, or number, remove duplicate and remove empty comments. The
CRISP-DM, will be used as the research methodology. In three spelling corrections function to correct spelling errors
order to define the CRISP-DM technique, a hierarchical in a text or word. The four stemming functions remove
process model made up of groups of tasks is used[20]. affixes in a word so that it will change to its basic form using
the Literary library with the StemmerFactory() function.
Fifth, tokenizing is a process where data is separated into
token pieces, these pieces can be words, numbers, or symbols
that aim to simplify the analysis process at a later stage. The
six stopword removal uses the NLTK library as a corpus and
adds words to the list according to the context of the dataset
and the seventh stage is the implementation of TF-IDF in this
study using the TfidfVectorizer() function in the sklearn
library. After data preparation, the current data is 5,071 with
the distribution of data as follows.
Fig. 1. Method CRISP-DM
A. Business understanding
In this study, the company wants to know the sentiments
of PDAM users towards the services that have been provided.
By knowing user sentiment, companies can assess services
that need improvement so as to increase customer satisfaction.
To get accurate results, it is necessary to collect user sentiment
data directly. Facebook and Instagram are considered suitable
media to be used as objects for collecting PDAM user
sentiment data. Fig. 3. Dataset Distribution
B. Data understanding
TABLE I. DATA PREPARATION
The information utilized in this study is PDAM user
Before After
sentiment data from Facebook and Instagram. The data is
AyoooLahhhhh PDAM,ini dah lebih dr 24 ayo pdam sudah lebih
primary data collected through scrapping using the Data Jam Lochhh.......Kita Butuh Airrrrrrrrrr, dari jam kita butuh air
Scraper tool. The data is taken from PDAM's social media ,Normalisasi sampe Kapan normalisasi sampai
accounts that are spread throughout Indonesia. Data woiiiiiii kapan
collection was carried out from 12 November 2022 – 06 https://instagram.com/stories/perumda harus kasih piala juara
March 2023. The data collected was 12,019 with username tugutirta/2942401817216478157?utm mati angin doang keluar
_source=ig_story_item_share&igshid
and comment attributes. with the distribution of data as =MDJmNzVkMjY= harusnya dikasih piala
follows. juara 1 mati air, angin doang yang keluar
D. Modelling
The steps that need to be taken are data splitting, class
balancing, and implementation of the SVM algorithm.
1) Data splitting is the process of dividing data into two
components, namely the training set and the test set.
In its implementation it uses the holdout method
where the data will be split with a ratio of 60:40,
70:30, 75:25, 80:20, 90:10.
Fig. 2. Row data distribution 2) Class balancing is done because the dataset is
unbalanced, the data must be balanced using the
The data will be entered into the next stage, namely data SMOTE technique from the imbalanced-learn
preparation. where data with a neutral label will be deleted package and the minority approach.
because it is not in accordance with the research objectives,
bringing the total data to 5131. TABLE II. SMOTE RESULT
Ra Data Imbalanced Data Balanced
C. Data preparation
tio Posi Nega Total Posi Nega Total
The dataset that will be used needs to be prepared in order tive tive tive tive
to produce a good Machine Learning model. There are seven 60:40 350 2692 3042 2692 2692 5384
steps that need to be carried out at this stage, namely, first, 75:25 438 3365 3803 3365 3365 6730
70:30 409 3140 3549 3140 3140 6280
labeling is done using the SentiStrength algorithm for the 80:20 467 3589 4056 3589 3589 7178
dataset by checking again manually. Both data cleaning aims 90:10 525 4038 4563 4038 4038 8067
to improve data quality by identifying and eliminating errors
and inconsistencies[21]. The steps taken are casefolding,
remove username, remove hashtag, remove url, remove
Authorized licensed use limited to: Universita degli Studi di Napoli Federico II. Downloaded on May 16,2024 at 07:46:58 UTC from IEEE Xplore. Restrictions apply.
3) Implementation of the SVM algorithm with a linear 2) Classification report
kernel is carried out on training data that has gone Provides information about the effectiveness of the
through a series of processes at the previous stage to categorization model is founded on assessment
make predictions or identify patterns in data that parameters including accuracy, precision, and recall
have never been seen before. If using manual (sensitivity), F1-Score, and support for each class. In this
calculations to be able to make SVM modeling, you study, the selection of the model focuses on a high F1-
can refer to the linear equation f(x)=sign (wT .x+b). score because it represents a fairly good balance between
Before that, it is necessary to find the weight vector precision and recall to minimize false positive and false
value of each word. Documents with a positive label negative errors. As a consequence, the optimal model is
will use the formula (wT .x1 +b=1) and documents generated with balanced data and a 70:30 ratio.
with a negative label will use the formula ( wT .
x2 +b=-1). Where the TF-IDF calculation results for TABLE VI. MODEL EVALUATION WITH CLASSIFICATION REPORT ON
BALANCED DATA (SMOTE)
each token will be an input vector (x).
Ra Label Precisio Recal F1- Suppor Accurac
TABLE III. TEST RESULTS OF THE HOLDOUT METHOD RATIO tio n l Scor t y
e
Ratio Accuracy 60:40 Negati 0.96 0.97 0.97 1795 0.94
Imbalanced Data Balanced Data ve
(SMOTE) Positi 0.75 0.70 0.72 234
60:40 94.5% 93.8% ve
70:30 94.8% 94.5% 70:30 Negati 0.96 0.98 0.97 1347 0.95
75:25 94.5% 94% ve
80:20 94.8% 93.3% Positi 0.79 0.71 0.75 175
90:10 94.5% 93.3% ve
75:25 Negati 0.96 0.97 0.97 1122 0.94
ve
Table III. is the accuracy of the model after training using the Positi 0.74 0.73 0.73 146
holdout method at all ratios. ve
80:20 Negati 0.96 0.96 0.96 898 0.94
IV. EXPERIMENT AND RESULT ANALYSIS ve
Positi 0.71 0.71 0.71 117
A. Evaluation ve
It is needed to measure the performance and quality of 90:10 Negati 0.96 0.96 0.96 449 0.94
machine learning algorithms that have been taught to ve
Positi 0.72 0.69 0.71 59
generate accurate predictions on previously unseen data. In ve
this study the evaluation of machine learning models will use
four evaluation methods. Where the ratio of 70:30 balanced 3) ROC/AUC
data is the best model after experimenting with the following The true positive rate (TPR) vs the false positive rate
evaluation results. (FPR) is shown on the Receiver Operating Characteristic
1) Confusion Matrix (ROC) curve. The ROC curve is used to describe the
The Confusion Matrix is a way in the notion of data classifier's diagnostic capability [23]. The following are
mining that may be used to assess the correctness of the general principles for classifying test accuracy using
data so that the data can be utilized in a decision support AUC[24].
system [22].
0.90-1.00 = Excellent classification
TABLE IV. MODEL EVALUATION WITH CONFUSION MATRIX ON
IMBALANCED DATA
0.80-0.90 = Good classification
Ratio TP FN TN FP
60:40 140 94 1777 18 0.70-0.80 = Fair classification
70:30 108 67 1335 12
75:25 89 57 1110 12 0.60-0.70 = Poor classification
80:20 73 44 889 9
90:10 36 23 444 5 0.50-0.60 = Failure
Authorized licensed use limited to: Universita degli Studi di Napoli Federico II. Downloaded on May 16,2024 at 07:46:58 UTC from IEEE Xplore. Restrictions apply.
In the ROC-AUC evaluation, it has an AUC value Based on Figure 6 above, modeling is carried out using
of 0.927 where this value has entered into the excellent modules from the Python programming language, namely
classification category and the resulting curve also has scikit-learn. Furthermore, the training data modeling results
good performance. file will be saved using a module from Python, namely Pickle
4) K-Fold Cross Validation in the form of .pkl. Furthermore, modeling is made to predict
K-Fold Cross Validation aids in establishing the the input results from the user. All of these processes will be
model's level of resilience, namely the accuracy and packaged using flask as a web development framework and
success of categorization when applied to novel contexts. the website is ready to use.
K-Fold Cross Validation also determines the extent to
which the model is overfitting which can occur when the
calibration error rate is low but the cross validation error
rate is high. This indicates that the model works well for
some initial data or situations but does not work well for
other data or other situations[25]
Authorized licensed use limited to: Universita degli Studi di Napoli Federico II. Downloaded on May 16,2024 at 07:46:58 UTC from IEEE Xplore. Restrictions apply.
positive class is 0.79, 0.71 and 0.75. The resulting accuracy Advances in NLP: The Case of Arabic Language,” 2020.
[Online]. Available: http://www.springer.com/series/7092
is 94% and the AUC value is 0.927. After further analysis on
[12] Y. Santur, “Sentiment Analysis Based on Gated Recurrent Unit,”
the negative sentiments of PDAM users, many complain in 2019 International Artificial Intelligence and Data Processing
about services related to aspects of frequent water failures, Symposium (IDAP), IEEE, Sep. 2019, pp. 1–5. doi:
payments, leaks, cloudy water, and meter records. 10.1109/IDAP.2019.8875985.
[13] C. S. Hudaya, H. Fakhrurroja, and A. Alamsyah, “ANALISIS
In this study, data with positive labels had less word
PERSEPSI KONSUMEN TERHADAP BRAND GO-JEK
variations and a small amount of data. So there are still PADA MEDIA SOSIAL TWITTER MENGGUNAKAN
prediction errors on positive sentiment. It is hoped that future METODE SENTIMENT ANALYSIS DAN TOPIC
research can increase the amount of data with positive labels MODELLING,” Jurnal Mitra Manajemen, vol. 3, no. 6, pp. 664–
673, Jul. 2019, doi: 10.52160/ejmm.v3i6.244.
and more word variations. In order to be able to minimize
[14] B. Pahwa, S. Taruna, and N. Kasliwal, “Sentiment Analysis-
prediction errors on positive sentiment. Strategy for Text Pre-Processing,” Int J Comput Appl, vol. 180,
no. 34, pp. 15–18, Apr. 2018, doi: 10.5120/ijca2018916865.
REFERENCES [15] R. Ferreira-Mello, M. André, A. Pinheiro, E. Costa, and C.
Romero, “Text mining in education,” Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 6.
[1] World Meteorological Organization, “Wake Up to The Looming
Wiley-Blackwell, Nov. 01, 2019. doi: 10.1002/widm.1332.
Water Crisis, Report Warns,” World Meteorological
[16] S.-W. Kim and J.-M. Gil, “Research paper classification systems
Organization, 2021. https://public.wmo.int/en/media/press-
based on TF-IDF and LDA schemes,” Human-centric Computing
release/wake-looming-water-crisis-report-warns (accessed Nov.
and Information Sciences, vol. 9, no. 1, p. 30, Dec. 2019, doi:
27, 2022).
10.1186/s13673-019-0192-7.
[2] Kementrian Pekerjaan Umum dan Perumahan Rakyat, “Meski
[17] H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for
Semakin Langka, Air Tanah Masih Diminati Masyarakat,”
handling class imbalance in the classification of diabetes with
Kementrian Pekerjaan Umum dan Perumahan Rakyat, 2021.
C4.5, SVM, and naive Bayes,” Jurnal Teknologi dan Sistem
https://www.pu.go.id/index.php/berita/meski-semakin-langka-air-
Komputer, vol. 8, no. 2, pp. 89–93, Apr. 2020, doi:
tanah-masih-diminati-masyarakat (accessed Nov. 27, 2022).
10.14710/jtsiskom.8.2.2020.89-93.
[3] Badan Pusat Statistik, “Jumlah Pelanggan Perusahaan Air Bersih
[18] A. N. Kasanah, M. Muladi, and U. Pujianto, “Penerapan Teknik
2019-2021,” Badan Pusat Statistik, 2022.
SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi
https://www.bps.go.id/indicator/7/76/1/jumlah-pelanggan-
Objektivitas Berita Online Menggunakan Algoritma KNN,”
perusahaan-air-bersih.html (accessed Jun. 27, 2023).
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 3,
[4] wearesocial, “The World’s Most-Used Social Platforms,”
no. 2, pp. 196–201, Aug. 2019, doi: 10.29207/resti.v3i2.945.
wearesocial, 2023. https://wearesocial.com/uk/blog/2023/01/the-
[19] B. Quinto, Next-generation machine learning with spark: Covers
changing-world-of-digital-in-2023/ (accessed Jun. 27, 2023).
XGBoost, LightGBM, Spark NLP, distributed deep learning with
[5] F. S. Lubis, M. Lubis, L. Hakim, and H. Fakhrurroja, “The Text
keras, and more. Apress Media LLC, 2020. doi: 10.1007/978-1-
Mining Analysis Approach for Electronic Information and
4842-5669-5.
Transaction (ITE) Implementation Based on Sentiment in the
[20] O. Monica, F. W. Wahida, and H. Fakhruroja, “The Relations
Social Media,” in Lecture Notes in Networks and Systems,
Between Influencers in Social Media and The Election Winning
Springer Science and Business Media Deutschland GmbH, 2023,
Party 2019,” in 2019 International Conference on ICT for Smart
pp. 263–271. doi: 10.1007/978-981-19-7660-5_23.
Society (ICISS), IEEE, Nov. 2019, pp. 1–5. doi:
[6] R. Santosa, “Quality of Public Service for Regional Water
10.1109/ICISS48059.2019.8969801.
Companies: A Case Study in Local Water company Region II
[21] F. Ridzuan and W. M. N. Wan Zainon, “A Review on Data
Makassar City,” International Journal of Multicultural and
Cleansing Methods for Big Data,” Procedia Comput Sci, vol.
Multireligious Understanding, vol. 7, no. 2, p. 498, Mar. 2020,
161, pp. 731–738, 2019, doi: 10.1016/j.procs.2019.11.177.
doi: 10.18415/ijmmu.v7i2.1496.
[22] F. Rahmad, Y. Suryanto, and K. Ramli, “Performance
[7] R. A. Wildan, R. A. Rajagede, and R. Rahmadi, “Analisis
Comparison of Anti-Spam Technology Using Confusion Matrix
Sentimen Politik Berdasarkan Big Data dari Media Sosial
Classification,” IOP Conf Ser Mater Sci Eng, vol. 879, no. 1, p.
Youtube : Sebuah Tinjauan Literatur,” Automata, vol. 2, 2021.
012076, Jul. 2020, doi: 10.1088/1757-899X/879/1/012076.
[8] L. Magfiroh, H. Sembiring, A. Sihombing, and S. Kaputama
[23] C. Kar, A. Kumar, and S. Banerjee, “Tropical cyclone intensity
Binjai, “Clustering of Customer Complaints from PDAM Kota
detection by geometric features of cyclone images and multilayer
Binjai Using the K-Means Method,” 2022. [Online]. Available:
perceptron,” SN Appl Sci, vol. 1, no. 9, p. 1099, Sep. 2019, doi:
https://ijhet.com/index.php/ijhess/
10.1007/s42452-019-1134-8.
[9] R. Diouf, E. N. Sarr, O. Sall, B. Birregah, M. Bousso, and S. N.
[24] W. Bourequat and H. Mourad, “Sentiment Analysis Approach for
Mbaye, “Web Scraping: State-of-the-Art and Areas of
Analyzing iPhone Release using Support Vector Machine,”
Application,” in 2019 IEEE International Conference on Big
International Journal of Advances in Data and Information
Data (Big Data), IEEE, Dec. 2019, pp. 6040–6042. doi:
Systems, vol. 2, no. 1, pp. 36–44, Apr. 2021, doi:
10.1109/BigData47090.2019.9005594.
10.25008/ijadis.v2i1.1216.
[10] M. Agarwal, “An Overview of Natural Language Processing,” Int
[25] B. G. Marcot and A. M. Hanea, “What is an optimal value of k in
J Res Appl Sci Eng Technol, vol. 7, no. 5, pp. 2811–2813, May
k-fold cross-validation in discrete Bayesian network analysis?,”
2019, doi: 10.22214/ijraset.2019.5462.
Comput Stat, vol. 36, no. 3, pp. 2009–2031, Sep. 2021, doi:
[11] M. Abd, E. Mohammed, A. A. Al-Qaness, A. A. Ewees, and A.
10.1007/s00180-020-00999-9.
Dahou, “Studies in Computational Intelligence 874 Recent
Authorized licensed use limited to: Universita degli Studi di Napoli Federico II. Downloaded on May 16,2024 at 07:46:58 UTC from IEEE Xplore. Restrictions apply.