Abstract
Malay language is a major language that is in used by citizens of Malaysia, Indonesia, Singapore and Brunei. As the language is widely used, there are abundant of text articles written in Malay language that are available on the internet. This has resulted in the increasing of the Malay articles published online and the number of articles has increased greatly over the years. Automatically labeling Malay text articles is crucial in managing these articles. Due to lack of resources and tools used to perform the topic selection automatically for Malay text articles, this paper studies the factors that influence the performances of the algorithms that can be applied to perform a topic selection automatically for Malay articles. This is done by comparing the contents of the articles with the corresponding topics and all Malay articles will be assigned to the appropriate topics depending on the results of the classification process. In this paper, all Malay articles will be classified by using the k-Nearest Neighbors (k-NN) and Naïve Bayes classifiers. Both classifiers are used to classify and assign a topic to these Malay articles according to a predefined set of topics. The effectiveness of classifying these Malay articles using the k-NN classifier is highly dependent on the distance methods used and the number of Nearest Neighbors, k. Thus, this paper also assesses the effects of using different distance methods (e.g., Cosine Similarity and the Euclidean Distance) and varying the number of clusters, k. Other than that, the effects of utilizing the stemming process on the performance of the classifiers are also studied. Based on the results obtained, the proposed approach shows that the k-NN classifier performs better than the Naïve Bayes classifier in classifying the Malay articles into their respective topics. In addition to that, the stemming process also improves the overall performances of both classifiers. Other findings include the application of Cosine Similarity as the distance measure has improved the performance of the k-NN classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval, edn. 2. ACM Press Books, Addison-Wesley Professional ISBN-10: 0321416910 (2011)
Salim, J., Ismail, M., Suwarno, I., Alshalabi, H., Tiun, S., Omar, N., Albared, M.: Experiments on the use of feature selection and machine learning methods in automatic malay text categorization. Procedia Technol. 11, 748–754 (2013). ISSN 2212-0173
Uguz, Harun: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)
Echeverry-Correa, J.D., Ferreiros-López, J., Coucheiro-Limeres, A., Córdoba, R., Montero, J.M.: Topic identification techniques applied to dynamic language model adaptation for automatic speech recognition. Expert System with Applications 42(1), 101–112 (2015). ISSN: 0957-4174
Lee, J., Othman, R.M., Mohamad, N.Z.: Syllable-based Malay word stemmer, computers & informatics (ISCI). In: 2013 IEEE Symposium on, Langkawi, pp. 7–11 (2013). doi:10.1109/ISCI.2013.6612366
Jiang, L., Zhang, H.: Learning instance greedily cloning naive Bayes for ranking. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–12 (2005). doi:10.1109/ICDM.2005.87
Sankupellay, M., Subbu, V.: Malay-language stemmer. Sunway Academic J. 3, 147–153 (2006)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). ISBN-10: 0521865719
Meenakshi, Singla, S.: Review paper on text categorization techniques. Int. J. Innovative Res. Comput. Commun. Eng. 3(11), 809–813 (2015). ISSN: 2320-9801
Samat, N.A., Murad, M.A.A., Abdullah, M.T., Atan, R.: Malay documents clustering algorithm based on singular value decomposition. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 8(10), 357–361 (2008)
Ismail, N.K., Saad, N.H.M., Omar, S.B.S., Sembok, T.M.T.: 2D visualization of terms and documents in Malay language. In: 5th International Conference on Information and Communication Technology for the Muslim World (ICT4 M), Rabat, pp. 1–6 (2013). doi:10.1109/ICT4M.2013.6518919
Koulali, R., El-Haj, M., Meziane, A.: Arabic topic detection using automatic text summarisation. In: 2013 ACS International Conference on Computer Systems and Applications (AICCSA), Ifrane, pp. 1–4 (2013). doi:10.1109/AICCSA.2013.6616460
Thakur, S.K., Singh, V.K.: A lexicon pool augmented naive bayes classifier for nepali text. In: 2014 Seventh International Conference on Contemporary Computing (IC3), Noida, pp. 542–546 (2014). doi:10.1109/IC3.2014.6897231
Sembok, T.M.T., Bakar, Z.A., Ahmad, F.: Experiments in Malay information retrieval. In: 2011 International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, pp. 1–5 (2011). doi:10.1109/ICEEI.2011.6021578
Yong-qing, W., Pei-yu, L., Zhen-fang, Z.: A feature selection method based on improved TFIDF. In: Third International Conference on Pervasive Computing and Applications, 2008, ICPCA 2008, Alexandria, pp. 94–97 (2008). doi:10.1109/ICPCA.2008.4783657
Qin, Z.: Naive bayes classification given probability estimation trees. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, pp. 34–42 (2006). doi:10.1109/ICMLA.2006.36
Sharum, M.Y., Abdullah, M.T., Sulaiman, M.N., Murad, M.A.A., Hamzah, Z.A.Z.: MALIM — A new computational approach of malay morphology. In: 2010 International Symposium on Information Technology, Kuala Lumpur, pp. 837–843 (2010). doi:10.1109/ITSIM.2010.5561561
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of the 18th conference on Computational linguistics - Volume 1 (COLING 2000). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 453–459 (2000). doi:http://dx.doi.org/10.3115/990820.990886
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995) (1995)
Viswanath, P., Hitendra Sarma, T.: An improvement to k-nearest neighbor classifier. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, 2011, pp. 227–231 (2011). doi:10.1109/RAICS.2011.6069307
Qu, C., Yuan, R., Wei, X.: KNNCC: an algorithm for k-nearest neighbor clique clustering. In: 2013 International Conference on Machine Learning and Cybernetics, Tianjin, pp. 1763–1766 (2013). doi:10.1109/ICMLC.2013.6890883
Tanha, J., de Does, J., Depuydt, K.: An LDA-based topic selection approach to language model. In: Adaptation for Handwritten Text Recognition, Proceedings of Recent Advances in Natural Language Processing, pp. 646–653, Hissar, Bulgaria, September 7–9 (2015)
Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Alfred, R., Ren, L.J., Obit, J.H. (2016). Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles. In: Berry, M., Hj. Mohamed, A., Yap, B. (eds) Soft Computing in Data Science. SCDS 2016. Communications in Computer and Information Science, vol 652. Springer, Singapore. https://doi.org/10.1007/978-981-10-2777-2_27
Download citation
DOI: https://doi.org/10.1007/978-981-10-2777-2_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2776-5
Online ISBN: 978-981-10-2777-2
eBook Packages: Computer ScienceComputer Science (R0)