Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles

Alfred, Rayner; Ren, Leow Jia; Obit, Joe Henry

doi:10.1007/978-981-10-2777-2_27

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 652))

Included in the following conference series:

International Conference on Soft Computing in Data Science

823 Accesses

Abstract

Malay language is a major language that is in used by citizens of Malaysia, Indonesia, Singapore and Brunei. As the language is widely used, there are abundant of text articles written in Malay language that are available on the internet. This has resulted in the increasing of the Malay articles published online and the number of articles has increased greatly over the years. Automatically labeling Malay text articles is crucial in managing these articles. Due to lack of resources and tools used to perform the topic selection automatically for Malay text articles, this paper studies the factors that influence the performances of the algorithms that can be applied to perform a topic selection automatically for Malay articles. This is done by comparing the contents of the articles with the corresponding topics and all Malay articles will be assigned to the appropriate topics depending on the results of the classification process. In this paper, all Malay articles will be classified by using the k-Nearest Neighbors (k-NN) and Naïve Bayes classifiers. Both classifiers are used to classify and assign a topic to these Malay articles according to a predefined set of topics. The effectiveness of classifying these Malay articles using the k-NN classifier is highly dependent on the distance methods used and the number of Nearest Neighbors, k. Thus, this paper also assesses the effects of using different distance methods (e.g., Cosine Similarity and the Euclidean Distance) and varying the number of clusters, k. Other than that, the effects of utilizing the stemming process on the performance of the classifiers are also studied. Based on the results obtained, the proposed approach shows that the k-NN classifier performs better than the Naïve Bayes classifier in classifying the Malay articles into their respective topics. In addition to that, the stemming process also improves the overall performances of both classifiers. Other findings include the application of Cosine Similarity as the distance measure has improved the performance of the k-NN classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Supervised Machine Learning for Multi-label Classification of Bangla Articles

Topic Classification Based on Scientific Article Structure: A Case Study at Can Tho University Journal of Science

Automatic Kurdish Text Classification Using KDC 4007 Dataset

References

Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval, edn. 2. ACM Press Books, Addison-Wesley Professional ISBN-10: 0321416910 (2011)
Google Scholar
Salim, J., Ismail, M., Suwarno, I., Alshalabi, H., Tiun, S., Omar, N., Albared, M.: Experiments on the use of feature selection and machine learning methods in automatic malay text categorization. Procedia Technol. 11, 748–754 (2013). ISSN 2212-0173
Article Google Scholar
Uguz, Harun: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)
Article Google Scholar
Echeverry-Correa, J.D., Ferreiros-López, J., Coucheiro-Limeres, A., Córdoba, R., Montero, J.M.: Topic identification techniques applied to dynamic language model adaptation for automatic speech recognition. Expert System with Applications 42(1), 101–112 (2015). ISSN: 0957-4174
Article Google Scholar
Lee, J., Othman, R.M., Mohamad, N.Z.: Syllable-based Malay word stemmer, computers & informatics (ISCI). In: 2013 IEEE Symposium on, Langkawi, pp. 7–11 (2013). doi:10.1109/ISCI.2013.6612366
Jiang, L., Zhang, H.: Learning instance greedily cloning naive Bayes for ranking. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–12 (2005). doi:10.1109/ICDM.2005.87
Sankupellay, M., Subbu, V.: Malay-language stemmer. Sunway Academic J. 3, 147–153 (2006)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). ISBN-10: 0521865719
Book MATH Google Scholar
Meenakshi, Singla, S.: Review paper on text categorization techniques. Int. J. Innovative Res. Comput. Commun. Eng. 3(11), 809–813 (2015). ISSN: 2320-9801
Google Scholar
Samat, N.A., Murad, M.A.A., Abdullah, M.T., Atan, R.: Malay documents clustering algorithm based on singular value decomposition. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 8(10), 357–361 (2008)
Google Scholar
Ismail, N.K., Saad, N.H.M., Omar, S.B.S., Sembok, T.M.T.: 2D visualization of terms and documents in Malay language. In: 5th International Conference on Information and Communication Technology for the Muslim World (ICT4 M), Rabat, pp. 1–6 (2013). doi:10.1109/ICT4M.2013.6518919
Koulali, R., El-Haj, M., Meziane, A.: Arabic topic detection using automatic text summarisation. In: 2013 ACS International Conference on Computer Systems and Applications (AICCSA), Ifrane, pp. 1–4 (2013). doi:10.1109/AICCSA.2013.6616460
Thakur, S.K., Singh, V.K.: A lexicon pool augmented naive bayes classifier for nepali text. In: 2014 Seventh International Conference on Contemporary Computing (IC3), Noida, pp. 542–546 (2014). doi:10.1109/IC3.2014.6897231
Sembok, T.M.T., Bakar, Z.A., Ahmad, F.: Experiments in Malay information retrieval. In: 2011 International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, pp. 1–5 (2011). doi:10.1109/ICEEI.2011.6021578
Yong-qing, W., Pei-yu, L., Zhen-fang, Z.: A feature selection method based on improved TFIDF. In: Third International Conference on Pervasive Computing and Applications, 2008, ICPCA 2008, Alexandria, pp. 94–97 (2008). doi:10.1109/ICPCA.2008.4783657
Qin, Z.: Naive bayes classification given probability estimation trees. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, pp. 34–42 (2006). doi:10.1109/ICMLA.2006.36
Sharum, M.Y., Abdullah, M.T., Sulaiman, M.N., Murad, M.A.A., Hamzah, Z.A.Z.: MALIM — A new computational approach of malay morphology. In: 2010 International Symposium on Information Technology, Kuala Lumpur, pp. 837–843 (2010). doi:10.1109/ITSIM.2010.5561561
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of the 18th conference on Computational linguistics - Volume 1 (COLING 2000). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 453–459 (2000). doi:http://dx.doi.org/10.3115/990820.990886
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995) (1995)
Google Scholar
Viswanath, P., Hitendra Sarma, T.: An improvement to k-nearest neighbor classifier. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, 2011, pp. 227–231 (2011). doi:10.1109/RAICS.2011.6069307
Qu, C., Yuan, R., Wei, X.: KNNCC: an algorithm for k-nearest neighbor clique clustering. In: 2013 International Conference on Machine Learning and Cybernetics, Tianjin, pp. 1763–1766 (2013). doi:10.1109/ICMLC.2013.6890883
Tanha, J., de Does, J., Depuydt, K.: An LDA-based topic selection approach to language model. In: Adaptation for Handwritten Text Recognition, Proceedings of Recent Advances in Natural Language Processing, pp. 646–653, Hissar, Bulgaria, September 7–9 (2015)
Google Scholar
Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computing and Informatics, Universiti Malaysia Sabah, Sabah, Malaysia
Rayner Alfred, Leow Jia Ren & Joe Henry Obit

Authors

Rayner Alfred
View author publications
You can also search for this author in PubMed Google Scholar
Leow Jia Ren
View author publications
You can also search for this author in PubMed Google Scholar
Joe Henry Obit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rayner Alfred .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, Tennessee, USA
Michael W. Berry
Universiti Teknologi MARA, Shah Alam, Malaysia
Azlinah Hj. Mohamed
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam, Malaysia
Bee Wah Yap

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alfred, R., Ren, L.J., Obit, J.H. (2016). Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles. In: Berry, M., Hj. Mohamed, A., Yap, B. (eds) Soft Computing in Data Science. SCDS 2016. Communications in Computer and Information Science, vol 652. Springer, Singapore. https://doi.org/10.1007/978-981-10-2777-2_27

Download citation

DOI: https://doi.org/10.1007/978-981-10-2777-2_27
Published: 18 September 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2776-5
Online ISBN: 978-981-10-2777-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Supervised Machine Learning for Multi-label Classification of Bangla Articles

Topic Classification Based on Scientific Article Structure: A Case Study at Can Tho University Journal of Science

Automatic Kurdish Text Classification Using KDC 4007 Dataset

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Supervised Machine Learning for Multi-label Classification of Bangla Articles

Topic Classification Based on Scientific Article Structure: A Case Study at Can Tho University Journal of Science

Automatic Kurdish Text Classification Using KDC 4007 Dataset

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.