Skip to main content

Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles

  • Conference paper
  • First Online:
Soft Computing in Data Science (SCDS 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 652))

Included in the following conference series:

  • 823 Accesses

Abstract

Malay language is a major language that is in used by citizens of Malaysia, Indonesia, Singapore and Brunei. As the language is widely used, there are abundant of text articles written in Malay language that are available on the internet. This has resulted in the increasing of the Malay articles published online and the number of articles has increased greatly over the years. Automatically labeling Malay text articles is crucial in managing these articles. Due to lack of resources and tools used to perform the topic selection automatically for Malay text articles, this paper studies the factors that influence the performances of the algorithms that can be applied to perform a topic selection automatically for Malay articles. This is done by comparing the contents of the articles with the corresponding topics and all Malay articles will be assigned to the appropriate topics depending on the results of the classification process. In this paper, all Malay articles will be classified by using the k-Nearest Neighbors (k-NN) and Naïve Bayes classifiers. Both classifiers are used to classify and assign a topic to these Malay articles according to a predefined set of topics. The effectiveness of classifying these Malay articles using the k-NN classifier is highly dependent on the distance methods used and the number of Nearest Neighbors, k. Thus, this paper also assesses the effects of using different distance methods (e.g., Cosine Similarity and the Euclidean Distance) and varying the number of clusters, k. Other than that, the effects of utilizing the stemming process on the performance of the classifiers are also studied. Based on the results obtained, the proposed approach shows that the k-NN classifier performs better than the Naïve Bayes classifier in classifying the Malay articles into their respective topics. In addition to that, the stemming process also improves the overall performances of both classifiers. Other findings include the application of Cosine Similarity as the distance measure has improved the performance of the k-NN classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)

    Google Scholar 

  2. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval, edn. 2. ACM Press Books, Addison-Wesley Professional ISBN-10: 0321416910 (2011)

    Google Scholar 

  3. Salim, J., Ismail, M., Suwarno, I., Alshalabi, H., Tiun, S., Omar, N., Albared, M.: Experiments on the use of feature selection and machine learning methods in automatic malay text categorization. Procedia Technol. 11, 748–754 (2013). ISSN 2212-0173

    Article  Google Scholar 

  4. Uguz, Harun: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)

    Article  Google Scholar 

  5. Echeverry-Correa, J.D., Ferreiros-López, J., Coucheiro-Limeres, A., Córdoba, R., Montero, J.M.: Topic identification techniques applied to dynamic language model adaptation for automatic speech recognition. Expert System with Applications 42(1), 101–112 (2015). ISSN: 0957-4174

    Article  Google Scholar 

  6. Lee, J., Othman, R.M., Mohamad, N.Z.: Syllable-based Malay word stemmer, computers & informatics (ISCI). In: 2013 IEEE Symposium on, Langkawi, pp. 7–11 (2013). doi:10.1109/ISCI.2013.6612366

  7. Jiang, L., Zhang, H.: Learning instance greedily cloning naive Bayes for ranking. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–12 (2005). doi:10.1109/ICDM.2005.87

  8. Sankupellay, M., Subbu, V.: Malay-language stemmer. Sunway Academic J. 3, 147–153 (2006)

    Google Scholar 

  9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). ISBN-10: 0521865719

    Book  MATH  Google Scholar 

  10. Meenakshi, Singla, S.: Review paper on text categorization techniques. Int. J. Innovative Res. Comput. Commun. Eng. 3(11), 809–813 (2015). ISSN: 2320-9801

    Google Scholar 

  11. Samat, N.A., Murad, M.A.A., Abdullah, M.T., Atan, R.: Malay documents clustering algorithm based on singular value decomposition. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 8(10), 357–361 (2008)

    Google Scholar 

  12. Ismail, N.K., Saad, N.H.M., Omar, S.B.S., Sembok, T.M.T.: 2D visualization of terms and documents in Malay language. In: 5th International Conference on Information and Communication Technology for the Muslim World (ICT4 M), Rabat, pp. 1–6 (2013). doi:10.1109/ICT4M.2013.6518919

  13. Koulali, R., El-Haj, M., Meziane, A.: Arabic topic detection using automatic text summarisation. In: 2013 ACS International Conference on Computer Systems and Applications (AICCSA), Ifrane, pp. 1–4 (2013). doi:10.1109/AICCSA.2013.6616460

  14. Thakur, S.K., Singh, V.K.: A lexicon pool augmented naive bayes classifier for nepali text. In: 2014 Seventh International Conference on Contemporary Computing (IC3), Noida, pp. 542–546 (2014). doi:10.1109/IC3.2014.6897231

  15. Sembok, T.M.T., Bakar, Z.A., Ahmad, F.: Experiments in Malay information retrieval. In: 2011 International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, pp. 1–5 (2011). doi:10.1109/ICEEI.2011.6021578

  16. Yong-qing, W., Pei-yu, L., Zhen-fang, Z.: A feature selection method based on improved TFIDF. In: Third International Conference on Pervasive Computing and Applications, 2008, ICPCA 2008, Alexandria, pp. 94–97 (2008). doi:10.1109/ICPCA.2008.4783657

  17. Qin, Z.: Naive bayes classification given probability estimation trees. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, pp. 34–42 (2006). doi:10.1109/ICMLA.2006.36

  18. Sharum, M.Y., Abdullah, M.T., Sulaiman, M.N., Murad, M.A.A., Hamzah, Z.A.Z.: MALIM — A new computational approach of malay morphology. In: 2010 International Symposium on Information Technology, Kuala Lumpur, pp. 837–843 (2010). doi:10.1109/ITSIM.2010.5561561

  19. Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of the 18th conference on Computational linguistics - Volume 1 (COLING 2000). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 453–459 (2000). doi:http://dx.doi.org/10.3115/990820.990886

  20. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995) (1995)

    Google Scholar 

  21. Viswanath, P., Hitendra Sarma, T.: An improvement to k-nearest neighbor classifier. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, 2011, pp. 227–231 (2011). doi:10.1109/RAICS.2011.6069307

  22. Qu, C., Yuan, R., Wei, X.: KNNCC: an algorithm for k-nearest neighbor clique clustering. In: 2013 International Conference on Machine Learning and Cybernetics, Tianjin, pp. 1763–1766 (2013). doi:10.1109/ICMLC.2013.6890883

  23. Tanha, J., de Does, J., Depuydt, K.: An LDA-based topic selection approach to language model. In: Adaptation for Handwritten Text Recognition, Proceedings of Recent Advances in Natural Language Processing, pp. 646–653, Hissar, Bulgaria, September 7–9 (2015)

    Google Scholar 

  24. Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rayner Alfred .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Alfred, R., Ren, L.J., Obit, J.H. (2016). Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles. In: Berry, M., Hj. Mohamed, A., Yap, B. (eds) Soft Computing in Data Science. SCDS 2016. Communications in Computer and Information Science, vol 652. Springer, Singapore. https://doi.org/10.1007/978-981-10-2777-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2777-2_27

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2776-5

  • Online ISBN: 978-981-10-2777-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy