Artificial Intelligence Methods for Spanish Documents Classification

Authors

  • Tad Gonsalves Sophia University, Tokyo, Japan
  • Hang Hu Sophia University, Tokyo, Japan
  • Yoshimi Hiroyasu Sophia University, Tokyo, Japan https://orcid.org/0000-0001-5596-9933

DOI:

https://doi.org/10.15381/lengsoc.v23i2.29208

Keywords:

artificial intelligence, machine learning, deep learning, data augmentation, document classification

Abstract

The rapid globalization and growing need for cross-language communication necessitate modern, real-time corpora to aid language learners. Traditional methods for creating such corpora, especially in Spanish, are inadequate due to their inability to process the vast and unstructured data available online. This study explores Artificial Intelligence (AI) methodologies for automatic Spanish document acquisition from the web, pre-processing and classifying them in order to build a vast and flexible corpus for Spanish learning. The research applies web crawling using the Scrapy framework to collect data, which is then cleaned and classified using advanced Natural Language Processing (NLP) models. Specifically, the study employs BERT (Bidirectional Encoder Representations from Transformers) and its enhanced variant RoBERTa to achieve document classification. Through a combination of data augmentation techniques and deep learning models, the study achieves high accuracy in classifying Spanish-language texts, demonstrating the potential for using AI to overcome the limitations of traditional corpus-building approaches.

Author Biographies

  • Tad Gonsalves, Sophia University, Tokyo, Japan

    Is a full professor in the Department of Information and Communication Sciences, Faculty of Science and Technology, Sophia University, Tokyo, Japan. His research areas include bio-inspired optimization techniques and the application of deep learning techniques to diverse problems like autonomous driving, drones, digital art and music, and computational linguistics. Of late, he is also developing Affective Computing models. Gonsalves holds a BS in Theoretical Physics and MS in Astrophysics. He earned his PhD in Information Systems from Sophia University, Tokyo, Japan. His research laboratory (https://www.gonken.tokyo/) in Tokyo specializes in applications of deep learning and multi-GPU computing.  Gonsalves has published over a hundred and fifty papers in international conferences and journals. He is the author of the book Artificial Introduction: A Non-Technical Introduction (2017) Sophia University Press, Tokyo, Japan, and co-author of Artificial Intelligence for Business Optimization: Research and Applications (2021), CRC press, London.

  • Hang Hu, Sophia University, Tokyo, Japan

    Obtained the BS in Software Engineering from Beijing University of Posts and Telecommunications, and MS degree in Information Science from Sophia University, Tokyo, Japan. At the under-graduate level, he worked on the development of social network content analysis system with self-designed crawler. He started with crawler based on Scrapy, basic data processing and analysis with various methods, and developed a website with Django for presentation. He also has hands-on experience in developing Mobile and Cloud Applications. His research field is natural language processing, especially the classification of web texts through fine-tuning of pre-trained deep learning models.

  • Yoshimi Hiroyasu, Sophia University, Tokyo, Japan

    Is a full professor at the Center for Language Education and Research (CLER) at Sophia University in Tokyo. She obtained her MA in Linguistics from Sophia University. Since 1989, she has been engaged in the field of teaching Spanish as a Foreign Language (ELE). Her publications include several grammar books, self-study Spanish books, and dictionaries. She has also collaborated on numerous Spanish textbooks, notably El español y yo (2013) and ¡Muy bien! 1 and 2 (2018 and 2019). Currently, she is studying the textual traditions of Spanish teaching in Japan and is developing a corpus of textbooks used in Japan from 1900 to the present.

References

Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., and Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197-212. https://doi.org/10.1016/j.engappai.2019.08.024

Ahmed, H., Traore, I., and Saad, S. (2018). Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1), e9. http://dx.doi.org/10.1002/spy2.9

Bijalwan, V., Kumar, V., Kumari, P., and Pascual, J. (2014). KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61-70. http://dx.doi.org/10.14257/ijdta.2014.7.1.06

Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., and Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189-215. https://doi.org/10.1016/j.neucom.2019.10.118

Chen, X., Zhu, D., Lin, D., and Cao, D. (2021). Rumor knowledge embedding based data augmentation for imbalanced rumor detection. Information Sciences, 580, 352-370. https://doi.org/10.1016/j.ins.2021.08.059

Delicado, P., and Pachón-García, C. (2024). Multidimensional scaling for big data. Adv Data Anal Classif, 18(1), 1-22. https://doi.org/10.1007/s11634-024-00591-9

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423

Huang, J., Tzeng, G., and Ong, C. (2005). Multidimensional data in multidimensional scaling using the analytic network process. Pattern Recognition Letters, 26(6), 755-767.https://doi.org/10.1016/j.patrec.2004.09.027

Jerusha, A., and Rajakumari, R. (2024). Harnessing AI: Enhancing English language teaching through innovative tools. Proceedings of the 2024 Third International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 1-7. https://doi.org/10.1109/ICEEICT61591.2024.10718399

Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer Science & Business Media.

Khan, A., Baharudin, B., Lee, L., and Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. http://dx.doi.org/10.4304/jait.1.1.4-20

Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Aummit X: Papers, 79-86. https://aclanthology.org/2005.mtsummit-papers.11

Koroteev, M. (2021). BERT: a review of applications in natural language processing and understanding. ArXiv. http://dx.doi.org/10.48550/arXiv.2103.11943

Kowsari, K., Jafari, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10(4), 150. http://dx.doi.org/10.3390/info10040150

Liu, Y. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692

Medhat, W., Hassan, A., and Korashy, H. (2014). Sentiment Analysis Algorithms and Applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113.https://doi.org/10.1016/j.asej.2014.04.011

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., and Gao, J. (2021). Deep Learning Based Text Classification: A Comprehensive Review. ACM Computing Surveys (CSUR), 54(3), 1-40. https://doi.org/10.1145/3439726

Muñoz-Basols, J., Neville, C., Lafford, B., and Godev, C. (2023). Potentialities of Applied Translation for Language Learning in the Era of Artificial Intelligence. Hispania, 106(2), 171-94. https://doi.org/10.1353/hpn.2023.a899427

Muñoz-Basols, J., and Fuertes, M. (2024). Opportunities of Artificial Intelligence (AI) in language teaching and learning. In J. Muñoz-Basols, M. Fuertes, and L. Cerezo (Eds.), Technology-Mediated Language Teaching: From Social Justice to Artificial Intelligence (pp. 343-360). Routledge.

Rishabh, M., and Grover, J. (2021). Sculpting Data for ML: The first act of Machine Learning.

Rishabh, M. (2022). News Category Dataset. http://dx.doi.org/10.48550/arXiv.2209.11429

Shorten, C., and Khoshgoftaar, T. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1), 1-48. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to fine-tune bert for text classification? In M. Sun, X. Huang, H. Ji, Z. Liu and Y. Liu (Eds.), Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18-20, 2019 (Lecture Notes in Artificial Intelligence), proceedings 18 (pp. 194-206). Springer. https://doi.org/10.1007/978-3-030-32381-3_16

Tarwani, K., and Edem, S. (2017). Survey on Recurrent Neural Network in Natural Language Processing. International Journal of Engineering Trends and Technology (IJETT), 48(6), 301-304. https://doi.org/10.14445/22315381/IJETT-V48P253

Xu, K., Liao, S., Li, J., and Song, Y. (2011). Mining comparative opinions from customer reviews for competitive intelligence. Decision Support Systems, 50(4), 743-754. https://doi.org/10.1016/j.dss.2010.08.021

Yang, M., Kiang, M., and Shang, W. (2015). Filtering big data from social media–Building an early warning system for adverse drug reactions. Journal of Biomedical Informatics, 54, 230-240. https://doi.org/10.1016/j.jbi.2015.01.011

Downloads

Published

2024-12-30

Issue

Section

Dossier sobre inteligencia artificial, lenguaje y discurso digital

How to Cite

Gonsalves, T., Hang, H., & Hiroyasu, Y. (2024). Artificial Intelligence Methods for Spanish Documents Classification. Lengua Y Sociedad, 23(2), 1047-1068. https://doi.org/10.15381/lengsoc.v23i2.29208