Artificial Intelligence Methods for Spanish Documents Classification
DOI:
https://doi.org/10.15381/lengsoc.v23i2.29208Keywords:
artificial intelligence, machine learning, deep learning, data augmentation, document classificationAbstract
The rapid globalization and growing need for cross-language communication necessitate modern, real-time corpora to aid language learners. Traditional methods for creating such corpora, especially in Spanish, are inadequate due to their inability to process the vast and unstructured data available online. This study explores Artificial Intelligence (AI) methodologies for automatic Spanish document acquisition from the web, pre-processing and classifying them in order to build a vast and flexible corpus for Spanish learning. The research applies web crawling using the Scrapy framework to collect data, which is then cleaned and classified using advanced Natural Language Processing (NLP) models. Specifically, the study employs BERT (Bidirectional Encoder Representations from Transformers) and its enhanced variant RoBERTa to achieve document classification. Through a combination of data augmentation techniques and deep learning models, the study achieves high accuracy in classifying Spanish-language texts, demonstrating the potential for using AI to overcome the limitations of traditional corpus-building approaches.
References
Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., and Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197-212. https://doi.org/10.1016/j.engappai.2019.08.024
Ahmed, H., Traore, I., and Saad, S. (2018). Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1), e9. http://dx.doi.org/10.1002/spy2.9
Bijalwan, V., Kumar, V., Kumari, P., and Pascual, J. (2014). KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61-70. http://dx.doi.org/10.14257/ijdta.2014.7.1.06
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., and Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189-215. https://doi.org/10.1016/j.neucom.2019.10.118
Chen, X., Zhu, D., Lin, D., and Cao, D. (2021). Rumor knowledge embedding based data augmentation for imbalanced rumor detection. Information Sciences, 580, 352-370. https://doi.org/10.1016/j.ins.2021.08.059
Delicado, P., and Pachón-García, C. (2024). Multidimensional scaling for big data. Adv Data Anal Classif, 18(1), 1-22. https://doi.org/10.1007/s11634-024-00591-9
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423
Huang, J., Tzeng, G., and Ong, C. (2005). Multidimensional data in multidimensional scaling using the analytic network process. Pattern Recognition Letters, 26(6), 755-767.https://doi.org/10.1016/j.patrec.2004.09.027
Jerusha, A., and Rajakumari, R. (2024). Harnessing AI: Enhancing English language teaching through innovative tools. Proceedings of the 2024 Third International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 1-7. https://doi.org/10.1109/ICEEICT61591.2024.10718399
Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer Science & Business Media.
Khan, A., Baharudin, B., Lee, L., and Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. http://dx.doi.org/10.4304/jait.1.1.4-20
Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Aummit X: Papers, 79-86. https://aclanthology.org/2005.mtsummit-papers.11
Koroteev, M. (2021). BERT: a review of applications in natural language processing and understanding. ArXiv. http://dx.doi.org/10.48550/arXiv.2103.11943
Kowsari, K., Jafari, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10(4), 150. http://dx.doi.org/10.3390/info10040150
Liu, Y. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692
Medhat, W., Hassan, A., and Korashy, H. (2014). Sentiment Analysis Algorithms and Applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113.https://doi.org/10.1016/j.asej.2014.04.011
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., and Gao, J. (2021). Deep Learning Based Text Classification: A Comprehensive Review. ACM Computing Surveys (CSUR), 54(3), 1-40. https://doi.org/10.1145/3439726
Muñoz-Basols, J., Neville, C., Lafford, B., and Godev, C. (2023). Potentialities of Applied Translation for Language Learning in the Era of Artificial Intelligence. Hispania, 106(2), 171-94. https://doi.org/10.1353/hpn.2023.a899427
Muñoz-Basols, J., and Fuertes, M. (2024). Opportunities of Artificial Intelligence (AI) in language teaching and learning. In J. Muñoz-Basols, M. Fuertes, and L. Cerezo (Eds.), Technology-Mediated Language Teaching: From Social Justice to Artificial Intelligence (pp. 343-360). Routledge.
Rishabh, M., and Grover, J. (2021). Sculpting Data for ML: The first act of Machine Learning.
Rishabh, M. (2022). News Category Dataset. http://dx.doi.org/10.48550/arXiv.2209.11429
Shorten, C., and Khoshgoftaar, T. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1), 1-48. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to fine-tune bert for text classification? In M. Sun, X. Huang, H. Ji, Z. Liu and Y. Liu (Eds.), Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18-20, 2019 (Lecture Notes in Artificial Intelligence), proceedings 18 (pp. 194-206). Springer. https://doi.org/10.1007/978-3-030-32381-3_16
Tarwani, K., and Edem, S. (2017). Survey on Recurrent Neural Network in Natural Language Processing. International Journal of Engineering Trends and Technology (IJETT), 48(6), 301-304. https://doi.org/10.14445/22315381/IJETT-V48P253
Xu, K., Liao, S., Li, J., and Song, Y. (2011). Mining comparative opinions from customer reviews for competitive intelligence. Decision Support Systems, 50(4), 743-754. https://doi.org/10.1016/j.dss.2010.08.021
Yang, M., Kiang, M., and Shang, W. (2015). Filtering big data from social media–Building an early warning system for adverse drug reactions. Journal of Biomedical Informatics, 54, 230-240. https://doi.org/10.1016/j.jbi.2015.01.011
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Tad Gonsalves, Hu Hang, Yoshimi Hiroyasu

This work is licensed under a Creative Commons Attribution 4.0 International License.
AUTHORS RETAIN THEIR RIGHTS
a. Authors retain their trade mark rights and patent, and also on any process or procedure described in the article.
b. Authors can submit to the journal Lengua y Sociedad, papers disseminated as pre-print in repositories. This should be made known in the cover letter.
c. Authors retain their right to share, copy, distribute, perform and publicly communicate their article (eg, to place their article in an institutional repository or publish it in a book), with an acknowledgment of its initial publication in the journal Lengua y Sociedad.
d. Authors retain theirs right to make a subsequent publication of their work, to use the article or any part thereof (eg a compilation of his papers, lecture notes, thesis, or a book), always indicating its initial publication in the journal Lengua y Sociedad (the originator of the work, journal, volume, number and date).