Métodos de inteligência artificial para a classificação de documentos em Espanhol

Autores

  • Tad Gonsalves Sophia University, Tokyo, Japan
  • Hang Hu Sophia University, Tokyo, Japan
  • Yoshimi Hiroyasu Sophia University, Tokyo, Japan https://orcid.org/0000-0001-5596-9933

DOI:

https://doi.org/10.15381/lengsoc.v23i2.29208

Palavras-chave:

inteligência artificial, aprendizagem automática, aprendizagem profunda, aumento de dados, classificação de documentos

Resumo

A rápida globalização e a crescente necessidade de comunicação entre línguas exigem corpora modernos e em tempo real para ajudar os estudantes de línguas. Os métodos tradicionais para criar tais corpora, especialmente em espanhol, são inadequados devido à sua incapacidade de processar os dados vastos e não estruturados disponíveis online. Este estudo explora metodologias de Inteligência Artificial (IA) para a aquisição automática de documentos espanhóis da Web, pré-processando-os e classificando-os de modo a construir um corpus vasto e flexível para a aprendizagem do espanhol. A investigação aplica o rastreio da Web utilizando a estrutura Scrapy para recolher dados, que são depois limpos e classificados utilizando modelos avançados de processamento da linguagem natural (PNL). Especificamente, o estudo utiliza o algoritmo BERT (Bidirectional Encoder Representations from Transformers) e a sua variante melhorada RoBERTa para obter a classificação dos documentos. Através de uma combinação de técnicas de aumento de dados e modelos de aprendizagem profunda, o estudo consegue uma elevada precisão na classificação de textos em espanhol, demonstrando o potencial da utilização da IA para ultrapassar as limitações das abordagens tradicionais de construção de corpus.

Biografia do Autor

  • Tad Gonsalves, Sophia University, Tokyo, Japan

    Is a full professor in the Department of Information and Communication Sciences, Faculty of Science and Technology, Sophia University, Tokyo, Japan. His research areas include bio-inspired optimization techniques and the application of deep learning techniques to diverse problems like autonomous driving, drones, digital art and music, and computational linguistics. Of late, he is also developing Affective Computing models. Gonsalves holds a BS in Theoretical Physics and MS in Astrophysics. He earned his PhD in Information Systems from Sophia University, Tokyo, Japan. His research laboratory (https://www.gonken.tokyo/) in Tokyo specializes in applications of deep learning and multi-GPU computing.  Gonsalves has published over a hundred and fifty papers in international conferences and journals. He is the author of the book Artificial Introduction: A Non-Technical Introduction (2017) Sophia University Press, Tokyo, Japan, and co-author of Artificial Intelligence for Business Optimization: Research and Applications (2021), CRC press, London.

  • Hang Hu, Sophia University, Tokyo, Japan

    Obtained the BS in Software Engineering from Beijing University of Posts and Telecommunications, and MS degree in Information Science from Sophia University, Tokyo, Japan. At the under-graduate level, he worked on the development of social network content analysis system with self-designed crawler. He started with crawler based on Scrapy, basic data processing and analysis with various methods, and developed a website with Django for presentation. He also has hands-on experience in developing Mobile and Cloud Applications. His research field is natural language processing, especially the classification of web texts through fine-tuning of pre-trained deep learning models.

  • Yoshimi Hiroyasu, Sophia University, Tokyo, Japan

    Is a full professor at the Center for Language Education and Research (CLER) at Sophia University in Tokyo. She obtained her MA in Linguistics from Sophia University. Since 1989, she has been engaged in the field of teaching Spanish as a Foreign Language (ELE). Her publications include several grammar books, self-study Spanish books, and dictionaries. She has also collaborated on numerous Spanish textbooks, notably El español y yo (2013) and ¡Muy bien! 1 and 2 (2018 and 2019). Currently, she is studying the textual traditions of Spanish teaching in Japan and is developing a corpus of textbooks used in Japan from 1900 to the present.

Referências

Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., and Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197-212. https://doi.org/10.1016/j.engappai.2019.08.024

Ahmed, H., Traore, I., and Saad, S. (2018). Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1), e9. http://dx.doi.org/10.1002/spy2.9

Bijalwan, V., Kumar, V., Kumari, P., and Pascual, J. (2014). KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61-70. http://dx.doi.org/10.14257/ijdta.2014.7.1.06

Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., and Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189-215. https://doi.org/10.1016/j.neucom.2019.10.118

Chen, X., Zhu, D., Lin, D., and Cao, D. (2021). Rumor knowledge embedding based data augmentation for imbalanced rumor detection. Information Sciences, 580, 352-370. https://doi.org/10.1016/j.ins.2021.08.059

Delicado, P., and Pachón-García, C. (2024). Multidimensional scaling for big data. Adv Data Anal Classif, 18(1), 1-22. https://doi.org/10.1007/s11634-024-00591-9

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423

Huang, J., Tzeng, G., and Ong, C. (2005). Multidimensional data in multidimensional scaling using the analytic network process. Pattern Recognition Letters, 26(6), 755-767.https://doi.org/10.1016/j.patrec.2004.09.027

Jerusha, A., and Rajakumari, R. (2024). Harnessing AI: Enhancing English language teaching through innovative tools. Proceedings of the 2024 Third International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 1-7. https://doi.org/10.1109/ICEEICT61591.2024.10718399

Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer Science & Business Media.

Khan, A., Baharudin, B., Lee, L., and Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. http://dx.doi.org/10.4304/jait.1.1.4-20

Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Aummit X: Papers, 79-86. https://aclanthology.org/2005.mtsummit-papers.11

Koroteev, M. (2021). BERT: a review of applications in natural language processing and understanding. ArXiv. http://dx.doi.org/10.48550/arXiv.2103.11943

Kowsari, K., Jafari, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10(4), 150. http://dx.doi.org/10.3390/info10040150

Liu, Y. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692

Medhat, W., Hassan, A., and Korashy, H. (2014). Sentiment Analysis Algorithms and Applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113.https://doi.org/10.1016/j.asej.2014.04.011

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., and Gao, J. (2021). Deep Learning Based Text Classification: A Comprehensive Review. ACM Computing Surveys (CSUR), 54(3), 1-40. https://doi.org/10.1145/3439726

Muñoz-Basols, J., Neville, C., Lafford, B., and Godev, C. (2023). Potentialities of Applied Translation for Language Learning in the Era of Artificial Intelligence. Hispania, 106(2), 171-94. https://doi.org/10.1353/hpn.2023.a899427

Muñoz-Basols, J., and Fuertes, M. (2024). Opportunities of Artificial Intelligence (AI) in language teaching and learning. In J. Muñoz-Basols, M. Fuertes, and L. Cerezo (Eds.), Technology-Mediated Language Teaching: From Social Justice to Artificial Intelligence (pp. 343-360). Routledge.

Rishabh, M., and Grover, J. (2021). Sculpting Data for ML: The first act of Machine Learning.

Rishabh, M. (2022). News Category Dataset. http://dx.doi.org/10.48550/arXiv.2209.11429

Shorten, C., and Khoshgoftaar, T. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1), 1-48. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to fine-tune bert for text classification? In M. Sun, X. Huang, H. Ji, Z. Liu and Y. Liu (Eds.), Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18-20, 2019 (Lecture Notes in Artificial Intelligence), proceedings 18 (pp. 194-206). Springer. https://doi.org/10.1007/978-3-030-32381-3_16

Tarwani, K., and Edem, S. (2017). Survey on Recurrent Neural Network in Natural Language Processing. International Journal of Engineering Trends and Technology (IJETT), 48(6), 301-304. https://doi.org/10.14445/22315381/IJETT-V48P253

Xu, K., Liao, S., Li, J., and Song, Y. (2011). Mining comparative opinions from customer reviews for competitive intelligence. Decision Support Systems, 50(4), 743-754. https://doi.org/10.1016/j.dss.2010.08.021

Yang, M., Kiang, M., and Shang, W. (2015). Filtering big data from social media–Building an early warning system for adverse drug reactions. Journal of Biomedical Informatics, 54, 230-240. https://doi.org/10.1016/j.jbi.2015.01.011

Downloads

Publicado

2024-12-30

Edição

Seção

Dossier sobre inteligencia artificial, lenguaje y discurso digital

Como Citar

Gonsalves, T., Hang, H., & Hiroyasu, Y. (2024). Métodos de inteligência artificial para a classificação de documentos em Espanhol. Lengua Y Sociedad, 23(2), 1047-1068. https://doi.org/10.15381/lengsoc.v23i2.29208