Integración de embeddings de nueva generación y recursos lingüísticos actuales para identificar palabras complejas en español con machine learning

Luis Iván Mera Dávila

doi:10.15381/rpcs.v6i2.29211

Integration of new generation embeddings and current linguistic resources to identify complex words in Spanish with machine learning

Authors

Luis Iván Mera Dávila Universidad Nacional Mayor de San Marcos. Lima, Perú https://orcid.org/0009-0002-7765-2691

DOI:

https://doi.org/10.15381/rpcs.v6i2.29211

Keywords:

Complex word identification, Embeddings, Lexical Simplification, Spanish

Abstract

The complexity of words can pose a limitation to the accessibility of information, which could affect millions of Spanish-speaking people. The objective of this study is to develop a machine learning model for the binary task of identifying complex words in Spanish, using next-generation embeddings, current linguistic resources, and lexical properties. To this end, the Spanish dataset from the CWI Shared Task 2018 was used, obtaining embeddings generated by the text-embedding-3-large model and word frequencies extracted from resources such as the Corpus del Español del Siglo XXI, the Corpus de Referencia del Español Actual, the Spanish Billion Word Corpus and Embeddings, and Wordfreq. To select features and find their best combination, a 5-fold cross-validation using XGBClassifier was employed. After comparing several machine learning algorithms, the final model, based on LGBMClassifier, achieved a macro F1 score of 0.7993, surpassing the best team from that competition, more recent studies that used neural networks, and some large language models. This demonstrates the potential of these resources that are constantly being updated and that can contribute to improving the accuracy of this task.

Downloads

PDF (Spanish)

Published

2024-12-30

Issue

Vol. 6 No. 2 (2024)

Section

Contribution

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

THE AUTHORS RETAIN THEIR RIGHTS:

(a) The authors retain their trademark and patent rights, and also over any process or procedure described in the article.

(b) The authors retain the right to share, copy, distribute, execute and publicly communicate the article published in the Revista Peruana de Computación y Sistemas (for example, place it in an institutional repository or publish it in a book), with acknowledgment of its initial publication in Revista Peruana de Computación y Sistemas.

(c) Authors retain the right to make a subsequent publication of their work, to use the article or any part of it (for example: a compilation of their work, lecture notes, thesis, or for a book), provided that they indicate the source. of publication (authors of the work, magazine, volume, number and date).

How to Cite

Mera Dávila, L. I. (2024). Integration of new generation embeddings and current linguistic resources to identify complex words in Spanish with machine learning. Revista Peruana De Computación Y Sistemas, 6(2), 55-64. https://doi.org/10.15381/rpcs.v6i2.29211

Download Citation