Clasificación temática automática de documentos basada en vocabularios y frecuencias de uso. El caso de artículos de divulgación científica

César González-Pérez; José Ignacio Vidal Liy; Ana García García; Pablo Calleja Ibáñez

doi:10.3989/redc.2023.3.1996

Authors

César González-Pérez Instituto de Ciencias del Patrimonio (Incipit), CSIC https://orcid.org/0000-0002-3976-7589
José Ignacio Vidal Liy Centro de Ciencias Humanas y Sociales, CSIC https://orcid.org/0000-0001-6169-784X
Ana García García Centro de Ciencias Humanas y Sociales, CSIC https://orcid.org/0000-0002-5952-4971
Pablo Calleja Ibáñez Universidad Politécnica de Madrid https://orcid.org/0000-0001-8423-8240

DOI:

https://doi.org/10.3989/redc.2023.3.1996

Keywords:

Document classification, thematic classification, algorithm, vocabularies, lexical frequencies, science dissemination

Abstract

It is often necessary to classify documents by assigning them a theme or topic from a series of predefined options. This work is usually done manually, by reading the document by a specialist. This manual process is tedious, requires time and resources, and is prone to bias and preferences of each specialist.

As an alternative, this article presents an automatic thematic classification system, capable of classifying hundreds of documents in a few seconds, highly parameterized, and that does not require the specialists intervention. The system is based on predefined thematic vocabularies and frequencies of use of lexical forms, and assigns one or more priority topics to each document. The suggested approach has been developed and tested in the context of scientific dissemination articles in the Spanish language.

Using this approach, it is possible to systematically classify large amounts of documents by topic, using fewer resources than doing it manually, and avoiding unknown biases. The approach has shown to be as effective as other proposals, but requires less computational resources.

Downloads

Download data is not yet available.

References

Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., y Alkhawaldeh, R. S. (2021). A Systematic Review of emerging Feature Selection Optimization Methods for Optimal Text Classification: The Present State and Prospective Opportunities. Neural Computing and Applications, 33(22), 15091-15118. https://doi.org/10.1007/s00521-021-06406-8

PMid:34404964 PMCid:PMC8361413

Beltrán, C., y Barbona, I. (2017). Una revisión de las técnicas de clasificación supervisada en la clasificación automática de textos. Revista de Epistemología y Ciencias Humanas, 9, 78-90. Disponible en: http://hdl.handle.net/2133/13776%09

Black, P. E. (ed. ). (1999). Levenshtein Distance. Algorithms and Theory of Computation Handbook.

Campos Mocholí, M. (2021). Clasificación de textos basada en redes neuronales. Disponible en: https://riunet.upv.es:443/handle/10251/172276

Cárdenas, J., Olivares, G., y Alfaro, R. (2014). Clasificación automática de textos usando redes de palabras. Revista Signos, 47(86), 346-364. https://doi.org/10.4067/S0718-09342014000300001

Caruana, R., y Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning - ICML '06, 161-168. https://doi.org/10.1145/1143844.1143865

Cooperación Latinoamericana de Redes Avanzadas. (2013). LA Referencia. Disponible en: https://www.lareferencia.info/

FECYT. (2022). Recolecta FECYT. Disponible en: https://recolecta.fecyt.es/

Fleiss, J. L. (1971). Measuring Nominal Scale Agreement among Many Raters. Psychological Bulletin, 76(5), 378-382. https://doi.org/10.1037/h0031619

García Figuerola, C., Berrocal, J. L. A., y Rodríguez, A. Z. (2017). Organización automática de documentos mediante técnicas de análisis de redes. Scire: Representación y Organización Del Conocimiento, 25-36. https://doi.org/10.54886/scire.v1i2.4453

Goller, C., Löning, J., Will, T., y Wolff, W. (2020). Automatic Document Classification.

Granik, M., y Mesyura, V. (2017). Fake News Detection Using Naive Bayes Classifier. 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), 900-903. https://doi.org/10.1109/UKRCON.2017.8100379

Gutiérrez-Fandiño, A., Armengol-Estapé, J., Pàmies, M., Llop-Palao, J., Silveira-Ocampo, J., Carrino, C. P., Gonzalez-Agirre, A., Armentano-Oller, C., Rodriguez-Penagos, C., y Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68, 39-60.

Langley, P., Iba, W., y Thompson, K. (1992). An Analysis of Bayesian Classifiers. AAAI'92: Proceedings of the Tenth National Conference on Artificial Intelligence, 223-228. Disponible en: https://dl.acm.org/doi/abs/10.5555/1867135.1867170

Mccallum, A., y Nigam, K. (2001). A Comparison of Event Models for Naive Bayes Text Classification. Work Learn Text Categ, 752. Disponible en: https://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf

Nigam, K., Mccallum, A. K., Thrun, S., y Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2), 103-134. https://doi.org/10.1023/A:1007692713085

Real Academia Española. (2019). CORPES XXI. https://www.rae.es/banco-de-datos/corpes-xxi

Rodríguez Tapia, S., y Camacho Cañamón, J. (2018). La contribución de los métodos de aprendizaje automático no supervisado al diseño de métodos para la clasificación textual según el grado de especialización. Sintagma: Revista de Lingüística, 30, 131-149. https://doi.org/10.21001/sintagma.2018.30.08

Song, X., Petrak, J., Jiang, Y., Singh, I., Maynard, D., y Bontcheva, K. (2021). Classification Aware Neural Topic Model for COVID-19 Disinformation Categorisation. PLOS ONE, 16(2), e0247086. https://doi.org/10.1371/journal.pone.0247086

PMid:33600477 PMCid:PMC7891716

Stein, B., zu Eissen, S. M., y Potthast, M. (2007). Strategies for Retrieving Plagiarized Documents. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '07, 825. https://doi.org/10.1145/1277741.1277928

The Conversation, Spanish Edition. (2020). https://theconversation.com/es

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer New York. https://doi.org/10.1007/978-1-4757-2440-0

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. ukasz, y Polosukhin, I. (2017). Attention is All you Need. En I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, y R. Garnett (eds.), Advances in Neural Information Processing Systems, 30. Curran Associates, Inc. Disponible en: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Winkler, W. E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research, 354-359.