Automatic thematic classification of documents based on vocabularies and use frequencies. The case of scientific dissemination articles
DOI:
https://doi.org/10.3989/redc.2023.3.1996Keywords:
Document classification, thematic classification, algorithm, vocabularies, lexical frequencies, science disseminationAbstract
It is often necessary to classify documents by assigning them a theme or topic from a series of predefined options. This work is usually done manually, by reading the document by a specialist. This manual process is tedious, requires time and resources, and is prone to bias and preferences of each specialist.
As an alternative, this article presents an automatic thematic classification system, capable of classifying hundreds of documents in a few seconds, highly parameterized, and that does not require the specialists intervention. The system is based on predefined thematic vocabularies and frequencies of use of lexical forms, and assigns one or more priority topics to each document. The suggested approach has been developed and tested in the context of scientific dissemination articles in the Spanish language.
Using this approach, it is possible to systematically classify large amounts of documents by topic, using fewer resources than doing it manually, and avoiding unknown biases. The approach has shown to be as effective as other proposals, but requires less computational resources.
Downloads
References
Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., y Alkhawaldeh, R. S. (2021). A Systematic Review of emerging Feature Selection Optimization Methods for Optimal Text Classification: The Present State and Prospective Opportunities. Neural Computing and Applications, 33(22), 15091-15118. https://doi.org/10.1007/s00521-021-06406-8
PMid:34404964 PMCid:PMC8361413
Beltrán, C., y Barbona, I. (2017). Una revisión de las técnicas de clasificación supervisada en la clasificación automática de textos. Revista de Epistemología y Ciencias Humanas, 9, 78-90. Disponible en: http://hdl.handle.net/2133/13776%09
Black, P. E. (ed. ). (1999). Levenshtein Distance. Algorithms and Theory of Computation Handbook.
Campos Mocholí, M. (2021). Clasificación de textos basada en redes neuronales. Disponible en: https://riunet.upv.es:443/handle/10251/172276
Cárdenas, J., Olivares, G., y Alfaro, R. (2014). Clasificación automática de textos usando redes de palabras. Revista Signos, 47(86), 346-364. https://doi.org/10.4067/S0718-09342014000300001
Caruana, R., y Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning - ICML '06, 161-168. https://doi.org/10.1145/1143844.1143865
Cooperación Latinoamericana de Redes Avanzadas. (2013). LA Referencia. Disponible en: https://www.lareferencia.info/
FECYT. (2022). Recolecta FECYT. Disponible en: https://recolecta.fecyt.es/
Fleiss, J. L. (1971). Measuring Nominal Scale Agreement among Many Raters. Psychological Bulletin, 76(5), 378-382. https://doi.org/10.1037/h0031619
García Figuerola, C., Berrocal, J. L. A., y Rodríguez, A. Z. (2017). Organización automática de documentos mediante técnicas de análisis de redes. Scire: Representación y Organización Del Conocimiento, 25-36. https://doi.org/10.54886/scire.v1i2.4453
Goller, C., Löning, J., Will, T., y Wolff, W. (2020). Automatic Document Classification.
Granik, M., y Mesyura, V. (2017). Fake News Detection Using Naive Bayes Classifier. 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), 900-903. https://doi.org/10.1109/UKRCON.2017.8100379
Gutiérrez-Fandiño, A., Armengol-Estapé, J., Pàmies, M., Llop-Palao, J., Silveira-Ocampo, J., Carrino, C. P., Gonzalez-Agirre, A., Armentano-Oller, C., Rodriguez-Penagos, C., y Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68, 39-60.
Langley, P., Iba, W., y Thompson, K. (1992). An Analysis of Bayesian Classifiers. AAAI'92: Proceedings of the Tenth National Conference on Artificial Intelligence, 223-228. Disponible en: https://dl.acm.org/doi/abs/10.5555/1867135.1867170
Mccallum, A., y Nigam, K. (2001). A Comparison of Event Models for Naive Bayes Text Classification. Work Learn Text Categ, 752. Disponible en: https://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf
Nigam, K., Mccallum, A. K., Thrun, S., y Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2), 103-134. https://doi.org/10.1023/A:1007692713085
Real Academia Española. (2019). CORPES XXI. https://www.rae.es/banco-de-datos/corpes-xxi
Rodríguez Tapia, S., y Camacho Cañamón, J. (2018). La contribución de los métodos de aprendizaje automático no supervisado al diseño de métodos para la clasificación textual según el grado de especialización. Sintagma: Revista de Lingüística, 30, 131-149. https://doi.org/10.21001/sintagma.2018.30.08
Song, X., Petrak, J., Jiang, Y., Singh, I., Maynard, D., y Bontcheva, K. (2021). Classification Aware Neural Topic Model for COVID-19 Disinformation Categorisation. PLOS ONE, 16(2), e0247086. https://doi.org/10.1371/journal.pone.0247086
PMid:33600477 PMCid:PMC7891716
Stein, B., zu Eissen, S. M., y Potthast, M. (2007). Strategies for Retrieving Plagiarized Documents. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '07, 825. https://doi.org/10.1145/1277741.1277928
The Conversation, Spanish Edition. (2020). https://theconversation.com/es
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer New York. https://doi.org/10.1007/978-1-4757-2440-0
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. ukasz, y Polosukhin, I. (2017). Attention is All you Need. En I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, y R. Garnett (eds.), Advances in Neural Information Processing Systems, 30. Curran Associates, Inc. Disponible en: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Winkler, W. E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research, 354-359.
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Consejo Superior de Investigaciones Científicas (CSIC)

This work is licensed under a Creative Commons Attribution 4.0 International License.
© CSIC. Manuscripts published in both the print and online versions of this journal are the property of the Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.
All contents of this electronic edition, except where otherwise noted, are distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence. You may read the basic information and the legal text of the licence. The indication of the CC BY 4.0 licence must be expressly stated in this way when necessary.
Self-archiving in repositories, personal webpages or similar, of any version other than the final version of the work produced by the publisher, is not allowed.