Revista española de Documentación Científica, Vol 40, No 3 (2017)

Determinación de grupos de usuarios de bibliotecas digitales mediante el análisis de ficheros log

Juan Antonio Martínez-Comeche



En este estudio se analiza el modo en que los usuarios realizan tareas de búsqueda y recuperación de información mediante consulta en la Biblioteca Digital Hispánica, distinguiendo grupos de usuarios en función de su distinto comportamiento informacional. Para ello se emplean los ficheros log recopilados por el servidor durante un año y se cotejan distintos algoritmos de agrupamiento. Se observa que el algoritmo k-means es un procedimiento de agrupamiento adecuado al análisis de extensos ficheros log de consultas en bibliotecas digitales. En el caso de la Biblioteca Digital Hispánica se distinguen tres grupos de usuarios cuyo comportamiento informacional distintivo se describe.

Palabras clave

Agrupamiento; algoritmo k-means; bibliotecas digitales; ficheros log; análisis de ficheros de transacciones; Biblioteca Digital Hispánica

Texto completo:



Adèr, H. J.; Mellenberg, G. J.; Hand, D. J. (2008). Advising on research methods: a consultant’s companion. Johannes van Kessel Publishing; Huizen, the Netherlands.

Agosti, M.; Crivellari, F.; Di Nunzio, G. M. (2012). Web log analysis: a review of a decade of studies about information acquisition, inspection and interpretation of user interaction. Data Mining and Knowledge Discovery, vol. 24(3), 663-696.

Ahmad, P.; Brogan, M.; Johnstone, M. N. (2014). The e-book power user in academic and research libraries: Deep log analysis and user customization. Australian Academic & Research Libraries, vol. 45(1), 35-47.

Amorim, R.; Hennig, C. (2015). Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences, vol. 324, 126- 145.

Amorim, R.; Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recognition, vol. 45, 1061-1075.

Arnason, H.; Reimer, L. (2012). Analyzing public library service interactions to improve public library customer service and technology systems. Evidence Based Library and Information Practice, vol. 7(1), 22-40.

Asunka, S.; Chae, H. S.; Natriello, G. (2011). Towards an understanding of the use of an institutional repository with integrated social networking tools: A case study of PocketKnowledge. Library & Information Science Research, vol. 33(1): 80-88.

Benevenuto, F.; Rodrigues, T.; Cha, M.; Almeida, V. (2009). Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIG-COMM Conference on Internet Measurement Conference, pp. 49-62. ACM; New York.

Berndt-Morris, E.; Minnis, S. M. (2014). The chat is coming from inside the house: An analysis of perceived chat behavior and reality. Journal of Library & Information Services in Distance Learning, vol. 8(3-4), 168-180.

Borra, E.; Weber, I. (2012). Political insights: Exploring partisanship in web search queries. First Monday, vol. 17(7).

Borrego, A.; Fry, J. (2012). Measuring researchers’ use of scholarly information through social bookmarking data: A case study of BibSonomy. Journal of Information Science, vol. 38(3), 297-308.

Bouveyron, C.; Girard, S.; Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, vol. 52(1), 502-519.

Brett, K.; German, E.; Young, F. (2015). Tabs and tabulations: Results of a transaction log analysis of a tabbed-search interface. Journal of Web Librarianship, vol. 9(1), 22-41.

Calinski, T.; Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, vol. 3(1), 1-27.

Celebi, M. E.; Kingravi, H. A.; Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, vol. 40(1), 200-210.

Chapman, J. L. (1981). A state-transition analysis of online information seeking behavior. Journal of the American Society for Information Science, vol. 32(5), 325-333.

Chen, C. C.; Tsai, Y. (2012). A novel business cycle surveillance system using the query logs of search engines. Knowledge-Based Systems, vol. 30, 104-114.

Clifton, B. (2012). Advanced web metrics with Google Analytics. John Wiley & Sons; Indianapolis, Indiana.

Dempsey, M.; Valenti, A. M. (2016). Student use of keywords and limiters in web-scale discovery searching. Journal of Academic Librarianship, vol. 42(3), 200- 206.

Dick, S.; Yazdanbaksh, O.; Tang, X.; Huynh, T.; Miller, J. (2014). An empirical investigation of web session workloads: Can self-similarity be explained by deterministic chaos? Information Processing and Management, vol. 50(1), 41-53.

Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226-231. AAAI Press; Menlo Park, California.

Ferl, T. E.; Millsap, L. (1996). The knuckle-cracker’s dilemma: a transaction log study of OPAC subject searching. Information Technology and Libraries, vol. 15(2), 81-98.

Fisher, K. E.; Erdelez, S.; McKechnie, L. (editores) (2005). Theories of information behavior. Information Today; Medford, NJ,EE.UU.

González-Teruel, A.; Barrios Cerrejón, M. (2012). Métodos y técnicas para la investigación del comportamiento informacional: fundamentos y nuevos desarrollos. Editorial Trea; Gijón.

Guerbas, A.; Addam, O.; Zaarour, O.; Nagi, M.; Elhajj, A.; Ridley, M.; Alhajj, R. (2013). Effective web log mining and online navigational pattern prediction. Knowledge-Based Systems, vol. 49, 50-62.

Gul, S.; Nabi, S.; Mushtaq, S.; Shah, T. A.; Ahmad, S. (2013). Political unrest and educational electronic resource usage in a conflict zone, Kashmir (indian administered Kashmir): Log analysis as politico analytical tool. Information World, vol. 14(2): 388- 399.

Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, vol. 17, 107-145.

Hancock-Beaulieu, M. (1989). Online catalogues: a case for the user. En: Hildreth, C. R. (editor) The online catalogue: developments and directions. The Library Association; London.

Hastie, T.; Tibshirani, R.; Friedman, J. (2009). The EM algorithm. En: Hastie, T; Tibshirani, R.; Friedman, J. (autores) The elements of statistical learning: data mining, inference, and prediction . Springer; New York.

Hershkovitz, A.; Hardof-Jaffe, S.; Nachmias, R. (2014). Content consumption and hierarchical structures of web-supported courses. Journal of Interactive Learning Research, vol. 25(3), 353-371.

Huang, J.; White, R. W. (2010). Parallel browsing behavior on the web. Proceedings of the 21st ACM Conference on Hypertext and Hypermedia (HT ’10), pp. 13-18. ACM; New York.

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, vol. 2(3), 283- 304.

Hunt, S.; Cimino, J. J.; Koziol, D. E. (2013). A comparison of clinicians’ access to online knowledge resources using two types of information retrieval applications in an academic hospital setting. Journal of the Medical Library Association, vol. 101(1), 26-31.

Iyer, L. S.; Raman, R. M. (2011). Intelligent analytics: Integrating business intelligence and web analytics. International Journal of Business Intelligence Research, vol. 2(1), 31-45.

Jansen, B. J. (2006). Search log analysis: what it is, what’s been done, how to do it. Library & Information Science Research, vol. 28, 407-432.

Jansen, B. J.; Pooch, U. (2001). A review of web searching studies and a framework for future research. Journal of the American Society for Information Science and Technology, vol. 52(3), 235-246.

Jansen, B. J.; Spink, A. (2006). How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management, vol. 42(1), 248-263.

Jones, S.; Cunningham, S. J.; McNab, R.; Boddie, S. (2000). A transaction log analysis of a digital library. International Journal on Digital Libraries, vol. 3(2), 152-169.

Kahlon, M.; Yuan, L.; Daigre, J.; Meeks, E.; Nelson, K.; Piontkowski, C.; Reuter, K.; Sak, R.; Turner, B.; Weber, G. M.; Chatterjee, A. (2014). The use and significance of a research networking system. Journal of Medical Internet Research, vol. 16(2).

Kapoor, K. (2010). Print and electronic resources: Usage statistics at Guru Gobind Singh Indraprastha University library. Program: Electronic Library and Information Systems, vol. 44(1), 59-68.

Kaufman, L.; Rousseeuw, P. J. (2005). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons; Hoboken, New Jersey.

Lalmas, M.; O’Brien, H.; Yom-Tov, E. (2014). Measuring User Engagement. Synthesis Lectures on Information Concepts, Retrieval, and Services, 6(4), 1-132.

Lambert, F. (2013). Seeking electronic information from government resources: A comparative analysis of two communities’ web searching of municipal government websites. Government Information Quarterly, vol. 30(1), 99-109.

Larson, R. R. (1991). Classification clustering, probabilistic information retrieval, and the online catalog. The Library Quarterly, vol. 61(2), 133-173.

Lai, Y.; Zeng, J. (2013). A cross-language personalized recommendation model in digital libraries. The Electronic Library, vol. 31(3), 264-277.

Leeder, C.; Lonn, S. (2014). Faculty usage of library tools in a learning management system. College & Research Libraries, vol. 75(5), 641-663.

Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. (2010). Understanding of internal clustering validation measures. Proceedings of the 10th IEEE International Conference on Data Mining, pp. 911-916. IEEE Computer Society; Los Alamitos, California.

Ma, H. (2013). Tech services on the web: Google Analytics. Technical Services Quarterly, vol. 30(1), 119-200.

Maabreh, M. A.; Al-Kabi, M.; Alsmadi, I. M. (2012). Query classification and study of university students’ search trends. Program: Electronic Library and Information Systems, vol. 46(2), 220-241.

MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297. University of California Press.

Mahoui, M.; Jo Cunningham, S. (2000). A Comparative Transaction Log Analysis of Two Computing Collections. Research and Advanced Technology for Digital Libraries: Proceedings of the 4th European Conference, ECDL 2000, pp. 418-423. Springer; Berlin, Heidelberg.

Malliari, A.; Moreleli-Cacouris, M.; Kapsalis, K. (2010). Usage patterns in a greek academic library catalogue: A follow-up study. Performance Measurement and Metrics, vol. 11(1), 47-55.

Markey, K. (2007). Twenty-five years of end-user searching, Part 2: Future research directions. Journal of the American Society for Information Science and Technology, vol. 58(8), 1123-1130.

Maulik, U.; Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Inteligence, vol. 24(12), 1650-1654.

Mbabu, L. G.; Bertram, A.; Varnum, K. (2013). Patterns of undergraduates’ use of scholarly databases in a large research university. The Journal of Academic Librarianship, vol. 39(2), 189-193.

Moulaison, H. L.; Stanley, S. N. (2013). Beyond failure: Potentially mitigating failed author searches in the online library catalog through the use of linked data. Journal of Web Librarianship, vol. 7(1), 37-57.

Munson, D. M.; Otto, J. L. (2013). Have link resolvers helped or hurt? The relationship between ILL and OpenURL at a non-SFX library. OCLC Systems & Services: International Digital Library Perspectives, vol. 29(2), 78-86.

Ortega Priego, J. L. (2004). Análisis del consumo de información de una revista electrónica: análisis de ficheros log de Cybermetrics. Revista Española de Documentación Científica, vol. 27(4), 455-468.

Ortega Priego, J. L. (2005). Análisis de sesiones de la web del CINDOC: una aproximación a la minería de uso web. El Profesional de la Información, vol. 14(3), 190- 198. contenidos/2005/mayo/4.pdf

Ozen, Z.; Bakiolu, F.; Beden, S. (2014). The examination of user habits through the Google Analytic data of academic education platforms. International Journal of E-Adoption, vol. 6(2), 31-45.

Park, M.; Lee, T. S. (2016). A longitudinal study of information needs and search behavior in science and technology: a query analysis. The Electronic Library, vol. 34(1), 83-98.

Park, M.; Lee, T. S. (2013). Understanding science and technology information users through transaction log analysis. Library Hi Tech, vol. 31(1), 123-140.

Peeples, M. A. (2011). R script for k-means cluster analysis. [Consulta: 20/08/2016]

Peters, T. A. (1993). The history and development of transaction log analysis. Library Hi Tech, vol. 11(2), 41-66.

Priya, R. V.; Vadivel, A. (2012). User behaviour pattern mining from weblog. International Journal of Data Warehousing and Mining, vol. 8(2), 1-22.

Rechavi, A.; Rafaeli, S. (2014). Active players in a network tell the story: Parsimony in modeling huge networks. First Monday, vol. 19(8).

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, vol. 20, 53-65.

Rozaklis, L.; MacDonald, C. M. (2011). A typology of collaborative communication in a digital reference environment. Reference Librarian, vol. 52(4), 308- 319.

Schalkoff, R. J. (2001). Pattern Recognition. En: Wiley Encyclopedia of Electrical and Electronics Engineering. John Wiley & Sons, Inc.; Indianapolis, Indiana.

Shieh, J. (2012). From website log to findability. The Electronic Library, vol. 30(5), 707-720.

Shiri, A. (2011). Revealing interdisciplinarity in nanoscience and technology queries: A transaction log analysis approach. Knowledge Organization, vol. 38(2), 135-153.

Spink, A.; Jansen, B. J. (2004). Web search: Public searching of the Web. Kluwer; New York.

Spiteri, L. F.; Tarulli, L. (2012). Social discovery systems in public libraries: If we build them, will they come? Library Trends, vol. 61(1), 132-147.

Steinbach, M.; Karypis, G.; Kumar, V. (2000). A comparison of document clustering techniques. KDD- 2000 workshop on text mining, pp. 525-526. Boston.

Strohmaier, M.; Kroll, M. (2012). Acquiring knowledge about human goals from search query logs. Information Processing and Management, vol. 48(1), 63-82.

Stuit, M.; Wortmann, H. (2012). Discovery and analysis of e-mail-driven business processes. Information Systems, vol. 37(2): 142-168.

Tobias, C.; Blair, A. (2015). Listen to what you cannot hear, observe what you cannot see: An introduction to evidence-based methods for evaluating and enhancing the user experience in distance library services. Journal of Library & Information Service in Distance Learning, vol. 9(1-2), 148-156.

Tu, Z. (2005). Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering. Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1589-1596. IEEE; Beijing.

Van Gemert-Pijnen, J.; Kelders, S. M.; Bohlmeijer, E. T. (2014). Understanding the usage of content in a mental health intervention for depression: An analysis of log data. Journal of Medical Internet Research, vol. 16(1).

Verma, M.; Srivastava, M.; Chack, N.; Kumar, A.; Gupta, N. (2012). A comparative study of various clustering algorithms in data mining. International Journal of Engineering Research and Applications, vol. 2(3), 1379-1384.

Velmurugan, T.; Santhanam, T. (2010). Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points. Journal of Computer Science, vol. 6(3), 363- 368.

Villén-Rueda, L.; Senso, J. A.; Moya-Anegón, F. de (2007). The use of OPAC in a large academic library: a transactional log analysis study of subject searching. The Journal of Academic Librarianship, vol. 33(3), 327- 337.

Vogt, W. P. (editor) (2011). Sage quantitative research methods. SAGE; Los Angeles.

Waller, V. (2010). Accessing the collection of a large public library: An Analysis of OPAC use. LIBRES: Library and Information Science Research Electronic Journal, vol. 20(1).

Wang, C.; Ke, H.; Lu, W. (2012). Design and performance evaluation of mobile web services in libraries: A case study of the Oriental Institute of Technology library. The Electronic Library, vol. 30(1), 33-50.

Wang, J.; Huffaker, D. A.; Treem, J. W.; Fullerton, L.; Ahmad, M. A.; Williams, D.; Poole, M. S.; Contractor, N. (2011a). Focused on the prize: Characteristics of experts in massive multiplayer online games. First Monday, vol. 16(8).

Wang, P.; Berry, M. W.; Yang, Y. (2003). Mining longitudinal web queries: trends and patterns. Journal of the American Society for Information Science and Technology, vol. 54(8), 743-758.

Wang, S.; Zhang, J.; Yang, F.; Ye, J. (2014). Research on cluster analysis method of E-government public hotspot information based on web log analysis. CIT – Journal of Computing and Information Technology, vol. 22, 11-19.

Wang, X.; Shen, D.; Chen, H.; Wedman, L. (2011b). Applying web analytics in a K-12 resource inventory. The Electronic Library, vol. 29(1), 20-35.

Yom-Tov, E.; White, R. W.; Horvitz, E. (2014). Seeking insights about cycling mood disorders via anonymized search logs. Journal of Medical Internet Research, vol. 16(2).

Zhang, J.; An, L. (2010). Visual component plane analysis for the medical based on transaction log. The Canadian Journal of Information and Library Science, vol. 34(1), 83-111.

Zhang, J.; Zhao, Y. (2013). A user term visualization analysis based on a social question and answer log. Information Processing and Management, vol. 49(5), 1019-1048.

Zhu, D.; Guralnik, D.; Wang, X.; Li, X.; Moran, B. (2015). Statistical estimation for Single Linkage Hierarchical Clustering. Proceedings of the IEEE 5th Annual International Conference on Cyber Technology in Automation, Control and Intelligent Systems (CYBER 2015), pp. 745-750. IEEE Computer Society; Los Alamitos, California.

Copyright (c) 2017 Consejo Superior de Investigaciones Científicas (CSIC)

Licencia de Creative Commons
Este obra está bajo una licencia Creative Commons Reconocimiento 3.0 España (CC-by).

Contacte con la revista

Soporte técnico