Applying genetic algorithms for the identification of Websites’ structure
DOI:
https://doi.org/10.3989/redc.2011.2.779Keywords:
Link analysis, Website structure, factor analysis, genetic algorithmsAbstract
This paper explores website link structure, whereby websites are considered as interconnected graphs and their features are analyzed as a social network. For each root domain, two different networks are extracted: the first being the domain network and the second, the page network. In each case, a series of indicators taken from social network analysis is evaluated in order to characterize the website structure. Factor analysis may provide an appropriate statistical methodology for extracting in graphic form the principal profile of the website in terms of its internal structure. However, the large number of indicators generated by such an exploratory search would lead to a prohibitive number of possibilities. Therefore, this work proposes the use of genetic algorithms. By using this guided search over a given space of possible solutions, genetic algorithms can provide a subset of indicators able to optimize a fitness function. The results categorize corporate websites in terms of their link structure and highlight the possibilities for using genetic algorithms as a tool for knowledge discovery.
Downloads
References
Almind, T. C., y Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to Webometrics, Journal of Documentation, vol. 53 (4), pp. 404-426. doi:10.1108/EUM0000000007205
Almpanidis, G.; Kotropoulo, C., y Pitas, I. (2007). Combining text and link analysis for focused crawling. An application for vertical search engines, Information Systems, vol. 32, pp. 886-908. doi:10.1016/j.is.2006.09.004
Baeza-Yates, R., y Castillo, C. (2007). Characterization of national web domains, ACM Transactions on Internet Technology, vol. 7 (2), pp. 1-32. doi:10.1145/1239971.1239973
Berlt, K.; Silva de Moura, E.; Carvalho, A.; Cristo, M.; Ziviani, N., y Couto, T. (2010). Modeling the web as a hypergraph to compute page reputation, Information Systems, vol. 35 (5), pp. 530-543. doi:10.1016/j.is.2009.02.005
Björneborn, L., y Ingwersen, P. (2004). Toward a basic framework for webometrics, Journal of the American Society for Information Science and Technology, vol. 55 (14), pp. 1216-27. doi:10.1002/asi.20077
Faba-Pérez, C.; Zapico-Alonso, F.; Guerrero-Bote, V. P., y de Moya-Anegón, F. (2005). Comparative analysis of webometric measurements in thematic environments, Journal of the American Society for Information Science and Technology, vol. 56 (8), pp. 779-785. doi:10.1002/asi.20161
Goldberg, D. A. (1989). Genetic Algorithm-in Search, Optimization and Machine Learning, Addison-Wesley Publishing Company, Inc.
Goldfarb, A. (2006). The (teaching) role of universities in the diffusion of the Internet, International Journal of Industrial Organization, vol. 24 (2), pp. 203-225. doi:10.1016/j.ijindorg.2005.11.004
Holland, J. (1975). Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI.
Huizingh, E. K. (2000). The content and design of web sites: an empirical study, Information & Management, vol. 37 (3), pp. 123-134. doi:10.1016/S0378-7206(99)00044-0
Iacobucci, D. (1994). Graphs and matrices. En: Wasserman, S. y Faust, K. (eds.), Social network analysis-methods and applications. New York, NY: Cambridge University Press, pp. 92-166.
Martínez Torres, M. R., y Toral, S. L. (2010a). International Comparison of R&D Investment By European, US and Japanese Companies, International Journal of Technology Management, vol. 49 (1-2-3), pp. 107-122.
Martínez-Torres, M. R., y Toral, S. L. (2010b). Strategic group identification using evolutionary computation, Expert Systems with Applications, vol. 37 (7), pp. 4.948-4.954.
Martínez-Torres, M. R.; Toral, S. L.; Barrero, F., y Cortés, F. (2010). The role of Internet in the development of Future Software Projects, Internet Research, vol. 20 (1), pp. 72-86. doi:10.1108/10662241011020842
Miranda González, F. J., y Bañegil, T. M. (2004). Quantitative evaluation of commercial web sites: an empirical study of Spanish firms, International Journal of Information Management, vol. 24, pp. 313-328. doi:10.1016/j.ijinfomgt.2004.04.009
Nooy, W.; Mrvar, A., y Batagelj, V. (2005). Exploratory Network Analysis with Pajek, Cambridge University Press, New York.
Ortega, J. L., y Aguillo, I. F. (2008). Visualization of the Nordic academic web: Link analysis using social network tools, Information Processing and Management, vol. 44, pp. 1.624-1.633.
Ortega, J. L., y Aguillo, I. F. (2009). Mapping world-class universities on the web, Information Processing and Management, vol. 45, pp. 272-279. doi:10.1016/j.ipm.2008.10.001
Park, H. W., y Thelwall, M. (2003). Hyperlink analysis: Between networks and indicators, Journal of Computer-Mediated Communication, vol. 8 (4). (http://www.ascusc.org/jcmc/vol8/issue4/park.html) [consulta: mayo de 2010].
Pinto-Molina, M.; Alonso-Berrocal, J. L.; Cordón-García, J. A.; Fernández-Marcial, V.; García-Figuerola, C.; García-Marco, J.; Gómez-Camarero, C.; Zazo, Á. F., y Doucet, A. V. (2004). Análisis cualitativo de la visibilidad de la investigación de las universidades españolas a través de sus páginas web. Revista Española de Documentación Científica, vol. 27 (3), pp. 345-370.
Rencher, A. C. (2002): Methods of Multivariate Analysis. 2nd ed. Wiley Series in Probability and Statistics, John Wiley & Sons. doi:10.1002/0471271357
Robbins, S. S., y Stylianou, A. C. (2003). Global corporate web sites: an empirical investigation of content and design, Information & Management, vol. 40 (3), pp. 205-212. doi:10.1016/S0378-7206(02)00002-2
Tan, G. W. y Wei, K. K. (2006). An empirical study of Web browsing behaviour: Towards an effective Website design, Electronic Commerce Research and Applications, vol. 5, pp. 261-271. doi:10.1016/j.elerap.2006.04.007
Thelwall, M. (2004). Link Analysis: An Information Science Approach, Amsterdam, Elsevier 2004.
Thelwall, M. (2008). Bibliometrics to webometrics, Journal of Information Science, vol. 34 (4), pp. 605-621. doi:10.1177/0165551507087238
Toral, S. L.; Martínez Torres, M. R., y Barrero, F. (2010). Analysis of Virtual Communities supporting OSS Projects using Social Network Analysis, Information and Software Technology, vol. 52 (3), pp. 296-303. doi:10.1016/j.infsof.2009.10.007
Toral, S. L.; Martínez-Torres, M. R., y Barrero, F. (2009a). Virtual Communities as a resource for the development of OSS projects: the case of Linux ports to embedded processors, Behavior and Information Technology, vol. 28 (5), pp. 405-419. doi:10.1080/01449290903121394
Toral, S. L.; Martínez-Torres, M. R.; Barrero, F., y Cortés, F. (2009b). An empirical study of the driving forces behind online communities, Internet Research, vol. 19 (4), pp. 378-392. doi:10.1108/10662240910981353
Toral, S. L.; Martínez-Torres, M. R., y Barrero, F. (2009c). Modelling Mailing List Behaviour in Open Source Projects: the Case of ARM Embedded Linux, Journal of Universal Computer Science, vol. 15 (3), pp. 648-664.
Yang, B., y Qin, J. (2008). Data collection system for link analysis, Third International Conference on Digital Information Management, pp. 247-252. doi:10.1109/ICDIM.2008.4746781
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2011 Consejo Superior de Investigaciones Científicas (CSIC)

This work is licensed under a Creative Commons Attribution 4.0 International License.
© CSIC. Manuscripts published in both the print and online versions of this journal are the property of the Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.
All contents of this electronic edition, except where otherwise noted, are distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence. You may read the basic information and the legal text of the licence. The indication of the CC BY 4.0 licence must be expressly stated in this way when necessary.
Self-archiving in repositories, personal webpages or similar, of any version other than the final version of the work produced by the publisher, is not allowed.