Diferencias y evolución del impacto académico en los perfiles de Google Scholar Citations: Una aplicación de árboles de decisión ; Differences and evolution of scholarly impact in Google Scholar Citations profiles: An application of decision trees

The aim of this paper is to analyse the research performance of more than 3,000 profiles from Google Scholar Citations to define which groups (by gender, academic positions and disciplines) bring together more successful profiles. This analysis was faced both from a static and a longitudinal point of view. Decision trees were used to detect the most important variables in order to distinguish winning profiles and to observe which categories bring together more authors with high number of citations and h-indexes. Results show that the career is the most relevant aspect to achieve citations and improve the h-index. Senior researchers are thus ranked in the best positions, while young scholars describe nascent curricula. Otherwise, this distribution changes when growth rates are computed. It is concluded that researchers with a stable career from life sciences have better research impact than young researchers from humanities and social sciences, despite that the fastest growing profiles belong to young scholars.


Introduction
characterize the research impact of a scientist and then differentiating profiles in research assessment exercises.On the other hand, the impact that these products have both on the research community and the society in general is very different and above all difficult to measure.Due to this, bibliometrics is opening to new indicators (i.e.altmetrics, webometrics, etc.) and new sources (i.e.social network sites, academic search engines, etc.) that widen the scope of the scientific impact (Aguillo et al., 2005;Piwowar, 2013).In this framework, this study introduces the use of Google Scholar Citations (GSC) as a new open source to explore its utility for bibliometric analyses and its adaptation for research evaluation.

Related research
One of the initial purposes of bibliometrics has been to uncover what elements influence the obtaining of citations.Since a first time, it was already detected significant differences in the distribution of citations by disciplines, as a cause of the different scientific cultures (Solla Price, 1970;Small & Griffith, 1974).This prompted several studies exploring the reasons behind these differences and their implications for research evaluation (Kostoff, 1998).In this sense, many papers argued that these differences among disciplines are due to size effects such as publications (Small & Crane, 1979;Schubert & Braun, 1986), number of references (Garfield, 1980) or collaboration grade (Smart & Bayer, 1986).Recently, these disciplinary citation patterns have been studied from an evolving view.Radicchi et al. (2008) observed that the evolution of citations at article level describes a universal distribution if they are normalized by the citation average of a discipline.Althouse et al. (2009) analysed the reason of the increase of the impact factor in journals and they detected that this is mostly due to changes in the length of references lists.Finally, Finardi (2014) appreciated different evolutionary patterns in several journals from chemistry and social sciences.But, perhaps, these differences are better observable in relation to the academic position, as a proxy of research maturity.Many works have dealt these differences mainly with regard to the academic production (Long, 1978;Hancock et al., 1992;Jacobs & Ingwersen, 2000); whereas only a few addressed the correlation between career and citations.Ventura and Mombrú (2006) tested the performance of Full Professors and Associated Professors from Uruguay and detected that the academic position influenced positively their citation rates.Abramo et al. (2009) studied 33,000 Italian researchers and observed that the research impact increased as better academic positions are achieved.Similar results were obtained by Pagel and Hudetz (2011) when they analysed the h-index of more than 1,600 US anaesthesiologists.However, Aksnes et al. (2011) studied 8,500 Norwegian researchers but they did funnily not find significant differences according to scholar scales.From an evolutionary view, Penner et al. (2013) concluded that the research performance strongly depends upon career age, and that the first years are critical to build a future successful curriculum (Maranto & Streuly, 1994).On the other hand, literature on gender differences has been more prolific.Many studies detected variations in number of research papers (Kyvik & Teigen, 1996;Abramo et al., 2009) and academic scales (Long, 2001).Nevertheless, these differences were not observed between males and females when they come to achieve citations (Ding et al., 2006;Penas & Willett, 2006).However, Aksnes et al. (2011) observed differences but caused by production factors and cumulative advantages.The launch of GSC in 2011 attracted the attention of several researchers to explore the potential of this tool for research evaluation (Pitney and Gilson, 2012;Huang and Yuan, 2012).Ortega and Aguillo (2012) built a Map of Science using the labels of the profiles.And they also mapped country and institutional collaboration networks using co-authors lists included in these profiles (Ortega and Aguillo, 2013).On the other hand, Delgado López-Cózar et al. (2014) evidenced the possibility of manipulating bibliometric scores into profiles.

Objectives
The aim of this study is analysing the evolution of citation patterns in more than 3,000 research profiles of GSC from several samples extracted during 2011-2013 period.It is expected to describe which factors (gender, position and research area) could influence the evolution of these bibliometric indicators.This principal objective is detailed through several research questions: • Are there any gender, position or research area differences when it comes to achieve a better research impact?• Are there any gender, position or research area differences which influence in more or less extend the evolution of this research impact?• Are decision trees suitable tools for distinguishing and classifying the research performance of authors?• Is Google Scholar Citations an appropriate instrument for bibliometric analyses?

Data extraction
Google Scholar Citations is a web service that makes easier the web publication of personal curricula with bibliographic data from Google Scholar.Besides this publications list, it calculates several bibliometric indicators (citations, h-index, etc.) and shows identification data (name, affiliation, e-mail domain, etc.).This service was set up in November 2011 and it probably contains around 300,000 profiles from around the world (Ortega, in press).The reasons to select this data source were: • It is an open web service that allows to automatically extract data from the profiles.• Its fast updating favours the extraction of several samples along the time and their comparison.• In some cases, it is possible to identify the position, gender and research area of each profile which facilitates grouping profiles by categories.• Google Scholar is probably the most exhaustive scientific database, by which their figures would be rather consistent and reliable.
Data harvesting process was already detailed in previous works (Ortega and Aguillo, 2012;2013).This was developed in two stages: the first one, a SQL script was written to crawl the entire site asking for the 25 letters of the Latin alphabet in groups of two letters, identifying as many profiles as possible and extracting their author identification.Once this process was finished, a second script harvested the fundamental data form each profile such as name, affiliation, etc. and bibliometric indicators (citations, papers, h-index and i10 index).Five quarterly samples were taken from December 2011 to December 2012, and other one in December 2013.12,480 profiles, which always appear in each sample, were taken to test the bibliometric evolution of these researchers.Next, these records were submitted to a cleansing and normalization process to homogenize and group the categorical variables.From these only 3,034 were able to be classified according to three categories at the same time.This process was done to mainly standardize three categorical variables: • Gender: The gender of 7,673 (61%) user profiles was identified through of their first names.This was only done with frequent and usual names for male or female.When in doubt, no gender was assigned.• Position: Six professional categories, as close as possible to the academic hierarchy, were defined to group academic scales.This only was filled out in case where there was a mention to the position of the researcher in the affiliation, for example, "Candidate Ph.D of Computer Science, QUT".6,559 (52.5%) positions were identified.• Subject area: As in previous works on GSC (Ortega and Aguillo, 2012), labels were grouped and classified to describe the research interests of each scholar.Subject Area categories of Scopus (2013) were then used to group the labels in four main Subject Areas: Physical Sciences, Health Sciences, Social Sciences and Life Sciences.Arts and Humanities area was added because it was supposed that these researchers would show a differentiate behaviour according to Social Science researchers and therefore these had to be analysed separately.8,743 (70%) profiles were able to be classified.

Indicators
Next, it was calculated two bibliometric indicators that express a relative value between production (papers) and impact (citations).These indicators are considered more robust because are built as a ratio between two interrelated and dependent magnitudes: • Cit./Pap.: Total amount of citations received by each author divided by the number of papers.• h-index: It is formally defined as the number of papers h which have received at least h citations.For example, an h-index=5 means that the one author has published at least 5 papers that have been cited five or more times.However, this indicator is very dependent on the number of publications.
A growth measure (G q ) that quantifies the quarterly increase of these indicators from December 2011 to 2013 was calculated.Compound interest formula was used to describe the average growth of the bibliometric indicators in a percentage way.
Where V 1 is the initial value, V n the final value and n is the number of moments from the initial observation to the last one.

Decision trees
This is a statistical technique widely used in data mining that groups elements described by a variable (dependent) according to the values of other independent variables (predictors).Its objective is to trace significant variations in the distribution of the dependent variable with regard to the other independent ones, characterizing what factors have more influence on the detection of homogeneous groups.This process is developed through a reasoning process in which an algorithm (CHAID, CRT, QUEST, etc.) detects the variable with most influence on the dependent or target variable, splitting the original node and building a tree of new nodes that again classify the observations regarding the target variable.This process continues until the groups describe the highest purity, this is, each group contains only the highest proportion of a unique value of the target variable.In this way, it is possible to know which values of a variable significantly affect the distribution of the dependent variable, building profiles of objects or persons.This technique is proper for nominal or ordinal variables because it is easier to observe how the presence or absence of a variable value can affect the distribution of the sample.CHAID (Chi-square automatic interaction detector) exhaustive algorithm was used because it is the most generalized and restrictive in its results.This algorithm generates new nodes detecting significant differences in the distributions according to the chi-square test.
Bibliometric continuous variables (Cit./Pap.and h-index) were transformed to ordinal ones to implement this technique and obtain a better interpretation of the results.These variables were thus ranked and grouped in quartiles.In this way, quartile 1 would correspond to the profiles with 25% highest values of the bibliometric indicators, while quartile 4 would group 25% lowest ones.

Results
Decision trees are used to find out which research profiles achieve better performance in bibliometric terms.However, results describe groups that do not present high purity because the target variables (citations and h-index quartiles) are not entirely categorical by which their values are not exclusive.For example, the cluster Doctoral Student may include a 7% of researchers in Q1 because this academic category does not exclude the presence of outstanding scientists.The objective of this technique in this study is only to visually observe how the research impact is distributed according to positions, gender and subject areas, and not simply to built a classification model with high purity and low risk.Due to this, risk values are generally highs (risk>.5)and the groups usually are balanced.In spite of this, a p-value>.005was considered to determine each leaf with an acceptable statistical significance.

Current performance
In this section, the present research activity is observed to later compare differences between the cumulated performance of a researcher and how this evolves during three years.The reference moment used is the most updated, December 2013.According to h-index, decision tree brings a noticeably different picture with more marked differences (Figure 2).Position is still the most outstanding variable, where 44.8% of Professors-Emeritus Professors are located in the first quartile, and 84.3% of the Doctoral Students are in the Q4 with an h-index below to 7. The second branch is, in some cases, formed by gender or subject area criteria.This could be due to the low presence of women in the sample, which sometimes is not enough to find out significant differences.In the Professor-Emeritus Professor's branch, it is significant that men have almost the double of authors in the first quartile (47.1%) than women (26.3%).
According to subject areas, the most successful researchers are Professors-Emeritus Professors, males from Life Sciences (64.5%), Health Sciences (50%) and Physical Sciences-Social Sciences (46%).On the other hand, researchers with the lowest success rates (Q4) are Research Fellows females (54.4%) and Assistant Professors from Social Sciences-Arts and Humanities (56%).

Evolving performance
After describing the current performance of GSC profiles, this section presents how these profiles evolve according to their gender, position and research area.Increases are measured as the average quarterly growth during two years.Figure 3 unfolds the decision tree according to the increase of the ratio of citations per paper.As in the prior trees, the academic position is the principal criterion to spread the tree branches.Contrarily to the current performance, Doctoral Students (59.7%) and Research Fellows (49.5%) are the researchers that most increase their citation/article ratio; while Professor-Emeritus Professor (30.8%) is the group with highest Q4 values, this is, with an increase below 1.02%.The second variable in order of importance is subject area.This permits to precise that the researchers with the lowest growth rates are Professor-Emeritus Professors from Life Sciences-Arts and Humanities (38.3%) and Social Sciences-Health Sciences-Multidisciplinary (34.6%).On the other side, Assistant Professors from Social Sciences-Arts and Humanities (44.7%) are the scientists with the strongest improvement of their curricula.Finally, Figure 4 describes the decision tree for the growth of h-index values.Similar to Figure 3, Research Fellows (80.8%) and Doctoral Students (68.6%) are the authors that have the highest proportion of cases in Q1 and Q2, which it means that these are the positions that most increment their h-indexes.Contrarily, Professors-Emeritus Professors (67.6%) and Associate Professors (45.1%) are the segments that highest proportion of cases include in Q3 and Q4, being the academic scales that increase their h-indexes the most.Only in the case of Assistant Professor, thematic distinctions were found, even though these differences are not much substantial.In this way, Social Sciences-Arts and Humanities-Multidisciplinary contain the highest proportion of authors in Q1 (39.8%), while Assistant Professors from Life Sciences-Physical Sciences-Health Sciences assemble their cases in Q2 and Q3 (58%).Professor-Emeritus Professor and Associate Professor were the only academic positions where gender differences were reported.Thus, women Professors-Emeritus Professors slightly show a higher increase of h-indexes than men with 38.5% in Q1 and Q2, opposite to 31.5% in men.These differences are emphasised in Associate Professor where females obtain 67.4% in Q1 and Q2, while men only reach 52.3%.

Discussion
The use of Google Scholar Citations' profiles makes possible to directly analyse the author performance without to group their papers with the well-known problems of disambiguation and assignation of publications (Wooding et al., 2006;D'Angelo et al., 2011).In the case of this service, it is the same author who creates the profile adding, removing and merging their publications.This ensures a high reliability of the profiles because these publications actually correspond to these authors and not to others with similar names.Other advantage is that these publications come from Google Scholar database which is considered the most complete scientific search engine (Meho & Yang, 2007;Kousha & Thelwall, 2007), in consequence these results are an exhaustive and wide reflection of the research impact of these researchers.However, this profiling service introduces some limitations that have to bear in mind.For example, several studies have reported on a significant proportion of citations assigned to erroneous papers (Bar-Ilan, 2008;García-Pérez, 2010) which could alter the bibliometric indicators of a profile.In this study, these mistakes are considered horizontal problems and therefore they affect equally to every group (by subject matter, position and gender).Another problem that indeed could introduce a bias in the samples is that Google Scholar presents a poor coverage of materials before 1980 (Pauly & Stergiou, 2005).This could affect old authors, meanly Professors and Emeritus Professors with a pre-1980 trajectory, who would be able to see themselves undervalued.Although perhaps the most serious problem is the easy possibility of manipulate a profile uploading fictitious papers full of self-citations to an unsupervised repository (Delgado López-Cózar et al., 2014).Nevertheless, it could be considered that the number of altered profiles would be insignificant because these attitudes verge on the scientific ethic.For example, in this sample, 51.6% is the highest h-index increase for a author, and only nine scholars increase their Cit./Pap.rate above the 50% in two years.These figures do not report the presence of unethical behaviours in the sample.In general, the use of Google Scholar Citations' profiles could be a recommended tool for bibliometric analyses at author level because it allows easily and widely tracking the scientific productivity and impact of a large range of researchers.Before the results themselves, it is interesting to note the lacking presence of women in the sample.Only 14.8% of the authors are female, a percentage inferior to other statistics (20-25%) (NSF, 2013;Landivar, 2013).The reason of this difference could be due to a high presence of profiles from emergent countries (Brazil, India) with a lower involvement of women in research activities (Larivière et al., 2013).This diminished presence of female scientists could undermine the importance of gender to differentiate research performance, but previous results confirm that there are not significant differences between women and men regarding to citation impact (Ding et al., 2006;Penas & Willett, 2006).
According to the differences between positions and thematic groups, results show that the first element to distinguish the research performance is its academic position.In this sense, it could be claimed that the career is the factor that most influences the bibliometric success of an author (Penner et al., 2013).Thus, young researchers with a starting activity describe lower performance than consolidated scholars with a long career such a Professors and Emeritus Professors.These results are in line with previous analyses (Ventura and Mombrú, 2006;Abramo et al., 2009;Pagel and Hudetz, 2011) and are usually explained as a cumulative advantage phenomenon (Cole and Cole, 1972;Long, 1978).However, this situation changes when growth is considered.In this case, Doctoral Students and Research Fellows are those that most increase their curricula, while Professors maintain stable profiles with slight growths.This fact could be the reflection of an evolving phenomenon in which small entities grow faster than big ones (Gibrat, 1930), making that many novice researchers develop their curricula in their early stages and experiencing important initial increases that will mark their future prestige (Maranto & Streuly, 1994).
Results point that the second factor in order of importance to differentiate the research impact of a scientist is the research discipline.In general, it is appreciated that Arts & Humanities and Social Sciences researchers have a lower research impact with regard to Life Sciences, the discipline that most authors places in the Q1.This result fits with previous analyses where biosciences achieve more citations per article than other disciplines (Radicchi et al., 2008), and where humanities are scarcely cited (Althouse et al., 2009).However, according to the evolution of the indicators by thematic areas, it is appreciated an interesting pattern.When the model detects thematic groups for Professor-Emeritus Professor category, these groups describe similar distributions, hindering the observation of a clear growing pattern and concluding that there are not significant differences between disciplines.This leads to argue that senior researchers slow down their careers independently of their research fields.Nevertheless, young researchers indeed describe different growing patterns according to subject matter classification.Hence novice researchers, specifically Assistant Professors, from Arts & Humanities and Social Sciences experience the greatest increases, opposite to Life Science and Health Science colleagues.As the case of academic positions, it is possible that these thematic differences are due to growing phenomena where researchers with a small performance increase faster their citation impact than authors with a larger productivity.In some way, it could be claimed that Life Science and Health Science Assistant Professors reach the research maturity earlier than Arts & Humanities and Social Sciences colleagues, and the fast slowing down of their research performance could be understood as a sign of stability, while Arts & Humanities and Social Sciences Assistant Professors are still developing their careers (Smeby, 1998).

Conclusions
Decision trees have made possible to conclude that the first qualitative aspect that difference the research performance of an author is its academic position.Authors with an established career thus obtain better citation impact than starting researchers, as consequence of a cumulative advantage.This influence is also observed according disciplines, detecting that authors from life sciences gain more research impact than arts and humanities researchers.However, results do not find significant differences between men and women with regard to research impact.
From an evolving point of view, decision trees show that young researchers, mainly from humanities and social sciences, increase their curricula faster than senior professors which describe small increments in all research areas.This could be interpreted as a growing phenomenon in which beginner researchers tend to increase their curricula in the initial stage of their career to then remain stable in their mature life.Decision trees have also made possible to group and categorize which type of authors by gender, position and thematic adscription describe a better research impact according to their citations and h-index, both in a static or longitudinal view.It is therefore concluded that this data mining tool is recommend to study the influence of various qualitative elements involved in research activity in relation to the impact and productivity, showing which aspects of a profile house more members with a promising research career.At last, Google Scholar Citations could be evaluated as an appropriated bibliometric analysis tool because it makes easy the building of exhaustive and fresh author profiles with bibliometric indicators comparable among them.However, it is recommended a previous cleansing of these data to avoid duplicated and manipulated profiles, as well as to normalize affiliations and names.

Figure 1
Figure1displays the decision tree for quartiles of number of citations per paper (Cit./Pap.).The variable that most influences the distribution of quartiles is the research position.Thus, 35% of the Professors-Emeritus Professors are ranked into the first quartile; contrarily 54% of the Doctoral Students are located in the fourth quartile, having the lowest research performance.Descending to the next branch, the second variable in order of importance is subject area.According to this, the most successful authors are Professors-Emeritus Professors from Life Sciences because they contain 47.7% of the cases in Q1, followed by Social Sciences-Health Sciences with 41.6%.On the other side, authors with the lowest scientific performance are Doctoral Students from Physical Sciences-Arts and Humanities-Social Sciences-Multidisciplinary, with 60.5% of cases in Q4, besides to Research Fellows-Assistant Professors from Arts and Humanities-Social Sciences with 44.3% also in Q4.The third variable, gender, is little relevant and only detect differences in Professors-Emeritus Professors from Physical Sciences.In this case, 31.5% of the male are in the Q1 and 13.6% in Q4, while 26% of the female are in Q1 and in Q4.

Figure 2 .
Figure 2. Decision tree according to quartiles of h-index.

Figure 3 .
Figure 3. Decision tree according to Cit./Pap.percentage increases classified in quartiles.

Figure 4 .
Figure 4. Decision tree according to h-index percentage increases classified in quartiles.

Table 1 .
Table 1 hence shows the intervals and number of cases per quartile which it makes easier the interpretation of the results.Distribution of Cit./Pap.and h-index by quartiles in the current performance Table 2 describes the intervals and number of cases per quartile.The growth values are measured in percentages.

Table 2
. Distribution of Cit./Pap.and h-index by quartiles in the evolving performance