In the past years, the movement of data sharing has been enjoying great popularity. Within this context, Thomson Reuters launched at the end of 2012 a new product inside the Web of Knowledge family: the Data Citation Index. The aim of this new database is to enable discovery and access, from a single place, to data from a variety of data repositories from different subject areas and from around the world. In short note we present some results from the analysis of the Data Citation Index. Specifically, we address the following issues: discipline coverage, data types present in the database and repositories that were included at the time of the study.
En los últimos años, el movimiento conocido como “data sharing”, es decir compartir lo datos de investigación, está cobrando una gran popularidad. Dentro de este contexto Thomson Reuters lanzó a finales de 2012 un nuevo producto dentro de su plataforma Web of Knowledge: el Data Citation Index. El objetivo de esta nueva base de datos es facilitar el acceso desde un único punto a los datos indexados en diferentes repositorios de datos de todo el mundo. En esta nota se presentan los resultados del análisis del Data Citation Index y más concretamente se analiza la cobertura de este producto atendiendo a las disciplinas, las tipologías documentales indexadas y los repositorios que se encuentran disponibles en el momento de la realización del estudio.
During the last decade, there has been a heated debate among the scientific community about the need of releasing research data, a movement commonly referred to as data sharing. Although the practice of sharing data has been present among researchers for a long time (Hrynaszkiewicz, Altman,
The benefits of data sharing have already been studied and identified (Arzberger et al.,
Currently there are a large number of initiatives, commonly called data banks or data repositories, dedicated to store, describe and disseminate scientific data. Unlike pre-prints or post-prints repositories, which deal only with one bibliographic format for the items they contain, there is a great variety of data repositories and the solutions adopted are different in each case, and often this makes them difficult to use to people without knowledge of the data bank’s subject area (Torres-Salinas et al.,
Within the context described above, Thomson Reuters has added a new member to the Web of Knowledge family of databases: the Data Citation Index (henceforth DCI). The DCI, released in November
For this reason in this note we present an analysis of this new database; more specifically we address the following questions:
Question 1.What is the discipline and subject area coverage in the DCI?
Question 2. What kinds of data types are present in the DCI, and what is their statistical distribution?
Question 3. Which repositories contribute a larger share of records to the DCI and what are their basic characteristics (data type, country, etc..)?
These results are interesting since they are the first empiric results obtained from an analysis of the DCI as a scientific information and evaluation tool. We should also mention that this note is based on a previous working paper deposited in Arxiv in June 2013 (Torres-Salinas et al.,
For the purpose of this analysis, all records from the Data Citation Index were downloaded in April-May 2013, using the DCI web interface. The resulting text files were processed and added to a relational database, using the Accession Number Field (UT) as the primary key for the data records. The rest of the fields analyzed were: Document Type (DT), Publication Year (PY), and Web of Science Category (WC). Regarding the issue of discipline coverage, two classification systems have been used in order to assign categories to the records: one of them comprises four major subject areas (Science, Social Sciences, Humanities & Arts, and Engineering & Technology), and the other is the one proposed by Moed (
At the time of the download, the Data Citation Index held a total of 2.623.528 records. The oldest of them can be traced back to the year 1800 (
If we consider the classification system proposed by Moed (
The Data Citation Index contains at the moment three different document types: data repositories, data studies, and data sets (Thomson Reuters,
Lastly, in
In this note we have presented some preliminary results based on the analysis of the Data Citation Index. We have shown discipline coverage, the data repositories and document types that can be found in this new database. The main conclusions and findings about the DCI can be summarized as follows:
1) It is heavily oriented towards the hard sciences; Science accounts for 80% of the records in the database. Within this area, the best represented disciplines are Clinical Medicine, Genetics & Heredity, and Biochemistry & Molecular Biology.
2) The DCI uses three document types (data set, data study and repository). There are 96 data repositories, and the predominant typology is the data set, with 2,475,534 records, which is 94% of the entire database.
3) Even though there are a total of 29 repositories that contain at least 4000 records, a total of 64 repositories that contain at least 100 records, there are four repositories that contain 75% of all the records in the database: Gene Expression Omnibus, UniProt Knowledgebase, PANGAEA, and U.S. Census Bureau TIGER/Line Shapefiles.
This article is based on a previous working paper deposited in Arxiv in June 2013: Torres-Salinas, D.; Martín-Martín, A.; Fuente-Gutiérrez, E. (
This article was written as part of the University of Granada´s “Introduction to Scientific Research” Grant Program.
This article has been translated by Alberto Martín-Martín and Nicolás Robinson-García.