Request for Information (RFI): Input on Development of a NIH Data Catalog

June 6, 2013

Biomedical research is becoming more data-intensive as researchers are generating and using increasingly large, complex, and diverse datasets. This era of ‘Big Data’ in biomedical research taxes the ability of many researchers to release, locate, analyze, and interact with these data and associated software due to the lack of tools, accessibility, and training.  In response to these new challenges in biomedical research, and in response to the recommendations of the Data and Informatics Working Group (DIWG) of the Advisory Committee to the NIH Director(, NIH has launched the trans-NIH Big Data to Knowledge (BD2K) Initiative.

The long-term goal of the BD2K Initiative is to support advances in data science, other quantitative sciences, policy, and training that are needed for the effective use of Big Data in biomedical research.  (The term “biomedical” is used here in the broadest sense to include biological, biomedical, behavioral, social, environmental, and clinical studies that relate to understanding health and disease).  The term ‘Big Data’ refers to datasets that are increasingly larger, more complex, and which exceed the abilities of currently used approaches to manage and analyze.  “Big Data” is also meant to capture the opportunities and address the challenges facing all biomedical researchers in accessing, managing, analyzing and integrating large datasets of diverse data types.  Such data types may include imaging, phenotypic, molecular (including –omics), clinical, environmental, behavioral, and many other types of biological and biomedical data.  “Big Data” also includes data generated for other purposes (e.g. social media, search histories, cell phone data) when they are repurposed and applied to address health research questions.  Biomedical Big Data primarily emanate from three sources: (1) a small number of groups that produce very large amounts of data, usually as part of projects specifically funded to produce important resources for use by the research community at large, or large collections of electronic health records; (2) individual investigators who produce large datasets for their own project, but which might be broadly useful to the research community at-large; (3) an even greater number of investigators who each produce small datasets whose value can be amplified by aggregating or integrating them with other data.

One of the DIWG recommendations was to promote data sharing through the establishment of central and federated Data Catalogs. Among the issues raised were how to establish minimal and relevant metadata to facilitate data sharing, broad adoption of standards to enhance data retrieval, as well as data citation and adoption of the catalog by the broader biomedical community.

BD2K is now considering the development of a biomedical Data Catalog to make biomedical research data findable and citable, as PubMed does for scientific publications.  Such a Data Catalog would make it easier for researchers to find, share, and cite data, as well as the publications and grants that they are associated with. A Data Catalog is distinct from a data repository, but would help make data in such repositories more easily findable and citable in a consistent manner. In addition to supplying core, minimal metadata to ensure a valid data reference, it is envisioned that a Data Catalog would include links out to the location of the data, to the NIH Reporter record of the grant that supported the research, to relevant publications within PubMed or journals, and possibly to associated software or algorithms.