Data

Digital Data Infrastructure

The center is building a digital data infrastructure compliant with the requirements of long-term preservation, reproducibility, searching, and sharing.

We have created a simple and general open source software to make available data on a per-paper basis: Qresp: "Curation and Exploration of Reproducible Scientific Papers"

Qresp is a web-based application that may be used to both curate and explore data presented in scientific papers or just explore curated scientific papers. The curation and exploration strategies are implemented in four steps:

1) Paper Organization
The author organizes the data presented in a scientific paper. We suggest the following organization which however is by no means mandatory for the use of Qresp: the data are organized as a collection of datasets (raw data acquired or generated for the paper, either as a result of a computation or collected by an instrument), charts (including images of figures and tables, notebooks used to create them and data displayed in the figure or table), scripts (codes not publicly available, used to manipulate datasets and generate the data files of charts), tools (publicly available software or facilities or instruments used to generate the data), and notebooks.

2) Paper Organization
Once the data have been organized, e.g. as suggested above or in a manner of choice by the investigator, the GUI of Qresp guides the user in creating metadata from the data associated to a scientific paper. The metadata gathered during this curation step include data location, publication details and user-defined attributes. The Qresp software also offers the option to generate a data workflow that describes the procedure(s) followed to obtain the data. The metadata is generated using the JSON (JavaScript Object Notation) syntax and the metadata file is sent to a document-oriented database (our current implementation uses MongoDB which is an open source software). In addition the curator feature of Qresp allows the authors to version their data. We emphasize the importance of generating workflows, not only for tracing provenance of data and making data transparent to the community, but also in order to explain in a detailed and compact way scientific strategies used in the paper; these may then be used for training purposes for students or investigators interested in joining a specific project related to the paper.

3) Metadata Collection
The document-oriented database collects the generated metadata (the database is maintained by the user or the user’s institution). By generating and collecting metadata in a database, the authors of the publication store their paper’s data only in their location of choice, organized in the way that is most appropriate for their research, and they make them searchable through open source databases. The overall strategy of Qresp is to provide a fully flexible tool usable by various researchers in different ways. Eventually the databases where metadata are collected may linked to each other and also linked to other databases of interest, for example the Materials Projects, OQMD, Aflow and others, in the case of materials science focused papers.

4) Paper Exploration
The GUI of Qresp allows use to explore scientific papers. Users may search curated papers, view charts, notebooks, workflows on a per publication basis and download the data organized, e.g. as suggested above.

Collections

MICCoM focuses its data activity on validation, data production and collection, using public databases, and data analysis tools (scripts and codes to analyze data will be provided online).

At present the Center focuses on: