Facilitating Discovery of Scientific Notebooks: Applying Schema.org and CodeMeta Metadata to Jupyter Notebooks

CG Auditorium
Keith Maull

Jupyter Notebooks and related computational tools are becoming commonplace
in scientific analyses, research explorations, publications, educational
tutorials and nearly anywhere computational narratives are used.
The rapid explosion of these notebooks has resulted in the availability of
millions of open notebooks on platforms like Github, but a corresponding
lack of both metadata and metadata standards has significantly hindered the
widespread discovery of all but the most popular of these important class
of computational and expository artifacts. In this talk we present
metadata mechanisms to understand and improve the discovery of
Jupyter Notebooks on Github. Building upon the existing Schema.org and
CodeMeta metadata schemas for software, we apply these schema
to Jupyter Notebooks. Drawing from nearly one half million notebooks on
Github, we develop semi-automated methods for applying the metadata schema
to notebooks in an effort to improve both short and long-term discovery
of notebooks on Github and elsewhere. Our results indicate that there
is promise in this technique and that further work to improve fully automated
metadata generation and notebook discovery could be built atop this as an
initial step towards standardizing metadata for these and other artifacts
developed as computational narratives.

Speaker Description: 

Keith is a software engineer and data scientist at the NCAR Library specializing in scholarly metrics and digital scholarship from data to software to computational narratives. Keith joined the NCAR library in 2013 after completing his PhD in computer science from CU/Boulder, and has been passionately involved in research and development projects at the Library, including bibliometrics and scholarly metrics initiatives. He currently leads efforts to focus on the importance of software as a first-class scholarly artifact and explores all aspects of how libraries might become critical partners in developing strategies and techniques for the future of research traceability and reproducibility through software, data, computational narratives and other scientific digital artifacts. 

