Cryohub: Jupyterhub Deployments for the Cryosphere

CG Auditorium
Luis Lopez

Workflows to analyze large scientific datasets have historically followed a search-subset-download pattern. This traditional approach has a lot of disadvantages, from research being compartmentalized and not easily reproducible to a lack of scalability and portability. Jupyterhub is a relatively new platform in the data science world that encourages reproducibility and lowers the entry barrier to non-expert users by eliminating individual prerequisites and machine dependencies needed to share data analysis workflows .

At NSIDC we have taken tools from the Jupyter and Pangeo ecosystems to deploy two Jupyterhub instances that can be used in a multi-tenant environment where users have different needs and requirements. The first deployment uses our own internal infrastructure with VMWare and Docker Swarm. This hub is intended for analyzing and processing data with XArray and Dask in distributed mode. The second deployment uses Kubernetes in AWS and is intended for interactive Jupyter Notebook-based data tutorials that teach scientists and the general public how to access and work with the data we distribute.

In this talk we present the work done to deploy and operate these instances, the challenges encountered in each approach, and offer lessons learned for those interested in implementing similar workflows.

Speaker Description: 

Software Engineer with a master's degree in computer science from the University of Colorado and interests in distributed systems, cloud computing and machine learning.

Event Category: