Spark and Tensorflow for scientific data at scale

Date and Time: 
2017 June 22nd - whole day class
2017 June 23rd - whole hands on
Mesa Lab Fleishmann building
Neal McBurnett

Apache Spark is a modern open source cluster computing platform. It is helping data scientists analyze and explore large datasets more effectively than ever before, in terms of both software development productivity and efficient use of hardware, scaling from on-premises clusters to on-demand cloud computing.

Come see examples of Spark at work on scientific datasets, and learn how the largest open source project in data processing can help unify a variety of tasks, including machine learning, streaming data and SQL queries, using Python, Scala Java or R.

We'll also briefly introduce Tensorflow, the hot new deep learning and numerical computation library from Google.

Bring your laptops, because after setting the stage, we'll have lots of time for you to dig in to these projects on your own, or in small groups, and ask questions.  You can set up Spark and Tensorflow in your own environment on NCAR's supercomputers, and work thru tutorial material.  We'll also provide a a collection of papers, web sites and Jupyter notebooks with relevant ideas, that we can explore and discuss, both in a group setting and one-on-one.

We are especially interested in identifying and exploring problems you think might be suited to these modern tools, ranging from Spark applications like:

  • Temporal and Zonal averaging of data
  • Computation of climatologies and anomalies
  • Pre-processing of CMIP data such as:
    • Regridding
    • Variable clustering (min/max)
    • Calendar harmonizing

and Tensorflow applications like:

  • Image classification
  • Image segmentation
  • Automatic differentiation
  • High-level API for many popular GPU operations


So bring ideas for use-cases with you.

The field of climate science, supercomputing, data formats, data science and machine learning are each highly complex, and we're still in the exploration phase of figuring out how to leverage a variety of modern software tools for the science.  We've set aside two days to explore good ways to combine them.  So please bring your expertise, your problems and your questions.

Although other languages can be used with these systems, we will use the Python programming language, which is well suited to these and other modern data science tools. The world is transitioning from Python 2 to to Python 3, and we'll briefly cover the reasons for doing so and techniques for doing it most effectively.

Schedule for Thursday June 22:

  • Introduction to Spark [about an hour]
  • Break
  • Setting up Spark in your CISL environment
  • Explore tutorials and notebooks
  • Lunch
  • Intro to Tensorflow
  • Setting up Tensorflow, and/or
  • Continued exploration of tutorials and problems with Spark


Schedule for Friday June 23:

  • Efficient modern data formats for parallel processing
  • Python 3: advantages, transitioning from Python 2
  • Exploration / Discussion of good use cases for Spark and Tensorflow
  • Break
  • Continued exploration of both Spark and Tensorflow on your own and in small groups
  • Lunch
  • Unconference: submit ideas for small group discussions, and break up into groups to discuss them


Class materials

Class materials (presentation, notes, etc) are available at:

Speaker Description: 

Neal McBurnett is a consultant in Boulder Colorado. Since his career as a Distinguished Member of Technical Staff at Bell Labs, working on tools for software development, security and open source web collaboration, he has taught Artificial Intelligence at CU and worked as a techincal content developer at Databricks for courses on Apache Spark, including two massive online courses on Spark in 2015.

Event Category: