Apache Spark is a modern open source cluster computing platform. It is helping data scientists analyze and explore large datasets more effectively than ever before, in terms of both software development productivity and efficient use of hardware, scaling from on-premises clusters to on-demand cloud computing.
Come see examples of Spark at work on scientific datasets, and learn how the largest open source project in data processing can help unify a variety of tasks, including machine learning, streaming data and SQL queries, using Python, Scala Java or R.
We'll also briefly introduce Tensorflow, the hot new deep learning and numerical computation library from Google.
Bring your laptops, because after setting the stage, we'll have lots of time for you to dig in to these projects on your own, or in small groups, and ask questions. You can set up Spark and Tensorflow in your own environment on NCAR's supercomputers, and work thru tutorial material. We'll also provide a a collection of papers, web sites and Jupyter notebooks with relevant ideas, that we can explore and discuss, both in a group setting and one-on-one.
We are especially interested in identifying and exploring problems you think might be suited to these modern tools, ranging from Spark applications like:
and Tensorflow applications like:
So bring ideas for use-cases with you.
The field of climate science, supercomputing, data formats, data science and machine learning are each highly complex, and we're still in the exploration phase of figuring out how to leverage a variety of modern software tools for the science. We've set aside two days to explore good ways to combine them. So please bring your expertise, your problems and your questions.
Although other languages can be used with these systems, we will use the Python programming language, which is well suited to these and other modern data science tools. The world is transitioning from Python 2 to to Python 3, and we'll briefly cover the reasons for doing so and techniques for doing it most effectively.
Class materials (presentation, notes, etc) are available at: http://bcn.boulder.co.us/~neal/talks/spark-science-scale-2017/
Neal McBurnett is a consultant in Boulder Colorado. Since his career as a Distinguished Member of Technical Staff at Bell Labs, working on tools for software development, security and open source web collaboration, he has taught Artificial Intelligence at CU and worked as a techincal content developer at Databricks for courses on Apache Spark, including two massive online courses on Spark in 2015.