NumPy and Pandas provide usable high-level abstractions over low-level efficient algorithms. Unfortunately both NumPy and Pandas are largely limited to single- core in-memory computing. When inconveniently large data forces users beyond this context we re-enter the frontier of novel solutions.
The Blaze project eases navigation of this frontier by providing a uniform user interface on top of a variety of pre-existing computational solutions. This allows Python users to interact in a NumPy/Pandas style while actually driving large distributed or out-of-core computational systems like SQL, or Spark.
Blaze also provides a setting for the construction of new computational systems out of old ones. In some cases we coordinate many processes using Pandas or NumPy to perform complex computations in parallel and out-of-core.
In this talk we give a brief outline of the Blaze project as a whole and then specialize down to the case of out-of-core shared-memory n-dimensional arrays. Using standard storage technologies (e.g. HDF5) and NumPy in-memory we show how a task scheduling framework can achieve NumPy-like usability on large datasets for a broad class of operations
Matthew Rocklin is a computational scientist at Continuum Analytics. He writes open source tools to help scientists interact with large volumes of data.
If you use a non-flash enabled device, you may download the video here