February 26, 2016

SparkSummit East

Apache Spark was first released in May of 2014. Originally developed in 2009 by Matei Zaharia at UC Berkeley, Spark has rapidly become the dominant cluster computing framework in the big data and machine learning revolution. Open source, easy to configure, and highly extensible, Spark has made data mining, language processing, and predictive analytics extremely fast and cheap, and a highly engaged and passionate fanbase (and business demand) has formed around it.

To better inform that base, Databricks, a company that offers extremely simple out-of-the-box data analysis tools, has organized a series of conferences that promote and discuss the Spark ecosystem, called SparkSummit. The latest such gathering took place in New York City from February 16-18, and the three-day event was a thrill-ride of information, exposition, and advertisement. The keynote talks and breakaways were jam-packed with interesting topics and discussion.

It was kicked off by a speech by Mr. Zaharia talking about the upcoming release of Spark 2.0, scheduled to be sometime in April/May. The big takeaways from the release announcement are the new phase for Tungsten, which dramatically improves the efficiency of memory and CPU usage for Spark execution, and the introduction of Structured Streaming. I also very much enjoyed the live demo of the Databricks system, which provides impressively simple tools for interacting with datasets using Spark.

It would be nearly impossible to effectively breakdown a three-day summit into an easily consumable article. Instead of trying, I will cover my favorite three talks, and then provide links to slides for all the talks I attended at the bottom.

Relationship Extraction from Unstructured Text

A common challenge of natural language systems is the accurate extraction of features and relationships in unstructured text. Supervised learning (using manual annotations on a training dataset) is a common method of training a model. However, manual annotations are time-consuming and labor intensive, and machine learning systems that are dependent on fully supervised learning are less flexible and scalable. Semi-supervised and unsupervised learning are methods for extracting data in unstructured text with less (or no) human intervention. Standford CoreNLP is an open-source system that provides high-quality named-entity recognizer to automatically annotate text by extracting and classifying known entities, and this process can be massively distributed and made near real-time with Spark. The use of Stanford CoreNLP on unstructured text was explained well by data scientists from Capgemini, Yana Ponomarova and Nicolas Claudon.

Apache SystemML

Data scientists often struggle to translate their low-scale development practices into high-scale production systems. Python or R scripts that were rapidly iterated on during development, often on a single machine, must then be translated into code for a distributed system that will run across a cluster. SystemML is an attempt to marry those two workflows, providing a single declarative large-scale machine learning framework that runs atop Spark and can range effortlessly from single-node in-memory computations to massively distributed computations. Originally developed by IBM and very recently open-sourced, Frederick Reiss did an excellent job reviewing the problem space and the efficiency win of SystemML.


If you ever work in a production machine learning environment, you will hear many tales of translating models from the lab into the production space. Scientists will develop data pipelines to build new models in a local environment, and then engineers will translate the spirit of those pipelines to scale, often with nearly full rewrites. It can be near impossible to standardize feature extraction and data transformation, as scientists want to play with extractors and transformation logic frequently, and frustration can quickly build between the engineering team and the science team. MLeap is an attempt to make data pipelines more packaged, able to move freely from the development space to the production environment. It is explained well by data scientists from TrueCar, Mikhail Semeniuk and Hollin Wilkins.

Additional talks that I attended and enjoyed:

1. Distributed Time Travel for Feature Generation
2. Realtime Risk Management Using Kafka, Python, and Spark Streaming
3. Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
4. Building a Recommendation Engine Using Diverse Features
5. How to Use Apache Spark Properly in Your Big Data Architecture
6. Generalized Linear Models in Spark MLlib and SparkR
7. Spark Tuning for Enterprise System Administrators
8. Distributed Tensor Flow on Spark: Scaling Google's Deep Learning Library
9. ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization
10. Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark
11. Succinct Spark: Fast Interactive Queries on Compressed RDDs
12. Highlights and Challenges from Running Spark on Mesos in Production

The conference was packed, filled with scientists, engineers, business representatives, and media professionals. There were booths to showcase business offerings, from small start-up companies to Intel and IBM. It was informative, immersive, and a lot of fun. The next SparkSummit is to take place in San Francisco in June. If you are a big data or machine learning scientist, or just want to learn more about Spark and the ecosystem of products and services that live atop and beside it, I highly recommend attendence!

comments powered by Disqus