Apache Spark was first released in May of 2014. Originally developed in 2009 by Matei Zaharia at UC Berkeley, Spark has rapidly become the dominant cluster computing framework in the big data and machine learning revolution. Open source, easy to configure, and highly extensible, Spark has made data mining, language processing, and predictive analytics extremely fast and cheap, and a highly engaged and passionate fanbase (and business demand) has formed around it.
To better inform that base, Databricks, a company that offers extremely simple out-of-the-box data analysis tools, has organized a series of conferences that promote and discuss the Spark ecosystem, called SparkSummit. The latest such gathering took place in New York City from February 16-18, and the three-day event was a thrill-ride of information, exposition, and advertisement. The keynote talks and breakaways were jam-packed with interesting topics and discussion.
It was kicked off by a speech by Mr. Zaharia talking about the upcoming release of Spark 2.0, scheduled to be sometime in April/May. The big takeaways from the release announcement are the new phase for Tungsten, which dramatically improves the efficiency of memory and CPU usage for Spark execution, and the introduction of Structured Streaming. I also very much enjoyed the live demo of the Databricks system, which provides impressively simple tools for interacting with datasets using Spark.
It would be nearly impossible to effectively breakdown a three-day summit into an easily consumable article. Instead of trying, I will cover my favorite three talks, and then provide links to slides for all the talks I attended at the bottom.
Relationship Extraction from Unstructured Text
A common challenge of natural language systems is the accurate extraction of features and relationships in unstructured text. Supervised learning (using manual annotations on a training dataset) is a common method of training a model. However, manual annotations are time-consuming and labor intensive, and machine learning systems that are dependent on fully supervised learning are less flexible and scalable. Semi-supervised and unsupervised learning are methods for extracting data in unstructured text with less (or no) human intervention. Standford CoreNLP is an open-source system that provides high-quality named-entity recognizer to automatically annotate text by extracting and classifying known entities, and this process can be massively distributed and made near real-time with Spark. The use of Stanford CoreNLP on unstructured text was explained well by data scientists from Capgemini, Yana Ponomarova and Nicolas Claudon.
Data scientists often struggle to translate their low-scale development practices into high-scale production systems. Python or R scripts that were rapidly iterated on during development, often on a single machine, must then be translated into code for a distributed system that will run across a cluster. SystemML is an attempt to marry those two workflows, providing a single declarative large-scale machine learning framework that runs atop Spark and can range effortlessly from single-node in-memory computations to massively distributed computations. Originally developed by IBM and very recently open-sourced, Frederick Reiss did an excellent job reviewing the problem space and the efficiency win of SystemML.
If you ever work in a production machine learning environment, you will hear many tales of translating models from the lab into the production space. Scientists will develop data pipelines to build new models in a local environment, and then engineers will translate the spirit of those pipelines to scale, often with nearly full rewrites. It can be near impossible to standardize feature extraction and data transformation, as scientists want to play with extractors and transformation logic frequently, and frustration can quickly build between the engineering team and the science team. MLeap is an attempt to make data pipelines more packaged, able to move freely from the development space to the production environment. It is explained well by data scientists from TrueCar, Mikhail Semeniuk and Hollin Wilkins.
The conference was packed, filled with scientists, engineers, business representatives, and media professionals. There were booths to showcase business offerings, from small start-up companies to Intel and IBM. It was informative, immersive, and a lot of fun. The next SparkSummit is to take place in San Francisco in June. If you are a big data or machine learning scientist, or just want to learn more about Spark and the ecosystem of products and services that live atop and beside it, I highly recommend attendence!
In 2014, I released a book, entitled Start-up Struggles, in which I discussed, at a high-level, common issues encountered during the growth and development of a technical start-up. Each chapter in the collection of essays highlighted a specific challenge: finding a co-founder, managing finances, managing equity, building the minimum viable product, hiring employees, finding advisors, preparing for scale, and more. No chapter went into significant depth, but rather tried to provide some basic analysis and options for each topic.
In 2016, I plan to release the first in a series of follow-up books that will focus on many of these individual topics in greater detail. The first, entitled Start-up People, will dive into the difficult and muddy waters of technical recruiting, onboarding, and training. I will cover the pains of managing equity and options, finding and attracting the best talent (often with limited dollars), getting engineers up to speed and productive quickly, and scaling at different stages of growth. I'll cover interviewing best practices, tools for addressing underperforming developers, and how best to keep a cash-strapped and over-stressed organization on-track and happy.
Start-ups are never easy, but it helps to have a little more information when you kickoff!
The human attention span fits into two distinct categories: transient attention and sustained attention. When you are cramming for an exam, investigating a new skill, or otherwise engaged in deep learning, you are engaged in sustained attention; this allows you to focus for extended periods of time so that you can absorb and analyze complex concepts. When in sustained attention, you have between five and twenty minutes before the mind requires a break. However, in transiet attention, the mind has only seconds (as little as 8 seconds according to some research) to respond to a stimulus. When you attempt to read a sign along a walking route, for example, you focus only briefly before losing interest.
When browsing the internet, a user is likely not to be engaged in sustained attention. Discovery of new websites is often the act of casual interaction; a random click of a link, the following of a story, the comments of a conversation. When a user arrives at a new website in this way, they are likely to be in the transient phase of attention. That means that you have only seconds to grab their attention and convince them to move their mind into sustained attention for a deeper dive.
Make sure that your design is able to warrant sustained attention! Make good use of those eight seconds and hook each new user as quickly as possible.
For centuries, Boston has been a champion of higher education and academic achievement. In addition to the world-renown of Harvard and MIT, both sitting along the Charles River in Cambridge, the city hosts several top-tier institutions, including Boston College, Boston University, Northeastern University, Emerson, Tufts University; altogether, the Boston area holds over 50 institutions of higher learning, with notable specialists in art, music, technology, engineering, law, and science. Furthermore, the city is fed by an even greater number of institutions throughout the state and region, with a total of 114 in Massachusetts and 167 in all of southern New England, including 3 of the 8-member Ivy League conference (with a fourth, Dartmouth University, just north in New Hampshire).
This staggering collection of learned institutions has given Boston a number of historical triumphs in business, finance, and heavy industry, and has recently has been a major driver in the amazing expansion of the innovation economy. While the west coast is culturally more famous for technological prowess, Boston now boasts over 1,600 notable start-ups (amounting to $38.55 billion in funding and over 74,000 jobs), significant operations or headquarters for many firms, and a collective GDP that ranks higher than many countries, including Venezuela, Finland, Israel, Portugal, Egypt, and Qatar (at $370.77 billion).
Kendall Square, Downtown, and the Waterfront of South Boston account for the three biggest innovation centers in the city, but more are emerging nearly every day in and around. Somerville, just to the north, has been experiencing impressive growth for several years, and Assembly Row is becoming a crowning achievement for the city; other Charles River sites have begun to make noise, including the Mill District, and Route 128 has long been called the Silicon Valley of the East, with several miles of highway bordered by many dozens of high-powered tech firms, including the world headquarters for TripAdvisor, VistaPrint, Constant Contact, and many more.
With high-rise construction continuing to expand in number, scale, and scope, and with money continuing to flood the region as graduates decide to remain local and build their start-up companies in the city and surrounding towns, the city of Boston has a bright and impressive future ahead; and, indeed, the rest of Massachusetts and southern New England is sure to experience bleed-over growth as the limitless opportunities of the innovation economy continue to seek out new homes and champions. If I was a betting man, I'd bet on Boston every day of the week and twice on Sunday!