Confirmed Sessions at Data Day Seattle 2016

NEW : Check out the list of topics to be covered at this year's Data Day Seattle.
We're continuing to announce the confirmed talks for Data Day Seattle 2016. We have about 20 more sessions to announce, and will be adding to the page every few days.

Maslow's Hierarchy of Needs for Databases ( PLENARY KEYNOTE)

Charity Majors - Honeycomb
Are you an accidental DBA? A software engineer, or operations engineer, or startup founder who suddenly found yourself responsible for designing/scaling/not destroying a bunch of data? Yo me too -- we should form a support group or something. In this talk we’ll cover devops/DBA best practices from the earliest seed stages (survival, selecting the right storage layer) all the way up through what you should expect from a mature, self-actualized database tier. Along the way we’ll talk about how to ensure that your databases are a first-class citizen of your engineering and operational processes, and how your observability and reliability requirements are likely to evolve along with your organization as it matures.

Large-scale stream processing using Apache Kafka ( PLENARY KEYNOTE)

Jay Kreps (Confluent)

Most applications continuously transform streams of inputs into streams of outputs. Yet the idea of directly modeling stream processing in applications is just coming into its own after a few decades on the periphery.
This talk will cover the basic challenges of reliable, distributed, stateful stream processing. It will cover how Apache Kafka was designed to support capturing and processing distributed data streams by building up the basic primitives needed for a stream processing system. It will also introduce Kafka Streams, a lightweight library that applications can embed for expressing stateful stream processing operations.
Finally, it will explore how Kafka and Kafka Streams solves practical problems in building scalable and stateful microservices, based on our experience building and scaling Kafka to handle streams that captured hundreds of billions of records per day.

A Little Cassandra for the Relational Brain

Patrick McFadin (DataStax)

You would love to try some Apache Cassandra but that relational brain is still on? You aren’t alone. Let’s take that OLAP data model and get it OLTP. As you will see, static schema, one-to-many, many-to-many still have a place in Cassandra. From the familiar, we'll go into the specific differences in Cassandra and tricks to make your application fast and resilient. Real techniques to translate application patterns into effective models and common pitfalls that can slow you down and send you running back to RDBMS land.

Stuff you should know as an Advanced Cassandra user

Patrick McFadin (DataStax)

Cassandra has been around long enough that this is not a new topic for many of you. So what should you know as a more advanced user? I’ll cover some great topics that I hear about over and over. Operations, programming, data model. It’s all there! Have you tried any of the new Cassandra 3.0 features? Let’s dive in and find out what you have been missing. Not sure if you are hitting anti-patterns? Let’s go though a few and find out. I’ll leave some time at the end for some Q&A so bring your questions!

Graphs vs Tables: Ready? Fight.

Denise Gosnell (PokitDok)

Lessons learned from building similarity models from structured healthcare data in both graph and relational dbs
The infrastructure debate for the “optimal” data science environment is a loud and ever changing conversation. At PokitDok, the data engineering and data science teams have tested and deployed a myriad of architecture combinations including dbs like Titan, Datastax Enterprise, Neo4j, ElasticSearch, MySql, Cassandra, Mongo, … the list goes on. For us, the final implementations of tested and deployed data science pipelines became a balance of the scientific modeling domain, the right engineering tool, and a bunch of sandboxes.
In this talk, a Data Scientist from PokitDok discusses the polarizing false dichotomy of graph dbs vs. relational dbs. She will step through two different recommendation pipelines which ingest and transform structured healthcare transactions into similarity models. She will use (a) graph traversals to rank entities in a database, (b) relational tables to create co-occurrence similarity clusters, and then (c) discuss the modeling intricacies of each development process. Attendees of this talk will be introduced to the complexities of healthcare data analysis, step through graph and tabular based similarity models, and dive into the ongoing false dichotomy of graph vs relational dbs.

Real-time Search on Terabytes of Data Per Day: Lessons Learned

Joey Echeverria (Rocana)

The challenge with operating modern data centers is the figuring out how to collect, store, and search all of the log and event data generated by complex infrastructure. To address these challenges at Rocana we're applying big data technologies that are designed to handle this massive scale. In particular we need to simultaneously offer full-text and faceted search queries against large volumes of historical data as well as perform near-real time search against events collected in the last minute. In this talk we describe the challenges we've experienced performing search against petabytes of historical data while scaling to terabytes of new data ingest per day. We'll also share the key lessons we've learned and detail our architecture for handling search at these massive volumes. We detail how we use Apache Kafka to stream event data in near real-time to Apache Hadoop HDFS for deep storage and how we leverage Apache Solr and Apache Lucene to scale our search infrastructure to terabytes per day of ingest. We present our solution in the context of an overall event data management solution and show real-world scalability results.

TORA - eBay’s realtime data processing engine

Alex Liang / Thomas Varghese (eBay)

TORA is eBay’s next-generation real-time big data processing platform that’s capable of processing millions of events per second from various sources. The TORA ecosystem consists of a cohesive suite of software products and capabilities to address complexities of processing high-throughput streaming data from heterogeneous sources. TORA provides an extensible framework to build applications and services that process continuous streams of data. TORA also provides a compute platform on top of eBay’s internal cloud infrastructure (Connected Commerce Cloud – C3) to build, deploy, and life-cycle-manage real-time analytics and data integration apps. TORA converges real-time streams of transactional, behavioral, and other eBay/external data. The TORA framework components curate these data streams and makes such data available for consumption to all apps and services running on top of it. TORA also enables real-time data extraction from various sources and integration into eBay's Hadoop analytics environment as well as contextual access to historical data in Teradata & Hadoop environments.

The Algorithm Economy for Healthcare: best systems practices for data analytics

Sanjay Joshi (EMC Emerging Technologies)

Healthcare is complex. Human Biology is complexity wrapped in high dimensionality wrapped in n-of-ones. Quoting Dr. Roger Perlmutter, Executive VP of Research at Merck: "we don't know how the machine works, therefore we don't know what to do when it breaks." As we move toward the "Algorithm Economy" in Healthcare, phrases like "Precision Medicine" and "Population Scale Health" are appearing in the lay-press. Understanding Science before understanding Data Science is critical. This presentation will provide a summary review of four basics topics, each with profound implications: a) Sample Size (and how to get it); b) Infrastructure (and how to scale it); c) Workflows (and how to identify them); and d) Complexity (and how to handle it).
One use-case each from Radiology and Genomics will be presented as an example.

Virtualizing Relational Databases as Graphs: a multi-model approach

Juan Sequeda (Capsenta)

Relational Databases are inflexible due to the rigid constraints of the relational data model. If you have new data that doesn’t fit your schema, you will need to alter your schema (add a column or a new table). This is a task that is not always possible: IT departments don't have time, or they won't allow it -- just more nulls that can lead to query performance degradation, etc.
A goal of NoSQL databases is to address this problem with their schema-less data models (key-values, JSON documents). However, many businesses have large investments in commercial RDBMS's, and their associated applications, and can't expect to move all of their data to NoSQL stores.
In this talk, I will present a multi-model graph/relational architecture solution: keep your relational data where it is, virtualize it as a graph, and then connect it with additional data stored in a graph database. This way, both graph and relational technologies can seamlessly interact together.

How to Observe: Lessons from Epidemiologists, Actuaries and Charlatans

Juliet Hougland - Cloudera
What can you do when you can’t implement an experiment? A/B testing is the bread and butter of many data scientist’s work. Some data scientist aspire to “a culture of experimentation” or even go as far as to (incorrectly) claim that randomized controlled trials are the only way to make inferences from data. What analytic tools are available when we can’t randomize treatment groups and perform direct experiments?
Epidemiologists and actuaries have been working in this situation for decades because many of the processes they need to study are impossible or unethical to experiment on. This talk will provide an overview of observational methods: their strength, limitations, and situations they are best suited for. We will dig into a real observational studies which was not suited for A/B testing -- measuring comparative quality of different versions of on-premise enterprise software.
A useful as observational methods of inference are, they are also easy to misuse and misinterpret. We will discuss some choice examples of misuse and abuse of observational methods and hopefully avoid our own charlatanry in the future.

Governed Self Service analytics at eBay

Alex Liang (eBay)

How eBay has tamed complexity through a self service culture of business intelligence
Browse eBay’s effort to put data on everyone’s hand, front-line eBay staff use a similar interface to help themselves to its massive data store. eBay has spent the past three years transforming its data analysis and reporting capabilities, how we have tamed the phenomenal amount of data generated from one of the largest e-commerce platform.

  • Challenges faced in the development of a multi-platform, self-serve data analytics environment
  • How to enable innovation in the business through data access
  • How eBay’s business is structured to capitalise upon the data opportunity
  • Encouraging data analysts to collaborate across departments
  • How eBay is using this system to make predictions for the future
  • Insight into the key business results that have been achieved so far

Thinking like Spark: How trying to optimize one algorithm helped me re-think distributed data processing.

Rachel Warren - Alpine Data
Data processing routines with Spark are relatively easy to write, but are often very difficult to use in production. In this talk I focus on a use case from Alpine Data labs (an enterprise analytics company), and some of the lessons we learned in bringing one data cleaning procedure from failing on 100,000 rows to succeed on a noisy cluster with a billion rows. I hope to provide you with not just another list of abstract Spark "tips and tricks", but the tools to re-engineer your computations to work effectively in the Spark paradigm. Some of these lessons include narrow vs. wide transformations, how to avoid shuffles even in sorting and grouping operations, and managing high-dimensionality or dirty data.

SQL and NoSQL on MySQL.

Peter Zaitsev - Percona

MySQL is a relational database engine, giving you the full capabilities of the SQL language. It also implies you are subjected to relational data model limitations, without the more fluid document-based model abilities of some NoSQL solutions.
At least that used to be true!
Recent improvements in MySQL 5.7 allow the use of MySQL as a relational database engine with the power of SQL, AND as a document store with a CRUD interface and dynamic schema – or as both models working together on the same data.
This presentation explores the powerful new NoSQL features of MySQL 5.7, and provides examples of how you can put them to use in your applications.

Lessons learned from deploying the top deep learning frameworks in production

Kenny Daniel - Algorithmia
Algorithmia has a unique perspective on using not just one, but five different deep learning frameworks. Since users depend on Algorithmia to host and scale their algorithms, Algorithmia has been forced to deal with all the idiosyncrasies of the many deep learning frameworks out there. Kenny Daniel covers the pros and cons of popular frameworks like TensorFlow, Caffe, Torch, and Theano.
Cloud hosting deep learning models can be especially challenging due to complex hardware and software dependencies. Using GPU computing is not yet mainstream and is not as easy as spinning up an EC2 instance, but it is essential for making deep learning performant. Kenny explains why you should use one framework over another, and more importantly, once you have picked a framework and trained a machine-learning model to solve your problem, how to reliably deploy it at scale. Kenny also discusses the challenges Algorithmia faced when it moved beyond simple demos and used deep learning in real production systems. Kenny shares what Algorithmia has learned from fighting these battles so that you don’t have to fight them yourself.

Turning Unstructured Data into Kernels of Ideas

Jason Kessler - CDK Digital Marketing
This session covers tools and techniques to help users turn large amounts of text data into digestible insights. We will look at case studies that include:
- Discovering ideas for advertisements from a large set of product reviews
- Uncovering common talking points in American political convention speeches
- Finding what language in film reviews predicts a movie will do well at the box office
These case studies will allow us to compare and contrast a variety of Information Retrieval and Natural Language Processing. They will include
- obscure but simple term-association scores,
- using penalized regression to identify important terms in a model,
- and, a brief and gentle foray into inspecting trained convolutional neural networks.
We'll also discuss the pros and cons of different approaches to visualizing text and term-associations, including word clouds, ordered term lists, word-bubble charts, and ways of representing term contexts.

How to Visualize Graph Data: A Developer’s Guide

Corey Lanum - Cambridge Intelligence
As new use cases for connected data analysis develop, more and more applications built on graph data find their way into the workplace. As a result, enterprise architects, analysts and data scientist need to communicate their graphs to a non-technical audience of business users.
This is where visualization plays a vital role. Graph visualization tools give users rapid insight into complex connected data without technical knowledge. They can make decisions, perform graph analysis and query graph databases without the need to learn obscure query languages. Increasingly too, these advanced tools not only help answer the ‘who / why / how’ questions, but also the where and when.
This tutorial will give a developer an overview of the different approaches and techniques for visualizing graphs, as well as a summary of the different tools available.

Data Science for the Masses: Can KNIME make the impossible possible?

Michael Berthold - KNIME
The vision of “Data Science for the Masses” is simple: allow non-data scientists to make use of the power of data science without really understanding the machinery under the hood. While this is possible for narrow disciplines in which a nice GUI can hide complexity, it is not so simple for serious data science. First of all, it is already hard enough to make use of the wisdom of fellow data scientists who use their own favorite tools. Secondly, and still very much an open problem, is the issue of injecting feedback from casual users. Using the KNIME Analytics Platform I will demonstrate a number of methods to encapsulate, reuse, and deploy analytical procedures at various levels of abstraction - giving data scientists control to expose select parts of the analytics processes to user unfamiliar with the underlying tool or without the in-depth knowledge of the analytical algorithms used.

Building better models faster using active learning

Nick Gaylord - CrowdFlower
Active learning is an increasingly popular technique for rapidly iterating the construction of machine learning models. Models built using active learning require decreasing amounts of additional training data over time, because the current state of the model can be used to predict which additional examples will be the most informative. Active learning is appealing for two main reasons: it optimizes ongoing human involvement in the model building process, and it helps overcome the negative effects of imbalanced training data.
In this talk, I discuss how CrowdFlower has drawn inspiration from active learning in the development of our new offering, CrowdFlower AI. By coupling a machine learning environment with our human data collection platform, we enable a highly efficient workflow whereby items that are classified with low confidence can be automatically re-routed to receive human annotations. This yields more accurate classification of those specific items, as well as additional, targeted training data for the model.

Transforming Data to Unlock Its Latent Value

Tony Ojeda - District Data Labs
At the heart of data analysis, there lies a need to understand the real world entities being represented in the data. Every data set we encounter is an attempt to capture a slice of our complex world and communicate some information about it in a way that has potential to be informative to humans, machines, or both. Moving from basic analyses to advanced analytics requires the ability to imagine multiple ways of conceptualizing the composition of entities and the relationships present in our data. It also requires the realization that different levels of aggregation, disaggregation, and transformation can open up new pathways to understanding our data and identifying the valuable insights it contains.
In this talk, we’ll discuss several ways to think about the composition and representation of our data. We’ll also demonstrate a series of methods that leverage tools like networks, hierarchical aggregations, and unsupervised clustering to visually explore our data, transform it to discover new insights, help frame analytical problems and questions, and even improve machine learning model performance. In exploring these approaches, and with the help of Python libraries such as Pandas, Scikit-Learn, Seaborn, and Yellowbrick, we will provide a practical framework for thinking creatively and visually about your data and unlocking latent value and insights hidden deep beneath its surface.

How Machine Learning is like Cycling

Michelle Casbon - Qordoba
Over the past year, two seemingly unrelated dimensions of my life intersected. As a data science engineer at Idibon, it was all NLP all the time, coupled with a heavy dose of cycling as part of company culture. One long-distance bike trip later and the explanation for this co-occurrence became clear: at their roots, NLP and cycling are essentially the same thing. My talk will explain this bold assertion and provide my logic in coming to this conclusion. Join me for a recap of the most useful lessons I learned over the past year at a machine learning startup and how they were affirmed by means of cycling.

Modernizing the Fashion Industry with Data

Andy Terrel - Fashion Metric
The Fashion industry is going through a landmark moment. Profits are down, consumers are pickier, and purchases are shifting to online retail. But buying clothes online stinks. The clothes don’t fit. You don't have to be the size of Shaq or Bilbo to have a hard time finding fitting clothes. The clothes don’t flatter, do you trust the attention to detail from an online experience? The shopping experience of next few years will rapidly evolve and data has a key role to play.

In the last few years numerous tech companies have popped up to address the opportunities in the Fashion space. I’ll discuss my own experiences modeling the human body and clothing preference. Additionally, will look at breakthrough technologies and how they are being used by numerous players in the field.

Web Scraping in a JavaScript World

Ryan Mitchell - HedgeServ
Client computing power and the sophistication of JavaScript is increasing dramatically, which means website development is moving away from servers, towards browsers. This changes the face of web scraping dramatically, as simply wget’ing and parsing the response from a URL becomes useless without executing vast amounts of JavaScript and other "browser junk" with it.

We'll focus on three ways to extract data from these types of "difficult" sites: Parsing and analyzing the JavaScript itself with existing scraping tools, finding and working with hidden APIs, and, if all else fails, using Selenium and Python to work through your browser to scrape and manipulate the site. In addition, I'll show you why the switch to client side processing is actually a blessing in disguise for web scraping in many situations, and some of the niftier tricks you can do with the Python + Selenium + PhantomJS stack.

Generating personalized travel recommendations from natural language queries

Melanie Tosik - WayBlazer
Travel planning can be very time-consuming. At WayBlazer, we are striving to provide travelers with a single platform to dynamically deliver personalized hotel recommendations. All that is left to the user is to describe their desired travel experience in natural language. For example, you could ask "Where are the best hotels in the Caribbean for my honeymoon in June", or "I want a pet-friendly hotel in Ireland with access to a great golf course". In order to generate recommendations that are most relevant to the user query, WayBlazer leverages a number of internal and external NLP services. Specifically, we are developing semantic microservices which are designed to address the main aspects of any trip: who, what, when and where. As opposed to a traditional tf-idf search (a statistical measure to evaluate how important a word is to a document in a given corpus), our discovery tool is powered by a graph database similar to ConceptNet 5. This talk will give the audience an overview of WayBlazer's basic NLP stack, as well as a deep dive into our NL search technology.

NLP for the web: augmenting traditional systems with web specific features

Matthew Peters - Moz
The web provides a near infinite source of text data with a large variety of commercial NLP applications. However, web pages also contain visual, stylistic and semantic markup that is ignored by traditional NLP systems. In this talk, we will explore ways to leverage web page markup to improve machine learning systems for three tasks: web page dechroming, keyphrase extraction and author identification. The resulting algorithms provide state of the art performance and are capable of processing millions of pages per day on a single CPU core.

In Search of Database Nirvana – The Challenges of Delivering Hybrid Transaction/Analytical Processing

Rohit Jain - Esgyn
Companies are looking for a single database engine that can address all their varied needs—from transactional to analytical workloads, against structured, semi-structured, and unstructured data, leveraging graph databases, document stores, text search engines, column stores, key value stores, and wide column stores. They are looking for the ultimate database nirvana.
The term hybrid transactional/analytical processing (HTAP), coined by Gartner, perhaps comes closest to describing this concept. (451 Research uses the terms convergence or converged data platform. The terms multi-model or unified are also used.) But can such a nirvana be achieved? Rohit Jain discusses the challenges one faces on the path to this nirvana, including:

  • Creating a single query engine for all workloads, including both operational and analytical workloads
  • Supporting multiple storage engines, each serving a different need
  • Using the same data model for all workloads, delivering high levels of performance for both operational and analytical workloads
  • Meeting the enterprise operational capabilities needed to support operational and analytical applications

Attendees looking to assess query and storage engines would benefit from understanding what the key considerations are when picking an engine to run their targeted workloads. Also, developers working on such engines can better understand capabilities they need to provide in order to run workloads that span the HTAP spectrum.

What's Your Data Worth?

John Akred - Silicon Valley Data Science
The unique properties of data make assessing its value difficult when using the traditional approaches of intangible asset valuation. In this talk, John Akred will discuss a number of alternative approaches to valuing data within an organization for specific purposes, including informing decisions to purchase third party data, and monitoring data’s value internally to manage and increase that value over time.
Data is difficult to value in large part because, economically, it does not adhere to the three main conditions of a traditional market system. In addition, traditional valuation methods of intangible assets do not apply to data valuation.
* Cost - data is often produced as a by-product of other business processes, making its cost hard to pin down
* Comparables - data varies greatly by content and quality so comparbles are difficult to find
* Forecasts - the dominance of data aggregators and one-on-one deals in the buying and selling of data obscure the prices of any comparables that may actually exist in the market

While a traditional valuation of data’s value in general may not be applicable, we can think of data’s value in the context of specific uses and intentions within the organization. John will give several examples of how to use methods such as the Value of Information (VOI) framework and A/B testing to assess whether or not a third party data source should be purchased or continue to be purchased. He will also show how mutual information (MI) can be used to assess value of a data source once it is in use within the organization.
Lastly John discuss the qualities that make data more valuable within an organization, and provide a range of concrete and straightforward metrics that allow the value of data to be monitored internally to ensure that business decisions can be optimized to maximize that value over time.

Paths of Learning: The most effective way to learn about learning is to play among lovely graphs

Taylor Martin - O'Reilly Media

What if we could tell what you learn based on what you _do_ while you learn? No tests, no silly fake training exercises, in general a lot less wasting of your time. In this talk, we’ll explore how to do that. Traditional approaches to solving this problem are based on assumptions about what people need to know and how they should learn it (curricula) that have been passed down through folk wisdom. We’ll take a more data-driven approach to developing an emergent curriculum based on large bodies of data. Once we have a reasonable subject graph, we can start to figure out the variety of paths that people follow as they traverse that graph and how to group those paths using sequence clustering. Finally, we can investigate when, for whom, and under what conditions these paths are optimal.

Extreme Streaming Processing at Uber

Hien Luu - Uber

Uber is technology company that provides a thriving Marketplace/Logistics platform. It is very important at real time to have both visibility and actionable insights into the health and efficiency of the marketplace. In addition, these insights should be easily accessible for human (Operations, Data Scientists and Engineers) as well as machines to consume in real time.
This talk will discuss how stream processing is used within the Uber's Marketplace system to solve a wide range use cases, including but not limited to realtime indexing and querying of geospatial time series, aggregation and computing of streaming data, and extracting patterns from data streams. In addition, it will present the architecture of the streaming processing system and how the various open source technologies such Apache Kafka, Samza and Spark Streaming are being used.

Data Pipelines with Kafka and Spark (2 hour workshop)

John Akred, Stephen O'Sullivan, and Mark Mims of Silicon Valley Data Science will lead this workshop.

Spark and Kafka have emerged as a core part of distributed data processing pipelines. This tutorial will explain how Spark, Kafka and rest of the big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. By examining use cases and architectures, we’ll trace the flow of data from source to output, and explore the options and considerations for each stage of the pipeline.

Data Infrastructure at a Small Company

Melissa Santos - Big Cartel

Big Cartel is a small (less than 50 people) company that creates and supports shops for artists. Melissa has recently joined them as their first data hire. She has been delighted to find that Big Cartel already has many tools in place that make it easier to work with data, and would like to spread the word so that others are not as surprised about the possibilities. The talk will touch on solutions from Xplenty, Segment, Mode Analytics, Periscope Data, New Relic, Google Analytics, GoSquared and how they interact with MySQL and Redshift. Melissa will demo some of the analyses and reports that are created by these working together, and talk through the best and worst parts of using these tools. The goal is to give the audience a feel for what tools are available and generate discussion about attendees address these needs at their companies.

Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS and to the 2016 Primary Elections

Steve Kramer - Paragon Science

Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection methods to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS and a second set consisting of more than 60 M tweets about the 2016 primary elections.
Paragon Science's patented dynamic anomaly detection technology is based on methods drawn from dynamical systems and chaos theory. In particular, we can calculate finite-time Lyapunov exponents from any time-dependent data stream to find the clusters of entities that are behaving most chaotically compared to the rest of the data set. Because we do not have to specify normal vs. abnormal behavior in advance, no machine learning per se is required. In a robust fashion that is tolerant of missing or erroneous data, we can identify the "unknown unknowns" that can represent threats to be mitigate or opportunities to be seized. To date, our technique has been applied successfully to a broad range of industry verticals, including healthcare data (Advisory Board Company), web user behavior data (Vast), mobile phone data (Place IQ), vehicle pricing analytics (Digital Motorworks/CDK Global), online coupon data (RetailMeNot), email monitoring for patent law cases, and social media monitoring.

Introducing Apache Airflow (Incubating) - A Better Way to Build Data Pipelines

Siddarth Anand - The Apache Software Foundation / Agari

Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines. In the Summer of 2015, Airbnb unveiled Airflow at the Hadoop Summit in San Jose. In the 10 months since, Airflow has grown in popularity and adoption : >300 users, >30 companies using it in production, >130 contributors, and > 2500 Github stars. In March of this year, Airflow was accepted into the Apache Software Foundation's Incubation program, further fueling both interest and adoption. Why the interest? Come to this talk to learn about Airflow (i.e. its features, some case-studies, and its roadmap) from one of its committers and maintainers.

Catching trains: Iterative model development with Jupyter Notebook

Chloe Mawer - Silicon Valley Data Science

Jupyter notebooks have become a highly valued medium for data scientists and researchers to explore data and develop models. Their ability to provide immediate feedback and act as an easy and integrated record of what was done can be used to effectively communicate work and results to a variety of audiences.
However, the ad hoc nature of the notebook can often result in a code free-for-all, leading to poor code quality that results in limited reusability and reproducibility without serious rework. Moreover, models are often developed entirely within one notebook, making the testing of different options either result in the multiplication of notebooks, all slightly different, or the writing over of previous code, making it difficult to recall what has been tried before.
This talk will step through the development of a multi-step algorithm created to detect the passing of a train and its direction from video to demonstrate best practices and tips for developing and iterating models within Jupyter notebooks. These best practices include maintaining a productive workflow for data scientists—one that both maximizes reproducibility, and allows for effective communication.

Elevating Your Data Platform

Kurt Brown - Netflix

Are you getting the most out of your data platform? The technologies you choose are important, but even more so is how you put them into practice. Part philosophy and part pragmatic reality, Kurt will dive into the thinking at Netflix on technology selection and trade-offs, challenging everything (constructively), providing building blocks and paved paths, staffing, and more. Kurt will also talk through our tech stack, which includes many big data technologies (e.g. Hadoop, Spark, and Presto), traditional BI tools (e.g. Teradata, MicroStrategy, and Tableau), and custom tools / services (e.g. Netflix' big data portal and API). Expect to leave with an arsenal of new ideas on the best way to get things done.
Kurt's talk will be followed with an extended Q&A / office hour.

Visualizing the Model Selection Process

Benjamin Bengfort - District Data Labs

Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it's beginnings in academia, and with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is _model selection_. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model's evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.

Distilling dark knowledge from neural networks

Alex Korbonits (Data Scientist Analyst, Remitly )

Recent papers from the NIPS 2015 workshop on feature extraction suggest that representational learning consisting of "supervised coupled" methods (such as the training of supervised deep neural networks) can significantly improve classification accuracy vis a vis unsupervised and/or uncoupled methods. Such methods jointly learn a representation function and a labeling function. If you are a machine learning practitioner in a field whose applications demand or require strict interpretability constraints, a major drawback of using deep neural networks is that they are notoriously difficult to interpret. In this talk, Alex will discuss "distilled learning" -- training a classifier and extracting its outputs for use as training labels for another model -- and "dark knowledge" -- implicit knowledge of the underlying data representation learned by a classifier. Together, Alex will show their efficacy in improving classification accuracy in more readily interpretable models such as single decision tree and logistic regression learners. Finally, Alex will discuss applications such as health sciences, credit decisions, and fraud detection.

Deep Learning for Natural Language Processing

Jonathan Mugan (Co-Founder/ CEO, Deep Grammar )

Deep Learning represents a significant advance in artificial intelligence because it enables computers to represent concepts using vectors instead of symbols. Representing concepts using vectors is particularly useful in natural language processing, and this talk will elucidate those benefits and provide an understandable introduction to the technologies that make up deep learning. It will outline ways to get started in deep learning, and it will conclude with a discussion of the gaps that remain between our current technologies and true computer understanding.

Graph Database Engine Shoot-out: part 1

Josh Perryman (Expero)

Our client’s legacy system held graph-like data in a relational database, but new customers’ data sizes were crippling performance and scale. As part of an overall architectural rejuvenation, we evaluated migrating their data to graph and relational schemas to determine if query performance and scalability could be improved. With representative data in hand, we designed alternate relational schemas, graph database designs, and triple store designs, benchmarking performance and noting subjective measures such as ease-of-use and fluency of the query language. Vendors included PostgreSQL, Neo4J, Titan, and AllegroGraph. Follow-up studies included other vendors. The results surprised us, leading to a hybrid relational and graph recommendation. We have implemented the first milestone over the last year. Follow-up work shows that graph DB vendors have come a long way even in that time. This methodology and information in this case study should be useful to teams choosing a database engine, whether graph or relational, for their next project.

Graph Database Engine Shoot-out: part 2

Josh Perryman (Expero)

[Does not require attendance in part 1.] Our client’s legacy system held graph-like data in a relational database, but new customers’ data sizes were crippling performance and scale. We measured several graph and relational database options and were ready to make a recommendation to our client, but first our architects needed to understand what the numbers all meant. Why did some engines performed better with some queries, and not with others? Why was the “Entity-Attribute-Value” relational schema approach so poor performing? Why a reporting relational schema perform better than the graph databases in some cases? This discussion looks at how database engines work, where graph databases perform better than relational, and where relational better than graph databases. This will be helpful for teams choosing a database engine to address performance or scaling concerns.

NLP @HomeAway: how to mine reviews and track competition

Brent Schneeman Homeaway

HomeAway attracts millions of vacationers to its millions of listings every year. The company has also attracted worthy competition and uses various means to keep track of those competitors. Examining unstructured data such as text reveals good signal that can is used to estimate relative inventory overlap in the industry. We'll look at comparing property descriptions via TF-IDF vectors and topic models, and discuss various distance metrics used to detect similarities.

word2vec, LDA, and introducing a new hybrid algorithm: lda2vec

Chris Moody - Stitchfix

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.

Beyond Shuffling - Tips & Tricks for Scaling Apache Spark Programs

Holden Karau - IBM

This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a very brief introduction to Datasets (new in Spark 1.6) and how they can be used to make our Spark Jobs faster.

Open Source Lambda Architecture with Kafka, Samza, Hadoop, and Druid

Fangjin Yang - Stealth

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Samza, and Druid.
Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka, Samza, and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.
In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Samza and Hadoop work together to load data into Druid. Samza handles near-real-time data and Hadoop handles historical data and data corrections. Druid provides flexible, highly available, low-latency queries.

What does the future hold for Business Analysts in the New World?

Matthew Baird - AtScale

Advances in technology, and understanding around machine learning and statistics, have moved key analysis and predictions from the realm of science (theoretical) to engineering (operational) to Business Analysis (actionable). Based on these advances, the traditional role of a business analyst will evolve to not only answer what questions, but why questions, not only analyze the past, but also predict the future. This talk will discuss what you can do today, and the requirements to enable the next generation of business intelligence.
BI is traditionally a conservative form of data manipulation. Mistakes have a high cost, limiting the potential for risk-taking. As the appetite for deeper insights increases, so is the willingness to push those conservative boundaries. Consumer Internet companies have been mavens in using modern statistical methods to run their businesses, and that data driven decision making philosophy is gaining steam in the enterprise. With the adoption of Hadoop as a data platform for BI comes new drivers for BI's evolution, including marrying foundational BI concepts with modern statistics and visualization.

Building Recommendations at Scale: Lessons Learned

Preetha Appan - Indeed.com

The recommendations engine at Indeed processes billions of input signals daily and drives millions of weekly page views. In this talk, we will delve into how we leveraged probabilistic data structures to build a hybrid (online+offline learning) recommendation pipeline. To address scaling challenges, we incrementally modified the system architecture, model output format, and A/B testing mechanisms. We'll describe these changes and highlight the impact each had on product metrics. We will conclude with lessons learned in system design that apply to any high traffic machine learning application.

What does the future hold for Business Analysts in the New World?

Matthew Baird - AtScale

Advances in technology, and understanding around machine learning and statistics, have moved key analysis and predictions from the realm of science (theoretical) to engineering (operational) to Business Analysis (actionable). Based on these advances, the traditional role of a business analyst will evolve to not only answer what questions, but why questions, not only analyze the past, but also predict the future. This talk will discuss what you can do today, and the requirements to enable the next generation of business intelligence.
BI is traditionally a conservative form of data manipulation. Mistakes have a high cost, limiting the potential for risk-taking. As the appetite for deeper insights increases, so is the willingness to push those conservative boundaries. Consumer Internet companies have been mavens in using modern statistical methods to run their businesses, and that data driven decision making philosophy is gaining steam in the enterprise. With the adoption of Hadoop as a data platform for BI comes new drivers for BI's evolution, including marrying foundational BI concepts with modern statistics and visualization.