Confirmed Sessions for Data Day Seattle

Keynote - This is Our Fight: Technology for Defending Public Discourse

Jonathon Morgan - New Knowledge

This has been a tough year for Silicon Valley media companies. Every week more comes to light about how the platforms that revolutionized how we communicate have become so corrupted by fake accounts and fake news that they're undermining our democracy. Amid the important ethical and legal conversations about how our society can protect itself, it's important to understand what we, as data scientists and machine learning practitioners, can do to prevent our systems from being weaponized. We'll take a look at some of the structural problems in our media platforms, and dig into the new and novel technologies that might repair public discourse.

Graph Keynote - Databases: the past, the present, and the future in cognitive computing

Haikal Pribadi - GRAKN.AI

In the past 70 years, we have seen the evolution of databases starting from punch cards and tapes, all the way to globally distributed databases supporting web scale applications. Every evolution of the database was critical to the enablement of different fields of computing. The relational database enabled the rise of business information systems, and NoSQL databases enabled the development of web scale applications. Now, the future is cognitive computing. However, cognitive systems process data that is way more complex that what we’ve been used to. In this keynote talk we will walk through evolution of databases in the history of computing, and review the applications of each. We will discuss how knowledge graphs and knowledge bases sit in the evolution. Could they potentially serve as the next generation of databases to support the future of cognitive computing?

Foundations of Streaming SQL or: How I Learned to Love Stream & Table Theory

Tyler Akidau - Google

What does it mean to execute robust streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing conceptually, or different? And how does all of this relate to the programmatic frameworks like we’re all familiar with? This talk will address all of those questions in two parts.
First, we’ll explore the relationship between the Apache Beam Model and stream & table theory (as popularized by Martin Kleppmann and Jay Kreps, amongst others, but essentially originating out of the database world). It turns out that stream & table theory does an illuminating job of describing the low-level concepts that underlie the Beam Model.
Second, we’ll apply our clear understanding of that relationship towards explaining what is required to provide robust stream processing support in SQL. We’ll discuss concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, and talk about new ideas yet to come.
In the end, you can expect to have a much better understanding of the key concepts underpinning data processing, regardless of whether that data processing batch or streaming, SQL or programmatic, as well as a concrete notion of what robust stream processing in SQL looks like.

Testing in Apache Spark 2+: avoiding the fail boat beyond RDDs

Holden Karau - IBM / Rachel Warren - Salesforce

As Spark continues to evolve, we need to revisit our testing techniques to support Datasets, streaming, and more. This talk expands on “Beyond Parallelize and Collect” (not required to have been seen) to discuss how to create large scale test jobs while supporting Spark’s latest features like Structured Streaming. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.

From Natural Language Processing to Artificial Intelligence

Jonathan Mugan - Deep Grammar

Why isn’t Siri smarter? Our computers currently have no commonsense understanding of our world. This deficiency forces them to rely on parlor tricks when processing natural language because our language evolved not to describe the world but rather to communicate the delta on top of our shared conception. The field of natural language processing has largely been about developing these parlor tricks into robust techniques. This talk will explain these techniques and discuss what needs to be done to move from tricks to understanding.

Machine Learning: From The Lab To The Factory

John Akred - Silicon Valley Data Science

When data scientists are done building their models, there are questions to ask:
* How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
* Can the model run automatically without issues and how does it recover from failure?
* What happens if the model becomes stale because it was trained on data that is no longer relevant?
* How do you deploy and manage new versions of that model without breaking downstream consumers?
This talk will illustrate the importance of these questions and provide a perspective on how to address them. John will share experiences deploying models across many enterprises, some of the problems we encountered along the way, and what best practice is for running machine learning models in production.

Chatbots from First Principles

Jonathan Mugan - Deep Grammar

There are lots of frameworks for building chatbots, but those abstractions can obscure understanding and hinder application development. In this talk, we will cover building chatbots from the ground up in Python. This can be done with either classic NLP or deep learning. We will cover both approaches, but this talk will focus on how one can build a chatbot using spaCy, pattern matching, and context-free grammars.

Evolution of Natural Language Comprehension with Human Machine Collaboration.

Sanghamitra Deb - Chegg

Huge amounts of text data collected daily reside in darkness since they are not structured and hence it is difficult to draw inferences from them directly. In order to convert this to structured data we need to develop language models. This requires heavy investments in feature engineering and creation of large labeled training sets manually. This process of model building is expensive and evolving language makes revisiting the process difficult. This results in older models being used for data that has changed significantly. In this talk I am proposing the technique of combining human input with data programing and weak supervision to create a high quality model that evolves with feedback. We apply dark data extraction method: snorkel, developed at Stanford ( to create an honor code violation detector (HCVD). Snorkel is a framework that uses inputs from SME’s and business partners and converts them into heuristic noisy rules. It combines the rules using a generative model to determine high and low quality rules and outputs a high accuracy training data based on combined rules.
HCVD detects key phrases (example: do my online quiz) that indicate honor code violation. We run this model daily and place the HCVD texts (around 2%) in front of humans, the feedback from the humans is periodically checked and the rules are edited to change the weak supervision to produce a fresh training set for modeling. This is an ongoing and iterative process that uses interactive machine learning to evolve the Natural Language Comprehension model as new data gets collected.

Bootstrapping Knowledge-bases from Text

Garrett Eastham - Edgecase / Data Exhaust

Garrett will give an overview of types of knowledge-bases (ontologies, graph structure, etc.). He will then go over examples of how knowledge-bases can / are being applied to improve common intelligent systems (specifically search and personalization engines). Next, he will discuss state-of-the-art approaches to information extraction and the challenges / opportunities in leveraging these when attempting to train a knowledge base from large text corpora. Finally, Garrett will walk through example music-knowledge base extracted from music reviews and user-submitted tags (focus on methods used and challenges overcome)
Intended Audience: Applied machine learning engineers / scientists - ideally those who are working on improving the results of existing production search and/or recommendation pipelines. Technical Skills / Concepts: Familiarity with general NLP practices / techniques (specifically modern information extraction approaches), Apache Spark, Understanding of core search / personalization principles.

Understanding Cultures and Perspectives through Text and Emjoi Visualization.

Jason Kessler - CDK Global

How do people from different groups (e.g., political perspectives, genders, and language speakers) express themselves differently? This talk introduces the Python package Scattertext, a tool for visualizing how word choice varies between document sets. Next, we will look at how Scattertext has been used to visualize the differences between political parties, genders, and other groups. The widespread adoption of Emoji gives us an unprecedented opportunity to see how word (i.e. Emoji) use differs among speakers of different languages, and the talk will conclude with such an analysis.

Scaling Data Science at Stitch Fix

Stefan Krawczyk - Stitch Fix

Stitch Fix is an online clothing retailer that not only focuses on delivering personalized clothing recommendations for our customers, but also applies the output of data science to automate numerous other business functions through the delivery of forecasts, predictions, and analyses. We rely heavily on the ability for applied mathematics & statistics and our human decision makers to synergistically work; doing this well requires us to merge art & science together. However with around eighty data scientists in residence, it can be challenging to support so many different needs from an infrastructure perspective.The talk will also cover how Stitch Fix scales access to data using S3 as our source of truth as well as how Stitch Fix scales ad-hoc compute resources for data scientists using Docker & ECS

Detecting Bias in News Articles

Rob McDaniel - Lingistic / Rakuten

Bias is a hard thing to define, let alone detect. What is bias? How many different types of bias exist? What, if any, lexical cues exist to identify bias-inducing words? Can machines help us qualify and improve news articles?
Using millions of individual Wikipedia revisions, we will discuss a supervised method for identifying bias in news articles. First, we will discuss the last several decades of linguistic research into bias and the various types of biased verbs and lexicons that exist. Then, with plenty of examples, we will explore the way that these words introduce hidden bias into a text, and will follow up with a demonstration of a model for predicting the presence of bias-inducing words.
We will conclude with an exploration of ways to automatically suggest improvements to an article, to associate bias with topics, future implications in the field of stance detection, and a discussion of the background bias of various publishers.

Tech Battle: Machine Learning vs Graphs

Haikal Pribadi - GRAKN.AI / Denis Vrdoljak; - Berkeley Data Science

In this Tech Battle, we'll put Berkeley Data Science Group's Machine Learning expertise up against Grakn AI's Hyper-Relational Database, and see who comes out on top! The challenge dataset: Credit and Fraud. In the process, we'll cover some of the challenges of using blackbox machine learning models, and the techniques used to deal with them. We'll cover alternatives approaches available with Grakn's platform--such as rule mining. And, finally, we'll tell you where we see the future going, and how we envision these two technologies working together to take the industry beyond just Machine Learning and create true Artificial Intelligence.

Applying an Active Learning Algorithm For Entity Deduplication In Graph Data

William Lyon - Neo4j

Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database.

Graph Analytics - For Fun and Profit

David Bechberger - Gene by Gene

Great, you have cleared your first hurdle by building your data model and loading your data into a graph, but you know that there’s more. Now the real fun begins, finding out what secrets reside within you data.
We will use a data model we are all familiar with, family trees, and a common language, Apache Tinkerpop, to demonstrate how you can begin applying some common graph analytical techniques (e.g. Path analysis, centrality analysis, community detection) to pull interesting information from within your data.
- Who's married their 1st cousin?
- Who is the most influential person in my family?
- Am I really only 6 degrees from Kevin Bacon?
By the end of this session you will have enough knowledge to begin running useful analytics on your graphs, or at least have a better appreciation for how you can use analytics to provide valuable insight into your data.

Combining graph analytics with real-time graph query workloads for solving business problems

Ryan Boyd - Neo4j

Abstract forthcoming.

Text Mining Using Tidy Data Principles

Julia Silge - Stack Overflow

Unstructured, text-heavy data sets are increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. In this talk, we will explore how you can manipulate, summarize, and visualize the characteristics of text using R packages from the tidy tool ecosystem; these tools extend naturally to many text analyses and allow analysts to integrate natural language processing into effective workflows already in wide use. We will discuss how to implement approaches such as sentiment analysis of texts and measuring tf-idf to quantify what a document is about.

Graph Data Obfuscation

Mike Downie - Expero

It only takes 5 pieces of identifying data to uniquely identify someone within a zip code with fairly high confidence. If you have HIPPA or PCI (or both!) compliance requirements, and you are using graph, that problem just got hard. The graph eco-system doesn’t have the same obfuscation support as other data engines. Since graph is an enabling and strategic technology for many projects, and regulatory requirements aren’t going away, that means you have to solve the data obfuscation problem. Come join Mike as he looks at techniques and strategies for obfuscating, de-identifying or encrypting data in your graph implementation.

3 ways to build a near real-time recommendation engine

David Gilardi - DataStax

David Gilardi will show how to add a near real-time recommendation engine to KillrVideo, a reference application built to help developers learn Cassandra, Graph, and DataStax Enterprise. David will discuss the benefits and practical considerations of building a solution that leverages multiple, highly connected data models such as graph and tabular. He’ll also take a look at multiple recommendation engine examples, including use of domain specific languages to make development even simpler.
An introduction to KillrVideo - David will briefly introduce the reference implementation, a cloud-based video sharing application which uses the Apache Cassandra core of DataStax Enterprise as well as DSE Search and DSE Graph integrations.
What do we mean by “multiple, highly connected models”? - David will talk about what this means and discuss the benefits of these attributes in building applications that include transaction processing, search, and graph.
Adding a recommendation engine - David will discuss the task to extend KillrVideo to provide real-time video recommendations using DSE Graph and the popular Gremlin graph traversal language using DSL’s (Domain Specific Language).

Everything is not a graph problem (but there are plenty)

Dr. Denise Gosnell - DataStax

As the reality of the hype cycle sets in, the graph pragmatists have shown up to guide the charge. What we are seeing and experiencing is an adjustment in mindset: the convergence to multi-model database systems parallels the mentality of using the right tool for the problem. With graph databases, there is an intricate balance to find where the rubber meets the road between theorists and practitioners.
Before hammering away on the keyboard to insert vertices and edges, it is crucial to iterate and drive the development life cycle from definitive use cases. Too many times the field has seen monoglot system thinking pressure the construction of the one graph that can rule it all which can result in some impressive scope creep. In this talk, Dr. Gosnell will walk through three overlooked architectural components that can make or break a graph implementation and suggest some best practices for navigating common anti-patterns.

Case Study: Visualize and Analyze the GDELT Global Knowledge Graph

Kevin Madden - Tom Sawyer Software

Every 15 minutes Google captures the world’s news into the Global Database of Events, Language, and Tone (GDELT) Global Knowledge Graph. The 2015 data contains nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references.
This talk describes the challenges, complexities, and technical solutions involved in creating a massive data lake from GDELT data that enables users to query, visualize and analyze the world’s news every 15 minutes, and save the results back to a graph database.
On the back end, this project included ingesting the entire underlying event and graph datasets – more than 2.5TB for last year alone – and then updating the data every 15 minutes. We migrated the link structure from a relational to a spatial and graph database. This enabled regional geographic clustering and provided the platform to support all-in-one analysis.
The front end required interactive, user-friendly search and filtering to automatically generate a unified view of diverse data and analytics. The view includes geospatial information, network topologies and sentiment analysis combined with timelines, link analysis, maps, trees, charts and tables.
The talk concludes with a demonstration.
Intended Audience: CTOs, CIOs, data architects, data engineers, data scientists, enterprise architects, solution architects, VPs of Engineering, VP of Analytics.
Technical Skills and Concepts Required: Property Graphs, Knowledge Graphs, Graph Query Languages, Graph Theory and Algorithms, Graph Analytics and Visualization, Linked Data, Sentiment Analysis.

Conversational Assistants with Deep Learning

Dr. Zornitsa Kozareva - AWS

Over the years there has been a paradigm shift in how humans interact with machines. Today’s users are no longer satisfied with seeing a list of relevant web pages, instead they want to talk to machines, complete tasks and take actions. This raises the questions: “How do we teach machines to become useful in a human-centered environment?” and “How do we build machines that help us organize our daily schedules, arrange our travel and be aware of our preferences and habits?”. In this talk, I will describe these challenges in the context of conversational assistants. Then, I will delve into deep learning algorithms for entity extraction and user intent prediction. Finally, I will highlight findings on user intent prediction from shopping and movie domains.

Exploring the graph database landscape through graph visualization

Christian Miles - Cambridge Intelligence

As with any exciting and applicable technologies the graph database domain is an ever-evolving landscape of tools and applications that can often be hard to keep up with. It can often seem like there is a new database announced each month. In this talk Christian will use cutting-edge visualization techniques to provide a live exploration of the world of graph databases and show how this landscape has shifted over time. Of particular interest will be an attempt to untangle the different database classifications as well as plotting the active development of particular platforms. Christian will conclude by showing that graphvisualization is a valuable tool even when it's applied to a set of applications that visualization itself often leverages.

Evolution of the Graph Schema

Joshua Shinavier - Uber

To every graph database, there is a schema -- a language in which we describe a domain of interest, modeled as a graph. Historically, NoSQL graph databases emerged as a more dynamic alternative to relational databases and RDF triple stores, with their relatively formal schemas. However, as graph database applications have grown in scale over the years, schema management has become essential for optimal performance. Likewise, as applications have grown in complexity, schemas have become increasingly necessary for data integration.
At Uber, we are tackling the problem of modeling an open-ended and rapidly changing domain at very large scale. Although a graph of “things” connected by “relationships” is perhaps the most generic and universally understandable format for such a data set, the sheer volume of streaming and historical data, combined with the need to act on incoming data in real time, pose significant challenges.
In this talk, we will focus specifically on the challenges of graph data modeling and integration, surveying techniques from academic research, the early graph landscape, and current high-performance applications including the Uber knowledge graph.

Build horizontally scalable graphs and real-time data science solutions with Azure Cosmos DB

Denny Lee / Luis Bosquez - Microsoft

Azure Cosmos DB lets you interact with graphs utilizing Apache TinkerPop’s Gremlin APIs with a service that provides turn-key global distribution, elastic scaling of storage and throughput, and multiple consistency models. Because Cosmos DB is multi-model solution, the same graph container can be used to build real-time data science solutions by combining the power of Apache Spark and Cosmos DB. This enables blazing fast analytics using Spark by efficiently exploiting Cosmos DB's indexes with pushdown predicate filtering and enabling updateable columns while performing analytics. By the end of this session, you will have learned how you can use the Spark connector for Azure Cosmos DB in a variety of scenarios including online recommendations and personalization, fraud detection, event messaging, and IoT sensor-data anomaly detection.

Securing Federated Data with TinkerPop and how to handle the “search engine” problem

Josh Perryman - Expero

"Hey, I have a great idea,” the CIO said, “let’s use a graph database to integrate all of our enterprise data sources!” “But security?” you replied. “Oh, it can’t be that hard” he dismissed, “Graph is easy. I’m sure that security in graph will be a cake-walk for you.”
Little did he know the mighty project of woe he had just bequeathed to you. Security was easy to model when you talked to the owner of the first data source, but when you had the requirements conversation with the business owner of the second data source, and then the third, and each one thereafter, you started to fret. Security in a graph database is not easy, or obvious. And when full-text search capabilities made it into the MVP requirements, you started to despair.
With his unique style of wisdom and wit Josh will look at the options for doing “cell level” security in a TinkerPop3-based graph data engine, and how to solve “the search engine problem” as well.

Interactive prototyping of Graph Applications with JanusGraph

Alan Pita - Expero

Graph Databases are being used to create applications that can expose new insights into very large, complex datasets. Technical skills are in short supply for writing programs to take advantage of graph query languages such as Gremlin and Cypher, and even with a team of top-notch experts the first working prototype consumes weeks of coding effort in several programming languages.
This talk explores the essential steps, components, and technologies necessary to build a scalable cloud-based graph database application. We will also visit the concept of model driven development and how it might be applied to accelerate the development of scalable cloud-based graph database applications. Finally, we will show examples of how model driven development can create rapid prototypes for graph applications that solve problems for real clients.

A lap around Azure Cosmos DB: Microsoft's globally distributed, multi-model database

Aravind Krishna R / Luis Bosquez - Microsoft

Earlier this year, we announced Azure Cosmos DB - the first and only globally distributed, multi-model database system. The service is designed to allow customers to elastically and horizontally scale both throughput and storage across any number of geographical regions, it offers guaranteed

Graph Representations in Machine Learning

Trey Wilson - Expero

Success in machine learning is all about the data we have available. Using graph representations and graph analytics enables organisations to understand their data in new and powerful ways. Many datasets are naturally graphs, and some benefit from being treated as a graph. There is significant potential then for graph analytics and machine learning to be used together. This session will cover the intersection of these two fields at a high level. We will introduce basic concepts and explore three areas in which graph-based machine learning is seeing traction: measures on graphs for features and constraints, semi-supervised learning with graphs, and graph-based deep learning.

Scaling Deep Link Graph Analytics using Native Parallel Graph by TigerGraph

Dr. Yu Xu - TigerGraph

Graph databases offer great promise, but the early generation of systems weren’t designed to simultaneously meet all the performance needs for today’s applications: huge and growing data, efficient scale-out, high-speed processing of complex queries, and high-speed loading and updates. Dr. Yu will discuss how TigerGraph's Native Parallel Graph can provide:
• Fast data loading speed to build graphs -- 50 to 150 GB data per hour, per machine
• Fast execution of graph algorithms -- traversing 100s of millions of vertices/edges per second per machine
• Real-time updates and inserts -- streaming 2B+ events per day to a graph with 100B+ vertices and 600B+ edges on a cluster of 20 commodity machines
• Fast and efficient deep analytics -- queries which transverse 3 to 10+ hops into big graphs -- to find non-obvious “hidden” relationships

Improving the Hiring Pipeline with GraphAI

Denis Vrdoljak; - Berkeley Data Science / Gunnar Kleeman - Austin Capital Data

Qualified job seekers can't make it past the keyword search filter. Recruiters can't find better than marginal candidates to present to their hiring managers. And, everyone dreads the entire hiring process.
Our suite of tools, based on graphDB's and NLP, streamlines the hiring process. We show resume writers which permutations of keywords recruiters are filtering against, helping them write resumes that will make it in front of a human. We show recruiters the permutations of skills their applicants have, helping them work with their hiring managers to fine tune the job requirements. And, finally, we rate the quality of researchers based on their publication patterns and networks, allowing recruiters to know which applicants stand out.
In this talk, we'll share how we built it all with Graphs!