Confirmed Sessions for Data Day Seattle

We are just now beginning to announce the confirmed sessions. Check this page regularly for updates.

Foundations of Streaming SQL or: How I Learned to Love Stream & Table Theory

Tyler Akidau - Google

What does it mean to execute robust streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing conceptually, or different? And how does all of this relate to the programmatic frameworks like we’re all familiar with? This talk will address all of those questions in two parts.
First, we’ll explore the relationship between the Apache Beam Model and stream & table theory (as popularized by Martin Kleppmann and Jay Kreps, amongst others, but essentially originating out of the database world). It turns out that stream & table theory does an illuminating job of describing the low-level concepts that underlie the Beam Model.
Second, we’ll apply our clear understanding of that relationship towards explaining what is required to provide robust stream processing support in SQL. We’ll discuss concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, and talk about new ideas yet to come.
In the end, you can expect to have a much better understanding of the key concepts underpinning data processing, regardless of whether that data processing batch or streaming, SQL or programmatic, as well as a concrete notion of what robust stream processing in SQL looks like.

Machine Learning: From The Lab To The Factory

John Akred - Silicon Valley Data Science

When data scientists are done building their models, there are questions to ask:
* How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
* Can the model run automatically without issues and how does it recover from failure?
* What happens if the model becomes stale because it was trained on data that is no longer relevant?
* How do you deploy and manage new versions of that model without breaking downstream consumers?
This talk will illustrate the importance of these questions and provide a perspective on how to address them. John will share experiences deploying models across many enterprises, some of the problems we encountered along the way, and what best practice is for running machine learning models in production.

Graph Sessions

Everything is not a graph problem (but there are plenty)

Dr. Denise Gosnell - DataStax

As the reality of the hype cycle sets in, the graph pragmatists have shown up to guide the charge. What we are seeing and experiencing is an adjustment in mindset: the convergence to multi-model database systems parallels the mentality of using the right tool for the problem. With graph databases, there is an intricate balance to find where the rubber meets the road between theorists and practitioners.
Before hammering away on the keyboard to insert vertices and edges, it is crucial to iterate and drive the development life cycle from definitive use cases. Too many times the field has seen monoglot system thinking pressure the construction of the one graph that can rule it all which can result in some impressive scope creep. In this talk, Dr. Gosnell will walk through three overlooked architectural components that can make or break a graph implementation and suggest some best practices for navigating common anti-patterns.

Evolution of the Graph Schema

Joshua Shinavier - Uber

To every graph database, there is a schema -- a language in which we describe a domain of interest, modeled as a graph. Historically, NoSQL graph databases emerged as a more dynamic alternative to relational databases and RDF triple stores, with their relatively formal schemas. However, as graph database applications have grown in scale over the years, schema management has become essential for optimal performance. Likewise, as applications have grown in complexity, schemas have become increasingly necessary for data integration.
At Uber, we are tackling the problem of modeling an open-ended and rapidly changing domain at very large scale. Although a graph of “things” connected by “relationships” is perhaps the most generic and universally understandable format for such a data set, the sheer volume of streaming and historical data, combined with the need to act on incoming data in real time, pose significant challenges.
In this talk, we will focus specifically on the challenges of graph data modeling and integration, surveying techniques from academic research, the early graph landscape, and current high-performance applications including the Uber knowledge graph.

Combining graph analytics with real-time graph query workloads for solving business problems

Ryan Boyd - Neo4j

Abstract forthcoming.

Graph Data Obfuscation

Mike Downie - Expero

It only takes 5 pieces of identifying data to uniquely identify someone within a zip code with fairly high confidence. If you have HIPPA or PCI (or both!) compliance requirements, and you are using graph, that problem just got hard. The graph eco-system doesn’t have the same obfuscation support as other data engines. Since graph is an enabling and strategic technology for many projects, and regulatory requirements aren’t going away, that means you have to solve the data obfuscation problem. Come join Mike as he looks at techniques and strategies for obfuscating, de-identifying or encrypting data in your graph implementation.

GraphAI Predictions on Real-World Data

Haikal Pribadi - Grakn.AI / Denis Vrdoljak - Berkeley Data Science Group

At Graph Day San Francisco, we presented our approach to applying machine learning to GraphDB’s, for both predicting edges and identifying outlier nodes. Some of the applications for this technology include Fraud Detection, Context Sensitive Recommender Systems, and even Counter Intelligence. In this talk, we’ll review the system briefly, and then dive into real-world examples: The Panama Papers, The Enron Emails, The Common Crawl Dataset, and a few surprise applications! We’ll show how we are able to implement and deploy these quickly--without racking up five or six figure AWS bills. And, we’ll talk about our ongoing collaboration with Grakn.AI, to integrate machine learning with graph databases--going beyond simple knowledge graphs and machine learning, and into building Artificial Intelligence!

3 ways to build a near real-time recommendation engine

David Gilardi - DataStax

David Gilardi will show how to add a near real-time recommendation engine to KillrVideo, a reference application built to help developers learn Cassandra, Graph, and DataStax Enterprise. David will discuss the benefits and practical considerations of building a solution that leverages multiple, highly connected data models such as graph and tabular. He’ll also take a look at multiple recommendation engine examples, including use of domain specific languages to make development even simpler.
An introduction to KillrVideo - David will briefly introduce the reference implementation, a cloud-based video sharing application which uses the Apache Cassandra core of DataStax Enterprise as well as DSE Search and DSE Graph integrations.
What do we mean by “multiple, highly connected models”? - David will talk about what this means and discuss the benefits of these attributes in building applications that include transaction processing, search, and graph.
Adding a recommendation engine - David will discuss the task to extend KillrVideo to provide real-time video recommendations using DSE Graph and the popular Gremlin graph traversal language using DSL’s (Domain Specific Language).

Build horizontally scalable graphs and real-time data science solutions with Azure Cosmos DB

Denny Lee / Shireesh Thota - Microsoft

Azure Cosmos DB lets you interact with graphs utilizing Apache TinkerPop’s Gremlin APIs with a service that provides turn-key global distribution, elastic scaling of storage and throughput, and multiple consistency models. Because Cosmos DB is multi-model solution, the same graph container can be used to build real-time data science solutions by combining the power of Apache Spark and Cosmos DB. This enables blazing fast analytics using Spark by efficiently exploiting Cosmos DB's indexes with pushdown predicate filtering and enabling updateable columns while performing analytics. By the end of this session, you will have learned how you can use the Spark connector for Azure Cosmos DB in a variety of scenarios including online recommendations and personalization, fraud detection, event messaging, and IoT sensor-data anomaly detection.

Applying an active learning algorithm for entity de-duplication in graph data

William Lyon - Neo4j

Abstract forthcoming.

Exploring the graph database landscape through graph visualization

Christian Miles - Cambridge Intelligence

As with any exciting and applicable technologies the graph database domain is an ever-evolving landscape of tools and applications that can often be hard to keep up with. It can often seem like there is a new database announced each month. In this talk Christian will use cutting-edge visualization techniques to provide a live exploration of the world of graph databases and show how this landscape has shifted over time. Of particular interest will be an attempt to untangle the different database classifications as well as plotting the active development of particular platforms. Christian will conclude by showing that graphvisualization is a valuable tool even when it's applied to a set of applications that visualization itself often leverages.

Securing Federated Data with TinkerPop and how to handle the “search engine” problem

Josh Perryman - Expero

"Hey, I have a great idea,” the CIO said, “let’s use a graph database to integrate all of our enterprise data sources!” “But security?” you replied. “Oh, it can’t be that hard” he dismissed, “Graph is easy. I’m sure that security in graph will be a cake-walk for you.”
Little did he know the mighty project of woe he had just bequeathed to you. Security was easy to model when you talked to the owner of the first data source, but when you had the requirements conversation with the business owner of the second data source, and then the third, and each one thereafter, you started to fret. Security in a graph database is not easy, or obvious. And when full-text search capabilities made it into the MVP requirements, you started to despair.
With his unique style of wisdom and wit Josh will look at the options for doing “cell level” security in a TinkerPop3-based graph data engine, and how to solve “the search engine problem” as well.

Interactive prototyping of Graph Applications with JanusGraph

Alan Pita - Expero

Graph Databases are being used to create applications that can expose new insights into very large, complex datasets. Technical skills are in short supply for writing programs to take advantage of graph query languages such as Gremlin and Cypher, and even with a team of top-notch experts the first working prototype consumes weeks of coding effort in several programming languages.
This talk explores the essential steps, components, and technologies necessary to build a scalable cloud-based graph database application. We will also visit the concept of model driven development and how it might be applied to accelerate the development of scalable cloud-based graph database applications. Finally, we will show examples of how model driven development can create rapid prototypes for graph applications that solve problems for real clients.

A lap around Azure Cosmos DB: Microsoft's globally distributed, multi-model database

Aravind Krishna R / Shireesh Thota - Microsoft

Earlier this year, we announced Azure Cosmos DB - the first and only globally distributed, multi-model database system. The service is designed to allow customers to elastically and horizontally scale both throughput and storage across any number of geographical regions, it offers guaranteed

Graph Representations in Machine Learning

Trey Wilson / Steve Purves - Expero

Success in machine learning is all about the data we have available. Using graph representations and graph analytics enables organisations to understand their data in new and powerful ways. Many datasets are naturally graphs, and some benefit from being treated as a graph. There is significant potential then for graph analytics and machine learning to be used together. This session will cover the intersection of these two fields at a high level. We will introduce basic concepts and explore three areas in which graph-based machine learning is seeing traction: measures on graphs for features and constraints, semi-supervised learning with graphs, and graph-based deep learning.

TigerGraph - A Game Changer: A Complete High-Performance Graph Data & Analytics Platform

Dr. Yu Xu - TigerGraph

TigerGraph develops a high-performance enterprise-class graph data platform to simplify and empower real-time graph analytics of massively connected data from all sources in complex enterprise data ecosystem.
TigerGraph’s ground-breaking technology innovation enables businesses to transform structured semi-structured, unstructured data and massive enterprise data silos into a massive intelligent inter-connected data network of business entities and meaningful relationships, and uncover implicit patterns and critical insights in order to achieve better business outcome faster, easier and cheaper.
TigerGraph’s technology excels at fast data loading to build graph rapidly, high-performance parallel execution of graph algorithms, real-time capability for streaming updates and inserts using REST, and unified real-time analytics with large-scale offline data processing in a single hassle-free environment. A complete high-level SDK package with intuitive visualization library is provided for graph modeling, mapping, loading, and querying to ease and accelerate the analytic application development and delivery lifecycle. Application developers benefit significantly to shorten their design and development cycle time from TigerGraph’s winning expressive SQL-like graph query language (GSQL). No additional coding via Java, C++, and other programming languages is necessary.

NLP Sessions

Detecting Bias in News Articles

Rob McDaniel - Lingistic / Rakuten

Bias is a hard thing to define, let alone detect. What is bias? How many different types of bias exist? What, if any, lexical cues exist to identify bias-inducing words? Can machines help us qualify and improve news articles?
Using millions of individual Wikipedia revisions, we will discuss a supervised method for identifying bias in news articles. First, we will discuss the last several decades of linguistic research into bias and the various types of biased verbs and lexicons that exist. Then, with plenty of examples, we will explore the way that these words introduce hidden bias into a text, and will follow up with a demonstration of a model for predicting the presence of bias-inducing words.
We will conclude with an exploration of ways to automatically suggest improvements to an article, to associate bias with topics, future implications in the field of stance detection, and a discussion of the background bias of various publishers.