E2E ML Apache Spark

`whoami`

@rhdzmota

Data Engineer / Scientist at LeadGenius

Agenda

ML Project Overview
Operationalizing
Spark-Based Projects
Code Example!

Acknowledgments

Operationalizing Machine Learning - Serving ML Models by Boris Lublinsky
Concept Drift: Monitoring Model Quality in Streaming Machine Learning Applications by Emre Velipasaoglu
R, Scikit-Learn, and Apache Spark ML: What Difference Does It Make? by Villu Ruusmann

ML Project Cycle

ML Pipelines

Operationalizing

Traditional Approach

A simple solution

Predictive Markdown Model Language is:

"an XML-based language that provides a way for applications to define statistical and data-mining models as well as to share models between PMML-compliant applications."

PMML Interop

Integration with the most popular ML frameworks via JPMML:

Simple Solution using PMML

Best Practice

We can perform model scoring either with a stream-processing engine or a stream-processing library.

Best Practice - Akka Aproach

We can use Akka Streams - based on Akka Actors (see syntax example).

Akka Cluster

The Big Picture

Spark-Based Projects

Why Apache Spark?

According to their website,

"Apache Spark is a unified analytics engine for large-scale data processing."

Intro to Spark ML

Spark ML is a practical and scalable machine learning library based on a [Dataset].

Dataset[A].map(fn: A => B): Dataset[B]
Dataset[A].flatMap(fn: A => Dataset[B]): Dataset[B]
Dataset[A].filter(fn: A => Boolean): Dataset[A]

Relevant Concepts

Dataset[Row]
Transformer
Estimator
Pipeline

Intro to JPMML

val pmmlBuilder = new PMMLBuilder(schema, pipelineModel)
pmmlBuilder.build()

See the official jpmml-sparkml github repo for a complete list of supported PipelineStages types.

Intro to Openscoring

We can use Openscoring, a java-based REST web-service, as our scoring-engine of the resulting PMML model.

Simple but powerful API
Allows for single predictions and for batch predictions.
Acceptable performance (usually sub-milliseconds respond time)