TFX and ML Ops

December 29, 2024 · 4 min read

Sanjeev Sarda

High Performance Developer

Some notes on TFX and ML Ops.

Tensorflow

ML Ops with Tensorflow

A production grade ML based system needs a number of components:

alt text

Serving infrastructure - provide inference services.

TF Serving allows you to run multiple models in one process or multiple versions of the same model. Uses REST and gRPC. Scales using K8s.

TFX

TFX - Tensorflow extended, a suite of components to tackle the entire ML pipeline.

alt text

TFX Component Architecture

alt text

Each component does a task.

e.g. examplegen ingests, trainer component trains the model etc.

Inputs and outputs of components are artifacts. The metadata store handles all artifacts across the lifetime of the product.

alt text

Artifacts flow from upstream components to downstream ones through the metadata store.

TF Serving Architecture

alt text

A saved model is a Servable managed by the loader.

DynamicManager determines when to load the Servable.

You can create new Servables to offer non Tensorflow based models.

TF Serving Workflow

alt text

You can save a TF model from Python then launch a TF Serving process in a Docker container - you specify the model as an environment param.

Lessons in MLOps from Google

There is a 2020 article from Google called "Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)" (see Links and References section below) which offers us insight into how TFX developed and the various lessons Google learnt from deploying numerous production grade machine learning pipelines. It also discusses ML Engineering as a discipline and how it relates to Software Engineering.

We discovered early on from our endeavors to apply ML 
in production that while ML algorithms are important, 
they are usually insufficient in realizing the 
successful application of ML in a product

Sybil

This was Google's first E2E ML platform. Provided a number of wide models, non linear transformations, customizable loss functions and regularization.

Sibyl also offered tools for several aspects of the ML
workflow including Data Ingestion, Data Analysis and 
Validation, Training (of course), Model Analysis, and 
Training-Serving Skew Detection

Uses Flume and Map Reduce under the hood.

TFX

Launched in 2017, an E2E ML platform for deep learning. Superseded Sybil which has been decommissioned. Makes use of Apache Beam instead of Map Reduce and Flume.

Uses a modular, layered architecture compared to the somewhat monolithic Sybil.

Composed of:

ML Services
Composable Pipelines
Binaries for serving
Libraries

Includes things like connectors for data ingestion.

Uses Apache Arrow for in memory columnular representation.

Rules of Machine Learning

These rules represent learning from the iterative application of ML to a number of products at Google:

● Start with simple rules and heuristics, and 
generate data to learn from; this journey usually 
starts from the serving side.

● Move to simple ML (i.e., simple models) and 
realize large gains; this is usually the entry 
point for introduction of ML pipelines.

● Move to ML with more features and more advanced 
models to realize decent gains.

● Move to state-of-the-art ML, manage refinement 
and complexity (for solutions to the problems that
 are worth it), and realize small gains.

● Apply the above launch-and-iterate cycle to more 
aspects of products and to solve more problems, 
bearing in mind return on investment (and 
diminishing returns).

TFX tries to codify these rules in code.

ML Engineering and TFX Components

Artifacts made or consumed by components are first class citizens - ML Metadata store component underlies everything and provides lineage, cataloguing and querying of this metadata.
Data is key - setting expectations on data via a schema, version control etc. TFX offers ExampleGen, StatisticsGen, SchemaGen and ExampleValidator components.
Models - these can have weak contracts compared to conventional software, and those contracts are also expressed in a more statistical fashion. Models are the product of code and data. This requires end to end model validation and understanding - this is provided by TFX Evaluator and InfraValidator components. This includes the ability to generate counterfactual and out of distribution data.
Mergable Fragments - we need to have the ability to merge data fragments e.g. merge summary statistics from 2 datasets. This may also involve the creation of new counterfactual data, or data that is out of distribution. This kind of functionality is provided by TFX ExampleGen, Transform, Trainer and Tuner.

Links and References

MLOps with Tensorflow playlist, Youtube

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX) , Konstantinos (Gus) Katsiapis, Abhijit Karmarkar, Ahmet Altay, Aleksandr Zaks, Neoklis Polyzotis, Anusha Ramesh, Ben Mathes, Gautam Vasudevan, Irene Giannoumis, Jarek Wilkiewicz, Jiri Simsa, Justin Hong, Mitch Trott, Noé Lutz, Pavel A. Dournov, Robert Crowe, Sarah Sirajuddin, Tris Brian Warkentin, Zhitao Li

ML Ops with Tensorflow​

TFX​

TFX Component Architecture​

TF Serving Architecture​

TF Serving Workflow​

Lessons in MLOps from Google​

Sybil​

TFX​

Rules of Machine Learning​

ML Engineering and TFX Components​

Links and References​

ML Ops with Tensorflow

TFX

TFX Component Architecture

TF Serving Architecture

TF Serving Workflow

Lessons in MLOps from Google

Sybil

TFX

Rules of Machine Learning

ML Engineering and TFX Components

Links and References