TFX and ML Ops
Some notes on TFX and ML Ops.
ML Ops with Tensorflow
A production grade ML based system needs a number of components:
Serving infrastructure - provide inference services.
TF Serving allows you to run multiple models in one process or multiple versions of the same model. Uses REST and gRPC. Scales using K8s.
TFX
TFX - Tensorflow extended, a suite of components to tackle the entire ML pipeline.
TFX Component Architecture
Each component does a task.
e.g. examplegen ingests, trainer component trains the model etc.
Inputs and outputs of components are artifacts. The metadata store handles all artifacts across the lifetime of the product.
Artifacts flow from upstream components to downstream ones through the metadata store.
TF Serving Architecture
A saved model is a Servable managed by the loader.
DynamicManager determines when to load the Servable.
You can create new Servables to offer non Tensorflow based models.
TF Serving Workflow
You can save a TF model from Python then launch a TF Serving process in a Docker container - you specify the model as an environment param.
Lessons in MLOps from Google
There is a 2020 article from Google called "Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)" (see Links and References section below) which offers us insight into how TFX developed and the various lessons Google learnt from deploying numerous production grade machine learning pipelines. It also discusses ML Engineering as a discipline and how it relates to Software Engineering.
We discovered early on from our endeavors to apply ML
in production that while ML algorithms are important,
they are usually insufficient in realizing the
successful application of ML in a product
Sybil
This was Google's first E2E ML platform. Provided a number of wide models, non linear transformations, customizable loss functions and regularization.
Sibyl also offered tools for several aspects of the ML
workflow including Data Ingestion, Data Analysis and
Validation, Training (of course), Model Analysis, and
Training-Serving Skew Detection
Uses Flume and Map Reduce under the hood.
TFX
Launched in 2017, an E2E ML platform for deep learning. Superseded Sybil which has been decommissioned. Makes use of Apache Beam instead of Map Reduce and Flume.
Uses a modular, layered architecture compared to the somewhat monolithic Sybil.
Composed of:
- ML Services
- Composable Pipelines
- Binaries for serving
- Libraries
Includes things like connectors for data ingestion.
Uses Apache Arrow for in memory columnular representation.
Rules of Machine Learning
These rules represent learning from the iterative application of ML to a number of products at Google:
● Start with simple rules and heuristics, and
generate data to learn from; this journey usually
starts from the serving side.
● Move to simple ML (i.e., simple models) and
realize large gains; this is usually the entry
point for introduction of ML pipelines.
● Move to ML with more features and more advanced
models to realize decent gains.
● Move to state-of-the-art ML, manage refinement
and complexity (for solutions to the problems that
are worth it), and realize small gains.
● Apply the above launch-and-iterate cycle to more
aspects of products and to solve more problems,
bearing in mind return on investment (and
diminishing returns).
TFX tries to codify these rules in code.
ML Engineering and TFX Components
-
Artifacts made or consumed by components are first class citizens - ML Metadata store component underlies everything and provides lineage, cataloguing and querying of this metadata.
-
Data is key - setting expectations on data via a schema, version control etc. TFX offers ExampleGen, StatisticsGen, SchemaGen and ExampleValidator components.
-
Models - these can have weak contracts compared to conventional software, and those contracts are also expressed in a more statistical fashion. Models are the product of code and data. This requires end to end model validation and understanding - this is provided by TFX Evaluator and InfraValidator components. This includes the ability to generate counterfactual and out of distribution data.
-
Mergable Fragments - we need to have the ability to merge data fragments e.g. merge summary statistics from 2 datasets. This may also involve the creation of new counterfactual data, or data that is out of distribution. This kind of functionality is provided by TFX ExampleGen, Transform, Trainer and Tuner.
Links and References
MLOps with Tensorflow playlist, Youtube
Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX) , Konstantinos (Gus) Katsiapis, Abhijit Karmarkar, Ahmet Altay, Aleksandr Zaks, Neoklis Polyzotis, Anusha Ramesh, Ben Mathes, Gautam Vasudevan, Irene Giannoumis, Jarek Wilkiewicz, Jiri Simsa, Justin Hong, Mitch Trott, Noé Lutz, Pavel A. Dournov, Robert Crowe, Sarah Sirajuddin, Tris Brian Warkentin, Zhitao Li