Dr Alex Ioannides

Best Practices for Engineering ML Pipelines - Part 2

2022-11-07T00:00:00+00:00

This is the second part in a series of articles demonstrating best practices for engineering ML pipelines and deploying them to production. In the first part we focused on project setup - everything from codebase structure to configuring a CI/CD pipeline and making an initial deployment of a skeleton pipeline.

In this part we are going to focus on developing a fully-operational pipeline and will cover:

A simple approach to data and model versioning, using cloud object storage.
How to factor-out common code and make it reusable between projects.
Defending against errors and handling failure.
How to enable configurable pipelines that can run in multiple environments without code changes.
Developing the automated model-training stage and how to write tests for it.
Developing and testing the serve-model stage that exposes the trained model via a web API.
Updating the deployment configuration and releasing the changes to production.
Scheduling the pipeline to run on a schedule.

All of the code referred to in this series of posts is available on GitHub, with a dedicated branch for each part, so you can explore the code in its various stages of development. Have a quick look before reading on.

Table of Contents

A Simple Strategy for Dataset and Model Versioning

To recap, the data engineering team will deliver the latest tranche of training data to an AWS S3 bucket, in CSV format. They will take responsibility for verifying that these files have the correct schema and contain no unexpected errors. Each filename will contain the timestamp of its creation, in ISO format, so that the datasets in the bucket will look as follows:

s3://time-to-dispatch/
|-- datasets/
    |-- time_to_dispatch_2021-07-03T23:05:32.csv
    |-- time_to_dispatch_2021-07-02T23:05:13.csv
    |-- time_to_dispatch_2021-07-01T23:04:52.csv
    |-- ...

The train-model stage of the pipeline will only need to download the latest file for training a new model. We could stop here and rely solely on the filenames as a lightweight versioning strategy, but it is safer to enable versioning for the S3 bucket and to track of the hash of the dataset used for training, which is computed automatically for every object stored on S3 (the MD5 hash of an object is stored as its Entity Tag or ETag). This allows us to defend against accidental deletes and/or overwrites and enables us to locate the precise dataset associated with a trained model.

Because this concept of a dataset is bigger than just an arbitrarily named file on S3, we will need to develop a custom Dataset class for representing files on S3 and retrieving their hashes, together with functions/methods for getting and putting Datasets to S3. All of this can be developed on top of the boto3 AWS client library for Python.

Trained models will be serialised to file using Python’s pickle module (this works well for SciKit-Learn models), and uploaded to the same AWS bucket, using the same timestamped file-naming convention:

s3://time-to-dispatch/
|-- models/
    |-- time_to_dispatch_2021-07-03T23:45:23.csv
    |-- time_to_dispatch_2021-07-02T23:45:31.csv
    |-- time_to_dispatch_2021-07-01T23:44:25.csv
    |-- ...

When triggered, the serve-model stage of the pipeline will only need to download the most recently persisted model, to ensure that it will generate predictions using the model from the output of the train-model stage. As with the datasets, we could stop here and rely solely on the filenames as a lightweight versioning strategy, but auditing and debugging predictions will be made much easier if we can access model metadata, such as the details of the exact dataset used for training.

The concept of a model becomes bigger than just the trained model in isolation, so we will also need to develop a custom Model class. This needs to ‘wrap’ the trained model object, so that it can be associated with all of the metadata that we need to operate our basic model versioning system. As with the custom Dataset class, we will need to develop functions/methods for getting and putting the Model object to S3.

There is a significant development effort required for implementing the functionality described above and it is likely that this will be repeated in many projects. We are going to cover how to handle reusable code in the section below, but you can see our implementations for the Dataset and Model classes using the links below, which we have also reproduced at the end of this article.

Reusing Common Code

The canonical way for distributing reusable Python modules, is by implementing them within a Python package that can be installed into any project that benefits from the functionality. This is what we have done for the dataset and model versioning functionality described in the previous section, and for configuring the logger used in both stages (so we can can enforce a common log format across projects). You can explore the codebase for this package, named bodywork-pipeline-utils, on GitHub. The functions and classes within it are shown below,

|-- aws
    |-- Dataset
    |-- get_latest_csv_dataset_from_s3
    |-- get_latest_parquet_dataset_from_s3
    |-- put_csv_dataset_to_s3
    |-- put_parquet_dataset_to_s3
    |-- Model
    |-- get_latest_pkl_model_from_s3
|-- logging
    |-- configure_logger

A discussion of best practices for developing a Python package is beyond the scope of these articles, but you can use bodywork-pipeline-utils as a template and/or refer to the Python Packaging Authority. The Scikit-Learn team has also published their insights into API design for machine learning software, which we recommend reading.

Distributing Python Packages within your Company

The easiest way to distribute Python packages within an organisation is directly from your Version Control System (VCS) - e.g. a remote Git repository hosted on GitHub. You do not need to host an internal PyPI server, unless you have a specific reason to do so. To install a Python package from a remote Git repo you can use,

$ pip install git+https://github.com/bodywork-ml/bodywork-pipeline-utils@v0.1.5

Where v0.1.5 is the release tag, but could also be a Git commit hash. This will need to be specified in requrements_pipe.txt as,

git+https://github.com/bodywork-ml/bodywork-pipeline-utils@v0.1.5

Pip supports many VCSs and protocols - e.g. private Git repositories can be accessed via SSH by using git+ssh and ensuring that the machine making the request has the appropriate SSH keys available. Refer to the documentation for pip for more information.

Defending Against Errors and Handling Failures

Pipelines can experience many types of error - here are some examples:

Invalid configuration, such as specifying the wrong storage location for datasets and models.
Access to datasets and models becomes temporarily unavailable.
Errors in an unverified dataset causes model-training to fail.
An unexpected jump in concept drift causes model metrics to breach performance thresholds.

When developing pipeline stages, it is critical that error events such as these are identified and logged to aid with debugging, and that the pipeline is not allowed to proceed. Our chosen pattern for handling errors is demonstrated in this snippet from train_model.py,

import sys

# ...

if __name__ == "__main__":

# ...

    try:
        main(
            s3_bucket,
            r2_metric_error_threshold,
            r2_metric_warning_threshold,
            HYPERPARAM_GRID
        )
          sys.exit(0)
    except Exception as e:
        log.error(f"Error encountered when training model - {e}")
        sys.exit(1)

The pipeline is defined in the main function, which is executed within a try... except block. If it executes without error, then we signal this to Kubernetes with an exit-code of 0 . If any error is encountered, then the exception is caught, we log the details and signal this to Kubernetes with an exit-code of 1 (so it can attempt a retry, if this has been configured).

Exceptions within main are likely to be raised from within 3rd party packages that we’ve installed - e.g. if bodywork-pipeline-utils can’t access AWS or if Scikit-Learn fails to train a model. We recommend reading the documentation (or source code) for external functions and classes to understand what exceptions they raise and if the pipeline would benefit from custom handling and logging.

Sometimes, however, we need to look for the error ourselves and raise the exception manually, as shown below when the key test metric falls below a pre-configured threshold level,

def main(
    s3_bucket: str,
    metric_error_threshold: float,
    metric_warning_threshold: float,
    hyperparam_grid: Dict[str, Any]
) -> None:
    """Main training job."""
    log.info("Starting train-model stage.")

    # ...

    if metrics.r_squared >= metric_error_threshold:
        if metrics.r_squared >= metric_warning_threshold:
            log.warning("Metrics breached warning threshold - check for drift.")
        s3_location = persist_model(s3_bucket, model, dataset, metrics)
        log.info(f"Model serialised and persisted to s3://{s3_location}")
    else:
        msg = (
            f"r-squared metric ({{metrics.r_squared:.3f}}) is below deployment "
            f"threshold {metric_error_threshold}"
        )
        raise RuntimeError(msg)

This works as follows:

If the r-squared metric is above the error threshold and the warning threshold, then persist the trained model.
If the r-squared metric is above the error threshold, but below the warning threshold, then log a warning message and then persist the trained model.
If the r-squared metric is below the error threshold, then raise an exception, which will cause the stage to log an error and exit with a non-zero exit code (halting the pipeline), using the logic in the try... except block discussed earlier in this section.

Using logs to communicate pipeline state will take on additional importance later on in Part Three of this series, when we add monitoring, observability and alerting to our pipeline.

Configurable Pipelines

Pipelines can benefit from parametrisation to make them re-usable across deployment environments (and potentially tenants, if this makes sense for your project). For example, passing the S3 bucket as an external argument to each stage, enables the pipeline to operate both in a staging environment, as well as in production. Similarly, external arguments can be used to set thresholds for defining when warnings and alerts are triggered, based on model training metrics, which can make testing the pipeline much easier.

Each stage of our pipeline is defined by an executable Python module. The easiest way to pass arguments to a module is via the command line. For example,

$ python -m pipeline.train_model time-to-dispatch 0.9 0.8

Passes an array of strings, ["time-to-dispatch", "0.9", "0.8"] to train_model.py, that can be retrieved from sys.argv as demonstrated in the excerpt from train_model.py below.

import sys

# ...

if __name__ == "__main__":
    try:
        args = sys.argv
        s3_bucket = args[1]
        r2_metric_error_threshold = float(args[2])
        if r2_metric_error_threshold <= 0 or r2_metric_error_threshold > 1:
            raise ValueError()
        r2_metric_warning_threshold = float(args[3])
        if r2_metric_warning_threshold <= 0 or r2_metric_warning_threshold > 1:
            raise ValueError()
    except (ValueError, IndexError):
        log.error(
            "Invalid arguments passed to train_model.py. "
            "Expected S3_BUCKET R_SQUARED_ERROR_THRESHOLD R_SQUARED_WARNING_THRESHOLD, "
            "where all thresholds must be in the range [0, 1]."
        )
        sys.exit(1)

    try:
        main(
            s3_bucket,
            r2_metric_error_threshold,
            r2_metric_warning_threshold,
            HYPERPARAM_GRID
        )
    except Exception as e:
        log.error(f"Error encountered when training model - {e}")
        sys.exit(1)

Note how we cast the numeric arguments to float types before performing basic input validation to ensure that users can’t accidentally specify invalided arguments that could lead to unintended consequences.

When deployed by Bodywork, train_model.py will be executed in a dedicated container on Kubernetes. The required arguments can be passed via the args parameter in the bodywork.yaml file that describes the deployment, as shown below.

# bodywork.yaml
...
stages:
  train_model:
    executable_module_path: pipeline/train_model.py
      args: ["time-to-dispatch", "0.9", "0.8"]
      ...

Engineering the Model Training Job

The core task here is to engineer the ML solution in the time_to_dispatch_model.ipynb notebook, provided to us by the data scientist who worked on this task, into the pipeline stage defined in pipeline/train_model.py (reproduced in the Appendix below). The central workflow is defined in the main function,

from typing import Any, Dict, List, NamedTuple, Tuple

from bodywork_pipeline_utils import aws, logging
from bodywork_pipeline_utils.aws import Dataset

# ...

log = logging.configure_logger()

# ...

def main(
    s3_bucket: str,
    metric_error_threshold: float,
    metric_warning_threshold: float,
    hyperparam_grid: Dict[str, Any]
) -> None:
    """Main training job."""
    log.info("Starting train-model stage.")
    dataset = aws.get_latest_csv_dataset_from_s3(s3_bucket, "datasets")
    log.info(f"Retrieved dataset from s3://{s3_bucket}/{dataset.key}")

    feature_and_labels = prepare_data(dataset.data)
    model, metrics = train_model(feature_and_labels, hyperparam_grid)
    validate_trained_model_logic(model, feature_and_labels)
    log.info(
        f"Trained model: r-squared={metrics.r_squared:.3f}, "
        f"MAE={metrics.mean_absolute_error:.3f}"
    )

    if metrics.r_squared >= metric_error_threshold:
        if metrics.r_squared >= metric_warning_threshold:
            log.warning("Metrics breached warning threshold - check for drift.")
        s3_location = persist_model(s3_bucket, model, dataset, metrics)
        log.info(f"Model serialised and persisted to s3://{s3_location}")
    else:
        msg = (
            f"r-squared metric ({{metrics.r_squared:.3f}}) is below deployment "
            f"threshold {metric_error_threshold}"
        )
        raise RuntimeError(msg)

This splits the job into smaller sub-tasks, such as preparing the data, that can be delegated to specialised functions that are easier to write (unit) tests for. All interaction with cloud object storage (AWS S3), for retrieving datasets and persisting trained models, is handled by functions imported from the bodywork-pipeline-utils package, leaving three key functions that we will discuss in turn:

prepare_data
train_model
validate_trained_model_logic

The persist_model function creates the Model object and calls its put_model_to_S3 method. It will be tested implicitly in the functional tests for main, which we will look at later on.

Prepare Data

This purpose of this function is to start with the dataset as a DataFrame, split the features from the labels and then partition each of these into ‘test’ and ‘train ‘subsets. We return the results as a NamedTuple called FeaturesAndLabels, which facilitates easier access within functions that consume these data structures.

from typing import Any, Dict, List, NamedTuple, Tuple

from sklearn.model_selection import GridSearchCV, train_test_split

# ...

class FeatureAndLabels(NamedTuple):
    """Container for features and labels split by test and train sets."""

    X_train: DataFrame
    X_test: DataFrame
    y_train: DataFrame
    y_test: DataFrame

# ...

def prepare_data(data: DataFrame) -> FeatureAndLabels:
    """Split the data into features and labels for training and testing."""
    X = data.drop("hours_to_dispatch", axis=1)
    y = data["hours_to_dispatch"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=data["product_code"].values, random_state=42
    )
    return FeatureAndLabels(X_train, X_test, y_train, y_test)

This is tested in tests/test_train_model.py as follows,

from pandas import read_csv, DataFrame
from pytest import fixture, raises

from bodywork_pipeline_utils.aws import Dataset

# ...

@fixture(scope="session")
def dataset() -> Dataset:
    data = read_csv("tests/resources/dataset.csv")
    dataset = Dataset(data, datetime(2021, 7, 15), "tests", "resources", "foobar")
    return dataset


def test_prepare_data_splits_labels_and_features_into_test_and_train(dataset: Dataset):
    label_column = "hours_to_dispatch"
    n_rows_in_dataset = dataset.data.shape[0]
    n_cols_in_dataset = dataset.data.shape[1]
    prepared_data = prepare_data(dataset.data)

    assert prepared_data.X_train.shape[1] == n_cols_in_dataset - 1
    assert label_column not in prepared_data.X_train.columns

    assert prepared_data.X_test.shape[1] == n_cols_in_dataset - 1
    assert label_column not in prepared_data.X_test.columns

    assert prepared_data.y_train.ndim == 1
    assert prepared_data.y_train.name == label_column

    assert prepared_data.y_test.ndim == 1
    assert prepared_data.y_test.name == label_column

    assert (prepared_data.X_train.shape[0] + prepared_data.X_test.shape[0]
            == n_rows_in_dataset)

    assert (prepared_data.y_train.shape[0] + prepared_data.y_test.shape[0]
            == n_rows_in_dataset)

To help with testing, we have saved a snapshot of CSV data to tests/resources/dataset.csv within the project repository, and made it available as a DataFrame to all tests in this model, via a Pytest fixture called dataset. There is only one unit test for this function and it tests that prepare_data splits labels from features, for both ‘test’ and ‘train’ sets, and that it doesn’t lose any rows of data in the process. If we refactor prepare_data in the future, then this test will help prevent us from accidentally leaking the label into the features.

Train Model

Given a FeaturesAndLabels object together with a grid of hyper-parameters, this function will yield a trained model, together with the model’s performance metrics for the ‘test’ set . The hyper-parameter grid is an input to this function, so that when testing we can use a single point, but can specify many more points for the actual job, when training time is less of a constraint. The metrics are contained within a NamedTuple called TaskMetrics, to make passing them between functions easier and less prone to error.

from sklearn.model_selection import GridSearchCV, train_test_split

# ...

PRODUCT_CODE_MAP = {"SKU001": 0, "SKU002": 1, "SKU003": 2, "SKU004": 3, "SKU005": 4}

# ...

class TaskMetrics(NamedTuple):
    """Container for the task's performance metrics."""

    r_squared: float
    mean_absolute_error: float

# ...

def train_model(
    data: FeatureAndLabels, hyperparam_grid: Dict[str, Any]
) -> Tuple[BaseEstimator, TaskMetrics]:
    """Train a model and compute performance metrics."""
    grid_search = GridSearchCV(
        estimator=DecisionTreeRegressor(),
        param_grid=hyperparam_grid,
        scoring="r2",
        cv=5,
        refit=True,
    )
    grid_search.fit(preprocess(data.X_train), data.y_train)
    best_model = grid_search.best_estimator_
    y_test_pred = best_model.predict(preprocess(data.X_test))
    performance_metrics = TaskMetrics(
        r2_score(data.y_test, y_test_pred),
        mean_absolute_error(data.y_test, y_test_pred)
    )
    return (best_model, performance_metrics)


def preprocess(df: DataFrame) -> DataFrame:
    """Create features for training model."""
    processed = df.copy()
    processed["product_code"] = df["product_code"].apply(lambda e: PRODUCT_CODE_MAP[e])
    return processed.values

We have further delegated the task of pre-processing the features for the model (in this case just mapping categories to integers), to a dedicated function called preprocess. The train_model function is tested in tests/test_train_model.py as follows,

from sklearn.utils.validation import check_is_fitted

# ...

@fixture(scope="session")
def prepared_data(dataset: Dataset) -> FeatureAndLabels:
    return FeatureAndLabels(
        dataset.data[["orders_placed", "product_code"]][:800],
        dataset.data[["orders_placed", "product_code"]][800:999],
        dataset.data["hours_to_dispatch"][:800],
        dataset.data["hours_to_dispatch"][800:999]
    )

# ...

def test_train_model_yields_model_and_metrics(prepared_data: FeaturesAndLabels):
    model, metrics = train_model(prepared_data, {"random_state": [42]})
    try:
        check_is_fitted(model)
        assert True
    except NotFittedError:
        assert False

    assert metrics.r_squared >= 0.9
    assert metrics.mean_absolute_error <= 1.25

Which tests that train_model returns a fitted model and acceptable performance metrics, given a reasonably sized tranche of data.

Note, that we haven’t relied on prepare_data to create the FeatureAndLabels object- we have created this manually in another fixture that relies on the dataset fixture discussed earlier. This is a deliberate choice made with the aim of decoupling the outcome of this test from the behaviour of prepare_data. Tests that are dependent on multiple functions can be ‘brittle’ and lead to cascades of failing tests when only a single function or method is raising an error. We cannot stress enough how important it is to structure your code in such a way that it can be easily tested.

For completeness, we also provide a simple test for preprocess,

from pandas import read_csv, DataFrame

# ...

def test_preprocess_processes_features():
    data = DataFrame({"orders_placed": [30], "product_code": ["SKU004"]})
    processed_data = preprocess(data)
    assert processed_data[0, 0] == 30
    assert processed_data[0, 1] == 3

Validating Trained Models

The goal of the pipeline is to automate the process of training a new model and deploying it - i.e. to take the data scientist out-of-the-loop. Consequently, we need to exercise caution before deploying the latest model. Although the final go/no-go decision on deploying the model will be based on performance metrics, we should also sense-check the model based on basic behaviours we expect it to have. The validate_trained_model_logic function performs three logical tests of the model and will raise an exception if it finds an issue (thereby terminating the pipeline before deployment). The three checks are:

Does the hours_to_dispatch variable increase with order_placed, for each product?
Are all predictions for the ‘test’ set positive?
Are all predictions for the ‘test’ within 25% of the highest hours_to_dispatch observation?

def validate_trained_model_logic(model: BaseEstimator, data: FeatureAndLabels) -> None:
    """Verify that a trained model passes basic logical expectations."""
    issues_detected: List[str] = []

    orders_placed_sensitivity_checks = [
        model.predict(array([[100, product], [150, product]])).tolist()
        for product in range(len(PRODUCT_CODE_MAP))
    ]
    if not all(e[0] < e[1] for e in orders_placed_sensitivity_checks):
        issues_detected.append(
            "hours_to_dispatch predictions do not increase with orders_placed"
        )

    test_set_predictions = model.predict(preprocess(data.X_test)).reshape(-1)
    if len(test_set_predictions[test_set_predictions < 0]) > 0:
        issues_detected.append(
            "negative hours_to_dispatch predictions found for test set"
        )
    if len(test_set_predictions[test_set_predictions > data.y_test.max() * 1.25]) > 0:
        issues_detected.append(
            "outlier hours_to_dispatch predictions found for test set"
        )

    if issues_detected:
        msg = "Trained model failed verification: " + ", ".join(issues_detected) + "."
        raise RuntimeError(msg)

Note, that we perform all three checks before raising the exception, so that the error message and the logs that will be generated from it, can be maximally informative when it comes to debugging.

The associated test can also be found in tests/test_train_model.py. This is the most complex test thus far, because we have to use Scikit-Learn’s DummyRegressor to create models that will fail each one of the tests individually, as can be seen below.

from pytest import fixture, raises
from sklearn.dummy import DummyRegressor

# ...

def test_validate_trained_model_logic_raises_exception_for_failing_models(
    prepared_data: FeaturesAndLabels
):
    dummy_model = DummyRegressor(strategy="constant", constant=-1.0)
    dummy_model.fit(prepared_data.X_train, prepared_data.y_train)
    expected_exception_str = (
        "Trained model failed verification: "
        "hours_to_dispatch predictions do not increase with orders_placed."
    )
    with raises(RuntimeError, match=expected_exception_str):
        validate_trained_model_logic(dummy_model, prepared_data)

    dummy_model = DummyRegressor(strategy="constant", constant=-1.0)
    dummy_model.fit(prepared_data.X_train, prepared_data.y_train)
    expected_exception_str = (
        "Trained model failed verification: "
        "hours_to_dispatch predictions do not increase with orders_placed, "
        "negative hours_to_dispatch predictions found for test set."
    )
    with raises(RuntimeError, match=expected_exception_str):
        validate_trained_model_logic(dummy_model, prepared_data)

    dummy_model = DummyRegressor(strategy="constant", constant=1000.0)
    dummy_model.fit(prepared_data.X_train, prepared_data.y_train)
    expected_exception_str = (
        "Trained model failed verification: "
        "hours_to_dispatch predictions do not increase with orders_placed, "
        "outlier hours_to_dispatch predictions found for test set."
    )
    with raises(RuntimeError, match=expected_exception_str):
        validate_trained_model_logic(dummy_model, prepared_data)

End-to-End Functional Tests

We’ve tested the individual sub-tasks within main , but how do we know that we’ve assembled them correctly, so that persist_model will upload the expected Model object to cloud storage? We now need to turn our attention to testing main from end-to-end - i.e. functional tests for the train-model stage.

The main function will try to access AWS S3 to get a dataset and then save a pickled Model to S3. We could setup a S3 bucket for testing this integration, but this constitutes an integration test and is not our current aim. We will disable the calls to AWS by mocking the bodywork_pipeline_utils.aws module using the patch function from the Python standard library’s unittest.mock module.

Decorating our test with @patch("pipeline.train_model.aws"), causes bodywork_pipeline_utils.aws (which we import into train_model.py) to be replaced by a MagicMock object called mock_aws. This allows us to perform a number of useful tasks:

Hard-code the return value from aws.get_latest_csv_dataset_from_s3, so that it returns our local test dataset instead of a remote dataset on S3.
Check if the put_model_to_s3method of the aws.Model object created in persist_model, was called.

You can see this in action below.

from unittest.mock import MagicMock, patch

from pytest import fixture, raises
from _pytest.logging import LogCaptureFixture

# ...

@patch("pipeline.train_model.aws")
def test_train_job_happy_path(
    mock_aws: MagicMock,
    dataset: Dataset,
    caplog: LogCaptureFixture,
):
    mock_aws.get_latest_csv_dataset_from_s3.return_value = dataset
    main("project-bucket", 0.8, 0.9, {"random_state": [42]})
    mock_aws.Model().put_model_to_s3.assert_called_once()
    logs = caplog.text
    assert "Starting train-model stage" in logs
    assert "Retrieved dataset from s3" in logs
    assert "Trained model" in logs
    assert "Model serialised and persisted to s3" in logs

This test also makes use of Pytest’s caplog fixture, enabling us to test that main yields the expected log records when everything goes according to plan (i.e. the ‘happy path’). This gives us confidence that model artefacts will be persisted as expected, when run in production.

What about the ‘unhappy paths’ - when performance metrics fall below warning and error thresholds? We need to test that main will behave as we expect it too, and so we will have to write tests for these scenarios, as well.

@patch("pipeline.train_model.aws")
def test_train_job_raises_exception_when_metrics_below_error_threshold(
    mock_aws: MagicMock,
    dataset: Dataset,
):
    mock_aws.get_latest_csv_dataset_from_s3.return_value = dataset
    with raises(RuntimeError, match="below deployment threshold"):
        main("project-bucket", 1, 0.9, {"random_state": [42]})


@patch("pipeline.train_model.aws")
def test_train_job_logs_warning_when_metrics_below_warning_threshold(
    mock_aws: MagicMock,
    dataset: Dataset,
    caplog: LogCaptureFixture,
):
    mock_aws.get_latest_csv_dataset_from_s3.return_value = dataset
    main("project-bucket", 0.5, 0.9, {"random_state": [42]})
    assert "WARNING" in caplog.text
    assert "breached warning threshold" in caplog.text

These tests work by setting the thresholds artificially high (or low) and checking that exceptions are raised or that warning messages are logged. Note, that this testing strategy only works because main accepts the thresholds as arguments, which was one of the key motivations for designing it in this way.

Input Validation for the Stage

The train-model stage works by executing train_model.py, which requires three arguments to be passed to it (as discussed earlier on). These inputs are validated and this validation needs to be tested for completeness. This is a long and boring test, so we will not reproduce the whole thing, but instead discuss the testing strategy (which is a bit more interesting).

The approach to testing input validation, is to run test_model.py as Bodywork would run it within a container on Kubernetes, by calling python pipeline/train_model.py from the command line. We can replicate this using subprocess.run from the Python standard library and capturing the output. We can then pass invalid arguments and check the output for the expected error messages. You can see this pattern in-action below, for the case when no arguments are passed.

from subprocess import run

# ...

def test_run_job_handles_error_for_invalid_args():
    process_one = run(
        ["python", "pipeline/train_model.py"], capture_output=True, encoding="utf-8"
    )
    assert process_one.returncode != 0
    assert "ERROR" in process_one.stdout
    assert "Invalid arguments passed to train_model.py" in process_one.stdout

      # ...

Developing the Model Serving Stage

In Part One of this series we developed a skeleton web service that returned a hard-coded value whenever the API was called. Our task in this part is to extend this to downloading the latest model persisted to cloud object storage (AWS S3), and then use the model for generating predictions. Unlike the train-model stage, the effort required for this task is relatively small and so we will reproduce serve_model.py in full and then discuss it in more detail afterwards.

import sys
from enum import Enum
from typing import Dict, Union

import uvicorn
from bodywork_pipeline_utils import aws, logging
from fastapi import FastAPI, status
from numpy import array
from pydantic import BaseModel, Field

from pipeline.train_model import PRODUCT_CODE_MAP

app = FastAPI(debug=False)
log = logging.configure_logger()


class ProductCode(Enum):
    SKU001 = "SKU001"
    SKU002 = "SKU002"
    SKU003 = "SKU003"
    SKU004 = "SKU004"
    SKU005 = "SKU005"


class Data(BaseModel):
    product_code: ProductCode
    orders_placed: float = Field(..., ge=0.0)


class Prediction(BaseModel):
    est_hours_to_dispatch: float
    model_version: str


@app.post(
    "/api/v0.1/time_to_dispatch",
    status_code=status.HTTP_200_OK,
    response_model=Prediction,
)
def time_to_dispatch(data: Data) -> Dict[str, Union[str, float]]:
    features = array([[data.orders_placed, PRODUCT_CODE_MAP[data.product_code.value]]])
    prediction = wrapped_model.model.predict(features).tolist()[0]
    return {"est_hours_to_dispatch": prediction, "model_version": str(wrapped_model)}


if __name__ == "__main__":
    try:
        args = sys.argv
        s3_bucket = args[1]
        wrapped_model = aws.get_latest_pkl_model_from_s3(s3_bucket, "models")
        log.info(f"Successfully loaded model: {wrapped_model}")
    except IndexError:
        log.error("Invalid arguments passed to serve_model.py - expected S3_BUCKET")
        sys.exit(1)
    except Exception as e:
        log.error(f"Could not get latest model and start web server - {e}")
        sys.exit(1)
    uvicorn.run(app, host="0.0.0.0", workers=1)

The key changes from the version in Part One are as follows:

We now pass the name of the AWS S3 bucket as an argument to serve_model.py.
In the if __name__ == "__main__" block we now attempt to to retrieve latest Model object that was persisted to AWS S3, before starting the FastAPI server.
We placed a new constraint on the Data.orders_placed field to ensure that all values sent to the API must be greater-than-or-equal-to zero, and another new constraint on Data.product_code that forces this field to be one of the values specified in the ProductCode enumeration.
We now use the model to generate predictions, using the PRODUCT_CODE_MAP dictionary from train_model.py to map product codes to integers, before calling the model.
We use the string representation of the Model object in the response’s model_version field, which contains the full information on which S3 object is being used, as well as other metadata such as the dataset used to train the model, the type of model, etc. This verbose information is designed to facilitate easy debugging of problematic responses.

If we start the server locally,

$ python -m pipeline.serve_model time-to-dispatch

2021-07-24 09:56:42,718 - INFO - serve_model.<module> - Successfully loaded model: name:time-to-dispatch|model_type:<class 'sklearn.tree._classes.DecisionTreeRegressor'>|model_timestamp:2021-07-20 14:44:13.558375|model_hash:b4860f56fa24193934fe1ea51b66818d|train_dataset_key:datasets/time_to_dispatch_2021-07-01T16|45|38.csv|train_dataset_hash:"759eccda4ceb7a07cda66ad4ef7cdfbc"|pipeline_git_commit_hash:NA
2021-07-24 09:56:42,718 - INFO - serve_model.<module> - Successfully loaded model: name:time-to-dispatch|model_type:<class 'sklearn.tree._classes.DecisionTreeRegressor'>|model_timestamp:2021-07-20 14:44:13.558375|model_hash:b4860f56fa24193934fe1ea51b66818d|train_dataset_key:datasets/time_to_dispatch_2021-07-01T16|45|38.csv|train_dataset_hash:"759eccda4ceb7a07cda66ad4ef7cdfbc"|pipeline_git_commit_hash:NA
INFO:     Started server process [88289]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then we can send a test request,

$ curl http://localhost:8000/api/v0.1/time_to_dispatch \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"product_code": "SKU001", "orders_placed": 10}'

Which should return a response along the lines of,

{
  "est_hours_to_dispatch": 0.6527543057985115,
  "model_version": "name:time-to-dispatch|model_type:<class 'sklearn.tree._classes.DecisionTreeRegressor'>|model_timestamp:2021-07-20 14:44:13.558375|model_hash:b4860f56fa24193934fe1ea51b66818d|train_dataset_key:datasets/time_to_dispatch_2021-07-01T16|45|38.csv|train_dataset_hash:\"759eccda4ceb7a07cda66ad4ef7cdfbc\"|pipeline_git_commit_hash:ed3113197adcbdbe338bf406841b930e895c42d6"
}

Updating the Tests

We only need to add one more (small) test to tests/test_serve_model.py, but we will have to modify the existing tests to take into account that we are now using a trained model to generate predictions, as opposed to returning fixed values. This introduces a complication, because we need to inject a working model into the module.

To facilitate testing, we have persisted a valid Model object to tests/resources/model.pkl, which will be loaded in a function called wrapped_model and injected into the module at test-time as a new object, using unittest.mock.patch. We are unable to use patch as we did in train_model.py, because the model is only loaded when serve_model.py is executed, whereas our tests rely only the FastAPI test client.

The modified test for a valid request is shown

import pickle
from subprocess import run
from unittest.mock import patch

from bodywork_pipeline_utils.aws import Model
from fastapi.testclient import TestClient
from numpy import array

test_client = TestClient(app)

def wrapped_model() -> Model:
    with open("tests/resources/model.pkl", "r+b") as file:
        wrapped_model = pickle.load(file)
    return wrapped_model


@patch("pipeline.serve_model.wrapped_model", new=wrapped_model(), create=True)
def test_web_api_returns_valid_response_given_valid_data():
    prediction_request = {"product_code": "SKU001", "orders_placed": 100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    model_obj = wrapped_model()
    expected_prediction = model_obj.model.predict(array([[100, 0]])).tolist()[0]
    assert prediction_response.status_code == 200
    assert prediction_response.json()["est_hours_to_dispatch"] == expected_prediction
    assert prediction_response.json()["model_version"] == str(model_obj)

This works by checking the output from the API against the output from the model loaded from the test resources, to make sure that they are identical. Next, we modify the test that covers the API data validation, to reflect the extra constraints we have placed on requests.

@patch("pipeline.serve_model.wrapped_model", new=wrapped_model(), create=True)
def test_web_api_returns_error_code_given_invalid_data():
    prediction_request = {"product_code": "SKU001", "foo": 100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    assert prediction_response.status_code == 422
    assert "value_error.missing" in prediction_response.text

    prediction_request = {"product_code": "SKU000", "orders_placed": 100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    assert prediction_response.status_code == 422
    assert "not a valid enumeration member" in prediction_response.text

    prediction_request = {"product_code": "SKU001", "orders_placed": -100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    assert prediction_response.status_code == 422
    assert "ensure this value is greater than or equal to 0" in prediction_response.text

Finally, we add one more test to cover the input validation for the serve_model.py module, using the same strategy as we did for the equivalent test for train_model.py.

from subprocess import run

# ...

def test_web_server_raises_exception_if_passed_invalid_args():
    process = run(
        ["python", "-m", "pipeline.serve_model"], capture_output=True, encoding="utf-8"
    )
    assert process.returncode != 0
    assert "ERROR" in process.stdout
    assert "Invalid arguments passed to serve_model.py" in process.stdout

Updating the Deployment and Releasing to Production

The last task we need to complete before we can commit all changes, push to GitHub and trigger the CI/CD pipeline, is to update the deployment configuration in bodywork.yaml. This requires three changes:

Arguments now need to be passed to each stage.
The Python package requirements for each stage need to be updated.
AWS credentials need to be injected into each stage, as required by bodywork_pipeline_utils.aws.
CPU and memory resources need to be updated, together with max completion/startup timeouts.

version: "1.1"
pipeline:
  name: time-to-dispatch
  docker_image: bodyworkml/bodywork-core:3.0
  DAG: train_model >> serve_model
  secrets_group: dev
stages:
  train_model:
    executable_module_path: pipeline/train_model.py
    args: ["time-to-dispatch", "0.9", "0.8"]
    requirements:
      - numpy>=1.21.0
      - pandas>=1.2.5
      - scikit-learn>=1.0.0
      - git+https://github.com/bodywork-ml/bodywork-pipeline-utils@v0.1.5
    cpu_request: 1.0
    memory_request_mb: 1000
    batch:
      max_completion_time_seconds: 180
      retries: 1
    secrets:
      AWS_ACCESS_KEY_ID: aws-credentials
      AWS_SECRET_ACCESS_KEY: aws-credentials
      AWS_DEFAULT_REGION: aws-credentials
  serve_model:
    executable_module_path: pipeline/serve_model.py
    args: ["time-to-dispatch"]
    requirements:
      - numpy>=1.21.0
      - scikit-learn>=1.0.0
      - fastapi>=0.65.2
      - uvicorn>=0.14.0
      - git+https://github.com/bodywork-ml/bodywork-pipeline-utils@v0.1.5
    cpu_request: 0.5
    memory_request_mb: 250
    service:
      max_startup_time_seconds: 180
      replicas: 2
      port: 8000
      ingress: true
    secrets:
      AWS_ACCESS_KEY_ID: aws-credentials
      AWS_SECRET_ACCESS_KEY: aws-credentials
      AWS_DEFAULT_REGION: aws-credentials
logging:
  log_level: INFO

This will instruct Bodywork to look for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION in a secret record called aws-credentials, so that it can inject these secrets into the containers running the stages of our pipeline (as environment variables that will be detected silently). So, these will have to be created, which can be done as follows,

$ bw create secret aws-credentials \
    --group=dev \
    --data AWS_ACCESS_KEY_ID=put-your-key-in-here \
    --data AWS_SECRET_ACCESS_KEY=put-your-other-key-in-here \
    --data AWS_DEFAULT_REGION=wherever-your-cluster-is

Now you’re ready to push this branch to your remote Git repo! If your tests pass and your colleagues approve the merge, the CD part of the CI/CD pipeline we setup in Part One will ensure the new pipeline is deployed to Kubernetes by Bodywork and executed immediately. Bodywork will perform a rolling-deployment that will ensure zero down-time and automatically roll-back failed deployments to the previous version. When Bodywork has finished, test the new web API,

$ curl http://CLUSTER_IP/pipelines/time-to-dispatch--serve-model/api/v0.1/time_to_dispatch \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"product_code": "SKU001", "orders_placed": 10}'

Where you should observe the same response you received when testing locally,

{
  "est_hours_to_dispatch": 0.6527543057985115,
  "model_version": "name:time-to-dispatch|model_type:<class 'sklearn.tree._classes.DecisionTreeRegressor'>|model_timestamp:2021-07-20 14:44:13.558375|model_hash:b4860f56fa24193934fe1ea51b66818d|train_dataset_key:datasets/time_to_dispatch_2021-07-01T16|45|38.csv|train_dataset_hash:\"759eccda4ceb7a07cda66ad4ef7cdfbc\"|pipeline_git_commit_hash:ed3113197adcbdbe338bf406841b930e895c42d6"
}

See our guide to accessing services for information on how to determine CLUSTER_IP.

Scheduling the Pipeline to run on a Schedule

At this point, the pipeline will have deployed a model using the most recent dataset made available for this task. We know, however, that new data will arrive every Friday evening and so we’d like to schedule the pipeline to run just after the data is expected. We can achieve this using Bodywork cronjobs, as follows,

$ bw create cronjob https://github.com/bodywork-ml/ml-pipeline-engineering \
    --name=weekly-update \
    --branch master \
    --schedule="45 11 * * 5" \
    --retries=2

Wrap-Up

In this second part we have gone from a skeleton “Hello, Production!” deployment to a fully-functional train-and-deploy pipeline, that automates re-training and re-deployment in a production environment, on a periodic basis. We have factored-out common code so that it can be re-used across projects and discussed various strategies for developing automated tests for both stages of the pipeline, ensuring that subsequent modifications can be reliably integrated and deployed, with relative ease.

Appendix

For reference.

The `Dataset` Class

Reproduced from the bodywork-pipeline-utils package, which is available to download from PyPI.

from datetime import datetime
from tempfile import NamedTemporaryFile
from typing import Any, NamedTuple

from pandas import DataFrame, read_csv, read_parquet

from bodywork_pipeline_utils.aws.artefacts import (
    find_latest_artefact_on_s3,
    make_timestamped_filename,
    put_file_to_s3,
)


class Dataset(NamedTuple):
    """Container for downloaded datasets and associated metadata."""

    data: DataFrame
    datetime: datetime
    bucket: str
    key: str
    hash: str


def get_latest_csv_dataset_from_s3(bucket: str, folder: str = "") -> Dataset:
    """Get the latest CSV dataset from S3.

    Args:
        bucket: S3 bucket to look in.
        folder: Folder within bucket to limit search, defaults to "".

    Returns:
        Dataset object.
    """
    artefact = find_latest_artefact_on_s3("csv", bucket, folder)
    data = read_csv(artefact.get())
    return Dataset(data, artefact.timestamp, bucket, artefact.obj_key, artefact.etag)


def get_latest_parquet_dataset_from_s3(bucket: str, folder: str = "") -> Dataset:
    """Get the latest Parquet dataset from S3.

    Args:
        bucket: S3 bucket to look in.
        folder: Folder within bucket to limit search, defaults to "".

    Returns:
        Dataset object.
    """
    artefact = find_latest_artefact_on_s3("parquet", bucket, folder)
    data = read_parquet(artefact.get())
    return Dataset(data, artefact.timestamp, bucket, artefact.obj_key, artefact.etag)


def put_csv_dataset_to_s3(
    data: DataFrame,
    filename_prefix: str,
    ref_datetime: datetime,
    bucket: str,
    folder: str = "",
    **kwargs: Any,
) -> None:
    """Upload DataFrame to S3 as a CSV file.

    Args:
        data: The DataFrame to upload.
        filename_prefix: Prefix before datetime filename element.
        ref_datetime: The reference date associated with data.
        bucket: Location on S3 to persist the data.
        folder: Folder within the bucket, defaults to "".
        kwargs: Keywork arguments to pass to pandas.to_csv.
    """
    filename = make_timestamped_filename(filename_prefix, ref_datetime, "csv")
    with NamedTemporaryFile() as temp_file:
        data.to_csv(temp_file, **kwargs)
        put_file_to_s3(temp_file.name, bucket, folder, filename)


def put_parquet_dataset_to_s3(
    data: DataFrame,
    filename_prefix: str,
    ref_datetime: datetime,
    bucket: str,
    folder: str = "",
    **kwargs: Any,
) -> None:
    """Upload DataFrame to S3 as a Parquet file.

    Args:
        data: The DataFrame to upload.
        filename_prefix: Prefix before datetime filename element.
        ref_datetime: The reference date associated with data.
        bucket: Location on S3 to persist the data.
        folder: Folder within the bucket, defaults to "".
        kwargs: Keywork arguments to pass to pandas.to_csv.
    """
    filename = make_timestamped_filename(filename_prefix, ref_datetime, "parquet")
    with NamedTemporaryFile() as temp_file:
        data.to_parquet(temp_file, **kwargs)
        put_file_to_s3(temp_file.name, bucket, folder, filename)

The `Model` Class

Reproduced from the bodywork-pipeline-utils package, which is available to download from PyPI.

from datetime import datetime
from hashlib import md5
from os import environ
from pickle import dump, dumps, loads, PicklingError, UnpicklingError
from tempfile import NamedTemporaryFile
from typing import Any, cast, Dict, Optional

from bodywork_pipeline_utils.aws.datasets import Dataset
from bodywork_pipeline_utils.aws.artefacts import (
    find_latest_artefact_on_s3,
    make_timestamped_filename,
    put_file_to_s3,
)


class Model:
    """Base class for representing ML models and metadata."""

    def __init__(
        self,
        name: str,
        model: Any,
        train_dataset: Dataset,
        metadata: Optional[Dict[str, Any]] = None,
    ):
        """Constructor.

        Args:
            name: Model name.
            model: Trained model object.
            train_dataset: Dataset object used to train the model.
            metadata: Arbitrary model metadata.
        """
        self._name = name
        self._train_dataset_key = train_dataset.key
        self._train_dataset_hash = train_dataset.hash
        self._model_hash = self._compute_model_hash(model)
        self._model = model
        self._model_type = type(model)
        self._creation_time = datetime.now()
        self._pipeline_git_commit_hash = environ.get("GIT_COMMIT_HASH", "NA")
        self._metadata = metadata

    def __eq__(self, other: object) -> bool:
        """Model quality operator."""
        if isinstance(other, Model):
            conditions = [
                self._train_dataset_hash == other._train_dataset_hash,
                self._train_dataset_key == other._train_dataset_key,
                self._creation_time == other._creation_time,
                self._pipeline_git_commit_hash == other._pipeline_git_commit_hash,
            ]
            if all(conditions):
                return True
            else:
                return False
        else:
            return False

    def __repr__(self) -> str:
        """Stdout representation."""
        info = (
            f"name: {self._name}\n"
            f"model_type: {self._model_type}\n"
            f"model_timestamp: {self._creation_time}\n"
            f"model_hash: {self._model_hash}\n"
            f"train_dataset_key: {self._train_dataset_key}\n"
            f"train_dataset_hash: {self._train_dataset_hash}\n"
            f"pipeline_git_commit_hash: {self._pipeline_git_commit_hash}"
        )
        return info

    def __str__(self) -> str:
        """String representation."""
        info = (
            f"name:{self._name}|"
            f"model_type:{self._model_type}|"
            f"model_timestamp:{self._creation_time}|"
            f"model_hash:{self._model_hash}|"
            f"train_dataset_key:{self._train_dataset_key}|"
            f"train_dataset_hash:{self._train_dataset_hash}|"
            f"pipeline_git_commit_hash:{self._pipeline_git_commit_hash}"
        )
        return info

    @property
    def metadata(self) -> Optional[Dict[str, Any]]:
        return self._metadata

    @property
    def model(self) -> Any:
        return self._model

    @staticmethod
    def _compute_model_hash(model: Any) -> str:
        """Compute a hash for a model object."""
        try:
            model_bytestream = dumps(model, protocol=5)
            hash = md5(model_bytestream)
            return hash.hexdigest()
        except PicklingError:
            msg = "Could not pickle model into bytes before hashing."
            raise RuntimeError(msg)
        except Exception as e:
            msg = "Could not hash model."
            raise RuntimeError(msg) from e

    def put_model_to_s3(self, bucket: str, folder: str = "") -> str:
        """Upload model to S3 as a pickle file.

        Args:
            bucket: Location on S3 to persist the data.
            folder: Folder within the bucket, defaults to "".
        """
        filename = make_timestamped_filename(self._name, self._creation_time, "pkl")
        with NamedTemporaryFile() as temp_file:
            dump(self, temp_file, protocol=5)
            put_file_to_s3(temp_file.name, bucket, folder, filename)
        return f"{bucket}/{folder}/{filename}"


def get_latest_pkl_model_from_s3(bucket: str, folder: str = "") -> Model:
    """Get the latest model from S3.

    Args:
        bucket: S3 bucket to look in.
        folder: Folder within bucket to limit search, defaults to "".

    Returns:
        Dataset object.
    """
    artefact = find_latest_artefact_on_s3("pkl", bucket, folder)
    try:
        artefact_bytes = artefact.get().read()
        model = cast(Model, loads(artefact_bytes))
        return model
    except UnpicklingError:
        msg = "artefact at {bucket}/{model.obj_key} could not be unpickled."
        raise RuntimeError(msg)
    except AttributeError:
        msg = "artefact at {bucket}/{model.obj_key} is not type Model."
        raise RuntimeError(msg)

`train_model.py`

Reproduced from the ml-pipeline-engineering repository.

"""
- Download training dataset from AWS S3.
- Prepare data and train model.
- Persist model to AWS S3.
"""
import sys
from typing import Any, Dict, List, NamedTuple, Tuple

from bodywork_pipeline_utils import aws, logging
from bodywork_pipeline_utils.aws import Dataset
from numpy import array
from pandas import DataFrame
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor

PRODUCT_CODE_MAP = {"SKU001": 0, "SKU002": 1, "SKU003": 2, "SKU004": 3, "SKU005": 4}
HYPERPARAM_GRID = {
    "random_state": [42],
    "criterion": ["squared_error", "absolute_error"],
    "max_depth": [2, 4, 6, 8, 10, None],
    "min_samples_split": [2, 4, 6, 8, 10],
    "min_samples_leaf": [2, 4, 6, 8, 10],
}

log = logging.configure_logger()


class FeatureAndLabels(NamedTuple):
    """Container for features and labels split by test and train sets."""

    X_train: DataFrame
    X_test: DataFrame
    y_train: DataFrame
    y_test: DataFrame


class TaskMetrics(NamedTuple):
    """Container for the task's performance metrics."""

    r_squared: float
    mean_absolute_error: float


def main(
    s3_bucket: str,
    metric_error_threshold: float,
    metric_warning_threshold: float,
    hyperparam_grid: Dict[str, Any],
) -> None:
    """Main training job."""
    log.info("Starting train-model stage.")
    dataset = aws.get_latest_csv_dataset_from_s3(s3_bucket, "datasets")
    log.info(f"Retrieved dataset from s3://{s3_bucket}/{dataset.key}")

    feature_and_labels = prepare_data(dataset.data)
    model, metrics = train_model(feature_and_labels, hyperparam_grid)
    validate_trained_model_logic(model, feature_and_labels)
    log.info(
        f"Trained model: r-squared={metrics.r_squared:.3f}, "
        f"MAE={metrics.mean_absolute_error:.3f}"
    )

    if metrics.r_squared >= metric_error_threshold:
        if metrics.r_squared >= metric_warning_threshold:
            log.warning("Metrics breached warning threshold - check for drift.")
        s3_location = persist_model(s3_bucket, model, dataset, metrics)
        log.info(f"Model serialised and persisted to s3://{s3_location}")
    else:
        msg = (
            f"r-squared metric ({{metrics.r_squared:.3f}}) is below deployment "
            f"threshold {metric_error_threshold}"
        )
        raise RuntimeError(msg)


def prepare_data(data: DataFrame) -> FeatureAndLabels:
    """Split the data into features and labels for training and testing."""
    X = data.drop("hours_to_dispatch", axis=1)
    y = data["hours_to_dispatch"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=data["product_code"].values, random_state=42
    )
    return FeatureAndLabels(X_train, X_test, y_train, y_test)


def train_model(
    data: FeatureAndLabels, hyperparam_grid: Dict[str, Any]
) -> Tuple[BaseEstimator, TaskMetrics]:
    """Train a model and compute performance metrics."""
    grid_search = GridSearchCV(
        estimator=DecisionTreeRegressor(),
        param_grid=hyperparam_grid,
        scoring="r2",
        cv=5,
        refit=True,
    )
    grid_search.fit(preprocess(data.X_train), data.y_train)
    best_model = grid_search.best_estimator_
    y_test_pred = best_model.predict(preprocess(data.X_test))
    performance_metrics = TaskMetrics(
        r2_score(data.y_test, y_test_pred),
        mean_absolute_error(data.y_test, y_test_pred),
    )
    return (best_model, performance_metrics)


def validate_trained_model_logic(model: BaseEstimator, data: FeatureAndLabels) -> None:
    """Verify that a trained model passes basic logical expectations."""
    issues_detected: List[str] = []

    orders_placed_sensitivity_checks = [
        model.predict(array([[100, product], [150, product]])).tolist()
        for product in range(len(PRODUCT_CODE_MAP))
    ]
    if not all(e[0] < e[1] for e in orders_placed_sensitivity_checks):
        issues_detected.append(
            "hours_to_dispatch predictions do not increase with orders_placed"
        )

    test_set_predictions = model.predict(preprocess(data.X_test)).reshape(-1)
    if len(test_set_predictions[test_set_predictions < 0]) > 0:
        issues_detected.append(
            "negative hours_to_dispatch predictions found for test set"
        )
    if len(test_set_predictions[test_set_predictions > data.y_test.max() * 1.25]) > 0:
        issues_detected.append(
            "outlier hours_to_dispatch predictions found for test set"
        )

    if issues_detected:
        msg = "Trained model failed verification: " + ", ".join(issues_detected) + "."
        raise RuntimeError(msg)


def preprocess(df: DataFrame) -> DataFrame:
    """Create features for training model."""
    processed = df.copy()
    processed["product_code"] = df["product_code"].apply(lambda e: PRODUCT_CODE_MAP[e])
    return processed.values


def persist_model(
    bucket: str, model: BaseEstimator, dataset: Dataset, metrics: TaskMetrics
) -> str:
    """Persist the model and metadata to S3."""
    metadata = {
        "r_squared": metrics.r_squared,
        "mean_absolute_error": metrics.mean_absolute_error,
    }
    wrapped_model = aws.Model("time-to-dispatch", model, dataset, metadata)
    s3_location = wrapped_model.put_model_to_s3(bucket, "models")
    return s3_location


if __name__ == "__main__":
    try:
        args = sys.argv
        s3_bucket = args[1]
        r2_metric_error_threshold = float(args[2])
        if r2_metric_error_threshold <= 0 or r2_metric_error_threshold > 1:
            raise ValueError()
        r2_metric_warning_threshold = float(args[3])
        if r2_metric_warning_threshold <= 0 or r2_metric_warning_threshold > 1:
            raise ValueError()
    except (ValueError, IndexError):
        log.error(
            "Invalid arguments passed to train_model.py. "
            "Expected S3_BUCKET R_SQUARED_ERROR_THRESHOLD R_SQUARED_WARNING_THRESHOLD, "
            "where all thresholds must be in the range [0, 1]."
        )
        sys.exit(1)

    try:
        main(
            s3_bucket,
            r2_metric_error_threshold,
            r2_metric_warning_threshold,
            HYPERPARAM_GRID,
        )
    except Exception as e:
        log.error(f"Error encountered when training model - {e}")
        sys.exit(1)

Best Practices for Engineering ML Pipelines - Part 1

2021-03-03T00:00:00+00:00

The is the first in a series of articles demonstrating how to engineer a machine learning pipeline and deploy it to a production environment. We’re going to assume that a solution to a ML problem already exists within a Jupyter notebook, and that our task is to engineer this solution into an operational ML system, that can train a model, serve it via a web API and automatically repeat this process on a schedule when new data is made available.

The focus will be on software engineering and DevOps, as applied to ML, with an emphasis on ‘best practices’. All of the code developed in each part of this project, is available on GitHub, with a dedicated branch for each part, so you can explore the code in its various stages of development.

This first part is focused on how to setup a ML pipeline engineering project and covers:

Basic solution architecture.
How to structure the codebase (and repo).
Setting-up automated testing and static code analysis tools.
Making an initial “”Hello, Production” deployment.
Configuring a CI/CD pipeline.

Table of Contents

Reviewing the Business Problem
Reviewing the Technical Problem
- Example Prediction Request JSON
- Example Prediction Response JSON
Solution Architecture
Structuring the Pipeline Project
Setting-Up the Local Dev Environment
Setting-Up the Testing Framework
- Using Tox for Test Automation
- Testing Manually
Creating a Deployment Environment
Configuring CI/CD
Wrapping-Up

Reviewing the Business Problem

A manufacturer of industrial spare-parts wants the ability to give its customers an estimate for the time it could take to dispatch an order. This depends on how many existing orders have yet to be processed, such that customers ordering late on a busy day can encounter unexpected delays, which sometimes leads to complaints; this is an exercise in keeping customers happy by managing their expectations.

Orders are placed on a B2B eCommerce platform, that is developed and maintained by the manufacturer’s in-house software engineering team. The product manager for the platform wants the estimated dispatch time to be presented to the customer (through the UI), before they place an order.

Reviewing the Technical Problem

A data scientist has worked on this (regression) task and has handed us the Jupyter notebook containing their solution. They have concluded that optimal performance can be achieved by training on the preceding week’s orders data, so the model will have to be re-trained and redeployed on a weekly basis.

At the end of each week, the data engineering team deliver a new tranche of training data, as a CSV file on cloud object storage (AWS S3). The platform engineering team want access to order-dispatch estimates via a web service with a simple REST API, and have supplied us with an example request and response (reproduced below). The platform and data engineering teams both deploy their systems and services to AWS, and we too are required to deploy our solution (the pipeline) to AWS.

Example Prediction Request JSON

{
    "product_code": "SKU001",
    "orders_placed": 112
}

Example Prediction Response JSON

{
    "est_hours_to_dispatch": 5.321,
    "model_version": "0.1"
}

Solution Architecture

The architecture for the target solution is outlined above - the workflow is as follows:

Every Friday night at 2300 a new batch of training data is added to an S3 bucket in CSV format.
After the new data arrives, a pipeline needs to be triggered that will train a new model and then deploy it, tearing-down the previous prediction service in the process (with zero downtime in-between).

The pipeline will be split into two stages, each of which will be implemented as an executable Python module:

train model - downloads the latest tranche of data from object storage, trains a model and then persists the model to object storage.
serve model - downloads the latest trained model and then starts a web server that exposes a REST API endpoint that serves requests for dispatch duration predictions.

The pipeline will be deployed in containers to AWS EKS (managed Kubernetes cluster), using Bodywork.

Structuring the Pipeline Project

The files in the project’s git repository are organised as follows:

root/
 |-- .circleci/
     |-- config.yml
 |-- notebooks/    
     |-- time_to_dispatch_model.ipynb
     |-- requirements_nb.txt
 |-- pipeline/
     |-- __init__.py
     |-- serve_model.py
     |-- train_model.py
     |-- utils.py
 |-- tests/
     |-- __init__.py
     |-- test_train_model.py
     |-- test_serve_model.py
 |-- requirements_cicd.txt
 |-- requirements_pipe.txt
 |-- flake8.ini
 |-- mypy.ini
 |-- tox.ini
 |-- bodywork.yaml

.circleci/config.yml contains the configuration for the project’s CI/CD pipeline (using CircleCI). We’ll discuss in more depth later on.
notebooks/* - has all of the Jupyter notebooks detailing the ML solution to the business problem.
pipeline/* has all Python modules that define the pipeline.
tests/* contains Python modules defining automated tests for the pipeline.
requirements_cicd.txt lists the Python packages required by the CI/CD pipeline - e.g. for running tests and deploying the pipeline.
requirements_pipe.txt lists the Python packages required by the pipeline - e.g. Scikit-Learn, FastAPI, etc.
flake8.ini & mypy.ini are configuration files for Flake8 code style enforcement and MyPy static type checking.
tox.ini provides configuration for the Tox test automation framework.
bodywork.yaml is the Bodywork deployment configuration file.

Setting-Up the Local Dev Environment

We’ve split the various Python package requirements into separate files:

requirements_pipe.txt contains the packages required by the pipeline.
requirements_cicd.txt contains the packages required by the CICD pipeline.
notebooks/requirements_nb.txt contains the package required to run the notebook.

We’re planning to deploy the pipeline using Bodywork, which currently targets the Python 3.9 runtime, so we create a Python 3.9 virtual environment in which to install all requirements.

$ python3.9 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements_pipe.txt
$ pip install -r requirements_cicd.txt
$ pip install -r requirements_nb.txt

Setting-Up the Testing Framework

We’re going to use pytest to support test development and we’re going to run them via the Tox test automation framework. The best way to get this operational, is to write some skeleton code for the pipeline that can be covered by a couple of basic tests. For example, at a trivial level the train_model.py batch job should provide us with some basic logs, whose existence we can test for in test_train_model.py. Taking a Test-Driven Development (TDD) approach, we start with the test in test_train_model.py,

from _pytest.logging import LogCaptureFixture

from pipeline.train_model import main


def test_main_execution(caplog: LogCaptureFixture):
    main()
    logs = caplog.text
    assert "Starting train-model stage." in logs

Where we use pytest’s caplog fixture to capture logs messages. We now provide the implementation in train_model.py,

from pipeline.utils import configure_logger

log = configure_logger()


def main() -> None:
    log.info("Starting train-model stage.")


if __name__ == "__main__":
    main()

Where configure_logger configures a Python logger that will be common to both train_model.py and serve_model.py.

Similarly for the serve_model.py module, we can write a trivial test for the REST API endpoint in test_serve_model.py,

from fastapi.testclient import TestClient

from pipeline.serve_model import app

test_client = TestClient(app)


def test_web_api_returns_valid_response_given_valid_data():
    prediction_request = {"product_code": "SKU001", "orders_placed": 100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    assert prediction_response.status_code == 200
    assert "est_hours_to_dispatch" in prediction_response.json().keys()
    assert "model_version" in prediction_response.json().keys()


def test_web_api_returns_error_code_given_invalid_data():
    prediction_request = {"product_code": "SKU001", "foo": 100}
    prediction_response = test_client.post(
        "/api/v0.1/time_to_dispatch", json=prediction_request
    )
    assert prediction_response.status_code == 422
    assert "value_error.missing" in prediction_response.text

This loads the FastAPI test client and uses it to verify that sending a request with valid data results in a response with a HTTP status code of 200, but sending invalid data results in a HTTP 422 error (see this for more information on HTTP status codes). In serve_model.py we implement the code to satisfy these tests,

from typing import Dict, Union

import uvicorn
from fastapi import FastAPI, status
from pydantic import BaseModel

app = FastAPI(debug=False)


class Data(BaseModel):
    product_code: str
    orders_placed: float


class Prediction(BaseModel):
    est_hours_to_dispatch: float
    model_version: str


@app.post(
    "/api/v0.1/time_to_dispatch",
    status_code=status.HTTP_200_OK,
    response_model=Prediction,
)
def time_to_dispatch(data: Data) -> Dict[str, Union[str, float]]:
    return {"est_hours_to_dispatch": 1.0, "model_version": "0.1"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", workers=1)

If you’re unfamiliar with how FastAPI uses Python type hints and Pydantic to define JSON schema, then take a look at the FastAPI docs.

You can run all tests in the tests folder using,

$ pytest

Or isolate a specific test using the -k flag, for example,

$ pytest -k test_web_api_returns_valid_response_given_valid_data

Using Tox for Test Automation

Tox is a test automation framework that helps to manage groups of tests, together with isolated environments in which to run them. Configuration for Tox is defined in tox.ini , which is reproduced below.

[tox]
envlist = {py39}_{unit_and_functional_tests,static_code_analysis}

[testenv]
skip_install = true
deps = 
    -rrequirements_cicd.txt
    -rrequirements_pipe.txt
commands = 
    unit_and_functional_tests: pytest tests/ --disable-warnings {posargs}
    static_code_analysis: mypy --config-file mypy.ini
    static_code_analysis: flake8 --config flake8.ini pipeline

Calling Tox from command line,

$ tox

Will run every set of tests - those defined in the commands tagged with unit_and_functional and static_code_analysis - for every chosen environment, which in this case is just Python 3.9 (py39). This environment will have none of the environment variables or commands that are present in the local shell, unless they’ve been specified (we haven’t), and can only use the packages specified in requirements_cicd.txt and requirements_pipe.txt. Individual test-environment pairs can be executed using the -e flag - for example,

$ tox -e py39_static_code_analysis

Will only run Flake8 and MyPy (static code analysis tools) and leave out the unit and functional tests. For more information on working with Tox, see the documentation.

Testing Manually

Sometimes you just need to test on a ad hoc basis, by running the modules, setting breakpoints, etc. You can run the batch job in train_model.py using,

$ python -m pipeline.train_model

Which should print the following to stdout,

2021-07-05 18:52:24,264 - INFO - train_model.main - Starting train-model stage.

Similarly, the web API defined in serve_model can be started with,

$ python -m pipeline.serve_model

Which should print the following to stdout,

INFO:     Started server process [21974]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

And make the API available for testing locally - e.g., issuing the following request from the command line,

$ curl http://localhost:8000/api/v0.1/time_to_dispatch \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"product_code": "001", "orders_placed": 10}'

Should return,

{
  "est_hours_to_dispatch": 1.0,
  "model_version": "0.1"
}

As defined in the tests. FastAPI will also automatically expose the following endpoints on your service:

http://localhost:8000/docs - OpenAPI documentation for the API, with a UI for testing.
http://localhost:8000/openapi.json - the JSON schema for the API.

Creating a Deployment Environment

Here at Bodywork HQ, we’re advocates for the “Hello, Production” school-of-thought, that encourages teams to make the deployment of a skeleton application (such as the trivial pipeline sketched-out in this article), one of the first tasks for any new project. As we have written about before, there are many benefits to taking deployment pains early on in a software development project, and then using the initial deployment skeleton as the basis for rapidly delivering useful functionality into production.

We’re planning to deploy to Kubernetes using Bodywork, but we appreciate that not everyone has easy access to a Kubernetes cluster for development. If this is your reality, then the next best thing your team could do, is to start by deploying to a local test cluster, to make sure that the pipeline is at least deploy-able. You can get started with a single node cluster on your laptop, using Minikube - see our guide to get this up-and-running in under 10 minutes.

The full description of the deployment is contained in bodywork.yaml, which we’ve reproduced below.

version: "1.1"
pipeline:
  name: time-to-dispatch
  docker_image: bodyworkml/bodywork-core:3.1
  DAG: train_model >> serve_model
stages:
  train_model:
    executable_module_path: pipeline/train_model.py
    cpu_request: 0.25
    memory_request_mb: 100
    batch:
      max_completion_time_seconds: 60
      retries: 2
  serve_model:
    executable_module_path: pipeline/serve_model.py
    requirements:
      - fastapi==0.65.2
      - uvicorn==0.14.0
    cpu_request: 0.25
    memory_request_mb: 100
    service:
      max_startup_time_seconds: 90
      replicas: 2
      port: 8000
      ingress: true
logging:
  log_level: INFO

This describes a deployment with two stages - train-model and serve-model - that are executed one after the other, as described in pipeline.DAG. For more information on how to configure a Bodywork deployment, checkout the User Guide.

Once you have access to a test cluster, configure it for Bodywork deployments,

$ bw configure-cluster

And then deploy the workflow directly from the GitHub repository (so make sure all commits have been pushed to your remote branch),

$ bw create deployment https://github.com/bodywork-ml/ml-pipeline-engineering --branch part-one

We like to watch our deployments rolling-out using the Kubernetes dashboard, as you can see in the video clip below.

Once the deployment has completed successfully, retrieve the details of the prediction service,

$ bw get deployment time-to-dispatch serve-model

You can manually test the deployed prediction endpoint using,

$ curl http://CLUSTER_IP/time-to-dispatch/serve-model/api/v0.1/time_to_dispatch \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"product_code": "001", "orders_placed": 10}'

Which should return the same response as before,

{
  "est_hours_to_dispatch": 1.0,
  "model_version": "0.1"
}

See our guide to accessing services for information on how to determine CLUSTER_IP.

Configuring CI/CD

Now that the overall structure of the project has been created, all that remains is to put in-place the processes required to get new code merged and deployed as quickly and efficiently as possible. The process of getting new code merged on an ad hoc basis, is referred to as Continuous Integration (CI), while getting new code deployed as soon as it is merged, is known as Continuous Deployment (CD). The workflow we intend to impose is outlined in the diagram above. Briefly:

Pushing changes (commits) to the master branch of the repository is forbidden. All changes should first be raised as merge (or pull) requests, that have to pass all automated testing and some kind of peer review process (e.g. a code review), before they can be merged to the master branch.
Once changes are merged to the master branch, they can be deployed.

Here at Bodywork HQ we use GitHub and CircleCI to run this workflow. Branch protection rules on GitHub are used to prevent changes being pushed to master, unless automated tests and peer review have been passed. CircleCI is a paid-for CI/CD service (with an outrageously generous free-tier) that automatically integrates with GitHub to enable jobs (such as automated tests) to be triggered automatically following merge requests, or changes to the masterbranch, etc. Our CircleCI pipeline is defined in .circleci/config.yml and reproduced below.

version: 2.1

orbs:
  aws-eks: circleci/aws-eks@1.0.3

jobs:
  run-static-code-analysis:
    docker:
      - image: circleci/python:3.9
    steps:
      - checkout
      - run:
          name: Installing Python dependencies
          command: pip install -r requirements_cicd.txt
      - run:
          name: Running tests
          command: tox -e py39_static_code_analysis
  run-tests:
    docker: 
      - image: circleci/python:3.9
    steps:
      - checkout
      - run:
          name: Installing Python dependencies
          command: pip install -r requirements_cicd.txt
      - run: 
          name: Running tests
          command: tox -e py39_unit_and_functional_tests
  trigger-bodywork-deployment:
    executor:
      name: aws-eks/python
      tag: "3.9"
    steps:
      - aws-eks/update-kubeconfig-with-authenticator:
          cluster-name: bodywork-dev
      - checkout
      - run:
          name: Installing Python dependencies
          command: pip install -r requirements_cicd.txt
      - run: 
          name: Trigger Deployment
          command: bodywork create deployment https://github.com/bodywork-ml/ml-pipeline-engineering --branch master

workflows:
  version: 2
  test-build-deploy:
    jobs:
      - run-static-code-analysis:
          filters:
            branches:
              ignore: master
      - run-tests:
          requires:
            - run-static-code-analysis
          filters:
            branches:
              ignore: master
      - trigger-bodywork-deployment:
          filters:
            branches:
              only: master

Although this configuration file is specific to CircleCI, it will be easily recognisable to anyone who’s ever worked with similar services such as GitHub Actions, GitLab CI/CD, Travis CI, etc. In essence, it defines the following:

Three separate jobs: run-static-code-analysis, run-tests and trigger-bodywork-deployment. Each of these run in their own Docker container, with the project’s GitHub repo checked-out and any Python dependencies installed. The trigger-bodywork-deployment job is set to run on a custom AWS-managed image (or ‘Orb’), that comes with additional tools for working with AWS’s EKS (managed Kubernetes) service, which is our ultimate deployment target.
A workflow that is triggered upon every merge request: run-static-code-analysis is first executed, which runs tox -e py39_static_code_analysis. If this passes, then the run-tests job is executed, which runs tox -e py39_unit_and_functional_tests. If this also passes, then CircleCI will mark this workflow as ‘passed’ and report this back to GitHub (see below).
A workflow that is triggered upon every merge to master: trigger-bodywork-deploymentis the only job in this pipeline, which uses Bodywork to deploy the latest pipeline (using rolling updates to maintain service availability).

Wrapping-Up

In the first part of this project we have expended a lot of effort to lay the foundations for the work that is to come - developing the model training job, the prediction service and deploying these to a production environment where they will need to be monitored. Thanks to automated tests and CI/CD, our team will be able to quickly iterate towards a well-engineered solution, with results that can be demonstrated to stakeholders early on.

Deploying ML Models with Bodywork

2020-12-01T00:00:00+00:00

Tags: python, machine-learning, mlops, kubernetes, bodywork

Solutions to ML problems are usually first developed in Jupyter notebooks. We are then faced with an altogether different problem - how to engineer these notebook solutions into your products and systems and continue to maintain their performance through time, after new data is generated.

Table of Contents

What is this Tutorial Going to Teach Me?
Introduction
- Why is MLOps Getting so Much Attention?
- ML Deployment with Bodywork
Before we Start
The ML Task
A Continuous Training Pipeline
Configuring the Training Stage
Configuring the Prediction Service
Configuring the Pipeline
Deploying the Pipeline
Testing the API
Scheduling the Pipeline
Cleaning Up

What is this Tutorial Going to Teach Me?

How to re-engineer a ML solution from a Jupyter notebook into a production-ready Python modules.
How to develop a two-stage ML pipeline that trains a model and then creates a prediction service to exposes it via a REST API.
How to deploy the pipeline to Kubernetes using GitHub and Bodywork.
How to configure the pipeline to run on a schedule, so the model is periodically re-trained and re-deployed without the intervention of an ML engineer.

Introduction

I’ve written at length on the subject of getting machine learning into production - an area that now falls under Machine Learning Operations (MLOps). MLOps has become a hot topic - take my blog post on Deploying Python ML Models with Flask, Docker and Kubernetes, which is accessed by hundreds of ML practitioners every month; or the fact that Thoughtwork’s essay on Continuous Delivery for ML has become an essential reference for all ML engineers, together with Google’s paper on the Hidden Technical Debt in ML Systems; and MLOps even has its own entry on Wikipedia.

Why is MLOps Getting so Much Attention?

In my opinion, this is because we are at a point where a significant number of organisations have now overcome their data ingestion and engineering problems. They are able to provide their data scientists with the data required to solve business problems using ML, only to find that, as Thoughtworks put it,

“Getting machine learning applications into production is hard”

To tackle some of the core complexities of MLOps, many ML engineering teams have settled on approaches that are based-upon deploying containerised models, usually as RESTful prediction services, to some type of cloud platform. Kubernetes is especially useful for this as I have written about before.

ML Deployment with Bodywork

Running ML code in containers has become a common pattern to guarantee reproducibility between what has been developed and what is deployed in production environments.

Most ML engineers do not, however, have the time to develop the skills and expertise required to deliver and deploy containerised ML systems into production environments. This requires an understanding of how to build container images, how to push build artefacts to image repositories and how to configure a container orchestration platform to use these, to execute batch jobs and deploy services.

Developing and maintaining these deployment pipelines is time-consuming. If there are multiple projects - each requiring re-training and re-deployment - then the management of these pipelines will quickly become a large burden.

This is where Bodywork steps-in - it will deliver your project’s Python modules directly from your Git repository into Docker containers and manage their deployment to a Kubernetes cluster. In other words, Bodywork automates the repetitive tasks that most ML engineers think of as DevOps, allowing them to focus their time on what they do best - i.e., engineering solutions to ML tasks.

This post serves as a short tutorial on how to use Bodywork to productionise a common pipeline pattern (train-and-deploy), and it will refer to files within a Bodywork project hosted on GitHub - see bodywork-ml-pipeline-project.

Before we Start

If you want to run the examples you will need to install Bodywork on your machine and setup access to Kubernetes (see this Kubernetes Quickstart Guide for help here). I recommend that you find five minutes to read about the key concepts that Bodywork is built upon, before beginning to work-through the examples below.

The ML Task

The ML problem we have chosen to use for this example, is the classification of iris plants into one of their three sub-species, given their physical dimensions. It uses the infamous iris plants dataset and is an example of a multi-class classification task.

The Jupyter notebook titled ml_prototype_work.ipynb, documents the trivial ML workflow used to arrive at a solution to this task. It trains a Decision Tree classifier and persists the trained model to cloud storage (an AWS bucket). Take five minutes to read through it.

A Continuous Training Pipeline

The two stage train-and-deploy pipeline is packaged as a GitHub repository, and is structured as follows,

root/
 |-- notebooks/
     |-- ml_prototype_work.ipynb
 |-- pipeline/
     |-- train_model.py
     |-- serve_model.py
 |-- bodywork.yaml

All the configuration for this deployment is held within bodywork.yaml, whose contents are reproduced below.

version: "1.1"

pipeline:
  name: bodywork-ml-pipeline-project
  docker_image: bodyworkml/bodywork-core:latest
  DAG: stage_1_train_model >> stage_2_scoring_service

stages:
  stage_1_train_model:
    executable_module_path: pipeline/train_model.py
    requirements:
      - boto3==1.21.14
      - joblib==1.1.0
      - pandas==1.4.1
      - scikit-learn==1.0.2
    cpu_request: 0.5
    memory_request_mb: 100
    batch:
      max_completion_time_seconds: 60
      retries: 2

  stage_2_scoring_service:
    executable_module_path: pipeline/serve_model.py
    requirements:
      - flask==2.1.2
      - joblib==1.1.0
      - numpy==1.22.3
      - scikit-learn==1.0.2
    cpu_request: 0.25
    memory_request_mb: 100
    service:
      max_startup_time_seconds: 60
      replicas: 2
      port: 5000
      ingress: true

logging:
  log_level: INFO

The remainder of this tutorial is concerned with explaining how the configuration within bodywork.yaml is used to deploy the pipeline, as defined within the train_model.py and serve_model.py Python modules.

Configuring the Training Stage

The stages.stage_1_train_model.executable_module_path points to the executable Python module - train_model.py - that defines what will happen when the stage_1_train_model (batch) stage is executed, within a pre-built Bodywork container. This module contains the code required to:

download data from an AWS S3 bucket;
pre-process the data (e.g. extract labels for supervised learning);
train the model and compute performance metrics; and,
persist the model to the same AWS S3 bucket that contains the original data.

It can be summarised as,

from datetime import datetime
from urllib.request import urlopen

# other imports
# ...

DATA_URL = ('http://bodywork-ml-pipeline-project.s3.eu-west-2.amazonaws.com'
            '/data/iris_classification_data.csv')

# other constants
# ...


def main() -> None:
    """Main script to be executed."""
    data = download_dataset(DATA_URL)
    features, labels = pre_process_data(data)
    trained_model = train_model(features, labels)
    persist_model(trained_model)


# other functions definitions used in main()
# ...


if __name__ == '__main__':
    main()

We recommend that you spend five minutes familiarising yourself with the full contents of train_model.py. When Bodywork runs the stage, it will do so in exactly the same way as if you were to run,

$ python train_model.py

And so everything defined in main() will be executed.

The stages.stage_1_train_model.requirements parameter in the bodywork.yaml file lists the 3rd party Python packages that will be Pip-installed on the pre-built Bodywork container, as required to run the train_model.py module. In this example we have,

boto3==1.21.14
joblib==1.1.0
pandas==1.4.1
scikit-learn==1.0.2

boto3 - for interacting with AWS;
joblib - for persisting models;
pandas - for manipulating the raw data; and,
scikit-learn - for training the model.

Finally, the remaining parameters in stages.stage_1_train_model section of bodywork.yaml allow us to configure the remaining key parameters for the stage,

stage_1_train_model:
  executable_module_path: stage_1_train_model/train_model.py
  requirements:
    - boto3==1.21.14
    - joblib==1.1.0
    - pandas==1.4.1
    - scikit-learn==1.0.2
  cpu_request: 0.5
  memory_request_mb: 100
  batch:
    max_completion_time_seconds: 60
    retries: 2

From which it is clear to see that we have specified that this stage is a batch stage (as opposed to a service-deployment), together with an estimate of the CPU and memory resources to request from the Kubernetes cluster, how long to wait and how many times to retry, etc.

Configuring the Prediction Service

The stages.stage_2_scoring_service.executable_module_path parameter points to the executable Python module - serve_model.py - that defines what will happen when the stage_2_scoring_service (service) stage is executed, within a pre-built Bodywork container. This module contains the code required to:

load the model trained in stage_1_train_model and persisted to cloud storage; and,
start a Flask service to score instances (or rows) of data, sent as JSON to the API endpoint.

We chose to develop the prediction service using Flask, but this is not a requirement in any way and you are free to use any frameworks you like - e.g., FastAPI.

The contents of serve_model.py defines the REST API server and can be summarised as,

from urllib.request import urlopen
from typing import Dict

# other imports
# ...

MODEL_URL = ('http://bodywork-ml-pipeline-project.s3.eu-west-2.amazonaws.com/models'
             '/iris_tree_classifier.joblib')

# other constants
# ...

app = Flask(__name__)


@app.route('/iris/v1/score', methods=['POST'])
def score() -> Response:
    """Iris species classification API endpoint"""
    request_data = request.json
    X = make_features_from_request_data(request_data)
    model_output = model_predictions(X)
    response_data = jsonify({**model_output, 'model_info': str(model)})
    return make_response(response_data)


# other functions definitions used in score() and below
# ...


if __name__ == '__main__':
    model = get_model(MODEL_URL)
    print(f'loaded model={model}')
    print(f'starting API server')
    app.run(host='0.0.0.0', port=5000)

We recommend that you spend five minutes familiarising yourself with the full contents of serve_model.py. When Bodywork runs the stage, it will start the server defined by app and expose the /iris/v1/score route that is being handled by score(). Note, that this process has no scheduled end and the stage will be kept up-and-running until it is re-deployed or deleted.

The stages.stage_2_scoring_service.requirements parameter in the bodywork.yaml file lists the 3rd party Python packages that will be Pip-installed on the pre-built Bodywork container, as required to run the serve_model.py module. In this example we have,

boto3==1.21.14
joblib==1.1.0
pandas==1.4.1
scikit-learn==1.0.2

Flask - the framework upon which the REST API server is built;
joblib - for loading the persisted model;
numpy & scikit-learn - for working with the ML model.

Finally, the remaining parameters in stages.stage_2_scoring_service section of bodywork.yaml allow us to configure the remaining key parameters for the stage,

stage_2_scoring_service:
  executable_module_path: stage_2_scoring_service/serve_model.py
  requirements:
    - flask==2.1.2
    - joblib==1.1.0
    - numpy==1.22.3
    - scikit-learn==1.0.2
  cpu_request: 0.25
  memory_request_mb: 100
  service:
    max_startup_time_seconds: 30
    replicas: 2
    port: 5000
    ingress: true

From which it is clear to see that we have specified that this stage will create a service (as opposed to run a batch job), together with an estimate of the CPU and memory resources to request from the Kubernetes cluster, how long to wait for the service to start-up and be ‘ready’, which port to expose, to create a path to the service from an externally-facing ingress controller (if present in the cluster), and how many instances (or replicas) of the server should be created to stand-behind the cluster-service.

Configuring the Pipeline

The project section of bodywork.yaml contains the configuration for the pipeline,

pipeline:
  name: bodywork-ml-pipeline-project
  docker_image: bodyworkml/bodywork-core:latest
  DAG: stage_1_train_model >> stage_2_scoring_service

The most important element is the specification of the workflow DAG, which in this instance is simple and will instruct the Bodywork workflow-controller to first run the training stage, and then (if successful) create the prediction service.

Deploying the Pipeline

To deploy the pipeline and create the prediction service, use the following command,

$ bw create deployment "https://github.com/bodywork-ml/bodywork-ml-pipeline-project"

Which will run the pipeline defined in the default branch of the project’s remote Git repository (e.g., master), and stream the logs to stdout - e.g,

========================================== deploying master branch from https://github.com/bodywork-ml/bodywork-ml-pipeline-project ===========================================
[02/21/22 14:50:59] INFO     Creating k8s namespace = bodywork-ml-pipeline-project                                                                                             
[02/21/22 14:51:00] INFO     Creating k8s service account = bodywork-stage                                                                                                     
[02/21/22 14:51:00] INFO     Attempting to execute DAG step = [stage_1_train_model]                                                                                            
[02/21/22 14:51:00] INFO     Creating k8s job for stage = stage-1-train-model  
...

Testing the API

The details of any serviced associated with the pipeline, can be retrieved using,

$ bw get deployment "bodywork-ml-pipeline-project" "stage-2-scoring-service"

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Field                ┃ Value                                                                         ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ name                 │ stage-2-scoring-service                                                       │
│ namespace            │ bodywork-ml-pipeline-project                                                  │
│ service_exposed      │ True                                                                          │
│ service_url          │ http://stage-2-scoring-service.bodywork-ml-pipeline-project.svc.cluster.local │
│ service_port         │ 5000                                                                          │
│ available_replicas   │ 2                                                                             │
│ unavailable_replicas │ 0                                                                             │
│ git_url              │ https://github.com/bodywork-ml/bodywork-ml-pipeline-project                   │
│ git_branch           │ master                                                                        │
│ git_commit_hash      │ e9df4b4                                                                       │
│ has_ingress          │ True                                                                          │
│ ingress_route        │ /bodywork-ml-pipeline-project/stage-2-scoring-service                         │
└──────────────────────┴───────────────────────────────────────────────────────────────────────────────┘

Services are accessible via the public internet if you have installed an ingress controller within your cluster, and the stages.STAGE_NAME.service.ingress configuration parameter is set to true. If you are using Kubernetes via Minikube and our Kuberentes Quickstart guide, then this will have been enabled for you. Otherwise, services will only be accessible via HTTP from within the cluster, via the service_url.

Assuming that you are setup to access services from outside the cluster, then you can test the endpoint using,

$ curl http://YOUR_CLUSTERS_EXTERNAL_IP/bodywork-ml-pipeline-project/stage-2-scoring-service/iris/v1/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'

See here for instructions on how to retrieve YOUR_CLUSTERS_EXTERNAL_IP if you are using Minikube, otherwise refer to the instructions here. This request ought to return,

{
    "species_prediction":"setosa",
    "probabilities":"setosa=1.0|versicolor=0.0|virginica=0.0",
    "model_info": "DecisionTreeClassifier(class_weight='balanced', random_state=42)"
}

According to how the payload has been defined in the stage_2_scoring_service/serve_model.py module.

Scheduling the Pipeline

If you’re happy with the results of this test deployment, you can then schedule the pipeline to run on the cluster, on a schedule. For example, to setup the the workflow to run every day at midnight, use the following command,

$ bw create cronjob "https://github.com/bodywork-ml/bodywork-ml-pipeline-project" \
    --name "daily" \
    --schedule "0 * * * *" \
    --retries 2

Each scheduled pipeline execution will attempt to run the pipeline - i.e., retraining the model and updating the prediction service - as defined by the state of this repository’s default branch (master), at the time of execution. To change the branch used for deployment, use the --branch option.

To get the execution history for this cronjob use,

$ bw get cronjob "daily" --history

Which should return output along the lines of,

           run ID = daily-1645446900
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Field           ┃ Value                     ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ start_time      │ 2022-02-21 12:35:06+00:00 │
│ completion_time │ 2022-02-21 12:39:32+01:03 │
│ active          │ False                     │
│ succeeded       │ True                      │
│ failed          │ False                     │
└─────────────────┴───────────────────────────┘

Then to stream the logs from any given cronjob run (e.g. to debug and/or monitor for errors), use,

$ bw get cronjobs daily --logs "hourly-1645446900"

Cleaning Up

To tear-down the prediction service created by the pipeline you can use,

$ bw delete deployment "bodywork-ml-pipeline-project"

Best Practices for PySpark ETL Projects

2019-07-28T00:00:00+01:00

I have often lent heavily on Apache Spark and the SparkSQL APIs for operationalising any type of batch data-processing ‘job’, within a production environment where handling fluctuating volumes of data reliably and consistently are on-going business concerns. These batch data-processing jobs may involve nothing more than joining data sources and performing aggregations, or they may apply machine learning models to generate inventory recommendations - regardless of the complexity, this often reduces to defining Extract, Transform and Load (ETL) jobs. I’m a self-proclaimed Pythonista, so I use PySpark for interacting with SparkSQL and for writing and testing all of my ETL scripts.

This post is designed to be read in parallel with the code in the pyspark-template-project GitHub repository. Together, these constitute what I consider to be a ‘best practices’ approach to writing ETL jobs using Apache Spark and its Python (‘PySpark’) APIs. These ‘best practices’ have been learnt over several years in-the-field, often the result of hindsight and the quest for continuous improvement. I am also grateful to the various contributors to this project for adding their own wisdom to this endeavour.

I aim to addresses the following topics:

how to structure ETL code in such a way that it can be easily tested and debugged;
how to pass configuration parameters to a PySpark job;
how to handle dependencies on other modules and packages; and,
what constitutes a ‘meaningful’ test for an ETL job.

Table of Contents

PySpark ETL Project Structure
The Structure of an ETL Job
Passing Configuration Parameters to the ETL Job
Packaging ETL Job Dependencies
Running the ETL job
Debugging Spark Jobs Using start_spark
Automated Testing
Managing Project Dependencies using Pipenv
Summary

PySpark ETL Project Structure

The basic project structure is as follows:

root/
|-- configs/
 |   |-- etl_config.json
 |-- dependencies/
 |   |-- logging.py
 |   |-- spark.py
 |-- jobs/
 |   |-- etl_job.py
 |   tests/
 |   |-- test_data/
 |   |-- | -- employees/
 |   |-- | -- employees_report/
 |   |-- test_etl_job.py
 |   Pipfile
 |   Pipfile.lock 
 |   build_dependencies.sh
 |   packages.zip

The main Python module containing the ETL job (which will be sent to the Spark cluster), is jobs/etl_job.py. Any external configuration parameters required by etl_job.py are stored in JSON format in configs/etl_config.json. Additional modules that support this job can be kept in the dependencies folder (more on this later). In the project’s root we include build_dependencies.sh - a bash script for building these dependencies into a zip-file to be sent to the cluster (packages.zip). Unit test modules are kept in the tests folder and small chunks of representative input and output data, to be use with the tests, are kept in tests/test_data folder.

The Structure of an ETL Job

In order to facilitate easy debugging and testing, we recommend that the ‘Transformation’ step be isolated from the ‘Extract’ and ‘Load’ steps, into it’s own function - taking input data arguments in the form of DataFrames and returning the transformed data as a single DataFrame. For example, in the main() job function from jobs/etl_job.py we have,

data = extract_data(...)
data_transformed = transform_data(data, ...)
load_data(data_transformed, ...)

The code that surrounds the use of the transformation function in the main() job function, is concerned with Extracting the data, passing it to the transformation function and then Loading (or writing) the results to their ultimate destination. Testing is simplified, as mock or test data can be passed to the transformation function and the results explicitly verified, which would not be possible if all of the ETL code resided in main() and referenced production data sources and destinations.

More generally, transformation functions should be designed to be idempotent. This is a technical way of saying that,

the repeated application of the transformation function to the input data, should have no impact on the fundamental state of output data, until the instance when the input data changes.

One of the key advantages of idempotent ETL jobs, is that they can be set to run repeatedly (e.g. by using cron to trigger the spark-submit command on a pre-defined schedule), rather than having to factor-in potential dependencies on other ETL jobs completing successfully.

Passing Configuration Parameters to the ETL Job

Although it is possible to pass arguments to etl_job.py, as you would for any generic Python module running as a ‘main’ program - by specifying them after the module’s filename and then parsing these command line arguments - this can get very complicated, very quickly, especially when there are lot of parameters (e.g. credentials for multiple databases, table names, SQL snippets, etc.). This also makes debugging the code from within a Python interpreter extremely awkward, as you don’t have access to the command line arguments that would ordinarily be passed to the code, when calling it from the command line.

A much more effective solution is to send Spark a separate file - e.g. using the --files configs/etl_config.json flag with spark-submit - containing the configuration in JSON format, which can be parsed into a Python dictionary in one line of code with json.loads(config_file_contents). Testing the code from within a Python interactive console session is also greatly simplified, as all one has to do to access configuration parameters for testing, is to copy and paste the contents of the file - e.g.,

import json

config = json.loads("""{"field": "value"}""")

This also has the added bonus that the ETL job configuration can be explicitly version controlled within the same project structure, avoiding the risk that configuration parameters escape any type of version control - e.g. because they are passed as arguments in bash scripts written by separate teams, whose responsibility is deploying the code, not writing it.

For the exact details of how the configuration file is located, opened and parsed, please see the start_spark() function in dependencies/spark.py (also discussed in more detail below), which in addition to parsing the configuration file sent to Spark (and returning it as a Python dictionary), also launches the Spark driver program (the application) on the cluster and retrieves the Spark logger at the same time.

Packaging ETL Job Dependencies

In this project, functions that can be used across different ETL jobs are kept in a module called dependencies and referenced in specific job modules using, for example,

from dependencies.spark import start_spark

This package, together with any additional dependencies referenced within it, must be to copied to each Spark node for all jobs that use dependencies to run. This can be achieved in one of several ways:

send all dependencies as a zip archive together with the job, using --py-files with Spark submit;
formally package and upload dependencies to somewhere like the PyPI archive (or a private version) and then run pip3 install dependencies on each node; or,
a combination of manually copying new modules (e.g. dependencies) to the Python path of each node and using pip3 install for additional dependencies (e.g. for requests).

Option (1) is by far the easiest and most flexible approach, so we will make use of this. To make this task easier, especially when modules such as dependencies have their own downstream dependencies (e.g. the requests package), we have provided the build_dependencies.sh bash script for automating the production of packages.zip, given a list of dependencies documented in Pipfile and managed by the Pipenv python application (we discuss the use of Pipenv in greater depth below).

Note, that dependencies (e.g. NumPy) requiring extensions (e.g. C code) to be compiled locally, will have to be installed manually on each node as part of the node setup.

Running the ETL job

Assuming that the $SPARK_HOME environment variable points to your local Spark installation folder, then the ETL job can be run from the project’s root directory using the following command from the terminal,

$SPARK_HOME/bin/spark-submit \
--master local[*] \
--packages 'com.some-spark-jar.dependency:1.0.0' \
--py-files dependencies.zip \
--files configs/etl_config.json \
jobs/etl_job.py

Briefly, the options supplied serve the following purposes:

--master local[*] - the address of the Spark cluster to start the job on. If you have a Spark cluster in operation (either in single-executor mode locally, or something larger in the cloud) and want to send the job there, then modify this with the appropriate Spark IP - e.g. spark://the-clusters-ip-address:7077;
--packages 'com.some-spark-jar.dependency:1.0.0,...' - Maven coordinates for any JAR dependencies required by the job (e.g. JDBC driver for connecting to a relational database);
--files configs/etl_config.json - the (optional) path to any config file that may be required by the ETL job;
--py-files packages.zip - archive containing Python dependencies (modules) referenced by the job; and,
jobs/etl_job.py - the Python module file containing the ETL job to execute.

Full details of all possible options can be found here. Note, that we have left some options to be defined within the job (which is actually a Spark application) - e.g. spark.cores.max and spark.executor.memory are defined in the Python script as it is felt that the job should explicitly contain the requests for the required cluster resources.

Debugging Spark Jobs Using `start_spark`

It is not practical to test and debug Spark jobs by sending them to a cluster using spark-submit and examining stack traces for clues on what went wrong. A more productive workflow is to use an interactive console session (e.g. IPython) or a debugger (e.g. the pdb package in the Python standard library or the Python debugger in Visual Studio Code). In practice, however, it can be hard to test and debug Spark jobs in this way, as they can implicitly rely on arguments that are sent to spark-submit, which are not available in a console or debug session.

We wrote the start_spark function - found in dependencies/spark.py - to facilitate the development of Spark jobs that are aware of the context in which they are being executed - i.e. as spark-submit jobs or within an IPython console, etc. The expected location of the Spark and job configuration parameters required by the job, is contingent on which execution context has been detected. The doscstring for start_spark gives the precise details,

def start_spark(app_name='my_spark_app', master='local[*]', jar_packages=[],
                files=[], spark_config={}):
    """Start Spark session, get Spark logger and load config files.

    Start a Spark session on the worker node and register the Spark
    application with the cluster. Note, that only the app_name argument
    will apply when this is called from a script sent to spark-submit.
    All other arguments exist solely for testing the script from within
    an interactive Python console.

    This function also looks for a file ending in 'config.json' that
    can be sent with the Spark job. If it is found, it is opened,
    the contents parsed (assuming it contains valid JSON for the ETL job
    configuration), into a dict of ETL job configuration parameters,
    which are returned as the last element in the tuple returned by
    this function. If the file cannot be found then the return tuple
    only contains the Spark session and Spark logger objects and None
    for config.

    The function checks the enclosing environment to see if it is being
    run from inside an interactive console session or from an
    environment which has a `DEBUG` environment varibale set (e.g.
    setting `DEBUG=1` as an environment variable as part of a debug
    configuration within an IDE such as Visual Studio Code or PyCharm.
    In this scenario, the function uses all available function arguments
    to start a PySpark driver from the local PySpark package as opposed
    to using the spark-submit and Spark cluster defaults. This will also
    use local module imports, as opposed to those in the zip archive
    sent to spark via the --py-files flag in spark-submit. 

    Note, if using the local PySpark package on a machine that has the
    SPARK_HOME environment variable set to a local install of Spark,
    then the versions will need to match as PySpark appears to pick-up
    on SPARK_HOME automatically and version conflicts yield errors.

    :param app_name: Name of Spark app.
    :param master: Cluster connection details (defaults to local[*].
    :param jar_packages: List of Spark JAR package names.
    :param files: List of files to send to Spark cluster (master and
        workers).
    :param spark_config: Dictionary of config key-value pairs.
    :return: A tuple of references to the Spark session, logger and
        config dict (only if available).
    """

    # ...

    return spark_sess, spark_logger, config_dict

For example, the following code snippet,

spark, log, config = start_spark(
    app_name='my_etl_job',
    jar_packages=['com.somesparkjar.dependency:1.0.0'],
    files=['configs/etl_config.json'])

Will use the arguments provided to start_spark to setup the Spark job if executed from an interactive console session or debugger, but will look for the same arguments sent via spark-submit if that is how the job has been executed.

Note, if you are using the local PySpark package - e.g. if running from an interactive console session or debugger - on a machine that also has the SPARK_HOME environment variable set to a local install of Spark, then the two versions will need to match as PySpark appears to pick-up on SPARK_HOME automatically, with version conflicts leading to (unintuitive) errors.

Automated Testing

In order to test with Spark, we use the pyspark Python package, which is bundled with the Spark JARs required to programmatically start-up and tear-down a local Spark instance, on a per-test-suite basis (we recommend using the setUp and tearDown methods in unittest.TestCase to do this once per test-suite). Note, that using pyspark to run Spark is an alternative way of developing with Spark as opposed to using the PySpark shell or spark-submit.

Given that we have chosen to structure our ETL jobs in such a way as to isolate the ‘Transformation’ step into its own function (see ‘Structure of an ETL job’ above), we are free to feed it a small slice of ‘real-world’ production data that has been persisted locally - e.g. in tests/test_data or some easily accessible network directory - and check it against known results (e.g. computed manually or interactively within a Python interactive console session), as demonstrated in this extract from tests/test_etl_job.py,

# assemble
input_data = (
    self.spark
    .read
    .parquet(self.test_data_path + 'employees'))

expected_data = (
    self.spark
    .read
    .parquet(self.test_data_path + 'employees_report'))

expected_cols = len(expected_data.columns)
expected_rows = expected_data.count()
expected_avg_steps = (
    expected_data
    .agg(mean('steps_to_desk').alias('avg_steps_to_desk'))
    .collect()[0]
    ['avg_steps_to_desk'])

# act
data_transformed = transform_data(input_data, 21)

cols = len(expected_data.columns)
rows = expected_data.count()
avg_steps = (
    expected_data
    .agg(mean('steps_to_desk').alias('avg_steps_to_desk'))
    .collect()[0]
    ['avg_steps_to_desk'])

# assert
self.assertEqual(expected_cols, cols)
self.assertEqual(expected_rows, rows)
self.assertEqual(expected_avg_steps, avg_steps)
self.assertTrue([col in expected_data.columns
                 for col in data_transformed.columns])

To execute the example unit test for this project run,

pipenv run python -m unittest tests/test_*.py

If you’re wondering what the pipenv command is, then read the next section.

Managing Project Dependencies using Pipenv

We use Pipenv for managing project dependencies and Python environments (i.e. virtual environments). All direct packages dependencies (e.g. NumPy may be used in a User Defined Function), as well as all the packages used during development (e.g. PySpark, flake8 for code linting, IPython for interactive console sessions, etc.), are described in the Pipfile. Their precise downstream dependencies are described and frozen in Pipfile.lock (generated automatically by Pipenv, given a Pipfile).

Installing Pipenv

To get started with Pipenv, first of all download it - assuming that there is a global version of Python available on your system and on the PATH, then this can be achieved by running the following command,

pip3 install pipenv

Pipenv is also available to install from many non-Python package managers. For example, on OS X it can be installed using the Homebrew package manager, with the following terminal command,

brew install pipenv

For more information, including advanced configuration options, see the official Pipenv documentation.

Installing this Projects’ Dependencies

Make sure that you’re in the project’s root directory (the same one in which the Pipfile resides), and then run,

pipenv install --dev

This will install all of the direct project dependencies as well as the development dependencies (the latter a consequence of the --dev flag).

Running Python and IPython from the Project’s Virtual Environment

In order to continue development in a Python environment that precisely mimics the one the project was initially developed with, use Pipenv from the command line as follows,

pipenv run python3

The python3 command could just as well be ipython3, for example,

pipenv run ipython

This will fire-up an IPython console session where the default Python 3 kernel includes all of the direct and development project dependencies - this is our preference.

Pipenv Shells

Prepending pipenv to every command you want to run within the context of your Pipenv-managed virtual environment can get very tedious. This can be avoided by entering into a Pipenv-managed shell,

pipenv shell

This is equivalent to ‘activating’ the virtual environment; any command will now be executed within the virtual environment. Use exit to leave the shell session.

Automatic Loading of Environment Variables

Pipenv will automatically pick-up and load any environment variables declared in the .env file, located in the package’s root directory. For example, adding,

SPARK_HOME=applications/spark-2.3.1/bin
DEBUG=1

Will enable access to these variables within any Python program -e.g. via a call to os.environ['SPARK_HOME']. Note, that if any security credentials are placed here, then this file must be removed from source control - i.e. add .env to the .gitignore file to prevent potential security risks.

Summary

The workflow described above, together with the accompanying Python project, represents a stable foundation for writing robust ETL jobs, regardless of their complexity and regardless of how the jobs are being executed - e.g. via use of cron or more sophisticated workflow automation tools, such as Airflow. I am always interested in collating and integrating more ‘best practices’ - if you have any, please submit them here.

Stochastic Process Calibration using Bayesian Inference & Probabilistic Programs

2019-01-18T00:00:00+00:00

Stochastic processes are used extensively throughout quantitative finance - for example, to simulate asset prices in risk models that aim to estimate key risk metrics such as Value-at-Risk (VaR), Expected Shortfall (ES) and Potential Future Exposure (PFE). Estimating the parameters of a stochastic processes - referred to as ‘calibration’ in the parlance of quantitative finance -usually involves:

computing the distribution of price returns for a financial asset;
deriving point-estimates for the mean and volatility of the returns; and then,
solving a set of simultaneous equations.

An excellent and accessible account of these statistical procedures for a variety of commonly used stochastic processes is given in ‘A Stochastic Processes Toolkit for Risk Management’, by Damiano Brigo et al..

The parameter estimates are usually equivalent to Maximum Likelihood (ML) point estimates and often no effort is made to capture the estimation uncertainty and incorporate it explicitly into the derived risk metrics; it involves additional financial engineering that is burdensome. Instead, parameter estimates are usually adjusted heuristically until the results of ‘back-testing’ risk metrics on historical data become ‘acceptable’.

The purpose of this Python notebook is to demonstrate how Bayesian Inference and Probabilistic Programming (using PYMC3), is an alternative and more powerful approach that can be viewed as a unified framework for:

exploiting any available prior knowledge on market prices (quantitative or qualitative);
estimating the parameters of a stochastic process; and,
naturally incorporating parameter uncertainty into risk metrics.

By simulating a Geometric Brownian Motion (GBM) and then estimating the parameters based on the randomly generated observations, we will quantify the impact of using Bayesian Inference against traditional ML estimation, when the available data is both plentiful and scarce - the latter being a scenario in which Bayesian Inference is shown to be especially powerful.

Table of Contents

Imports and Global Settings
Synthetic Data Generation using Geometric Brownian Motion
The Traditional Approach to Parameter Estimation
- Parameter Estimation when Data is Plentiful
- Parameter Estimation when Data is Scarce
Parameter Estimation using Bayesian Inference and Probabilistic Programming
- Selecting Suitable Prior Distributions
  - Choosing a Prior Distribution for the Expected Return of Daily Returns
  - Choosing a Prior Distribution for the Volatility of Daily Returns
- Inference using a Probabilistic Program & Markov Chain Monte Carlo (MCMC)
Making Predictions
- Impact on Risk Metrics - Value-at-Risk (VaR)
Summary

Imports and Global Settings

Before we get going in earnest, we follow the convention of declaring all imports at the top of the notebook.

import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
from numpy import ndarray

And then notebook-wide (global) settings that enable in-line plotting, configure Seaborn for visualisation and to explicitly ignore warnings (e.g. NumPy deprecations).

%matplotlib inline

sns.set()
warnings.filterwarnings('ignore')

Synthetic Data Generation using Geometric Brownian Motion

We start by defining a function for simulating a single path from a GBM - perhaps the most commonly used stochastic process for modelling the time-series of asset prices. We make use of the following equation:

$$ \tilde{S_t} = S_0 \exp \left{ \left(\mu - \frac{\sigma^2}{2} \right) t + \sigma \tilde{W_t}\right} $$

where:

$t$ is the time in years;
$S_0$ is value of time-series at the start;
$\tilde{S_t}$ is value of time-series at time $t$;
$\mu$ is the annualised drift (or expected return);
$\sigma$ is the annualised standard deviation of the returns; and,
$\tilde{W_t}$ is a Brownian motion.

This is the solution to the following stochastic differential equation,

$$ d\tilde{S_t} = \mu \tilde{S_t} dt + \sigma \tilde{S_t} d\tilde{W_t} $$

For a more in-depth discussion refer to ‘A Stochastic Processes Toolkit for Risk Management’, by Damiano Brigo et al..

def gbm(start: float, mu: float, sigma: float, days: int) -> ndarray:
    """Generate a time-series using a Geometric Brownian Motion (GBM).

    Yields daily values for the specified number of days.

    :parameter start: The starting value.
    :type start: float
    :parameter mu: Anualised drift.
    :type: float
    :parameter sigma: Annualised volatility.
    :type: float
    :parameter days: The number of days to simulate.
    :type: int
    :return: A time-series of values.
    :rtype: np.ndarray
    """

    dt = 1 / 365
    t = np.hstack([np.zeros(1), np.repeat(dt, days-1)]).cumsum()

    dw = np.random.normal(loc=0, scale=np.sqrt(dt), size=days-1)
    w = np.hstack([np.zeros(1), dw]).cumsum()

    s_t = start * np.exp((mu - 0.5 * sigma ** 2) * t + sigma * w)
    return s_t

We now choose ex ante parameter values for an example GBM time-series that we will then estimate using both maximum likelihood and Bayesian Inference.

mu = 0.0
sigma = 0.15

These are ‘reasonable’ parameter choices for a liquid stock in a ‘flat’ market - i.e. 0% drift and 15% expected volatility on an annualised basis (the equivalent volatility on a daily basis is ~0.8%). We then take a look at a single simulated time-series over the course of a single year, which we define as 365 days (i.e. ignoring the existence of weekends, bank holidays for the sake of simplicity).

np.random.seed(42)

example_data = pd.DataFrame({
    'day': np.arange(365),
    's': gbm(100, mu, sigma, 365)})

_ = sns.lineplot(x='day', y='s', data=example_data)

The Traditional Approach to Parameter Estimation

Traditionally, the parameters are estimated using the empirical mean and standard deviation of the daily logarithmic (or geometric) returns. The reasoning behind this can be seen by re-arranging the above equation for $\tilde{S_t}$ as follows,

$$ \log \left( \frac{S_t}{S_{t-1}} \right) = \left(\mu - \frac{\sigma^2}{2} \right) \Delta t + \sigma \tilde{\Delta W_t} $$

Which implies that,

$$ \log \left( \frac{S_t}{S_{t-1}} \right) \sim \text{Normal} \left[ \left(\mu - \frac{\sigma^2}{2} \right) \Delta t, \sigma^2 \Delta t \right] $$

Where,

$$ \Delta t = \frac{1}{365} $$

From which it is possible to solve the implied simultaneous equations for $\mu$ and $\sigma$, as functions of the mean and standard deviation of the geometric (i.e. logarithmic) returns. Once again, for a more in-depth discussion we refer the reader to ‘A Stochastic Processes Toolkit for Risk Management’, by Damiano Brigo et al..

Parameter Estimation when Data is Plentiful

An example computation, using the whole time-series generated above (364 observations of daily returns), is shown below. We start by taking a look at the distribution of daily returns.

returns_geo_full = (
    np.log(example_data.s)
    .diff(1)
    .dropna())

_ = sns.distplot(returns_geo_full)

The empirical distribution is relatively Normal in appearance, as expected. We now compute $\mu$ and $\sigma$ using the mean and standard deviation (or volatility) of this distribution.

dist_mean_full = returns_geo_full.mean()
dist_vol_full = returns_geo_full.std()

sigma_ml_full = dist_vol_full * np.sqrt(365)
mu_ml_full = dist_mean_full * 365 + 0.5 * sigma_ml_full ** 2

print(f'empirical estimate of mu = {mu_ml_full:.4f}')
print(f'empirical estimate of sigma = {sigma_ml_full:.4f}')

empirical estimate of mu = 0.0220
empirical estimate of sigma = 0.1423

We can see that the empirical estimate of $\sigma$ is close to the ex ante paramter value we chose, but that the estimate of $\mu$ is poor - estimating the drift of a stochastic process is notoriously hard.

Parameter Estimation when Data is Scarce

Very often data is scare - we may not have 364 observations of geometric returns. To demonstrate the impact this can have on parameter estimation, we sub-sample the distribution of geometric returns by picking 12 returns by random - e.g. to simulate the impact of having only 12 monthly returns to base the estimation on.

n_observations = 12

We now take a look at the distribution of returns.

returns_geo = returns_geo_full.sample(n_observations, random_state=42)

_ = sns.distplot(returns_geo, bins=3)

And the corresponding empirical parameter estimates.

dist_mean_ml = returns_geo.mean()
dist_vol_ml = returns_geo.std()

sigma_ml = dist_vol_ml * np.sqrt(365)
mu_ml = dist_mean_ml * 365 + 0.5 * sigma_ml ** 2

print(f'empirical estimate of mu = {mu_ml:.4f}')
print(f'empirical estimate of sigma = {sigma_ml:.4f}')

empirical estimate of mu = -1.3935
empirical estimate of sigma = 0.1080

We can clearly see that now estimates of both $\mu$ and $\sigma$ are poor.

Parameter Estimation using Bayesian Inference and Probabilistic Programming

Like statistical data analysis more broadly, the main aim of Bayesian Data Analysis (BDA) is to infer unknown parameters for models of observed data, in order to test hypotheses about the physical processes that lead to the observations. Bayesian data analysis deviates from traditional statistics - on a practical level - when it comes to the explicit assimilation of prior knowledge regarding the uncertainty of the model parameters, into the statistical inference process and overall analysis workflow. To this end, BDA focuses on the posterior distribution,

$$ p(\Theta | X) = \frac{p(X | \Theta) \cdot p(\Theta)}{p(X)} $$

Where,

$\Theta$ is the vector of unknown model parameters, that we wish to estimate;
$X$ is the vector of observed data;
$p(X | \Theta)$ is the likelihood function that models the probability of observing the data for a fixed choice of parameters; and,
$p(\Theta)$ is the prior distribution of the model parameters.

For an excellent (inspirational) introduction to practical BDA, take a look at Statistical Rethinking by Richard McElreath, or for a more theoretical treatment try Bayesian Data Analysis by Gelman & co..

We will use BDA to estimate the GBM parameters from our time series with scare data, to demonstrate the benefits of incorporating prior knowledge into the inference process and then compare these results with those derived using ML estimation (discussed above).

Selecting Suitable Prior Distributions

We will choose regularising priors that are also in-line with our prior knowledge of the time-series - that is, priors that place the bulk of their probability mass near zero, but allow for enough variation to make ‘reasonable’ parameter values viable for our liquid stock in a ‘flat’ (or drift-less) market.

Note, that in the discussion that follows, we will reason about the priors in terms of our real-world experience of daily price returns, their expected returns and volatility - i.e. the mean and standard deviation of our likelihood function.

Choosing a Prior Distribution for the Expected Return of Daily Returns

We choose a Normal distribution for this prior distribution, centered at 0 (i.e. regularising), but with a standard deviation of 0.0001 (i.e. 1 basis-point or 0.01%), to render a 3-4% annualised return a less than 1% probability - consistent with a market for a liquid stock trading ‘flat’.

prior_mean_mu = 0
prior_mean_sigma = 0.0001

prior_mean = pm.Normal.dist(mu=prior_mean_mu, sd=prior_mean_sigma)

Plotting the prior distribution for the mean return of daily returns.

prior_mean_x = np.arange(-0.0005, 0.0005, 0.00001)
prior_mean_density = [np.exp(prior_mean.logp(x).eval())
                    for x in prior_mean_x]

_ = sns.lineplot(x=prior_mean_x, y=prior_mean_density)

Choosing a Prior Distribution for the Volatility of Daily Returns

We choose a positive Half-Normal distribution for this prior distribution. Most of the mass is near 0 (i.e. regularising), but with a standard deviation of 0.0188 that corresponds to an expected daily volatility of ~0.015 (or 1.5%).

prior_vol_sigma = 0.0188

prior_vol = pm.HalfNormal.dist(sd=prior_vol_sigma)

Plotting the prior distribution for volatility of daily returns.

prior_vol_x = np.arange(0, 0.1, 0.001)
prior_vol_density = [np.exp(prior_vol.logp(x).eval())
                       for x in prior_vol_x]

_ = sns.lineplot(x=prior_vol_x, y=prior_vol_density)

Inference using a Probabilistic Program & Markov Chain Monte Carlo (MCMC)

Performing Bayesian inference usually requires some form of Probabilistic Programming Language (PPL), unless analytical approaches (e.g. based on conjugate prior models), are appropriate for the task at hand. More often than not, PPLs such as PYMC3 implement Markov Chain Monte Carlo (MCMC) algorithms that allow one to draw samples and make inferences from the posterior distribution implied by the choice of model - the likelihood and prior distributions for its parameters - conditional on the observed data.

We will make use of the default MCMC method in PYMC3’s sample function, which is Hamiltonian Monte Carlo (HMC). Those interested in the precise details of the HMC algorithm are directed to the excellent paper Michael Betancourt. Briefly, MCMC algorithms work by defining multi-dimensional Markovian stochastic processes, that when simulated (using Monte Carlo methods), will eventually converge to a state where successive simulations will be equivalent to drawing random samples from the posterior distribution of the model we wish to estimate.

The posterior distribution has one dimension for each model parameter, so we can then use the distribution of samples for each parameter to infer the range of possible values and/or compute point estimates (e.g. by taking the mean of all samples).

We start by defining the model we wish to infer - i.e. the probabilistic program.

model_gbm = pm.Model()

with model_gbm:
    prior_mean = pm.Normal('mean', mu=prior_mean_mu, sd=prior_mean_sigma)
    prior_vol = pm.HalfNormal('volatility', sd=prior_vol_sigma)

    likelihood = pm.Normal(
        'daily_returns', mu=prior_mean, sd=prior_vol, observed=returns_geo)

In the canoncial format adopted by Bayesian data analysts, this is expressed mathematically as,

model_gbm

$$ \begin{array}{rcl} \text{mean} &\sim & \text{Normal}(\mathit{mu}=0,~\mathit{sd}=0.0001)\text{volatility} &\sim & \text{HalfNormal}(\mathit{sd}=0.0188)\text{daily_returns} &\sim & \text{Normal}(\mathit{mu}=\text{mean},~\mathit{sd}=f(\text{volatility})) \end{array} $$

We now proceed to perform the inference step. For out purposes, we sample two chains in parallel (as we have two CPU cores available for doing so and this effectively doubles the number of samples), allow 5,000 steps for each chain to converge to its steady-state and then sample for a further 10,000 steps - i.e. generate 20,000 samples from the posterior distribution, assuming that each chain has converged after 5,000 samples.

with model_gbm:
    trace = pm.sample(draws=10000, tune=5000, njobs=2, random_seed=42)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [volatility, mean]
Sampling 2 chains: 100%|██████████| 30000/30000 [00:27<00:00, 1097.48draws/s]

We then take a look at the marginal parameter distributions inferred by each chain, together with the corresponding trace plots - i.e the sequential sample-by-sample draws of each chain - to look for ‘anomalies’.

_ = az.plot_trace(trace)

No obvious anomalies can be seen by visual inspection. We now compute the summary statistics for the inference (aggregating the draws from each train).

az.summary(trace, round_to=9)

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
mean	-0.000009	0.000102	0.000001	-0.000201	0.000183	20191.0	1.0
volatility	0.007363	0.001705	0.000016	0.004614	0.010505	15261.0	1.0

Both values of the Gelman-Rubin statistic (r_hat) are 1 and the the effective number of draws for each marginal parameter distribution (eff_n) are > 10,000. Thus, we have confidence that the MCMC algorithm has successfully inferred (or explored) the posterior distribution for our chosen probabilistic program. We now take a closer look at the marginal parameter distributions.

_ = az.plot_posterior(trace, round_to=9)

And their dependency structure.

_ = az.plot_pair(trace)

Finally, we compute estimates for $\mu$ and $\sigma$, based on our Bayesian point-estimates.

dist_mean_bayes = trace.get_values('mean').mean()
dist_sd_bayes = trace.get_values('volatility').mean()

sigma_bayes = dist_sd_bayes * np.sqrt(365)
mu_bayes = dist_mean_bayes * 365 + 0.5 * dist_sd_bayes ** 2

print(f'bayesian estimate of mu = {mu_bayes:.5f}')
print(f'bayesian estimate of sigma = {sigma_bayes:.4f}')

bayesian estimate of mu = -0.00309
bayesian estimate of sigma = 0.1407

The estimate for $\mu$ is far better than both ML estimates (full and partial data) and the estimate for $\sigma$ is considerably better than the ML estimate with partial data and approaching that with full data.

Making Predictions

Perhaps most importantly, how do the differences in parameter inference methodology translate into predictions for future distributions of geometric returns? We compare a (Normal) distribution of daily geometric returns simulated using the constant empirical parameter estimates with partial data (black line in the plot below), to that simulated by using random draws of Bayesian parameter estimates from the marginal posterior distributions (red line in the plot below).

n_samples=10000
np.random.seed(42)

posterior_predictive_samples = pm.sampling.sample_ppc(
    trace, samples=n_samples, model=model_gbm, random_seed=42)

returns_geo_bayes = posterior_predictive_samples['daily_returns'][:, 1]
returns_geo_ml = np.random.normal(dist_mean_ml, dist_vol_ml, n_samples)

_ = sns.distplot(returns_geo_ml, hist=False, color='black')
_ = sns.distplot(returns_geo_bayes, hist=False, color='red')

100%|██████████| 10000/10000 [00:06<00:00, 1555.86it/s]

We can clearly see that taking a Bayesian Inference approach to calibrating stochastic processes leads to more probability mass in the ‘tails’ of the distribution of geomtric returns.

Impact on Risk Metrics - Value-at-Risk (VaR)

We now quantify the impact that the difference in these distributions has on the VaR for a single unit of the stock, at the 1% and 99% percentile levels - i.e. on 1/100 chance events.

var_ml = np.percentile(returns_geo_ml, [1, 99])
var_bayes = np.percentile(returns_geo_bayes, [1, 99])

print(f'VaR-1%:')
print('-------')
print(f'maximum likelihood = {var_ml[0]}')
print(f'bayesian = {var_bayes[0]}')

print(f'\nVaR-99%:')
print('--------')
print(f'maximum likelihood = {var_ml[1]}')
print(f'bayesian = {var_bayes[1]}')

VaR-1%:
-------
maximum likelihood = -0.017048787051462327
bayesian = -0.01853874227071885

VaR-99%:
--------
maximum likelihood = 0.009175421564332082
bayesian = 0.019038871195300778

We can see that maximum likelihood estimation in our setup would underestimate risk for both long (VaR-1%) and short (VaR-99%) positions, but particularly for short position where the difference is by over 100%.

Summary

Bayesian inference can exploit relevant prior knowledge to yield more precise parameter estimate for stochastic processes, especially when data is scarce;
because it doesn’t rely on point-estimates of parameters and is intrinsically stochastic in nature, it is a natural unified framework for parameter inference and simulation, under uncertainty; and,
taken together, the above two points make the case for using Bayesian inference to calibrate risk models with greater confidence that they represent the real-world economic events the risk modeller needs them too, without having to rely as heavily on heuristic manipulation of these estimates. Indeed, the discussion now shifts to the choice of prior distribution for the paramters, which is more in-keeping with theoretical rigour.

Deploying Python ML Models with Flask, Docker and Kubernetes

2019-01-10T00:00:00+00:00

17th August 2019 - updated to reflect changes in the Kubernetes API and Seldon Core.
14th December 2020 - the work in this post forms the basis of the Bodywork MLOps tool - read about it here.

A common pattern for deploying Machine Learning (ML) models into production environments - e.g. ML models trained using the SciKit Learn or Keras packages (for Python), that are ready to provide predictions on new data - is to expose these ML models as RESTful API microservices, hosted from within Docker containers. These can then deployed to a cloud environment for handling everything required for maintaining continuous availability - e.g. fault-tolerance, auto-scaling, load balancing and rolling service updates.

The configuration details for a continuously available cloud deployment are specific to the targeted cloud provider(s) - e.g. the deployment process and topology for Amazon Web Services is not the same as that for Microsoft Azure, which in-turn is not the same as that for Google Cloud Platform. This constitutes knowledge that needs to be acquired for every cloud provider. Furthermore, it is difficult (some would say near impossible) to test entire deployment strategies locally, which makes issues such as networking hard to debug.

Kubernetes is a container orchestration platform that seeks to address these issues. Briefly, it provides a mechanism for defining entire microservice-based application deployment topologies and their service-level requirements for maintaining continuous availability. It is agnostic to the targeted cloud provider, can be run on-premises and even locally on your laptop - all that’s required is a cluster of virtual machines running Kubernetes - i.e. a Kubernetes cluster.

This blog post is designed to be read in conjunction with the code in this GitHub repository, that contains the Python modules, Docker configuration files and Kubernetes instructions for demonstrating how a simple Python ML model can be turned into a production-grade RESTful model-scoring (or prediction) API service, using Docker and Kubernetes - both locally and with Google Cloud Platform (GCP). It is not a comprehensive guide to Kubernetes, Docker or ML - think of it more as a ‘ML on Kubernetes 101’ for demonstrating capability and allowing newcomers to Kubernetes (e.g. data scientists who are more focused on building models as opposed to deploying them), to get up-and-running quickly and become familiar with the basic concepts and patterns.

We will demonstrate ML model deployment using two different approaches: a first principles approach using Docker and Kubernetes; and then a deployment using the Seldon-Core Kubernetes native framework for streamlining the deployment of ML services. The former will help to appreciate the latter, which constitutes a powerful framework for deploying and performance-monitoring many complex ML model pipelines.

Table of Contents

Containerising a Simple ML Model Scoring Service using Flask and Docker
Installing Kubernetes for Local Development and Testing
Configuring a Multi-Node Cluster on Google Cloud Platform
Switching Between Kubectl Contexts
Using YAML Files to Define and Deploy the ML Model Scoring Service
Using Helm Charts to Define and Deploy the ML Model Scoring Service
- Installing Helm
- Deploying with Helm
Using Seldon to Deploy the ML Model Scoring Service to Kubernetes
Tear Down
Where to go from Here
Appendix - Using Pipenv for Managing Python Package Dependencies

Containerising a Simple ML Model Scoring Service using Flask and Docker

We start by demonstrating how to achieve this basic competence using the simple Python ML model scoring REST API contained in the api.py module, together with the Dockerfile, both within the py-flask-ml-score-api directory, whose core contents are as follows,

py-flask-ml-score-api/
 | Dockerfile
 | Pipfile
 | Pipfile.lock
 | api.py

If you’re already feeling lost then these files are discussed in the points below, otherwise feel free to skip to the next section.

Defining the Flask Service in the `api.py` Module

This is a Python module that uses the Flask framework for defining a web service (app), with a function (score), that executes in response to a HTTP request to a specific URL (or ‘route’), thanks to being wrapped by the app.route function. For reference, the relevant code is reproduced below,

from flask import Flask, jsonify, make_response, request

app = Flask(__name__)


@app.route('/score', methods=['POST'])
def score():
    features = request.json['X']
    return make_response(jsonify({'score': features}))


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

If running locally - e.g. by starting the web service using python run api.py - we would be able reach our function (or ‘endpoint’) at http://localhost:5000/score. This function takes data sent to it as JSON (that has been automatically de-serialised as a Python dict made available as the request variable in our function definition), and returns a response (automatically serialised as JSON).

In our example function, we expect an array of features, X, that we pass to a ML model, which in our example returns those same features back to the caller - i.e. our chosen ML model is the identity function, which we have chosen for purely demonstrative purposes. We could just as easily have loaded a pickled SciKit-Learn or Keras model and passed the data to the approproate predict method, returning a score for the feature-data as JSON - see here for an example of this in action.

Defining the Docker Image with the `Dockerfile`

A Dockerfile is essentially the configuration file used by Docker, that allows you to define the contents and configure the operation of a Docker container, when operational. This static data, when not executed as a container, is referred to as the ‘image’. For reference, the Dockerfile is reproduced below,

FROM python:3.6-slim
WORKDIR /usr/src/app
COPY . .
RUN pip install pipenv
RUN pipenv install
EXPOSE 5000
CMD ["pipenv", "run", "python", "api.py"]

In our example Dockerfile we:

start by using a pre-configured Docker image (python:3.6-slim) that has a version of the Alpine Linux distribution with Python already installed;
then copy the contents of the py-flask-ml-score-api local directory to a directory on the image called /usr/src/app;
then use pip to install the Pipenv package for Python dependency management (see the appendix at the bottom for more information on how we use Pipenv);
then use Pipenv to install the dependencies described in Pipfile.lock into a virtual environment on the image;
configure port 5000 to be exposed to the ‘outside world’ on the running container; and finally,
to start our Flask RESTful web service - api.py. Note, that here we are relying on Flask’s internal WSGI server, whereas in a production setting we would recommend on configuring a more robust option (e.g. Gunicorn), as discussed here.

Building this custom image and asking the Docker daemon to run it (remember that a running image is a ‘container’), will expose our RESTful ML model scoring service on port 5000 as if it were running on a dedicated virtual machine. Refer to the official Docker documentation for a more comprehensive discussion of these core concepts.

Building a Docker Image for the ML Scoring Service

We assume that Docker is running locally (both Docker client and daemon), that the client is logged into an account on DockerHub and that there is a terminal open in the this project’s root directory. To build the image described in the Dockerfile run,

docker build --tag alexioannides/test-ml-score-api py-flask-ml-score-api

Where ‘alexioannides’ refers to the name of the DockerHub account that we will push the image to, once we have tested it.

Testing

To test that the image can be used to create a Docker container that functions as we expect it to use,

docker run --rm --name test-api -p 5000:5000 -d alexioannides/test-ml-score-api

Where we have mapped port 5000 from the Docker container - i.e. the port our ML model scoring service is listening to - to port 5000 on our host machine (localhost). Then check that the container is listed as running using,

docker ps

And then test the exposed API endpoint using,

curl http://localhost:5000/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"X": [1, 2]}'

Where you should expect a response along the lines of,

{"score":[1,2]}

All our test model does is return the input data - i.e. it is the identity function. Only a few lines of additional code are required to modify this service to load a SciKit Learn model from disk and pass new data to it’s ‘predict’ method for generating predictions - see here for an example. Now that the container has been confirmed as operational, we can stop it,

docker stop test-api

Pushing the Image to the DockerHub Registry

In order for a remote Docker host or Kubernetes cluster to have access to the image we’ve created, we need to publish it to an image registry. All cloud computing providers that offer managed Docker-based services will provide private image registries, but we will use the public image registry at DockerHub, for convenience. To push our new image to DockerHub (where my account ID is ‘alexioannides’) use,

docker push alexioannides/test-ml-score-api

Where we can now see that our chosen naming convention for the image is intrinsically linked to our target image registry (you will need to insert your own account ID where required). Once the upload is finished, log onto DockerHub to confirm that the upload has been successful via the DockerHub UI.

Installing Kubernetes for Local Development and Testing

There are two options for installing a single-node Kubernetes cluster that is suitable for local development and testing: via the Docker Desktop client, or via Minikube.

Installing Kubernetes via Docker Desktop

If you have been using Docker on a Mac, then the chances are that you will have been doing this via the Docker Desktop application. If not (e.g. if you installed Docker Engine via Homebrew), then Docker Desktop can be downloaded here. Docker Desktop now comes bundled with Kubernetes, which can be activated by going to Preferences -> Kubernetes and selecting Enable Kubernetes. It will take a while for Docker Desktop to download the Docker images required to run Kubernetes, so be patient. After it has finished, go to Preferences -> Advanced and ensure that at least 2 CPUs and 4 GiB have been allocated to the Docker Engine, which are the the minimum resources required to deploy a single Seldon ML component.

To interact with the Kubernetes cluster you will need the kubectl Command Line Interface (CLI) tool, which will need to be downloaded separately. The easiest way to do this on a Mac is via Homebrew - i.e with brew install kubernetes-cli. Once you have kubectl installed and a Kubernetes cluster up-and-running, test that everything is working as expected by running,

kubectl cluster-info

Which ought to return something along the lines of,

Kubernetes master is running at https://kubernetes.docker.internal:6443
KubeDNS is running at https://kubernetes.docker.internal:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Installing Kubernetes via Minikube

On Mac OS X, the steps required to get up-and-running with Minikube are as follows:

make sure the Homebrew package manager for OS X is installed; then,
install VirtualBox using, brew cask install virtualbox (you may need to approve installation via OS X System Preferences); and then,
install Minikube using, brew cask install minikube.

To start the test cluster run,

minikube start --memory 4096

Where we have specified the minimum amount of memory required to deploy a single Seldon ML component. Be patient - Minikube may take a while to start. To test that the cluster is operational run,

kubectl cluster-info

Where kubectl is the standard Command Line Interface (CLI) client for interacting with the Kubernetes API (which was installed as part of Minikube, but is also available separately).

Deploying the Containerised ML Model Scoring Service to Kubernetes

To launch our test model scoring service on Kubernetes, we will start by deploying the containerised service within a Kubernetes Pod, whose rollout is managed by a Deployment, which in in-turn creates a ReplicaSet - a Kubernetes resource that ensures a minimum number of pods (or replicas), running our service are operational at any given time. This is achieved with,

kubectl create deployment test-ml-score-api --image=alexioannides/test-ml-score-api:latest

To check on the status of the deployment run,

kubectl rollout status deployment test-ml-score-api

And to see the pods that is has created run,

kubectl get pods

It is possible to use port forwarding to test an individual container without exposing it to the public internet. To use this, open a separate terminal and run (for example),

kubectl port-forward test-ml-score-api-szd4j 5000:5000

Where test-ml-score-api-szd4j is the precise name of the pod currently active on the cluster, as determined from the kubectl get pods command. Then from your original terminal, to repeat our test request against the same container running on Kubernetes run,

curl http://localhost:5000/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"X": [1, 2]}'

To expose the container as a (load balanced) service to the outside world, we have to create a Kubernetes service that references it. This is achieved with the following command,

kubectl expose deployment test-ml-score-api --port 5000 --type=LoadBalancer --name test-ml-score-api-lb

If you are using Docker Desktop, then this will automatically emulate a load balancer at http://localhost:5000. To find where Minikube has exposed its emulated load balancer run,

minikube service list

Now we test our new service - for example (with Docker Desktop),

curl http://localhost:5000/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"X": [1, 2]}'

Note, neither Docker Desktop or Minikube setup a real-life load balancer (which is what would happen if we made this request on a cloud platform). To tear-down the load balancer, deployment and pod, run the following commands in sequence,

kubectl delete deployment test-ml-score-api
kubectl delete service test-ml-score-api-lb

Configuring a Multi-Node Cluster on Google Cloud Platform

In order to perform testing on a real-world Kubernetes cluster with far greater resources than those available on a laptop, the easiest way is to use a managed Kubernetes platform from a cloud provider. We will use Kubernetes Engine on Google Cloud Platform (GCP).

Getting Up-and-Running with Google Cloud Platform

Before we can use Google Cloud Platform, sign-up for an account and create a project specifically for this work. Next, make sure that the GCP SDK is installed on your local machine - e.g.,

brew cask install google-cloud-sdk

Or by downloading an installation image directly from GCP. Note, that if you haven’t already installed Kubectl, then you will need to do so now, which can be done using the GCP SDK,

gcloud components install kubectl

We then need to initialise the SDK,

gcloud init

Which will open a browser and guide you through the necessary authentication steps. Make sure you pick the project you created, together with a default zone and region (if this has not been set via Compute Engine -> Settings).

Initialising a Kubernetes Cluster

Firstly, within the GCP UI visit the Kubernetes Engine page to trigger the Kubernetes API to start-up. From the command line we then start a cluster using,

gcloud container clusters create k8s-test-cluster --num-nodes 3 --machine-type g1-small

And then go make a cup of coffee while you wait for the cluster to be created. Note, that this will automatically switch your kubectl context to point to the cluster on GCP, as you will see if you run, kubectl config get-contexts. To switch back to the Docker Desktop client use kubectl config use-context docker-desktop.

Launching the Containerised ML Model Scoring Service on GCP

This is largely the same as we did for running the test service locally - run the following commands in sequence,

kubectl create deployment test-ml-score-api --image=alexioannides/test-ml-score-api:latest
kubectl expose deployment test-ml-score-api --port 5000 --type=LoadBalancer --name test-ml-score-api-lb

But, to find the external IP address for the GCP cluster we will need to use,

kubectl get services

And then we can test our service on GCP - for example,

curl http://35.246.92.213:5000/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"X": [1, 2]}'

Or, we could again use port forwarding to attach to a single pod - for example,

kubectl port-forward test-ml-score-api-nl4sc 5000:5000

And then in a separate terminal,

curl http://localhost:5000/score \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"X": [1, 2]}'

Finally, we tear-down the replication controller and load balancer,

kubectl delete deployment test-ml-score-api
kubectl delete service test-ml-score-api-lb

Switching Between Kubectl Contexts

If you are running both with Kubernetes locally and with a cluster on GCP, then you can switch Kubectl context from one cluster to the other, as follows,

kubectl config use-context docker-desktop

Where the list of available contexts can be found using,

kubectl config get-contexts

Using YAML Files to Define and Deploy the ML Model Scoring Service

Up to this point we have been using Kubectl commands to define and deploy a basic version of our ML model scoring service. This is fine for demonstrative purposes, but quickly becomes limiting, as well as unmanageable. In practice, the standard way of defining entire Kubernetes deployments is with YAML files, posted to the Kubernetes API. The py-flask-ml-score.yaml file in the py-flask-ml-score-api directory is an example of how our ML model scoring service can be defined in a single YAML file. This can now be deployed using a single command,

kubectl apply -f py-flask-ml-score-api/py-flask-ml-score.yaml

Note, that we have defined three separate Kubernetes components in this single file: a namespace, a deployment and a load-balanced service - for all of these components (and their sub-components), using --- to delimit the definition of each separate component. To see all components deployed into this namespace use,

kubectl get all --namespace test-ml-app

And likewise set the --namespace flag when using any kubectl get command to inspect the different components of our test app. Alternatively, we can set our new namespace as the default context,

kubectl config set-context $(kubectl config current-context) --namespace=test-ml-app

And then run,

kubectl get all

Where we can switch back to the default namespace using,

kubectl config set-context $(kubectl config current-context) --namespace=default

To tear-down this application we can then use,

kubectl delete -f py-flask-ml-score-api/py-flask-ml-score.yaml

Which saves us from having to use multiple commands to delete each component individually. Refer to the official documentation for the Kubernetes API to understand the contents of this YAML file in greater depth.

Using Helm Charts to Define and Deploy the ML Model Scoring Service

Writing YAML files for Kubernetes can get repetitive and hard to manage, especially if there is a lot of ‘copy-paste’ involved, when only a handful of parameters need to be changed from one deployment to the next, but there is a ‘wall of YAML’ that needs to be modified. Enter Helm - a framework for creating, executing and managing Kubernetes deployment templates. What follows is a very high-level demonstration of how Helm can be used to deploy our ML model scoring service - for a comprehensive discussion of Helm’s full capabilities (and here are a lot of them), please refer to the official documentation. Seldon-Core can also be deployed using Helm and we will cover this in more detail later on.

Installing Helm

As before, the easiest way to install Helm onto Mac OS X is to use the Homebrew package manager,

brew install kubernetes-helm

Helm relies on a dedicated deployment server, referred to as the ‘Tiller’, to be running within the same Kubernetes cluster we wish to deploy our applications to. Before we deploy Tiller we need to create a cluster-wide super-user role to assign to it, so that it can create and modify Kubernetes resources in any namespace. To achieve this, we start by creating a Service Account that is destined for our tiller. A Service Account is a means by which a pod (and any service running within it), when associated with a Service Accoutn, can authenticate itself to the Kubernetes API, to be able to view, create and modify resources. We create this in the kube-system namespace (a common convention) as follows,

kubectl --namespace kube-system create serviceaccount tiller

We then create a binding between this Service Account and the cluster-admin Cluster Role, which as the name suggest grants cluster-wide admin rights,

kubectl create clusterrolebinding tiller \
    --clusterrole cluster-admin \
    --serviceaccount=kube-system:tiller

We can now deploy the Helm Tiller to a Kubernetes cluster, with the desired access rights using,

helm init --service-account tiller

Deploying with Helm

To create a fresh Helm deployment definition - referred to as a ‘chart’ in Helm terminology - run,

helm create NAME-OF-YOUR-HELM-CHART

This creates a new directory - e.g. helm-ml-score-app as included with this repository - with the following high-level directory structure,

helm-ml-score-app/
 | -- charts/
 | -- templates/
 | Chart.yaml
 | values.yaml

Briefly, the charts directory contains other charts that our new chart will depend on (we will not make use of this), the templates directory contains our Helm templates, Chart.yaml contains core information for our chart (e.g. name and version information) and values.yaml contains default values to render our templates with (in the case that no values are set from the command line).

The next step is to delete all of the files in the templates directory (apart from NOTES.txt), and to replace them with our own. We start with namespace.yaml for declaring a namespace for our app,

apiVersion: v1
kind: Namespace
metadata:
  name: {{ .Values.app.namespace }}

Anyone familiar with HTML template frameworks (e.g. Jinja), will be familiar with the use of {{}} for defining values that will be injected into the rendered template. In this specific instance .Values.app.namespace injects the app.namespace variable, whose default value defined in values.yaml. Next we define a deployment of pods in deployment.yaml,

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: {{ .Values.app.name }}
    env: {{ .Values.app.env }}
  name: {{ .Values.app.name }}
  namespace: {{ .Values.app.namespace }}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: {{ .Values.app.name }}
  template:
    metadata:
      labels:
        app: {{ .Values.app.name }}
        env: {{ .Values.app.env }}
    spec:
      containers:
      - image: {{ .Values.app.image }}
        name: {{ .Values.app.name }}
        ports:
        - containerPort: {{ .Values.containerPort }}
          protocol: TCP

And the details of the load balancer service in service.yaml,

apiVersion: v1
kind: Service
metadata:
  name: {{ .Values.app.name }}-lb
  labels:
    app: {{ .Values.app.name }}
  namespace: {{ .Values.app.namespace }}
spec:
  type: LoadBalancer
  ports:
  - port: {{ .Values.containerPort }}
    targetPort: {{ .Values.targetPort }}
  selector:
    app: {{ .Values.app.name }}

What we have done, in essence, is to split-out each component of the deployment details from py-flask-ml-score.yaml into its own file and then define template variables for each parameter of the configuration that is most likely to change from one deployment to the next. To test and examine the rendered template, without having to attempt a deployment, run,

helm install helm-ml-score-app --debug --dry-run

If you are happy with the results of the ‘dry run’, then execute the deployment and generate a release from the chart using,

helm install helm-ml-score-app --name test-ml-app

This will automatically print the status of the release, together with the name that Helm has ascribed to it (e.g. ‘willing-yak’) and the contents of NOTES.txt rendered to the terminal. To list all available Helm releases and their names use,

helm list

And to the status of all their constituent components (e.g. pods, replication controllers, service, etc.) use for example,

helm status test-ml-app

The ML scoring service can now be tested in exactly the same way as we have done previously (above). Once you have convinced yourself that it’s working as expected, the release can be deleted using,

helm delete test-ml-app

Using Seldon to Deploy the ML Model Scoring Service to Kubernetes

Seldon’s core mission is to simplify the repeated deployment and management of complex ML prediction pipelines on top of Kubernetes. In this demonstration we are going to focus on the simplest possible example - i.e. the simple ML model scoring API we have already been using.

Building an ML Component for Seldon

To deploy a ML component using Seldon, we need to create Seldon-compatible Docker images. We start by following these guidelines for defining a Python class that wraps an ML model targeted for deployment with Seldon. This is contained within the seldon-ml-score-component directory.

Building the Docker Image for use with Seldon

Seldon requires that the Docker image for the ML scoring service be structured in a particular way:

the ML model has to be wrapped in a Python class with a predict method with a particular signature (or interface);
the seldon-core Python package must be installed (we use pipenv to manage dependencies as discussed above and in the Appendix below); and,
the container starts by running the Seldon service using the seldon-core-microservice entry-point provided by the seldon-core package.

For the precise details, see MLScore.py and Dockefile in the seldon-ml-score-component directory. Next, build this image,

docker build seldon-ml-score-component -t alexioannides/test-ml-score-seldon-api:latest

Before we push this image to our registry, we need to make sure that it’s working as expected. Start the image on the local Docker daemon,

docker run --rm -p 5000:5000 -d alexioannides/test-ml-score-seldon-api:latest

And then send it a request (using a different request format to the ones we’ve used thus far),

curl -g http://localhost:5000/predict \
    --data-urlencode 'json={"data":{"names":["a","b"],"tensor":{"shape":[2,2],"values":[0,0,1,1]}}}'

If response is as expected (i.e. it contains the same payload as the request), then push the image,

docker push alexioannides/test-ml-score-seldon-api:latest

Deploying a ML Component with Seldon Core

We now move on to deploying our Seldon compatible ML component to a Kubernetes cluster and creating a fault-tolerant and scalable service from it. To achieve this, we will deploy Seldon-Core using Helm charts. We start by creating a namespace that will contain the seldon-core-operator, a custom Kubernetes resource required to deploy any ML model using Seldon,

kubectl create namespace seldon-core

Then we deploy Seldon-Core using Helm and the official Seldon Helm chart repository hosted at https://storage.googleapis.com/seldon-charts,

helm install seldon-core-operator \
  --name seldon-core \
  --repo https://storage.googleapis.com/seldon-charts \
  --set usageMetrics.enabled=false \
  --namespace seldon-core

Next, we deploy the Ambassador API gateway for Kubernetes, that will act as a single point of entry into our Kubernetes cluster and will be able to route requests to any ML model we have deployed using Seldon. We will create a dedicate namespace for the Ambassador deployment,

kubectl create namespace ambassador

And then deploy Ambassador using the most recent charts in the official Helm repository,

helm install stable/ambassador \
  --name ambassador \
  --set crds.keep=false \
  --namespace ambassador

If we now run helm list --namespace seldon-core we should see that Seldon-Core has been deployed and is waiting for Seldon ML components to be deployed. To deploy our Seldon ML model scoring service we create a separate namespace for it,

kubectl create namespace test-ml-seldon-app

And then configure and deploy another official Seldon Helm chart as follows,

helm install seldon-single-model \
  --name test-ml-seldon-app \
  --repo https://storage.googleapis.com/seldon-charts \
  --set model.image.name=alexioannides/test-ml-score-seldon-api:latest \
  --namespace test-ml-seldon-app

Note, that multiple ML models can now be deployed using Seldon by repeating the last two steps and they will all be automatically reachable via the same Ambassador API gateway, which we will now use to test our Seldon ML model scoring service.

Testing the API via the Ambassador Gateway API

To test the Seldon-based ML model scoring service, we follow the same general approach as we did for our first-principles Kubernetes deployments above, but we will route our requests via the Ambassador API gateway. To find the IP address for Ambassador service run,

kubectl -n ambassador get service ambassador

Which will be localhost:80 if using Docker Desktop, or an IP address if running on GCP or Minikube (were you will need to remember to use minikuke service list in the latter case). Now test the prediction end-point - for example,

curl http://35.246.28.247:80/seldon/test-ml-seldon-app/test-ml-seldon-app/api/v0.1/predictions \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"data":{"names":["a","b"],"tensor":{"shape":[2,2],"values":[0,0,1,1]}}}'

If you want to understand the full logic behind the routing see the Seldon documentation, but the URL is essentially assembled using,

http://<ambassadorEndpoint>/seldon/<namespace>/<deploymentName>/api/v0.1/predictions

If your request has been successful, then you should see a response along the lines of,

{
  "meta": {
    "puid": "hsu0j9c39a4avmeonhj2ugllh9",
    "tags": {
    },
    "routing": {
    },
    "requestPath": {
      "classifier": "alexioannides/test-ml-score-seldon-api:latest"
    },
    "metrics": []
  },
  "data": {
    "names": ["t:0", "t:1"],
    "tensor": {
      "shape": [2, 2],
      "values": [0.0, 0.0, 1.0, 1.0]
    }
  }
}

Tear Down

To delete a single Seldon ML model and its namespace, deployed using the steps above, run,

helm delete test-ml-seldon-app --purge &&
  kubectl delete namespace test-ml-seldon-app

Follow the same pattern to remove the Seldon Core Operator and Ambassador,

helm delete seldon-core --purge && kubectl delete namespace seldon-core
helm delete ambassador --purge && kubectl delete namespace ambassador

If there is a GCP cluster that needs to be killed run,

gcloud container clusters delete k8s-test-cluster

And likewise if working with Minikube,

minikube stop
minikube delete

If running on Docker Desktop, navigate to Preferences -> Reset to reset the cluster.

Where to go from Here

The following list of resources will help you dive deeply into the subjects we skimmed-over above:

the full set of functionality provided by Seldon;
running multi-stage containerised workflows (e.g. for data engineering and model training) using Argo Workflows;
the excellent ‘Kubernetes in Action‘ by Marko Lukša available from Manning Publications;
‘Docker in Action‘ by Jeff Nickoloff and Stephen Kuenzli also available from Manning Publications; and,
‘Flask Web Development’ by Miguel Grinberg O’Reilly.

This work was initially committed in 2018 and has since formed the basis of Bodywork - an open-source MLOps tool for deploying machine learning projects developed in Python, to Kubernetes. This project, where I am one of the core contributors, is an attempt to provide automation for a lot of the steps that this project has demonstrated to many machine learning engineers over the years.

Appendix - Using Pipenv for Managing Python Package Dependencies

We use pipenv for managing project dependencies and Python environments (i.e. virtual environments). All of the direct packages dependencies required to run the code (e.g. Flask or Seldon-Core), as well as any packages that could have been used during development (e.g. flake8 for code linting and IPython for interactive console sessions), are described in the Pipfile. Their precise downstream dependencies are described in Pipfile.lock.

Installing Pipenv

pip3 install pipenv

Pipenv is also available to install from many non-Python package managers. For example, on OS X it can be installed using the Homebrew package manager, with the following terminal command,

brew install pipenv

For more information, including advanced configuration options, see the official pipenv documentation.

Installing Projects Dependencies

If you want to experiment with the Python code in the py-flask-ml-score-api or seldon-ml-score-component directories, then make sure that you’re in the appropriate directory and then run,

pipenv install

This will install all of the direct project dependencies.

Running Python, IPython and JupyterLab from the Project’s Virtual Environment

In order to continue development in a Python environment that precisely mimics the one the project was initially developed with, use Pipenv from the command line as follows,

pipenv run python3

The python3 command could just as well be seldon-core-microservice or any other entry-point provided by the seldon-core package - for example, in the Dockerfile for the seldon-ml-score-component we start the Seldon-based ML model scoring service using,

pipenv run seldon-core-microservice ...

Pipenv Shells

Prepending pipenv to every command you want to run within the context of your Pipenv-managed virtual environment, can get very tedious. This can be avoided by entering into a Pipenv-managed shell,

pipenv shell

which is equivalent to ‘activating’ the virtual environment. Any command will now be executed within the virtual environment. Use exit to leave the shell session.

Bayesian Regression in PYMC3 using MCMC & Variational Inference

2018-11-07T00:00:00+00:00

Conducting a Bayesian data analysis - e.g. estimating a Bayesian linear regression model - will usually require some form of Probabilistic Programming Language (PPL), unless analytical approaches (e.g. based on conjugate prior models), are appropriate for the task at hand. More often than not, PPLs implement Markov Chain Monte Carlo (MCMC) algorithms that allow one to draw samples and make inferences from the posterior distribution implied by the choice of model - the likelihood and prior distributions for its parameters - conditional on the observed data.

MCMC algorithms are, generally speaking, computationally expensive and do not scale very easily. For example, it is not as easy to distribute the execution of these algorithms over a cluster of machines, when compared to the optimisation algorithms used for training deep neural networks (e.g. stochastic gradient descent).

Over the past few years, however, a new class of algorithms for inferring Bayesian models has been developed, that do not rely heavily on computationally expensive random sampling. These algorithms are referred to as Variational Inference (VI) algorithms and have been shown to be successful with the potential to scale to ‘large’ datasets.

My preferred PPL is PYMC3 and offers a choice of both MCMC and VI algorithms for inferring models in Bayesian data analysis. This blog post is based on a Jupyter notebook located in this GitHub repository, whose purpose is to demonstrate using PYMC3, how MCMC and VI can both be used to perform a simple linear regression, and to make a basic comparison of their results.

Table of Contents

A (very) Quick Introduction to Bayesian Data Analysis
Imports and Global Settings
Create Synthetic Data
Split Data into Training and Test Sets
Define Bayesian Regression Model
Model Inference Using MCMC (HMC)
Model Inference using Variational Inference (mini-batch ADVI)
Comparing Predictions
Conclusions

A (very) Quick Introduction to Bayesian Data Analysis

$$ p(\Theta | X) = \frac{p(X | \Theta) \cdot p(\Theta)}{p(X)} $$

Where,

$\Theta$ is the vector of unknown model parameters, that we wish to estimate;
$X$ is the vector of observed data;
$p(X | \Theta)$ is the likelihood function that models the probability of observing the data for a fixed choice of parameters; and,
$p(\Theta)$ is the prior distribution of the model parameters.

This notebook is concerned with demonstrating and comparing two separate approaches for inferring the posterior distribution, $p(\Theta | X)$, for a linear regression model.

Imports and Global Settings

Before we get going in earnest, we follow the convention of declaring all imports at the top of the notebook.

import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import theano
import warnings
from numpy.random import binomial, randn, uniform
from sklearn.model_selection import train_test_split

And then notebook-wide (global) settings that enable in-line plotting, configure Seaborn for visualisation and to explicitly ignore warnings (e.g. NumPy deprecations).

%matplotlib inline

sns.set()
warnings.filterwarnings('ignore')

Create Synthetic Data

We will assume that there is a dependent variable (or labelled data) $\tilde{y}$, that is a linear function of independent variables (or feature data), $x$ and $c$. In this instance, $x$ is a positive real number and $c$ denotes membership to one of two categories that occur with equal likelihood. We express this model mathematically, as follows,

$$ \tilde{y} = \alpha_{c} + \beta_{c} \cdot x + \sigma \cdot \tilde{\epsilon} $$

where $\tilde{\epsilon} \sim N(0, 1)$, $\sigma$ is the standard deviation of the noise in the data and $c \in {0, 1}$ denotes the category. We start by defining our a priori choices for the model parameters.

alpha_0 = 1
alpha_1 = 1.25

beta_0 = 1
beta_1 = 1.25

sigma = 0.75

We then use these to generate some random samples that we store in a DataFrame and visualise using the Seaborn package.

n_samples = 1000

category = binomial(n=1, p=0.5, size=n_samples)
x = uniform(low=0, high=10, size=n_samples)

y = ((1 - category) * alpha_0 + category * alpha_1
     + ((1 - category) * beta_0 + category * beta_1) * x
     + sigma * randn(n_samples))

model_data = pd.DataFrame({'y': y, 'x': x, 'category': category})

display(model_data.head())
_ = sns.relplot(x='x', y='y', hue='category', data=model_data)

	y	x	category
0	3.429483	2.487456	1
1	6.987868	5.801619	0
2	3.340802	3.046879	0
3	8.826015	6.172437	1
4	10.659304	9.829751	0

Split Data into Training and Test Sets

One of the advantages of generating synthetic data is that we can ensure we have enough data to be able to partition it into two sets - one for training models and one for testing models. We use a helper function from the Scikit-Learn package for this task and make use of stratified sampling to ensure that we have a balanced representation of each category in both training and test datasets.

train, test = train_test_split(
    model_data, test_size=0.2, stratify=model_data.category)

We will be using the PYMC3 package for building and estimating our Bayesian regression models, which in-turn uses the Theano package as a computational ‘back-end’ (in much the same way that the Keras package for deep learning uses TensorFlow as back-end). Consequently, we will have to interact with Theano if we want to have the ability to swap between training and test data (which we do). As such, we will explicitly define ‘shared’ tensors for all of our model variables.

y_tensor = theano.shared(train.y.values.astype('float64'))
x_tensor = theano.shared(train.x.values.astype('float64'))
cat_tensor = theano.shared(train.category.values.astype('int64'))

Define Bayesian Regression Model

Now we move on to define the model that we want to estimate (i.e. our hypothesis regarding the data), irrespective of how we will perform the inference. We will assume full knowledge of the data-generating model we defined above and define conservative regularising priors for each of the model parameters.

with pm.Model() as model:
    alpha_prior = pm.HalfNormal('alpha', sd=2, shape=2)
    beta_prior = pm.Normal('beta', mu=0, sd=2, shape=2)
    sigma_prior = pm.HalfNormal('sigma', sd=2, shape=1)
    mu_likelihood = alpha_prior[cat_tensor] + beta_prior[cat_tensor] * x_tensor
    y_likelihood = pm.Normal('y', mu=mu_likelihood, sd=sigma_prior, observed=y_tensor)

Model Inference Using MCMC (HMC)

For the purposes of this demonstration, we sample two chains in parallel (as we have two CPU cores available for doing so and this effectively doubles the number of samples), allow 1,000 steps for each chain to converge to its steady-state and then sample for a further 5,000 steps - i.e. generate 5,000 samples from the posterior distribution, assuming that the chain has converged after 1,000 samples.

with model:
    hmc_trace = pm.sample(draws=5000, tune=1000, cores=2)

Now let’s take a look at what we can infer from the HMC samples of the posterior distribution.

pm.traceplot(hmc_trace)
pm.summary(hmc_trace)

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
beta__0	1.002347	0.013061	0.000159	0.977161	1.028955	5741.410305	0.999903
beta__1	1.250504	0.012084	0.000172	1.226709	1.273830	5293.506143	1.000090
alpha__0	0.989984	0.073328	0.000902	0.850417	1.141318	5661.466167	0.999900
alpha__1	1.204203	0.069373	0.000900	1.069428	1.339139	5514.158012	1.000004
sigma__0	0.734316	0.017956	0.000168	0.698726	0.768540	8925.864908	1.000337

Firstly, note that Rhat values (the Gelman Rubin statistic) converging to 1 implies chain convergence for the marginal parameter distributions, while n_eff describes the effective number of samples after autocorrelations in the chains have been accounted for. We can see from the mean (point) estimate of each parameter that HMC has done a reasonable job of estimating our original parameters.

Model Inference using Variational Inference (mini-batch ADVI)

Variational Inference (VI) takes a completely different approach to inference. Briefly, VI is a name for a class of algorithms that seek to fit a chosen class of functions to approximate the posterior distribution, effectively turning inference into an optimisation problem. In this instance VI minimises the Kullback–Leibler (KL) divergence (a measure of the ‘similarity’ between two densities), between the approximated posterior density and the actual posterior density. An excellent review of VI can be found in the paper by Blei & co..

Just to make things more complicated (and for this description to be complete), the KL divergence is actually minimised, by maximising the Evidence Lower BOund (ELBO), which is equal to the negative of the KL divergence up to a constant term - a constant that is computationally infeasible to compute, which is why, technically, we are optimising ELBO and not the KL divergence, albeit to achieve the same end-goal.

We are going to make use of PYMC3’s Auto-Differentiation Variational Inference (ADVI) algorithm (full details in the paper by Kucukelbir & co.), which is capable of computing a VI for any differentiable posterior distribution (i.e. any model with continuous prior distributions). In order to achieve this very clever feat (the paper is well-worth a read), the algorithm first maps the posterior into a space where all prior distributions have the same support, such that they can be well approximated by fitting a spherical n-dimensional Gaussian distribution within this space - this is referred to as the ‘Gaussian mean-field approximation’. Note, that due to the initial transformation, this is not the same as approximating the posterior distribution using an n-dimensional Normal distribution. The parameters of these Gaussian parameters are then chosen to maximise the ELBO using gradient ascent - i.e. using high-performance auto-differentiation techniques in numerical computing back-ends such as Theano, TensorFlow, etc..

The assumption of a spherical Gaussian distribution does, however, imply no dependency (i.e. zero correlations) between parameter distributions. One of the advantages of HMC over ADVI, is that these correlations, which can lead to under-estimated variances in the parameter distributions, are included. ADVI gives these up in the name of computational efficiency (i.e. speed and scale of data). This simplifying assumption can be dropped, however, and PYMC3 does offer the option to use ‘full-rank’ Gaussians, but I have not used this in anger (yet).

We also take the opportunity to make use of PYMC3’s ability to compute ADVI using ‘batched’ data, analogous to how Stochastic Gradient Descent (SGD) is used to optimise loss functions in deep-neural networks, which further facilitates model training at scale thanks to the reliance on auto-differentiation and batched data, which can also be distributed across CPU (or GPUs).

In order to enable mini-batch ADVI, we first have to setup the mini-batches (we use batches of 100 samples).

map_tensor_batch = {y_tensor: pm.Minibatch(train.y.values, 100),
                    x_tensor: pm.Minibatch(train.x.values, 100),
                    cat_tensor: pm.Minibatch(train.category.values, 100)}

We then compute the variational inference using 30,000 iterations (for the gradient ascent of the ELBO). We use the more_replacements key-word argument to swap-out the original Theano tensors with the batched versions defined above.

with model:
    advi_fit = pm.fit(method=pm.ADVI(), n=30000,
                      more_replacements=map_tensor_batch)

Before we take a look at the parameters, let’s make sure the ADVI fit has converged by plotting ELBO as a function of the number of iterations.

advi_elbo = pd.DataFrame(
    {'log-ELBO': -np.log(advi_fit.hist),
     'n': np.arange(advi_fit.hist.shape[0])})

_ = sns.lineplot(y='log-ELBO', x='n', data=advi_elbo)

In order to be able to look at what we can infer from posterior distribution we have fit with ADVI, we first have to draw some samples from it, before summarising like we did with HMC inference.

advi_trace = advi_fit.sample(10000)
pm.traceplot(advi_trace)
pm.summary(advi_trace)

	mean	sd	mc_error	hpd_2.5	hpd_97.5
beta__0	1.000717	0.022073	0.000220	0.957703	1.044096
beta__1	1.250904	0.020917	0.000206	1.209715	1.292017
alpha__0	0.984404	0.122010	0.001109	0.755816	1.230404
alpha__1	1.192829	0.120833	0.001146	0.966362	1.433906
sigma__0	0.760702	0.060009	0.000569	0.649582	0.883380

Not bad! The mean estimates are comparable, but we note that the standard deviations appear to be larger than those estimated with HMC.

Comparing Predictions

Let’s move on to comparing the inference algorithms on the practical task of making predictions on our test dataset. We start by swapping the test data into our Theano variables.

y_tensor.set_value(test.y.values)
x_tensor.set_value(test.x.values)
cat_tensor.set_value(test.category.values.astype('int64'))

And then drawing posterior-predictive samples for each new data-point, for which we use the mean as the point estimate to use for comparison.

hmc_posterior_pred = pm.sample_ppc(hmc_trace, 1000, model)
hmc_predictions = np.mean(hmc_posterior_pred['y'], axis=0)

advi_posterior_pred = pm.sample_ppc(advi_trace, 1000, model)
advi_predictions = np.mean(advi_posterior_pred['y'], axis=0)

prediction_data = pd.DataFrame(
    {'HMC': hmc_predictions, 
     'ADVI': advi_predictions, 
     'actual': test.y,
     'error_HMC': hmc_predictions - test.y, 
     'error_ADVI': advi_predictions - test.y})

_ = sns.lmplot(y='ADVI', x='HMC', data=prediction_data,
               line_kws={'color': 'red', 'alpha': 0.5})

As we might expect, given the parameter estimates, the two models generate similar predictions.

To begin to get an insight into the differences between HMC and ADVI, we look at the inferred dependency structure between the samples of alpha_0 and beta_0, for both HMC and VI, starting with HMC.

param_samples_HMC = pd.DataFrame(
    {'alpha_0': hmc_trace.get_values('alpha')[:, 0], 
     'beta_0': hmc_trace.get_values('beta')[:, 0]})

_ = sns.scatterplot(x='alpha_0', y='beta_0', data=param_samples_HMC).set_title('HMC')

And again for ADVI.

param_samples_ADVI = pd.DataFrame(
    {'alpha_0': advi_trace.get_values('alpha')[:, 0], 
     'beta_0': advi_trace.get_values('beta')[:, 0]})

_ = sns.scatterplot(x='alpha_0', y='beta_0', data=param_samples_ADVI).set_title('ADVI')

We can see clearly the impact of ADVI’s assumption of n-dimensional spherical Gaussians, manifest in the inference!

Finally, let’s compare predictions with the actual data.

RMSE = np.sqrt(np.mean(prediction_data.error_ADVI ** 2))

print(f'RMSE for ADVI predictions = {RMSE:.3f}')

_ = sns.lmplot(y='ADVI', x='actual', data=prediction_data, 
               line_kws={'color': 'red', 'alpha': 0.5})

RMSE for ADVI predictions = 0.746

Which is what one might expect, given the data generating model.

Conclusions

MCMC and VI present two very different approaches for drawing inferences from Bayesian models. Despite these differences, their high-level output for a simplistic (but not entirely trivial) regression problem, based on synthetic data, is comparable regardless of the approximations used within ADVI. This is important to note, because general purpose VI algorithms such as ADVI have the potential to work at scale - on large volumes of data in a distributed computing environment (see the references embedded above, for case studies).

Machine Learning Pipelines for R

2017-05-08T00:00:00+01:00

Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transformation functions - to go from the domain of the original input variables to the domain of the original output variables (via the model). This is usually quite a laborious and repetitive process that leads to messy code and notebooks.

The pipeliner package aims to provide an elegant solution to these issues by implementing a common interface and workflow with which it is possible to:

define transformation and inverse-transformation functions;
fit a model on training data; and then,
generate a prediction (or model-scoring) function that automatically applies the entire pipeline of transformations and inverse-transformations to the inputs and outputs of the inner-model and its predicted values (or scores).

The idea of pipelines is inspired by the machine learning pipelines implemented in Apache Spark’s MLib library (which are in-turn inspired by Python’s scikit-Learn package). This package is still in its infancy and the latest development version can be downloaded from this GitHub repository using the devtools package (bundled with RStudio),

devtools::install_github("alexioannides/pipeliner")

Pipes in the Pipeline

There are currently four types of pipeline section - a section being a function that wraps a user-defined function - that can be assembled into a pipeline:

transform_features: wraps a function that maps input variables (or features) to another space - e.g.,

transform_features(function(df) {
  data.frame(x1 = log(df$var1))
})

transform_response: wraps a function that maps the response variable to another space - e.g.,

transform_response(function(df) {
  data.frame(y = log(df$response))
})

estimate_model: wraps a function that defines how to estimate a model from training data in a data.frame - e.g.,

estimate_model(function(df) {
  lm(y ~ 1 + x1, df)
})

inv_transform_features(f): wraps a function that is the inverse to transform_response, such that we can map from the space of inner-model predictions to the one of output domain predictions - e.g.,

inv_transform_response(function(df) {
  data.frame(pred_response = exp(df$pred_y))
})

As demonstrated above, each one of these functions expects as its argument another unary function of a data.frame (i.e. it has to be a function of a single data.frame). With the exception of estimate_model, which expects the input function to return an object that has a predict.object-class-name method existing in the current environment (e.g. predict.lm for linear models built using lm()), all the other transform functions also expect their input functions to return data.frames (consisting entirely of columns not present in the input data.frame). If any of these rules are violated then appropriately named errors will be thrown to help you locate the issue.

If this sounds complex and convoluted then I encourage you to to skip to the examples below - this framework is very simple to use in practice. Simplicity is the key aim here.

Two Interfaces to Rule Them All

I am a great believer and protagonist for functional programming - especially for data-related tasks like building machine learning models. At the same time the notion of a ‘machine learning pipeline’ is well represented with a simple object-oriented class hierarchy (which is how it is implemented in Apache Spark’s). I couldn’t decide which style of interface was best, so I implemented both within pipeliner (using the same underlying code) and ensured their output can be used interchangeably. To keep this introduction simple, however, I’m only going to talk about the functional interface - those interested in the (more) object-oriented approach are encouraged to read the manual pages for the ml_pipeline_builder ‘class’.

Example Usage with a Functional Flavor

We use the faithful dataset shipped with R, together with the pipeliner package to estimate a linear regression model for the eruption duration of ‘Old Faithful’ as a function of the inter-eruption waiting time. The transformations we apply to the input and response variables - before we estimate the model - are simple scaling by the mean and standard deviation (i.e. mapping the variables to z-scores).

The end-to-end process for building the pipeline, estimating the model and generating in-sample predictions (that include all interim variable transformations), is as follows,

library(pipeliner)

data <- faithful

lm_pipeline <- pipeline(
  data,

  transform_features(function(df) {
    data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting))
  }),

  transform_response(function(df) {
    data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions))
  }),

  estimate_model(function(df) {
    lm(y ~ 1 + x1, df)
  }),

  inv_transform_response(function(df) {
    data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions))
  })
)

in_sample_predictions <- predict(lm_pipeline, data, verbose = TRUE)  
head(in_sample_predictions)
##   eruptions waiting         x1 pred_model pred_eruptions
## 1     3.600      79  0.5960248  0.5369058       4.100592
## 2     1.800      54 -1.2428901 -1.1196093       2.209893
## 3     3.333      74  0.2282418  0.2056028       3.722452
## 4     2.283      62 -0.6544374 -0.5895245       2.814917
## 5     4.533      85  1.0373644  0.9344694       4.554360
## 6     2.883      55 -1.1693335 -1.0533487       2.285521

Accessing Inner Models & Prediction Functions

We can access the estimated inner models directly and compute summaries, etc - for example,

summary(lm_pipeline$inner_model)
##
## Call:
## lm(formula = y ~ 1 + x1, data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.13826 -0.33021  0.03074  0.30586  1.04549
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.139e-16  2.638e-02    0.00        1    
## x1           9.008e-01  2.643e-02   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.435 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Pipeline prediction functions can also be accessed directly in a similar way - for example,

pred_function <- lm_pipeline$predict
predictions <- pred_function(data, verbose = FALSE)

head(predictions)
##   pred_eruptions
## 1       4.100592
## 2       2.209893
## 3       3.722452
## 4       2.814917
## 5       4.554360
## 6       2.285521

Turbo-Charged Pipelines in the Tidyverse

The pipeliner approach to building models becomes even more concise when combined with the set of packages in the tidyverse. For example, the ‘Old Faithful’ pipeline could be rewritten as,

library(tidyverse)

lm_pipeline <- data %>%
  pipeline(
    transform_features(function(df) {
      transmute(df, x1 = (waiting - mean(waiting)) / sd(waiting))
    }),

    transform_response(function(df) {
      transmute(df, y = (eruptions - mean(eruptions)) / sd(eruptions))
    }),

    estimate_model(function(df) {
      lm(y ~ 1 + x1, df)
    }),

    inv_transform_response(function(df) {
      transmute(df, pred_eruptions = pred_model * sd(eruptions) + mean(eruptions))
    })
  )

head(predict(lm_pipeline, data))
## [1] 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521

Nice, compact and expressive (if I don’t say so myself)!

Compact Cross-validation

If we now introduce the modelr package into this workflow and adopt the the list-columns pattern described in Hadley Wickham’s R for Data Science, we can also achieve wonderfully compact end-to-end model estimation and cross-validation,

library(modelr)

# define a function that estimates a machine learning pipeline on a single fold of the data
pipeline_func <- function(df) {
  pipeline(
    df,
    transform_features(function(df) {
      transmute(df, x1 = (waiting - mean(waiting)) / sd(waiting))
    }),

    transform_response(function(df) {
      transmute(df, y = (eruptions - mean(eruptions)) / sd(eruptions))
    }),

    estimate_model(function(df) {
      lm(y ~ 1 + x1, df)
    }),

    inv_transform_response(function(df) {
      transmute(df, pred_eruptions = pred_model * sd(eruptions) + mean(eruptions))
    })
  )
}

# 5-fold cross-validation using machine learning pipelines
cv_rmse <- crossv_kfold(data, 5) %>%
  mutate(model = map(train, ~ pipeline_func(as.data.frame(.x))),
         predictions = map2(model, test, ~ predict(.x, as.data.frame(.y))),
         residuals = map2(predictions, test, ~ .x - as.data.frame(.y)$eruptions),
         rmse = map_dbl(residuals, ~ sqrt(mean(.x ^ 2)))) %>%
  summarise(mean_rmse = mean(rmse), sd_rmse = sd(rmse))

cv_rmse
## # A tibble: 1 × 2
##   mean_rmse    sd_rmse
##       <dbl>      <dbl>
## 1 0.4877222 0.05314748

Forthcoming Attractions

I built pipeliner largely to fill a hole in my own workflows. Up until now I’ve used Max Kuhn’s excellent caret package quite a bit, but for in-the-moment model building (e.g. within a R Notebook) it wasn’t simplifying the code that much, and the style doesn’t quite fit with the tidy and functional world that I now inhabit most of the time. So, I plugged the hole by myself. I intend to live with pipeliner for a while to get an idea of where it might go next, but I am always open to suggestions (and bug notifications) - please leave any ideas here.

elasticsearchr - a Lightweight Elasticsearch Client for R

2016-11-28T00:00:00+00:00

Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with default settings that allow it to work effectively out-of-the-box, and all interaction is made via a set of intuitive and extremely well documented RESTful APIs. I’ve been using it for two years now and I am evangelical.

The elasticsearchr package implements a simple Domain-Specific Language (DSL) for indexing, deleting, querying, sorting and aggregating data in Elasticsearch, from within R. The main purpose of this package is to remove the labour involved with assembling HTTP requests to Elasticsearch’s REST APIs and parsing the responses. Instead, users of this package need only send and receive data frames to Elasticsearch resources. Users needing richer functionality are encouraged to investigate the excellent elastic package from the good people at rOpenSci.

This package is available on CRAN or from this GitHub repository. To install the latest (development) version from GitHub, make sure that you have the devtools package installed (this comes bundled with RStudio), and then execute the following on the R command line:

devtools::install_github("alexioannides/elasticsearchr")

Installing Elasticsearch

Elasticsearch can be downloaded here, where the instructions for installing and starting it can also be found. OS X users (such as myself) can also make use of Homebrew to install it with the command,

$ brew install elasticsearch

And then start it by executing $ elasticsearch from within any Terminal window. Successful installation can be checked by navigating any web browser to http://localhost:9200, where the following message should greet you (give or take the cluster name that changes with every restart),

{
  "name" : "Kraven the Hunter",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.3.5",
    "build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4",
    "build_timestamp" : "2016-07-27T10:36:52Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch 101

If you followed the installation steps above, you have just installed a single Elasticsearch ‘node’. When not testing on your laptop, Elasticsearch usually comes in clusters of nodes (usually there are at least 3). The easiest easy way to get access to a managed Elasticsearch cluster is by using the Elastic Cloud managed service provided by Elastic (Amazon Web Services offer something similar too). For the rest of this brief tutorial I will assuming you’re running a single node on your laptop.

In Elasticsearch a ‘row’ of data is stored as a ‘document’. A document is a JSON object - for example, the first row of R’s iris dataset,

#   sepal_length sepal_width petal_length petal_width species
# 1          5.1         3.5          1.4         0.2  setosa

would be represented as follows using JSON,

{
  "sepal_length": 5.1,
  "sepal_width": 3.5,
  "petal_length": 1.4,
  "petal_width": 0.2,
  "species": "setosa"
}

Documents are classified into ‘types’ and stored in an ‘index’. In a crude analogy with traditional SQL databases that is often used, we would associate an index with a database instance and the document types as tables within that database. In practice this example is not accurate - it is better to think of all documents as residing in a single - possibly sparse - table (defined by the index), where the document types represent sub-sets of columns in the table. This is especially so as fields that occur in multiple document types (within the same index), must have the same data-type - for example, if "name" exists in document type customer as well as in document type address, then "name" will need to be a string in both.

Each document is a ‘resource’ that has a Uniform Resource Locator (URL) associated with it. Elasticsearch URLs all have the following format:

http://your_cluster:9200/your_index/your_doc_type/your_doc_id

For example, the above iris document could be living at

http://localhost:9200/iris/data/1

Although Elasticsearch - like most NoSQL databases - is often referred to as being ‘schema free’, as we have already see this is not entirely correct. What is true, however, is that the schema - or ‘mapping’ as it’s called in Elasticsearch - does not need to be declared up-front (although you certainly can do this). Elasticsearch is more than capable of guessing the types of fields based on new data indexed for the first time. For more information on any of these basic concepts take a look here

elasticsearchr: a Quick Start

elasticsearchr is a lightweight client - by this I mean that it only aims to do ‘just enough’ work to make using Elasticsearch with R easy and intuitive. You will still need to read the Elasticsearch documentation to understand how to compose queries and aggregations. What follows is a quick summary of what is possible.

Resources

Elasticsearch resources, as defined by the URLs described above, are defined as elastic objects in elasticsearchr. For example,

es <- elastic("http://localhost:9200", "iris", "data")

Refers to documents of types ‘data’ in the ‘iris’ index located on an Elasticsearch node on my laptop. Note that: - it is possible to leave the document type empty if you need to refer to all documents in an index; and, - elastic objects can be defined even if the underling resources have yet to be brought into existence.

Indexing New Data

To index (insert) data from a data frame, use the %index% operator as follows:

elastic("http://localhost:9200", "iris", "data") %index% iris

In this example, the iris dataset is indexed into the ‘iris’ index and given a document type called ‘data’. Note that I have not provided any document ids here. To explicitly specify document ids there must be a column in the data frame that is labelled id, from which the document ids will be taken.

Deleting Data

Documents can be deleted in three different ways using the %delete% operator. Firstly, an entire index (including the mapping information) can be erased by referencing just the index in the resource - e.g.,

elastic("http://localhost:9200", "iris") %delete% TRUE

Alternatively, documents can be deleted on a type-by-type basis leaving the index and it’s mappings untouched, by referencing both the index and the document type as the resource - e.g.,

elastic("http://localhost:9200", "iris", "data") %delete% TRUE

Finally, specific documents can be deleted by referencing their ids directly - e.g.,

elastic("http://localhost:9200", "iris", "data") %delete% c("1", "2", "3", "4", "5")

Queries

Any type of query that Elasticsearch makes available can be defined in a query object using the native Elasticsearch JSON syntax - e.g. to match every document we could use the match_all query,

for_everything <- query('{
  "match_all": {}
}')

To execute this query we use the %search% operator on the appropriate resource - e.g.,

elastic("http://localhost:9200", "iris", "data") %search% for_everything

#     sepal_length sepal_width petal_length petal_width    species
# 1            4.9         3.0          1.4         0.2     setosa
# 2            4.9         3.1          1.5         0.1     setosa
# 3            5.8         4.0          1.2         0.2     setosa
# 4            5.4         3.9          1.3         0.4     setosa
# 5            5.1         3.5          1.4         0.3     setosa
# 6            5.4         3.4          1.7         0.2     setosa
# ...

Sorting Query Results

Query results can be sorted on multiple fields by defining a sort object using the same Elasticsearch JSON syntax - e.g. to sort by sepal_width in ascending order the required sort object would be defined as,

by_sepal_width <- sort('{"sepal_width": {"order": "asc"}}')

This is then added to a query object whose results we want sorted and executed using the %search% operator as before - e.g.,

elastic("http://localhost:9200", "iris", "data") %search% (for_everything + by_sepal_width)

#   sepal_length sepal_width petal_length petal_width    species
# 1          5.0         2.0          3.5         1.0 versicolor
# 2          6.0         2.2          5.0         1.5  virginica
# 3          6.0         2.2          4.0         1.0 versicolor
# 4          6.2         2.2          4.5         1.5 versicolor
# 5          4.5         2.3          1.3         0.3     setosa
# 6          6.3         2.3          4.4         1.3 versicolor
# ...

Aggregations

Similarly, any type of aggregation that Elasticsearch makes available can be defined in an aggs object - e.g. to compute the average sepal_width per-species of flower we would specify the following aggregation,

avg_sepal_width <- aggs('{
  "avg_sepal_width_per_species": {
    "terms": {
      "field": "species",
      "size": 3
    },
    "aggs": {
      "avg_sepal_width": {
        "avg": {
          "field": "sepal_width"
        }
      }
    }
  }
}')

(Elasticsearch 5.x users please note that when using the out-of-the-box mappings the above aggregation requires that "field": "species" be changed to "field": "species.keyword" - see here for more information as to why)

This aggregation is also executed via the %search% operator on the appropriate resource - e.g.,

elastic("http://localhost:9200", "iris", "data") %search% avg_sepal_width

#          key doc_count avg_sepal_width.value
# 1     setosa        50                 3.428
# 2 versicolor        50                 2.770
# 3  virginica        50                 2.974

Queries and aggregations can be combined such that the aggregations are computed on the results of the query. For example, to execute the combination of the above query and aggregation, we would execute,

elastic("http://localhost:9200", "iris", "data") %search% (for_everything + avg_sepal_width)

#          key doc_count avg_sepal_width.value
# 1     setosa        50                 3.428
# 2 versicolor        50                 2.770
# 3  virginica        50                 2.974

where the combination yields,

print(for_everything + avg_sepal_width)

# {
#     "size": 0,
#     "query": {
#         "match_all": {
#
#         }
#     },
#     "aggs": {
#         "avg_sepal_width_per_species": {
#             "terms": {
#                 "field": "species",
#                 "size": 0
#             },
#             "aggs": {
#                 "avg_sepal_width": {
#                     "avg": {
#                         "field": "sepal_width"
#                     }
#                 }
#             }
#         }
#     }
# }

For comprehensive coverage of all query and aggregations types please refer to the rather excellent official documentation (newcomers to Elasticsearch are advised to start with the ‘Query String’ query).

Mappings

Finally, I have included the ability to create an empty index with a custom mapping, using the %create% operator - e.g.,

elastic("http://localhost:9200", "iris") %create% mapping_default_simple()

Where in this instance mapping_default_simple() is a default mapping that I have shipped with elasticsearchr. It switches-off the text analyser for all fields of type ‘string’ (i.e. switches off free text search), allows all text search to work with case-insensitive lower-case terms, and maps any field with the name ‘timestamp’ to type ‘date’, so long as it has the appropriate string or long format.

Forthcoming Attractions

I do not have a grand vision for elasticsearchr - I want to keep it a lightweight client that requires knowledge of Elasticsearch - but I would like to add the ability to compose major query and aggregation types, without having to type-out lots of JSON, and to be able to retrieve simple information like the names of all indices in a cluster, and all the document types within an index, etc. Future development will likely be focused in these areas.

Acknowledgements

A big thank you to Hadley Wickham and Jeroen Ooms, the authors of the httr and jsonlite packages that elasticsearchr leans upon heavily.

Asynchronous and Distributed Programming in R with the Future Package

2016-11-02T00:00:00+00:00

Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more productive than it had been before. Although I only first read about it at the beginning of this week, my instinct tells me that in Henrik Bengtsson’s future package we might have another such game-changing R package.

The future package provides an API for futures (or promises) in R. To quote Wikipedia, a future or promise is,

… a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

A classic example would be a request made to a web server via HTTP, that has yet to return and whose value remains unknown until it does (and which has promised to return at some point in the future). This ‘promise’ is an object assigned to a variable in R like any other, and allows code execution to progress until the moment the code explicitly requires the future to be resolved (i.e. to ‘make good’ on it’s promise). So the code does not need to wait for the web server until the very moment that the information anticipated in its response it actually needed. In the intervening execution time we can send requests to other web servers, run some other computations, etc. Ultimately, this leads to faster and more efficient code. This way of working also opens the door to distributed (i.e. parallel) computation, as the computation assigned to each new future can be executed on a new thread (and executed on a different core on the same machine, or on another machine/node).

The future API is extremely expressive and the associated documentation is excellent. My motivation here is not to repeat any of this, but rather to give a few examples to serve as inspiration for how futures could be used for day-to-day Data Science tasks in R.

Creating a Future to be Executed on a Different Core to that Running the Main Script

To demonstrate the syntax and structure required to achieve this aim, I am going to delegate to a future the task of estimating the mean of 10 million random samples from the normal distribution, and ask it to spawn a new R process on a different core in order to do so. The code to achieve this is as follows,

library(future)

f <- future({
  samples <- rnorm(10000000)
  mean(samples)
}) %plan% multiprocess
w <- value(f)
w
# [1] 3.046653e-05

future({...}) assigns the code (actually a construct known as a closure), to be computed asynchronously from the main script. The code will be start execution the moment this initial assignment is made;
%plan% multiprocess sets the future’s execution plan to be on a different core (or thread); and,
value asks for the return value of future. This will block further code execution until the future can be resolved.

The above example can easily be turned into a function that outputs dots (...) to the console until the future can be resolved and return it’s value,

f_dots <- function() {
  f <- future({
    s <- rnorm(10000000)
    mean(s)
  }) %plan% multiprocess

  while (!resolved(f)) {
    cat("...")
  }
  cat("\n")

  value(f)
}
f_dots()
# ............
# [1] -0.0001872372

Here, resolved(f) will return FALSE until the future f has finished executing.

Useful Use Cases

I can recall many situations where futures would have been handy when writing R scripts. The examples below are the most obvious that come to mind. No doubt there will be many more.

Distributed (Parallel) Computation

In the past, when I’ve felt the need to distribute a calculation I have usually used the mclapply function (i.e. multi-core lapply), from the parallel library that comes bundled together with base R. Computing the mean of 100 million random samples from the normal distribution would look something like,

library(parallel)

sub_means <- mclapply(
              X = 1:4,
              FUN = function(x) { samples <- rnorm(25000000); mean(samples) },
              mc.cores = 4)

final_mean <- mean(unlist(sub_mean))
final_mean
# [1] -0.0002100956

Perhaps more importantly, the script will be ‘blocked’ until sub_means has finished executing. We can achieve the same end-result, but without blocking, using futures,

single_thread_mean <- function() {
  samples <- rnorm(25000000)
  mean(samples)
}

multi_thread_mean <- function() {
  f1 <- future({ single_thread_mean() }) %plan% multiprocess
  f2 <- future({ single_thread_mean() }) %plan% multiprocess
  f3 <- future({ single_thread_mean() }) %plan% multiprocess
  f4 <- future({ single_thread_mean() }) %plan% multiprocess

  mean(value(f1), value(f2), value(f3), value(f4))
}

multi_thread_mean()
# [1] -4.581293e-05

We can compare computation time between the single and multi-threaded versions of the mean computation (using the microbenchmark package),

library(microbenchmark)

microbenchmark({ samples <- rnorm(100000000); mean(samples) },
               multi_thread_mean(),
               times = 10)
# Unit: seconds
#                  expr      min       lq     mean   median       uq      max neval
#  single_thread(1e+08) 7.671721 7.729608 7.886563 7.765452 7.957930 8.406778    10
#   multi_thread(1e+08) 2.046663 2.069641 2.139476 2.111769 2.206319 2.344448    10

We can see that the multi-threaded version is nearly 3 times faster, which is not surprising given that we’re using 3 extra threads. Note that time is lost spawning the extra threads and combining their results (usually referred to as ‘overhead’), such that distributing a calculation can actually increase computation time if the benefit of parallelisation is less than the cost of the overhead.

Non-Blocking Asynchronous Input/Output

I have often found myself in the situation where I need to read several large CSV files, each of which can take a long time to load. Because the files can only be loaded sequentially, I have had to wait for one file to be read before the next one can start loading, which compounds the time devoted to input. Thanks to futures, we can can now achieve asynchronous input and output as follows,

library(readr)

df1 <- future({ read_csv("data/csv1.csv") }) %plan% multiprocess
df2 <- future({ read_csv("data/csv2.csv") }) %plan% multiprocess
df3 <- future({ read_csv("data/csv3.csv") }) %plan% multiprocess
df4 <- future({ read_csv("data/csv4.csv") }) %plan% multiprocess

df <- rbind(value(df1), value(df2), value(df3), value(df4))

Running microbenchmark on the above code illustrates the speed-up (each file is ~50MB in size),

# Unit: seconds
#                   min       lq     mean   median       uq      max neval
#  synchronous 7.880043 8.220015 8.502294 8.446078 8.604284 9.447176    10
# asynchronous 4.203271 4.256449 4.494366 4.388478 4.490442 5.748833    10

The same pattern can be applied to making HTTP requests asynchronously. In the following example I make an asynchronous HTTP GET request to the OpenCPU public API, to retrieve the Boston housing dataset via JSON. While I’m waiting for the future to resolve the response I keep making more asynchronous requests, but this time to http://time.jsontest.com to get the current time. Once the original future has resolved, I block output until all remaining futures have been resolved.

library(httr)
library(jsonlite)

time_futures <- list()

data_future <- future({
  response <- GET("http://public.opencpu.org/ocpu/library/MASS/data/Boston/json")
  fromJSON(content(response, as = "text"))
}) %plan% multiprocess

while (!resolved(data_future)) {
  time_futures <- append(time_futures, future({ GET("http://time.jsontest.com") }) %plan% multiprocess)
}
values(time_futures)
# [[1]]
# Response [http://time.jsontest.com/]
#   Date: 2016-11-02 01:31
#   Status: 200
#   Content-Type: application/json; charset=ISO-8859-1
#   Size: 100 B
# {
#    "time": "01:31:19 AM",
#    "milliseconds_since_epoch": 1478050279145,
#    "date": "11-02-2016"
# }

head(value(data_future))
# crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
# 1 0.0063 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
# 2 0.0273  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
# 3 0.0273  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
# 4 0.0324  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# 5 0.0690  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
# 6 0.0298  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

The same logic applies to accessing databases and executing SQL queries via ODBC or JDBC. For example, large complex queries can be split into ‘chunks’ and sent asynchronously to the database server in order to have them executed on multiple server threads. The output can then be unified once the server has sent back the chunks, using R (e.g. with dplyr). This is a strategy that I have been using with Apache Spark, but I could now implement it within R. Similarly, multiple database tables can be accessed concurrently, and so on.

Final Thoughts

I have only really scratched the surface of what is possible with futures. For example, future supports multiple execution plans including lazy and cluster (for multiple machines/nodes) - I have only focused on increasing performance on a single machine with multiple cores. If this post has provided some inspiration or left you curious, then head over to the official future docs for the full details (which are a joy to read and work-through).

An R Function for Generating Authenticated URLs to Private Web Sites Hosted on AWS S3

2016-09-19T00:00:00+01:00

Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer any authentication to prevent anyone from accessing potentially sensitive information.

Yegor Bugayenko has created an external service S3Auth.com that stands in the way of any S3 hosted web site, but this is a little too much for my needs. All I want to achieve is to limit access to specific S3 resources that will be largely transient in nature. A viable and simple solution is to use ‘query string request authentication’ that is described in detail here. I must confess to not really understanding what was going on here, until I had dug around on the web to see what others have been up to.

This blog post describes a simple R function for generating authenticated and ephemeral URLs to private S3 resources (including web pages) that only the holders of the URL can access.

Creating User Credentials for Read-Only Access to S3

Before we can authenticate anyone, we need someone to authenticate. From the AWS Management Console create a new user, download their security credentials and then attach the AmazonS3ReadOnlyAccess policy to them. For more details on how to do this, refer to a previous post. Note, that you should not create passwords for them to access the AWS console.

Loading a Static Web Page to AWS S3

Do not be tempted to follow the S3 ‘Getting Started’ page on how to host a static web page and in doing so enable ‘Static Website Hosting’. We need our resources to remain private and we would also like to use HTTPS, which this option does not support. Instead, create a new bucket and upload a simple HTML file as usual. An example html file - e.g. index.html - could be,

<!DOCTYPE html>
<html>
  <body>
    <p>Hello, World!</p>
  </body>
</html>

An R Function for Generating Authenticated URLs

We can now use our new user’s Access Key ID and Secret Access Key to create a URL with a limited lifetime that enables access to index.html. Technically, we are making a HTTP GET request to the S3 REST API, with the authentication details sent as part of a query string. Creating this URL is a bit tricky - I have adapted the Python example (number 3) that is provided here, as an R function (that can be found in the Gist below) - aws_query_string_auth_url(...). Here’s an example showing this R function in action:

path_to_file <- "index.html"
bucket <- "my.s3.bucket"
region <- "eu-west-1"
aws_access_key_id <- "DWAAAAJL4KIEWJCV3R36"
aws_secret_access_key <- "jH1pEfnQtKj6VZJOFDy+t253OZJWZLEo9gaEoFAY"
lifetime_minutes <- 1
aws_query_string_auth_url(path_to_file, bucket, region, aws_access_key_id, aws_secret_access_key, lifetime_minutes)
# "https://s3-eu-west-1.amazonaws.com/my.s3.bucket/index.html?AWSAccessKeyId=DWAAAKIAJL4EWJCV3R36&Expires=1471994487&Signature=inZlnNHHswKmcPfTBiKhziRSwT4%3D"

And here’s the code for it as inspired by the short code snippet here:

Note the dependencies on the digest and base64enc packages.

Building a Data Science Platform for R&D, Part 4 - Apache Zeppelin & Scala Notebooks

2016-08-29T00:00:00+01:00

Parts one, two and three of this series of posts have taken us from creating an account on AWS to loading and interacting with data in Spark via R and R Studio. My vision of a Data Science platform for R&D is nearly complete - the only outstanding component is the ability to interact (REPL-style) with Spark using code written in Scala and to run this on some sort of scheduled basis. So, for this last part I am going to focus on getting Apache Zeppelin up-and-running.

Zeppelin is a notebook server in a similar vein as the Jupyter or Beaker notebooks (and very similar to those available on Databricks). Code is submitted and executed in ‘chunks’ with interim output (e.g. charts and tables) displayed after it has been computed. Where Zeppelin differs from the other, is its first-class support for Spark and it’s ability to run notebooks (and thereby ETL process) on a schedule (in essence it uses chron for scheduling and execution).

Installing Apache Zeppelin

Following the steps laid-out in previous posts, SSH into our Spark cluster’s master node (or use $ ./flintrock login my-cluster for extra convenience). Just like we did for R Studio Server we’re going to install Zeppelin here as well. Find the URL for the latest version of Zeppelin here and then from the master node’s shell execute,

$ cd /home/ec2-user

$ wget http://apache.mirror.anlx.net/zeppelin/zeppelin-0.6.1/zeppelin-0.6.1-bin-all.tgz

$ tar -xzf zeppelin-0.6.1-bin-all.tgz

$ rm zeppelin-0.6.1-bin-all.tgz

Note that I have chosen to install the binaries that contain all of the available language interpreters - there is no restriction on choice of language and you could just as easily use R or Python for interacting with Spark.

Configuring Zeppelin

Before we can start-up and test Zeppelin, we will need to configure it. Templates for configuration files can be found in the conf directory of the Zeppelin folder. Makes copies of these by executing the following commands,

$ cd /home/ec2-user/zeppelin-0.6.1-bin-all/conf

$ cp zeppelin-env.sh.template zeppelin-env.sh

$ cp zeppelin-site.xml.template zeppelin-site.xml

Then using a text editor such as vi - e.g. $ vi zeppelin-env.sh - to edit each file making the changes described below.

zeppelin-env.sh

Find the following variable exports, uncomment them, and then make the following assignments:

export MASTER=spark://ip-172-31-6-33:7077 # use the appropriate local IP address here
export SPARK_HOME=/usr/local/lib/spark
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2"

Most of these options should be familiar to you by now so I won’t go-over again here.

zeppelin-site.xml

Find the following property name and change it to the value below:

<property>
  <name>zeppelin.server.port</name>
  <value>8081</value>
  <description>Server port.</description>
</property>

All we’re doing here is assigning Zeppelin to port 8081 (which we opened in Part 2), so that it does not clash with the Spark master web UI on port 8080 (the default port for Zeppelin). Test that Zeppelin is working by executing the following,

$ cd /home/ec2-user/zeppelin-0.6.1-bin-all/bin

$ ./zeppelin-daemon start

Open a browser and navigate to http://your_master_node_public_ip:8081. If Zeppelin has been installed and configured properly you should be presented with Zeppelin’s home screen:

To shut Zeppelin down return to the master node’s shell and execute,

$ ./zeppelin-daemon stop.

Running Zeppelin with a Service Manager

Unlike R Studio server that automatically configures and starts-up a daemon that will shut-down and re-start with our master node when required, we will have to configure and perform these steps manually for Zeppelin - otherwise it will need to be manually started-up every time the cluster is started after being stopped (and I’m far too lazy for this inconvenience).

To make this happen on Amazon Linux we will make use of Upstart and the initctl command. But first of all we will need to create a configuration file in the /etc/init directory,

$ cd /etc/init

$ sudo touch zeppelin.conf

We then need to edit this file - e.g. $ sudo vi zeppelin.conf - and copy the following script, which is adapted from rstudio-server.conf and this fantastic blog post from DevOps All the Things:

description "zeppelin"

start on (runlevel [345] and started network)
stop on (runlevel [!345] or stopping network)

# start on (local-filesystems and net-device-up IFACE!=lo)
# stop on shutdown

# Respawn the process on unexpected termination
respawn

# respawn the job up to 7 times within a 5 second period.
# If the job exceeds these values, it will be stopped and marked as failed.
respawn limit 7 5

# zeppelin was installed in /home/ec2-user/zeppelin-0.6.1-bin-all in this example
chdir /home/ec2-user/zeppelin-0.6.1-bin-all
exec bin/zeppelin-daemon.sh upstart

To test our script return to the shell and execute,

$ sudo initctl start zeppelin

And return to the browser to check that Zeppelin is up-and-running. You can check that this works by stopping the cluster and then starting it again.

Scala Notebooks

From the Zeppelin home page select the ‘Zeppelin Tutorial’, accept the interpreter options and you should be presented with the following notebook:

Click into the first code chunk and hit shift + enter to run it. If everything has been configured correctly then the code will run and the Zeppelin application will be listed in the Spark master node’s web UI. We then test our connectivity to S3 by attempting to access our data there in the usual way:

Note that this notebook, as well as any other, can be set to execute on a schedule defined using the ‘Run Scheduler’ from the notebook’s menu bar. This will happen irrespective of whether or not you have it loaded in the browser - so long as the Zeppelin daemon is running the notebooks will run on their defined schedule.

Storing Zeppelin Notebooks on S3

By default Zeppelin will store all notebooks locally. This is likely to be fine under most circumstances (as it is also very easy to export them), but it makes sense to exploit the ability to have them stored in an S3 bucket instead. For example, if you have amassed a lot of notebooks working on one cluster and you’d like to run them on another (maybe much larger) cluster, then it makes sense not to have to manually export them all from one cluster to another.

Enabling access to S3 is relatively easy as we already have S3-enabled IAM roles assigned to our nodes (via Flintrock configuration). Start by creating a new bucket to store them in - e.g. my.zeppelin.notebooks. Then create a folder within this bucket - e.g. userone - and another one within that called notebook.

Next, SSH into the master node and open the zeppelin-site.xml file for editing as we did above. This time, un-comment and set the following properties,

<property>
  <name>zeppelin.notebook.s3.bucket</name>
  <value>my.zeppelin.notebooks</value>
  <description>bucket name for notebook storage</description>
</property>

<property>
  <name>zeppelin.notebook.s3.user</name>
  <value>userone</value>
  <description>user name for s3 folder structure</description>
</property>

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>

And comment-out the property for local storage,

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>

Save the changes and return to the terminal. Finally, execute,

$ sudo initctl restart zeppelin

And wait a few seconds before re-loading Zeppelin in your browser. If you create a new notebook you should be able to see if you go looking for it in the AWS console.

Basic Notebook Security

Being able to limit access to Zeppelin as well control the read/write permissions on individual notebooks will be useful if multiple people are likely to be working on the platform and using it to trial and schedule jobs on the cluster. It’s also handy if you just want to grant someone access to read results and don’t want to risk them changing the code by accident.

Enabling basic authentication is relatively straight-forwards. First, open the zeppelin-site.xml file for editing and ensure that the zeppelin.anonymous.allowed property is set to false,

<property>
  <name>zeppelin.anonymous.allowed</name>
  <value>false</value>
  <description>Anonymous user allowed by default</description>
</property>

Next, open the shiro.ini file in Zeppelin’s conf directory and then change,

/** = anon
#/** = authc

#/** = anon
/** = authc

This file also allows you to set usernames, password and groups. For a slightly more detailed explanation head-over to the Zeppelin documentation.

Zeppelin as a Spark Job REST Server

Each notebook on a Zeppelin server can be considered as an ‘analytics job’. We have already briefly mentioned the ability to execute such ‘jobs’ on a schedule - e.g. execute an ETL process every hour, etc. We can actually take this further by exploiting Zeppelin’s REST API that controls pretty much any server action. So, for example, we could execute a job (as defined in a notebook), remotely and possibly on an event-driven basis. A comprehensive description of the Zeppelin REST API can be found on the official API documentation.

This is the point at which I start to get excited as our R&D platform starts to resemble a production platform. To illustrate how one could remotely execute Zeppelin jobs I have written a few basic R function (with examples) to facilitate this using R - these can be found on GitHub, a discussion of which may make a post of its own in the near future.

Conclusion

That’s it - mission accomplished!

I have met all of my initial aims - possibly more. I have myself a Spark-based R&D platform that I can interact with using my favorite R tools and Scala, all from the comfort of my laptop. And we’re not far removed from being able to deploy code and ‘analytics jobs’ in a production environment. All we’re really missing is a database for serving analytics (e.g. Elasticsearch) and maybe another for storing data if we won’t be relying on S3. More on this in another post.

Building a Data Science Platform for R&D, Part 3 - R, R Studio Server, SparkR & Sparklyr

2016-08-22T00:00:00+01:00

Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to serve my favorite R IDE: R Studio. We will then install and configure both the Sparklyr and [SparkR][sparkR] packages for connecting and interacting with Spark and our data. After this, we will be on our way to interacting with and computing on large-scale data as if it were sitting on our laptops.

Installing R

Our first task is to install R onto our master node. Start by SSH-ing into the master node using the steps described in Part 2. Then execute the following commands in the following order:

$ sudo yum update - update all the packages on Amazon Linux machine imagine to the latest ones in the Amazon Linux’s repository;
$ sudo yum install R - install R and all of its dependencies;
$ sudo yum install libcurl libcurl-devel - ensure that Curl is installed (a dependency for the httr and curl R packages used to install other R packages); and,
$ sudo yum install openssl openssl-devel - ensure that OpenSSL is installed (another dependency for the httr R package).

If everything has worked as intended, then executing $ R should present you with R on the command line:

Installing R Studio Server

Installing R Studio on the same local network as the Spark cluster that we want to connect to - in our case directly on the master node - is the recommended approach for using R Studio with a remote Spark Cluster. Using a local version of R Studio to connect to a remote Spark cluster is prone to the same networking issues as trying to use the Spark shell remotely in client-mode (see part 2).

First of all we need the URL for the latest version of R Studio Server. Preview versions can be found here while stable releases can be found here. At the time of writing Sparklyr integration is a preview feature so I’m using the latest preview version of R Studio Server for 64bit RedHat/CentOS (should this fail at any point, then revert back to the latest stable release as all of the scripts used in this post will still run). Picking-up where we left-off in the master node’s terminal window, execute the following commands,

$ wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-0.99.1289-i686.rpm $ sudo yum install --nogpgcheck rstudio-server-rhel-0.99.1289-i686.rpm

Next, we need to assign a password to our ec2-user so that they can login to R Studio as well,

$ sudo passwd ec2-user

If we wanted to create additional users (with their own R Studio workspaces and local R package repositories), we would execute,

$ sudo useradd alex $ sudo passwd alex

Because we have installed Spark in our ec2-user’s home directory, other users will not be able to access it. To get around this problem (if we want to have multiple users working on the platform), we need a local copy of Spark available to everyone. A sensible place to store this is in /usr/local/lib and we can make a copy of our Spark directory here as follows:

$ cd /home/ec2-user $ sudo cp -r spark /usr/local/lib

Now check that everything works as expected by opening your browser and heading to http://master_nodes_public_ip_address:8787 where you should be greeted with the R Studio login page:

Enter a username and password and then we should be ready to go:

Finally, on R Studio’s command line run,

> install.packages("devtools")

to install the devtools R package that will allow us to install packages directly from GitHub repositories (as well as many other things). If OpenSSL and Curl were installed correctly in the above steps, then this should take under a minute.

Connect to Spark with Sparklyr

Sparklyr is an extensible R API for Spark from the people at R Studio- an alternative to the SparkR package that ships with Spark as standard. In particular, it provides a ‘back end’ for the powerful dplyr data manipulation package that lets you manipulate Spark DataFrames using the same package and functions that I would use to manipulate native R data frames on my laptop.

Sparklyr is still in it’s infancy and is not yet available on the CRAN archives. As such, it needs to be installed directly from its GitHub repo, which from within R Studio is done by executing,

> devtools::install_github("rstudio/sparklyr")

This will take a few minutes as there are a lot of dependencies that need to be built from source. Once this is finished create a new script and copy the following code for testing Sparklyr, its ability to connect to our Spark cluster and our S3 data:

# set system variables for access to S3 using older "s3n:" protocol ----
# Sys.setenv(AWS_ACCESS_KEY_ID="AKIAJL4EWJCQ3R86DWAA")
# Sys.setenv(AWS_SECRET_ACCESS_KEY="nVZJQtKj6ODDy+t253OZJWZLEo2gaEoFAYjH1pEf")

# load packages ----
library(sparklyr)
library(dplyr)

# add packages to Spark config ----
config <- spark_config()
config$sparklyr.defaultPackages[[3]] <- "org.apache.hadoop:hadoop-aws:2.7.2"
config$sparklyr.defaultPackages
# [1] "com.databricks:spark-csv_2.11:1.3.0"    "com.amazonaws:aws-java-sdk-pom:1.10.34" "org.apache.hadoop:hadoop-aws:2.7.2"

# connect to Spark cluster ----
sc <- spark_connect(master = "spark://ip-172-31-11-216:7077",
                   spark_home = "/usr/local/lib/spark",
                   config = config)

# copy the local iris dataset to Spark ----
iris_tbl <- copy_to(sc, iris)
head(iris_tbl)
# Sepal_Length Sepal_Width Petal_Length Petal_Width  Species
#        <dbl>       <dbl>        <dbl>       <dbl>    <chr>
#          5.1         3.5          1.4         0.2 "setosa"
#          4.9         3.0          1.4         0.2 "setosa"
#          4.7         3.2          1.3         0.2 "setosa"
#          4.6         3.1          1.5         0.2 "setosa"
#          5.0         3.6          1.4         0.2 "setosa"
#          5.4         3.9          1.7         0.4 "setosa"

# load S3 file into Spark's using the "s3a:" protocol ----
test <- spark_read_csv(sc, "test", "s3a://adhoc.analytics.data/README.md")
test
# Source:   query [?? x 1]
# Database: spark connection master=spark://ip-172-31-11-216:7077 app=sparklyr local=FALSE
#
#                                                                  _Apache_Spark
#                                                                          <chr>
# Spark is a fast and general cluster computing system for Big Data. It provides
#                                                       high-level APIs in Scala
#      supports general computation graphs for data analysis. It also supports a
#      rich set of higher-level tools including Spark SQL for SQL and DataFrames
#                                                     MLlib for machine learning
#                                     and Spark Streaming for stream processing.
#                                                     <http://spark.apache.org/>
#                                                        ## Online Documentation
#                                    You can find the latest Spark documentation
#                                                                          guide
# # ... with more rows

# disconnect ----
spark_disconnect_all()

Execute line-by-line and check the key outputs with those commented-out in the above script. Sparklyr is changing rapidly at the moment - for the latest documentation and information on: how to use it with the dplyr package, how to leverage Spark machine learning libraries and how to extend Sparklyr itself, head over to the Sparklyr web site hosted by R Studio.

Connect to Spark with SparkR

SparkR is shipped with Spark and as such there is no external installation process that we’re required to follow. It does, however, require R to be installed on every node in the cluster. This can be achieved by SSH-ing into every node in our cluster and repeating the above R installation steps, or experimenting with Flintrock’s run-command command that will automatically execute the same command on every node in the cluster, such as,

$ ./flintrock run-command the_name_of_your_cluster 'sudo yum install -y R'

To enable SparkR to be used via R Studio and demonstrate the same connectivity as we did above for Sparklyr, create a new script for the following code:

# set system variables ----
# - location of Spark on master node;
# - add sparkR package directory to the list of path to look for R packages
Sys.setenv(SPARK_HOME="/home/ec2-user/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

# load packages ----
library(SparkR)

# connect to Spark cluster ----
# check your_public_ip_address:8080 to get the local network address of your master node
sc <- sparkR.session(master = "spark://ip-172-31-11-216:7077",
                     sparkPackages = c("com.databricks:spark-csv_2.11:1.3.0",
                                       "com.amazonaws:aws-java-sdk-pom:1.10.34",
                                       "org.apache.hadoop:hadoop-aws:2.7.2"))

# copy the local iris dataset to Spark ----
iris_tbl <- createDataFrame(iris)
head(iris_tbl)
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species
#          5.1         3.5          1.4         0.2  setosa
#          4.9         3.0          1.4         0.2  setosa
#          4.7         3.2          1.3         0.2  setosa
#          4.6         3.1          1.5         0.2  setosa
#          5.0         3.6          1.4         0.2  setosa
#          5.4         3.9          1.7         0.4  setosa

# load S3 file into Spark's using the "s3a:" protocol ----
test <- read.text("s3a://adhoc.analytics.data/README.md")
head(collect(test))
#                                                                            value
# 1                                                                 # Apache Spark
# 2
# 3 Spark is a fast and general cluster computing system for Big Data. It provides
# 4    high-level APIs in Scala, Java, Python, and R, and an optimized engine that
# 5      supports general computation graphs for data analysis. It also supports a
# 6     rich set of higher-level tools including Spark SQL for SQL and DataFrames,

# close connection
sparkR.session.stop()

Again, execute line-by-line and check the key outputs with those commented-out in the above script. Use the sparkR programming guide and the sparkR API documentation for more information on the available functions.

We have nearly met all of the aims set-out at the beginning of this series of posts. All that remains now is to install Apache Zeppelin so we can interact with Spark using Scala in the same way we can now interact with it using R.

Building a Data Science Platform for R&D, Part 2 - Deploying Spark on AWS using Flintrock

2016-08-18T00:00:00+01:00

Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.

A key part of my vision for a Spark-based R&D platform is being able to to launch, stop, start and then connect to a cluster from my laptop. By this I mean that I don’t want to have to directly interact with AWS every time I want to switch my cluster on or off. Versions of Spark prior to v2 had a folder in the home directory, /ec2, containing scripts for doing exactly this from the terminal. I was perturbed to find this folder missing in Spark 2.0 and ‘Amazon EC2’ missing from the ‘Deploying’ menu of the official Spark documentation. It appears that these scripts have not been actively maintained and as such they’ve been moved to a separate GitHub repo for the foreseeable future. I spent a little bit of time trying to get them to work, but ultimately they do not support v2 of Spark as yet. They also don’t allow you the flexibility of choosing which version of Hadoop to install along with Spark and this can cause headaches when it comes to accessing data on S3 (a bit more on this later).

I’m very keen on using Spark 2.0 so I needed an alternative solution. Manually firing-up VMs on EC2 and installing Spark and Hadoop on each node was out of the question, as was an ascent of the AWS DevOps learning-curve required to automate such a process. This sort of thing is not part of my day-job and I don’t have the time otherwise. So I turned to Google and was very happy to stumble upon the Flintrock project on GitHub. Its still in its infancy, but using it I managed to achieve everything I could do with the old Spark ec2 scripts, but with far greater flexibility and speed. It is really rather good and I will be using it for Spark cluster management.

Download Spark Locally

In order to be able to send jobs to our Spark cluster we will need a local version of Spark so we can use the spark-submit command. In any case, its useful for development and learning as well as for small ad hoc jobs. Download Spark 2.0 here and choose ‘Pre-built for Hadoop 2.7 and later’. My version lives in /applications and I will assume that yours does too. To check that everything is okay, open the terminal and make Spark-2.0.0 your current directory. From here run,

$ ./bin/spark-shell

If everything is okay you should be met with the Spark shell for Scala interaction:

Install Flintrock

Exit the Spark shell (ctrl-d on a Mac, just in case you didn’t know…) and return to Spark’s home directory. For convenience, I’m going to download Flintrock to here as well - where the old ec2 scripts used to be. The steps for downloading the Flintrock binaries - taken verbatim from the Flinkrock repo’s README - are as follows:

$ flintrock_version="0.5.0"

$ curl --location --remote-name "https://github.com/nchammas/flintrock/releases/download/v$flintrock_version/Flintrock-$flintrock_version-standalone-OSX-x86_64.zip"
$ unzip -q -d flintrock "Flintrock-$flintrock_version-standalone-OSX-x86_64.zip"
$ cd flintrock/

And test that it works by running,

$ ./flintrock --help

It’s worth familiarizing yourself with the available commands. We’ll only be using a small sub-set of these, but there’s a lot more you can do with Flintrock.

Configure Flintrock

The configuration details of the default cluster are kept in a YAML file that will be opened in your favorite text editor if you run

$ ./flintrock configure

Most of these are the default Flintrock options, but a few of them deserve a little more discussion:

key-name and identity-file - in Part 1 we generated a key-pair to allow us to connect remotely to EC2 VMs. These options refer to the name of the key-par and the path to the file containing our private key.
instance-profile-name - this assigns an IAM ‘role’ to each node. A role is a like an IAM user that isn’t a person, but can have access policies attached to it. Ultimately, this determines what out Spark nodes can and cannot do on AWS. I have chosen the default role that EMR assigns to nodes, which allows them to access data held in S3.
instance-type - I think running 2 x m4.large instances is more than enough for testing a Spark cluster. In total, this gets you 4 cores, 16Gb of RAM and Elastic Block Storage (EBS). The latter is important as it means your VMs will ‘persist’ when you stop them - just like shutting-down your laptop. Check that the overall pricing is acceptable to you here. If it isn’t, then choose another instance type, but make sure it has EBS (or add it separately if you need to).
region - the AWS region that you want the cluster to be created in. I’m in the UK so my default region is Ireland (aka eu-west-1).
ami - which Amazon Machine Image (AMI) should the VMs in our cluster be based on? For the time-being I’m using the latest version of Amazon’s Linux distribution, which is based on Red Hat Linux and includes AWS tools. Be aware that this has its idiosyncrasies (deviations from what would be expected on Red Hat and CentOS), and that these can create headaches (some of which I encountered when I was trying to get the Apache Zeppelin daemon to run). It is free and easy, however, and the ID for the latest version can be found here.
user - the setup scripts will create a non-root user on each VM and this will be the associated username.
num-slaves - the number of non-master Spark nodes - 1 or 2 will suffice for testing.
install-hdfs - should Hadoop be installed on each machine alongside Spark? We want to access data in S3 and Hadoop is also a convenient way of making files and JARs visible to all nodes. So it’s a ‘True’ for me.

Launch Cluster

Once you’ve decided on the cluster’s configuration, head back to the terminal and launch a cluster using,

$ ./flintrock launch the_name_of_my_cluster

This took me under 3 minutes, which is an enormous improvement on the old ec2 scripts. Once Flintrock issues it’s health report and returns control of the terminal back to you, login to the AWS console and head over to the EC2 page to see the VMs that have been created for you:

Select the master node to see it’s details and check that the correct IAM role has been added:

Note that Flintrock has created two security groups for us: flintrock-your_cluster_name-cluster and flintrock. The former allows each node to connect with every other node, and the latter determines who can connect to the nodes from the ‘outside world’. Select the ‘flintrock’ security group:

The Sources are the IP addresses allowed to access the cluster. Initially, this should be set to the IP address of the machine that has just created your cluster. If you are unsure what you IP address is, then try whatismyip.com. The ports that should be open are:

4040 - allows you to connect to a Spark application’s web UI (e.g. the spark-shell or Zeppelin, etc.),
8080 & 8081 - the Spark master node’s web UI and a free port that we’ll use for Apache Zeppelin when we set that up later on (in the final post of this series),
22 - the default port for connecting via SSH.

Edit this list and add another Custom TCP rule to allow port 8787 to be accessed by your IP address. We will use this port to connect to R Studio when we set that up in the next post in this series.

Connect to Cluster

Find the Public IP address of the master node from the Instances tab of the EC2 Dashboard. Enter this into a browser followed by :8080, which should allow us to access the Spark master node’s web UI:

If everything has worked correctly then you should see one worker node registered with the master.

Back on the Instances tab, select the master node and hit the connect button. You should be presented with all the information required for connecting to the master node via SSH:

Return to the terminal and follow this advice. If successful, you should see something along the lines of:

Next, fire-up the Spark shell for Scala by executing spark-shell. To run a trivial job across all nodes and test the cluster, run the following program on a line-by-line basis:

val localArray = Array(1,2,3,4,5)
val rddArray = sc.parallelize(localArray)
val rddArraySum = rddArray.reduce((x, y) => x + y)

If no errors were thrown and the shell’s final output is,

rddArraySum: Int = 15

then give yourself a pat-on-the-back as you’ve just executed your first distributed computation on a cloud-hosted Spark cluster.

There are two ways we can send a complete Spark application - a JAR file - to the cluster. Firstly, we could copy our JAR to the master node - let’s assume it’s the Apache Spark example application that computes Pi to n decimal places, where n is passed as an argument to the application. In this instance, we could SSH into the master node as we did for the Spark shell and then execute Spark in ‘client’ mode,

$ spark/bin/spark-submit --master spark://ip-172-31-6-33:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi spark/examples/jars/spark-examples_2.11-2.0.0.jar 10

Note that the --master option takes the local IP address of the master node within our network in AWS. An alternative method is to send our JAR file directly from our local machine using Spark in ‘cluster’ mode,

$ bin/spark-submit --master spark://52.48.93.43:6066 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10

A common pattern is to use the latter when the application both reads data and writes output to and from S3 or some other data repository (or database) in our AWS network. I have not had any luck running an application on the cluster from my local machine in ‘client’ mode. I haven’t been able to make the master node ‘see’ my laptop - pinging the latter from the former always fails and in client mode the Spark master node must be able to reach the machine that is running the driver application (which in client mode, in this context, is my laptop). I’m sure that I could circumnavigate this issue if I setup a VPN or an SSH-tunnel between my laptop and the AWS cluster, but this seem like more hassle than it’s worth considering that most of my interaction with Spark will be via R Studio or Zeppelin that I will setup to access remotely.

Read S3 Data from Spark

In order to access our S3 data from Spark (via Hadoop), we need to make a couple of packages (JAR files and their dependencies) available to all nodes in our cluster. The easiest way to do this, is to start the spark-shell with the following options:

$ spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2

Once the cluster has downloaded everything it needs and the shell has started, run the following program that ‘opens’ the README file we uploaded to S3 in Part 1 of this series of blogs, and ‘collects’ it back to the master node from its distributed (RDD) representation:

val data = sc.textFile("s3a://alex.data/README.md")
data.collect

If everything is successful then you should see the contents of the file printed to screen.

If you have read elsewhere about accessing data on S3, you may have seen references made to connection strings that start with "s3n://... or maybe even "s3://... with accompanying discussions about passing credentials either as part of the connection string or by setting system variables, etc. Because we are using a recent version of Hadoop and the Amazon packages required to map S3 objects onto Hadoop, and because we have assigned our nodes IAM roles that have permission to access S3, we do not need to negotiate any of these (sometimes painful) issues.

Stopping, Starting and Destroying Clusters

Stopping a cluster - shutting it down to be re-started in the state you left it in - and preventing any further costs from accumulating is as simple as asking Flintrock to,

$ ./flintrock stop the_name_of_my_cluster

and similarly for starting and destroying (terminating the cluster VMs and their state’s forever),

$ ./flintrock start the_name_of_my_cluster

$ ./flintrock destroy the_name_of_my_cluster

Be aware that when you restart a cluster the public IP addresses for all the nodes will have changed. This can be a bit of a (minor) hassle, so I have opted to create an Elastic IP address and assign it to my master node to keep it’s public IP address constant over stops and restarts (for a nominal cost). To see what clusters are running at any one moment in time,

$ ./flintrock describe

We are now ready to install R, R Studio and start using Sparklyr and/or SparkR to start interacting with our data (Part 3 in this series of blogs).

Building a Data Science Platform for R&D, Part 1 - Setting-Up AWS

2016-08-16T00:00:00+01:00

Here’s my vision: I get into the office and switch-on my laptop; then I start-up my Spark cluster; I interact with it via RStudio to exploring a new dataset a client uploaded overnight; after getting a handle on what I want to do with it, I prototype an ETL and/or model-building process in Scala by using Zeppelin and I might even ask it to run every hour to see how it fairs.

In all likelihood this is going to be more than one day’s work, but you get the idea - I want a workspace that lets me use production-scale technologies to test ideas and processes that are a small step away from being handed-over to someone who can put them into production.

This series of posts is about how to setup and configure what I’m going to refer to as the ‘Data Science R&D platform’. I’m intending to cover the following:

setting-up Amazon Web Services (AWS) with some respect for security, and loading data to AWS’s S3 file system (where I’m assuming all static data will live);
launching, connecting-to and controlling an Apache Spark cluster on AWS, from my laptop, with the ability to start and stop it at will,
installing R and RStudio Server on my Spark cluster’s master node and then configuring SparkR and Sparklyr to connect to Spark and AWS S3,
installing and configuring Apache Zeppelin for Scala and SQL based Spark interaction, and for automating basic ETL/model-building processes.

I’m running on Mac OS X so this will be my frame of reference, but the Unix/Linux terminal-based parts of these posts should play nicely with all Linux distributions. I have no idea about Windows.

You might be wondering why I don’t use AWS’s Elastic Map Reduce (EMR) service that can also run a Spark cluster with Zeppelin. I did try, but I found that it wasn’t really suited to ad hoc R&D - I couldn’t configure it with all my favorite tools (e.g. RStudio) and then easily ‘pause’ the cluster when I’m done for the day. I’d be forced to stop the cluster and re-install my tools when I start another cluster up. EMR clusters appear to be better suited to being programmatically brought up and down as and when required, or for long-running clusters - excellent for a production environment. Not quite so good for R&D. Costs more too, which is the main reason Databricks doesn’t work for me either.

This is obvious, but nevertheless for completeness head over to aws.amazon.com and create an account:

Once you’ve entered your credentials and payment details you’ll be brought to the main AWS Management Console that lists all the services at your disposal. The AWS documentation is excellent and a great way to get an understanding of what everything is and how you might use it.

This is also a good point to choose the region you want your services to be created in. I live in the UK so it makes sense for me to choose Ireland (aka eu-west-1):

Setup Users and Grant them Roles

It is considered bad practice to login to AWS as the root user (i.e. the one that opened the account). So it’s worth knowing how to setup users, restrict their access to the platform and assign them credentials. This is also easy to to.

For now I’m just going to create an ‘admin’ user that has more-or-less the same privileges as the root user, but is unable to delete the account or change the billing details, etc.

To begin with, login to the AWS console as the root user and navigate to Identity and Access Management (IAM) under Security and Identity. Click on the Users tab and then Create New User. Enter a new user name and then Create. You should then see the following confirmation together with new users’ credentials:

Make a note of these - or even better download them in CSV format using the ‘Download Credentials’ button. Close the window and then select the new user again on the Users tab. Next, find the Permissions tab and Attach Policy:

Choose AdministratorAccess for our admin user:

There are an enormous amount of policies you could apply depending on what your users need to access. For example, we could just as easily have created a user that can only access Amazon’s EMR service with read-only permission on S3.

Finally, because we’d like our admin user to be able to able to login to the AWS Management Console, we need to given them a password by navigating to the Security Credentials tab to Manage Password.

Note, that non-root users need to login via a difference URL that can be found at the top of the IAM Dashboard:

Log out of the console and then back in again using this link, as your new admin user. It’s worth noting that the IAM Dashboard encourages you to follow a series of steps for securing your platform. The steps above represent a sub-set of what is required to get the ‘green light’ and I recommend that you work your way through all of them once you know your way around. For example, Multi-Factor Authentication (MFA) for the root user makes a lot of sense.

Generate EC2 Key Pairs

In order for you to remotely access AWS services - e.g. data in in S3 and virtual machines on EC2 from the comfort of your laptop - you will need to authenticate yourself. This is achieved using Key Pairs. Cryptography has never been a strong point, so if you want to know more about how this works I suggest taking a look here. To generate our Key Pair and download the private key we use for authentication, start by navigating from the main console page to the EC2 dashboard under Compute, and then to Key Pairs under Network & Security. Once there, Create Key Pair and name it (e.g. ‘spark_cluster’). The file containing your private key will be automatically downloaded. Stash it somewhere safe like your home directory ,or even better in a hidden folder like ~/.ssh. We will ultimately assign these Key Pairs to Virtual Machines (VMs) and other services we want to setup and access remotely.

Install the AWS CLI Tools

By no means an essential step, but the AWS terminal tools are useful - e.g. for copying files to S3 or starting and stopping EMR clusters without having to login to the AWS console and click buttons.

I think the easiest way to install the AWS CLI tools is to use Homebrew, a package manager for OS X (like APT or RPM for Mac). With Homebrew, installation is as easy as executing,

$ brew install awscli

from the terminal. Once installation is finished the AWS CLI Tools need to be configured. Make sure you have your users’ credentials details to hand (open the file that downloaded when you created your admin user). From the terminal run,

$ aws configure.

This will ask you for, in sequence: Access Key ID (copy from credentials file), Secret Access Key (copy from credentials file), Default region name (I use eu-west-1 in Ireland), and default output (I prefer JSON). To test that everything is working execute,

$ aws s3 ls

to list all the buckets we’ve made in S3 (currently none).

Upload Data to S3

Finally, it’s time to do something data science-y - loading data. Before we can do this we need to create a ‘bucket’ in S3 to put our data objects in. Using the AWS CLI tools we execute,

$ aws s3 mb s3://alex.data

to create the alex.data bucket. AWS is quite strict about what names are valid (i.e. no underscores), so it’s worth reading the AWS documentation on S3 if you get any errors. We can then copy a file over to our new bucket by executing,

$ aws s3 cp ./README.md s3://alex.data

We can check this file has been successfully copied by returning to the AWS console and heading to S3 under Storage & Content Delivery where it should be easy to browse to our file:

All of the above steps could have been carried out through the console, but I prefer using the terminal.

We are now ready to fire-up a Spark cluster and use it to read our data (Part 2 in this series of blogs).

Dr Alex Ioannides

Best Practices for Engineering ML Pipelines - Part 2

A Simple Strategy for Dataset and Model Versioning

Reusing Common Code

Distributing Python Packages within your Company

Defending Against Errors and Handling Failures

Configurable Pipelines

Engineering the Model Training Job

Prepare Data

Train Model

Validating Trained Models

End-to-End Functional Tests

Input Validation for the Stage

Developing the Model Serving Stage

Updating the Tests

Updating the Deployment and Releasing to Production

Scheduling the Pipeline to run on a Schedule

Wrap-Up

Appendix

The Dataset Class

The Model Class

train_model.py

Best Practices for Engineering ML Pipelines - Part 1

Reviewing the Business Problem

Reviewing the Technical Problem

Example Prediction Request JSON

Example Prediction Response JSON

Solution Architecture

Structuring the Pipeline Project

Setting-Up the Local Dev Environment

Setting-Up the Testing Framework

Using Tox for Test Automation

Testing Manually

Creating a Deployment Environment

Configuring CI/CD

Wrapping-Up

Deploying ML Models with Bodywork

What is this Tutorial Going to Teach Me?

Introduction

Why is MLOps Getting so Much Attention?

ML Deployment with Bodywork

Before we Start

The ML Task

A Continuous Training Pipeline

Configuring the Training Stage

Configuring the Prediction Service

Configuring the Pipeline

Deploying the Pipeline

Testing the API

Scheduling the Pipeline

Cleaning Up

Best Practices for PySpark ETL Projects

PySpark ETL Project Structure

The Structure of an ETL Job

Passing Configuration Parameters to the ETL Job

Packaging ETL Job Dependencies

Running the ETL job

Debugging Spark Jobs Using start_spark

Automated Testing

Managing Project Dependencies using Pipenv

Installing Pipenv

Installing this Projects’ Dependencies

Running Python and IPython from the Project’s Virtual Environment

Pipenv Shells

Automatic Loading of Environment Variables

Summary

Stochastic Process Calibration using Bayesian Inference & Probabilistic Programs

Imports and Global Settings

Synthetic Data Generation using Geometric Brownian Motion

The Traditional Approach to Parameter Estimation

Parameter Estimation when Data is Plentiful

Parameter Estimation when Data is Scarce

Parameter Estimation using Bayesian Inference and Probabilistic Programming

Selecting Suitable Prior Distributions

Choosing a Prior Distribution for the Expected Return of Daily Returns

Choosing a Prior Distribution for the Volatility of Daily Returns

Inference using a Probabilistic Program & Markov Chain Monte Carlo (MCMC)

Making Predictions

Impact on Risk Metrics - Value-at-Risk (VaR)

Summary

The `Dataset` Class

The `Model` Class

`train_model.py`

Debugging Spark Jobs Using `start_spark`

Defining the Flask Service in the `api.py` Module

Defining the Docker Image with the `Dockerfile`