# Continuous integration for data science with pytest, Github Actions, and Hypervector

# June 12, 2021


Update July 8th 2021 — there's now a video accompanying this tutorial post


Building a continuous integration (CI) step for your production-facing data science models and functions can ensure they remain valid and fit for purpose in a busy codebase. Furthermore, a build test that asserts successfully for all contributors to see helps keep developer confidence high around such components — which can often become 'black box' parts of the stack that only a small number of data scientists have expertise with.

This guide outlines how you can build an automated check for a data science feature that runs every time a pull request is opened against the project, using the Python library pytest (opens new window), Github's continuous integration platform Actions (opens new window), and Hypervector (opens new window) — an API for building data science test fixtures easily.

Hypervector offers free access as part of its early adopters Alpha programme — you can learn more here (opens new window). All code associated with this tutorial is available here (opens new window).

# A simple data feature

For this example, we'll build a toy model using data from sklearn's make_classifier. In reality, your data science or machine learning model is likely to be more complex than the one illustrated — the main aim of this tutorial is to demonstrate the pipeline building process. All code for this data science feature is available in a Jupyter Notebook (opens new window).

All dependencies are listed in a requirements.txt in the root directory — simply run pip install -r requirements.txt from your Python 3.8+ environment of choice to install all the necessary libraries - including Hypervector's hypervector-wrapper (opens new window) helper package.

Let's generate a 5,000 example dataset with 4 float dimensions and 2 classes [0, 1].

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=5000, 
    n_features=4, 
    n_informative=2,
    random_state=0
)

Now we have our data, we can build a very rudimentary classifier using an sklearn Pipeline consisting of a scaling pre-processing step StandardScaler and a RandomForestClassifier:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from joblib import dump

X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

dump(pipeline, 'pipeline.joblib')
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, pipeline.predict(X_test))
0.976

To take this toy model to a production-like context, we'll 'deploy' it using Flask. This makes it available to provide predictions via a REST endpoint: a fairly common pattern when shipping machine learning components.

import numpy as np
from flask import Flask
from joblib import load

app = Flask(__name__)
model = load('./pipeline.joblib')


def get_prediction(data):
    if len(np.shape(data)) == 1:
        data = np.array(data).reshape(1, -1)
    results = list(model.predict(data))
    response = {"prediction": [int(result) for result in results]}
    return response


@app.route('/')
def health_check():
    feature_vector = [0, 0, 0, 0]
    return get_prediction(feature_vector)


@app.route('/<a>/<b>/<c>/<d>')
def entrypoint(a, b, c, d):
    feature_vector = [a, b, c, d]
    return get_prediction(feature_vector)


if __name__ == '__main__':
    app.run()

Running python app.py from the root of the project will start a local webserver which provides predictions from this model over HTTP. While this is up, you can navigate to localhost:5000 to see the model output with inputs [0, 0, 0, 0] (as served by the health_check method), and custom inputs by editing the URL. For example, to get a prediction for the vector [1, 4, 2.5, 0.1], you would navigate to localhost:5000/1/4/2.5/0.1.

# Writing a test fixture

With a basic but functional web service providing access to our model, we want to write some tests to ensure the expected output is maintained as development on the project continues. Historically, this might have been done manually by data scientists when re-training a model, but Hypervector allows us to assert this on any change to the codebase — even changes not directly associated with the model artefact itself e.g. when updating library dependencies, when refactoring wrapper functions around the model, when changes occur to upstream data processing.

On top of the training data we used to build the classifier, Hypervector allows us to generate up to 10,000 further examples per fixture. This is useful as we can target specific areas of the feature vector distribution for testing, and Hypervector makes this available conveniently via a dedicated endpoint — so there's no headache of managing test resources in version control.

Examining the histogram of each of the features across the training data, we'll get an idea of the distribution each of these occupies. This allows us to decide the best statistical properties for a test fixture to run as part of our continuous integration. The histograms of each feature in the training data are provided below:

For this integration test, we'll generate test fixture data that is broadly equivalent to the training data: a Gaussian distribution N(0, 2) for both feature a and d, and two Gaussian mixtures of N(-0.75, 0.5) + N(0.75, 0.5) and N(-1, 0.5) + N(1, 0.5) for features b and c respectively. In other contexts, we might want to test extremes of the distribution, or a subset of specific data cases that are of particular interest for a given model. Hypervector's Definition API (opens new window) provides a flexible set of data primitives to build fixtures for a wide variety of test scenarios.

This can be defined in Hypervector using a Python dictionary as follows:

definition_json = {
    "definition_name": "Classifier test fixture",
    "features": [
        {
            "type": "float",
            "distribution": {
              "type": "gaussian", "mu": 0, "sigma": 2
            }
        },
        {
            "type": "float",
            "distribution": {
                "type": "mixture",
                "components": [
                    {"mu": -0.75, "sigma": 0.5, "weight": 2},
                    {"mu": 0.75, "sigma": 0.5, "weight": 1}
                ]
            }
        },
        {
            "type": "float",
            "distribution": {
                "type": "mixture",
                "components": [
                    {"mu": -1, "sigma": 0.5, "weight": 1},
                    {"mu": 1, "sigma": 0.5, "weight": 2}
                ]
            }
        },
        {
            "type": "float",
            "distribution": {
                "type": "gaussian", "mu": 0, "sigma": 2
            }
        },
        
    ]
}
import hypervector
hypervector.API_KEY = "YOUR_API_KEY"

project = hypervector.Project.new()

definition = hypervector.Definition.new(
    definition=definition_json,
    project_uuid=project.project_uuid
)

Hypervector generates test fixture data in groups known as Ensembles — we can generate a new Ensemble from this Definition and retrieve the data as follows:

ensemble = hypervector.Ensemble.new(
    definition_uuid=definition.definition_uuid,
    size=10000
)
test_fixture_data = ensemble.hypervectors()

We can look at what's returned to ensemble.hypervectors() from the API for the test fixtures.

The data from this fixture allows us to snapshot the behaviour of pipeline using Hypervector's Benchmark functionality. This saves a hashed representation of the feature's output — enabling repeated test assertions to be run en masse as part of the CI process.

benchmark = hypervector.Benchmark.new(
    ensemble=ensemble,
    expected_output=pipeline.predict(test_fixture_data)
)

To test out that this works as expected, we can assert the same input as the Benchmark was created with to verify the resource exists:

>>> benchmark.assert_equal(pipeline.predict(test_fixture_data))
{'assertion_uuid': '71f990e9-e336-4207-b146-870f72093d1b',
 'timestamp': '2021/06/08 13:06:42',
 'benchmark_uuid': 'fce8ef24-fc2f-4d0d-819b-209289a02734',
 'ensemble_uuid': 'e6ed88c0-6fb0-4b6a-bad3-90adfc0e3d17',
 'asserted': True,
 'diff': None}

We can run this as part of a wider test suite for the project using pytest.

# test_app.py

import hypervector
import pytest

from app import get_prediction

hypervector.API_KEY = "YOUR_API_KEY"


@pytest.fixture
def hypervector_fixture():
    definition = hypervector.Definition.get("YOUR_DEFINITION_UUID")
    ensemble = definition.ensembles[0]
    hypervectors = ensemble.hypervectors()
    benchmark = ensemble.benchmarks[0]
    return hypervectors, benchmark


def test_single_prediction():
    test_case = [0, 0, 0, 0]
    result = get_prediction(test_case)['prediction']
    assert result == [1]


def test_bulk_prediction(hypervector_fixture):
    hypervectors, benchmark = hypervector_fixture
    results = get_prediction(hypervectors)['prediction']
    assertion = benchmark.assert_equal(results)

    assert assertion['asserted'] is True

You can see there's two tests for our model — test_single_prediction() which ensures a single input feature vector of [0, 0, 0, 0] returns the class 1 as expected, and test_bulk_prediction() which uses our Hypervector Ensemble and Benchmark.

Running pytest from the root of the project will assert against these tests.

# Automating the test pipeline

Using Github Actions in conjunction with our Hypervector fixture, we'll automate the test suite to run as part of the continuous integration strategy for this repository. Hypervector provides an Action specifically for reporting the results of the last Assertion run (opens new window), and you can make use of this by adding the a .github/workflows/main.yaml file to the project containing the following:

name: Run Hypervector tests

on:
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.8]

    steps:
      - uses: actions/checkout@v2
      - name: Install Python ${{ matrix.python-version }}
        uses: actions/setup-python@v2
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run tests
        run: |
          pytest
        continue-on-error: true
      - name: Get Hypervector results
        uses: hypervectorio/hypervector-results-gh-action@1.0.0
        id: hypervector
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          hypervector_api_key: YOUR_API_KEY

Opening a pull request against a new branch of the project will automatically return the results of the Assertion to the pull request's conversation — making it easy to quickly verify there's been no breaking changes with the last commit. The pull request used in this example is available here (opens new window).

Let's deliberately introduce a regression to see what a failure case looks like. If we include another preprocessing step to the get_prediction() method in our service, erroneous results will be brought back by the model. We'll use MinMaxScaler() for this breaking change:

def get_prediction(data):
    if len(np.shape(data)) == 1:
        data = np.array(data).reshape(1, -1)

    # adding breaking change
    scaler = MinMaxScaler()
    scaler.fit(data)
    data = scaler.transform(data)

    results = list(model.predict(data))
    response = {"prediction": [int(result) for result in results]}
    return response

Pushing this to the open pull request on the branch above will result in a test_bulk_prediction() failure, and a Hypervector diff is auto-commented on the pull request.

Most importantly, we can see the test has failed as expected. Breaking this example down further, Hypervector provides the size of the ensemble (size: 10000), the number of failing cases (n_failures: 4959), and the indices and inputs of the first 10 failing cases (diffs can get large with Ensemble size, so the truncated flag indicates the API has provided only the first 10 test cases for convenience).

Furthermore, we can click through on the provided Assertion history deep-link to have a closer look at what's going on in the Hypervector Dashboard:

Hypervector tracks each Assertion against every Benchmark, and you can retrieve the full diff including all failing cases from here for further analysis of any code- or model-based regressions that might have been introduced.

In this case, the failing Assertion across so many cases of this Benchmark would probably be enough to signal that a service-side breaking change had been introduced (which was the case), and that the PR should not be merged into the project. For more subtle regressions, Hypervector's diff providing all the failing cases enables deeper analysis to be conducted on the nature of any problems that have surfaced.

# Hypervector's Alpha & beyond

Now there's a test fixture in place to verify your data science feature is still fit for purpose, you and your team can begin contributing to the codebase with more confidence and speed without the worry of introducing tricky regressions and silent breaking changes to your data-driven pipelines and models. The CI build step provides a central checkpoint for all developers to references against when working on these parts of the stack.

Hypervector is an API for bringing improved quality to data-driven software. It's currently in Alpha (opens new window) at the moment, which allows you to build a single Project with a Definition, Ensemble and Benchmark for free. Any and all feedback is highly appreciated, and we're always on the lookout for early adopters and users to help build better testing tools for data scientists and engineers working on empirically-derived and data-heavy software components.

Happy testing!

—Jason
jason@hypervector.io