Ways I Use Testing as a Data Scientist

In my work, writing tests serves three purposes: making sure things work, documenting my understanding, preventing future errors. When I was starting out with testing, I had a hard time understanding what I should be writing tests for. As a beginner, I just assumed my code worked–I was staring right at the output in a notebook and visually inspecting that the output was correct.

After gaining some experience writing tests, I realized one of my problems initially was the fact that knowing what to test requires some experience in knowing what can go wrong. It requires making mistakes and encountering issues so we know what needs to be tested when we encounter a similar problem. One complicating factor on top of this is that beginner mistakes are often syntactical (i.e. “how do I get this code to actually run”) and not conceptual or domain-specific. Telling a beginner to write a test every time they encounter a syntax error to make sure they don’t do that again is not valuable.

As a data scientist, I wear many different hats, which also made learning about testing difficult. There’s plenty of material on testing from a software development perspective, but if I’m doing an analysis and not developing software, I found many of those concepts difficult to translate and apply in my work.

In that spirit, I thought I would write a blog post on the many ways I use testing in my work, in hopes that other data scientists will find it helpful when they’re trying to figure out what to test and how to test in the code they write.

Testing Analysis & Processing Code: `assert`

When doing a one-time or ad hoc analysis, the assert statement in python is my go-to tool. Typically in an analysis, I use assert statements on as many intermediate calculations or processes as I can.

One example is merging two datasets by some common id. In this example, assume we have some knowledge that there should be no IDs that are in one data set and not the other. In this case, we can write a quick assert like the following:

ids1 = set(df1["ID"].unique())
ids2 = set(df2["ID"].unique())

assert len(ids1.symmetric_difference(ids2)) == 0, "One Dataset Contains Exclusive IDs"

In this test, we create a set out of the unique identifiers (assuming they’re in a DataFrame), and check that the symmetric difference between those two sets is 0.

Pro Tip: It’s possible to add an expression that runs when the assert fails by including it after the assertion expression, separated by a comma. We can use this fact to make a failed assert even more helpful when debugging, e.g.:

assert len(ids1.symmetric_difference(ids2)) == 0, f"DF1 not DF2: {ids1 - ids2} - DF2 not DF1: {ids2 - ids1}"

Another thing to check is basic calculations and arithmetic. Recently I was analyzing a survey that had some logic determining who should be asked what questions, where I needed to check that the total number of responses to a certain question added up to the number of “Yes” responses to a prior question. My data was in crosstabs generated by pandas, so my quick check looked something like this:

Example Crosstab 
         Count    Percent
Yes         30         .3
No          50         .5
Missing     20         .2

prior_question_yes = prior_question_crosstab.loc["Yes", "Count"]

subsequent_question_total = subquestion_question_crosstab["Count"].sum()

assert prior_question_yes == subsequent_question_total, "Subsequent Question Total Not Matching"

You might be thinking “this is too obvious of a thing to even test”, but it saved me in this instance as I looped over the pairs of parent/child questions and realized I had two typos in my question mapping and my code was referencing the wrong data frames in those cases. This is an important aspect of tests: while they may seem less valuable when running them on a single aspect of the data, they are very helpful as we write code that scales to touching multiple aspects of the data.

Writing More Tests While I’ve mostly migrated away from notebooks, for some projects they still make sense. One practice I’ve started is that whenever I visually investigate some aspect of my data by writing some disposable code in a notebook, I convert that validation into an assert statement.

As a beginner, one might have an idea for what to test but struggle to find the right tools to write tests. One place to look for help is the documentation and test suite of the libraries that are being used. For example, to check that two arrays of floats are close to each other, with the caveat they might not be exactly the same, there’s np.isclose. We can see use of np.isclose in the numpy test suite. If we’re using pandas, they also have a helpful testing module in case we need to do things like check if two DataFrames are equal.

Identifying New Tests & Testing Code that Operates on Data: `hypothesis`

If we have a function that operates on data and have a hard time figuring out what to test, hypothesis is a great library that can help. Rather than explicitly state the exact objects you want to run through a test, hypothesis generates examples of inputs that follow certain properties you define.

One way I’ve used this is analyzing code that operated on a Likert-style question from a survey. These are the type of survey questions that go from “strongly disagree” to “strongly agree”, and each value is associated with a number (e.g. 1 to 5).

I wrote a function that returned a bunch of information about a column of these values, similar to the following:


def summarize_likert(series: pd.Series):
    n = len(series)
    n_completed = series.notnull().sum()
    mean = series.mean()
    empty = series.isnull().sum()
    pct_empty = empty / n
    small = n_completed <= 7
    summary = dict(
        n=n,
        n_completed=n_completed,
        mean=mean,
        empty=empty,
        pct_empty=pct_empty,
        small=small,
    )
    return summary

Pretty simple, right? Now I’ll show how to use hypothesis to run some tests with this type of function. There is a bit of setup to create a hypothesis strategy that mimics our data, but we only have to do it once and it’s reusable.


from hypothesis import assume, given, strategies as st
from hypothesis.extra.pandas import range_indexes, series

# DEFINE STRATEGY
@st.composite
def plus_nan(draw, strat):
    return draw(st.one_of(st.just(np.nan), strat))


index_strategy = range_indexes(min_size=0, max_size=500)
likert_data = st.integers(min_value=1, max_value=5)
likert_data_with_nan = plus_nan(likert_data)

likert_series = series(elements=likert_data, index=index_strategy)
likert_series_with_nan = series(elements=likert_data_with_nan, index=index_strategy)


# CREATE TEST
@given(likert_series_with_nan)
def test_likert_na(series):
    summary = summarize_likert(series)
    # Likert are 1-5, so mean is in that interval
    assert 1 <= summary["mean"] <= 5

# EXECUTE TEST
if __name__ == "__main__":
	# actually run test
	# can use pytest as well
    test_likert_na()

If we run this simple test, we’ll get the following error:

RuntimeWarning: invalid value encountered in long_scalars
  pct_empty = empty / n
Falsifying example: test_likert_na(
    series=Series([], dtype: float64),
)

This is hypothesis telling us we had an error at runtime with the code and showing us the example it generated that gave that error. It turns out we have behavior that doesn’t work when passed an empty Series. At first I thought this wasn’t worth considering, but in my use case I was often automatically filtering data based on other columns, so it is realistic at some point I might pass an empty series and want to know about it. At this point, we’d write some behavior to handle the case where we are passed an empty series. A simple fix would be to add the following at the top of our function and add an assumption to the test that the series is not empty.

def summarize_likert(series: pd.Series)
	if series.empty:
   		raise ValueError("Series Empty")
	...

	
@given(likert_series_with_nan)
def test_likert_na(series):
	assume(not series.empty)
    summary = summarize_likert(series)
    # Likert are 1-5, so mean is in that interval
    assert 1 <= summary["mean"] <= 5

So we’ve fixed that issue, let’s run the test again.

Falsifying example: test_likert_na(
    series=0   NaN
    dtype: float64,
)
Traceback (most recent call last):
  File "likert.py", line 50, in <module>
    test_likert_na()
  File "likert.py", line 41, in test_likert_na
    @settings(verbosity=Verbosity.verbose)
  File "/Users/peter/opt/miniconda3/envs/blog-test/lib/python3.8/site-packages/hypothesis/core.py", line 1190, in wrapped_test
    raise the_error_hypothesis_found
  File "likert.py", line 46, in test_likert_na
    assert 1 <= summary["mean"] <= 5
AssertionError

A similar issue: we haven’t defined what happens when the series contains all NaN values. If we look up further in the traceback, we actually see that hypothesis is smart enough that it found a more complex example that didn’t work (A longer series of NaN), and then reduced it down to the simpler case with only one NaN.

Tests on the Data: `pandera` & Great Expectations

It’s a good idea to write tests on the data itself. To do this, we’ll need to do a bit of exploratory work to understand the qualities of the data, then translate that into tests. Testing data is extremely helpful if we will be repeatedly receiving new data with the same structure: tests are a quick way to make sure there are no issues with new data.

For lightweight use cases, I like pandera. With pandera, I do two high level activities with a dataset: explicitly define a schema for the data and add in any supplemental info I have about the data structure.

For the first definition of the schema, all I do is define the column names and types. We can use the infer_schema method to get started with this with an existing DataFrame. From here, we need to manually validate column types and make corrections. For example, an object type column might be better as a categorical variable, and we’ll want to check that any timestamp columns are correctly inferred. Even something as simple as this has saved me from a costly error when a new dataset I received was missing a column I was expecting from the first version of the dataset.

From there, I create v2 of the schema, which adds Checks to the columns. Checks are information we gain after exploring the data – for example, whether a column should always be positive, whether the column name should be formatted a certain way, or whether a column should only contain certain values (e.g. a bool represented as a 0/1 int).

If we’re expecting to repetedly read in new data, I would recommend exploring Great Expectations. The killer feature of Great Expectations is that it will generate a template of tests for the data based on a sample set of data we give it, like pandera’s infer_schema on steroids. Again, this is only a starting point for adding in future tests (or expectations), but can be really helpful in generating basic things to test.

Great Expectations is a little more involved to setup, so I think the investment is worth it if we know we’ll be repeatedly reading in new data with the same structure. There are also some stellar additional features like the data docs, which has been helpful for me communicating any data quality issues with a larger team.

Finally, even if we are not expecting new versions of the data, writing tests about the data is still a good idea. It’s great documentation for ourselves when we come back to this project or when others join our team. Additionally, it gives us something to communicate to others to help in validating data assumptions.

Writing Code for Other People: `pytest`

The final way I use tests is if I’m writing software for other people. Two recent libraries I’ve written, SetFit and EmBuddy, are examples of this. In this case, I take a more traditional software testing approach and use pytest to create and execute tests.

My testing approach is close to Test-Driven Development where I typically write a test first. I’m not rigorous with this, but I do use this process to sketch out the API I want to create so I can get a better picture of how people will interact with my code.

Here’s one example from EmBuddy that tests the save and load functionality. I know I want save and load to be as simple as possible from an API perspective – just provide a path and save the object there. Given that, I wrote a test with the API for this how I imagined it before I wrote the actual functionality. It looks like this:

def test_persist_str(tmp_path, embuddy_sm):
    path = str(tmp_path / "test_embeddings.emb")
    emb1 = embuddy_sm
    emb1.embed(["this is a sentence"])
    emb1.save(path)

    emb2 = EmBuddy.load(path)
    assert np.array_equal(emb1.embedding_cache, emb2.embedding_cache)

There are some advanced pytest features going on here to explain. First is that this test definition includes two arguments: those are actually fixtures, which are objects commonly used across tests. In this case I’m using a fixture that comes with pytest, tmp_path, to create a temporary path to save and another fixture, and embuddy_sm, which is a pre-created instance of an Embuddy object. The utility of these fixtures is that I don’t have to rewrite the code to create a temporary folder or Embuddy object every time I want to use those in a test – especially important if that test isn’t testing the functionality related to creating those things.

The test structure itself follows a common pattern called Arrange-Act-Assert. Until I learned about this, I really struggled trying to figure out how to write tests. As a bonus, this pattern fits nicely into how tests are written withpytest. In the above example, I arrange by including the fixtures, act by saving the model, and assert that the results from the initial model and persisted model are the same.

Even if we’re not sure what to assert, writing a test that executes the code is still valuable. Sometimes I write code that raises an exception when I run it because I made a mistake or forgot to implement something – so the value of a test was literally just running that code again and knowing that I had made a mistake that I needed to fix. When fixing a mistake, it’s a good idea to also convert that fix into a test as well.

Here’s an example of that from Embuddy. After attempting to ask for the nearest neighbors before I had built the index to do so, I realized I needed to add in an instructive error for other users who might make that same mistake. In this case, I created a custom exception and then tested that exception was raised when that same mistake was made, like so:

def test_no_index_exception(embuddy_sm):
    with pytest.raises(IndexNotBuiltError):
        embuddy_sm.nearest_neighbors("Some text")

Wrap-Up

There’s plenty of things to test in doing data-science work, but it’s not always clear what to test or how you should test it. In my experience, I’m usually testing one of the following things:

The results of some analysis process (using assert)
Code that operates on data (using hypothesis)
Aspects of the data (using pandera or Great Expectations)
Code for others (using pytest)

If you’re a data scientist and test other things or have other tools, reach out and let me know.

Testing Analysis & Processing Code: assert#

Identifying New Tests & Testing Code that Operates on Data: hypothesis#

Tests on the Data: pandera & Great Expectations#

Writing Code for Other People: pytest#

Wrap-Up#

Testing Analysis & Processing Code: `assert`

Identifying New Tests & Testing Code that Operates on Data: `hypothesis`

Tests on the Data: `pandera` & Great Expectations

Writing Code for Other People: `pytest`

Wrap-Up