The guts of a property testing library

Property-based testing is a common tool to improve testing by testing properties of a piece of software over many values drawn at random from a large space of valid values. This methodology was first introduced in the paper QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs, which describes the basic idea and shows a possible implementation in Haskell. Since then, many tools to aid in property based testing appeared for many programming languages: as of the time of writing, there are libraries for Haskell, Erlang, Clojure, Python, Scala, and many others. A few days ago I released the first version of StreamData, a property testing (and data generation) library for Elixir (that is a candidate for inclusion in Elixir itself in the future). This post is not an introduction to property-based testing nor a tutorial on how to use StreamData: what I want to do is dig into the mechanics of how StreamData works, its design, and how it compares to some of the other property-based testing libraries mentioned above.

Cover image of a factory

What StreamData looks like

Let's say we want to test a basic property of the :lists.reverse/1 function: that is, we want to test that the first element of a list is the last element of the reverse of that list. These properties written using StreamData would look like this:

property "the first elem of a list is the last elem of its reverse" do
  check all list <- list_of(int()) do
    assert hd(list) == List.last(:lists.reverse(list))
  end
end

This syntax means: take a list randomly generated from the list_of(int()) generator (which generates lists of integers) and test the do body (the assertion) using the generated list. If the test passes, a new list is generated and this goes on until a failure is found. When a failure is found, StreamData will try to shrink the list in order to find the "smallest" example that still fails the property. In the case above, hd/1 will fail for the empty list [] and StreamData will report that (the correct test would use values generated by non_empty(list_of(int()))).

check all is capable of doing more than the above: we can also combine generators, use previously generated values to influence further generators, and filter out values that are not good for our tests. For example, take this trivial test:

property "in/2 returns true for elements taken out of a list" do
  check all list <- list_of(int()),
            list != [],
            elem <- member_of(list) do
    assert elem in list
  end
end

As you can see, member_of/1 is a generator and in this case it uses list which is a value previously generated by the list_of(int()) generator. The list != [] line is a filter: when a filter fails for a generated value, the test "run" is thrown away and new values are generated.

Data generation

While until now we only focused on property testing, StreamData also provides data generation functionality. All generators implement the Enumerable protocol and are infinite streams of data that can easily be used for generating test data and similar purposes.

StreamData.int() |> Stream.map(&abs/1) |> Enum.take(3)
#=> [1, 0, 2]

The inner workings

The first things I did when starting to research a possible property testing implementation were reading the original QuickCheck paper and having a look at the resulting library.

Generators

After reading the paper and doing some more research on Haskell's implementation, I had a few requirements in mind for the generators:

they have to be able to generate random values indefinitely (without running out of values);
they have to be reproducible, that is, they should explicitly "take in" the randomness in order to be able to deterministically "replay" the generated values;
we have to be able to use a "size" to control the space of generated values to be able to start a test run with smaller values and grow the generated values over time.

The requirements above already got me going and I implemented the first version of generators. Back then, a generator was a struct (in order to be able to distinguish it from any other term) with just one field, :generator, which was a function that took a random seed and a size.

defmodule StreamData do
  defstruct [:generator]
end

In the first version, the generator functions had to return {term, :rand.state} tuples where the first value was the generated value and the second value was the updated random seed. It's important that generators don't use a mutable seed or don't share the seed, and making the seed explicit both as input and as output ensured that by passing the same initial seed around the generated values would have been the same. To give an idea, the int/0 generator (generator of integers) was implemented like this:

def int() do
  %StreamData{generator: fn seed, size ->
    {random_int, new_seed} = :rand.uniform_s(size * 2, seed)
    {random_int - 1 - size, new_seed}
  end}
end

As you can see, the seed is used to generated a random value and the size is used to control the space this value can be drawn from (in this case, int/0 returns integers in the -size..size range).

Generating values from this generator was straightforward:

data = int()
seed0 = :rand.seed_s(:exs1024)
{first_value, seed1} = data.generator.(seed0, _size = 10)
{second_value, seed2} = data.generator.(seed1, _size = 10)
# ...and so on

Random splitting

This worked fairly well, but when reading the QuickCheck paper again I noticed something I had missed the first time: the seed was never returned alongside the value. What QuickCheck does, instead, is they "split" the seed. Splitting the seed is a way to take one random seed and deterministically return two random seeds. This can be used when we need to use the random seed twice (or more): instead of using something like :rand.uniform_s/2 above, we can split the seed into seed1 and seed2 and use seed1 just once to generate an integer and then throw it away, since we still have seed2 if we need a seed for something else. seed2 can be further split, effectively making us able to use as many seeds as we want to generate values.

With this in mind, I changed the generator function to take a seed and a size but only return a value. If I needed to call a generator twice, I would just split the seed and use the two resulting seeds to call it twice. It now looked like this:

def int() do
  %StreamData{generator: fn seed, size ->
    {random_int, _seed} = :rand.uniform_s(size * 2, seed)
    random_int - 1 - size
  end}
end

data = int()
seed0 = :rand.seed_s(:exs1024)
{seed1, seed2} = split_seed(seed0)
first_value = data.generator.(seed1, _size = 10)
second_value = data.generator.(seed2, _size = 10)
# ...and so on

This worked really well and it's what ended up being used in StreamData. The only thing that changed is the return value which is not a simple value anymore because of shrinking, which we'll discuss below.

Random byte stream approach

This approach is used by Haskell's QuickCheck but also, for example, by Clojure's test.check. A very interesting and a bit different approach is taken by Hypothesis, Python's library for property testing. Instead of using a random seed to "drive" the generation, Hypothesis uses an infinite random stream of bytes. Each generator can take as many bytes as needed from the stream to generate a value that is deterministically related to those bytes. This approach is also deterministic and reproducible since by feeding generators the same byte stream, the whole generation is unchanged. This enables Hypothesis to have a "database" of known failures that is just a set of byte streams saved from previous runs: these streams are "replayed" on new runs to ensure known failures are always tested. In StreamData's case, it would be enough to store the initial seed for the known failure in order to reproduce it. Below we'll also take a look at how Hypothesis takes advantage of the byte stream approach to model shrinking.

Haskell's approach

Haskell takes a similar approach to what we have in StreamData (or rather, the other way around) but with a big difference: Haskell is able to use its type system to infer what values to generate just based on the type. So in Haskell, while the generation of integers is done similarly to what we do in int/0 above, there's no need to define int/0 in the first place but only to define how to generate integers for the Int type. This makes a noticeable difference, since you can give Haskell's QuickCheck a function and it will be able to test it on the right randomly generated values just by looking at its type signature. However, the "typed approach" has a drawback, as we'll see in the "Shrinking" section below.

Shrinking

Once I got generation down, I knew the other toughest problem to tackle was shrinking. Shrinking is an important component of property testing: since property testing is based on the idea of generating random data in order to find bugs in software, often the data that finds the bug is full of noise. Shrinking is what takes the noise out of the data that causes the failure and finds out the minimal failure case. A common example used to illustrate this is an always-failing property that looks like this:

property "lists of integers don't contain 42" do
  check all list <- list_of(int()) do
    assert 42 not in list
  end
end

This property will fail anytime 42 is generated by int(), but often the list 42 is generated in will contain more elements, for example, [30, 12, 9, 0, 4, 42, 93]. From this list, it's hard to tell that the failure is caused by the number 42 being in the list. After shrinking the list, however, StreamData will reduce it to [42] which is the minimal value that list_of(int()) can generate that fails the property.

The typed approach to shrinking

The approach Haskell takes is straightforward: it completely separates generation and shrinking. Haskell can do this fairly easily thanks to its type system. All you need to generate data and to shrink such data is a type. QuickCheck for Haskell defines a Arbitrary class that defines two functions, arbitrary and shrink:

class Arbitrary a where
  arbitrary :: Gen a
  shrink :: a -> [a]

The arbitrary function is responsible for generating a random value for the type that implements the class. Gen a is a wrapper type around a function that takes a random seed and a size and returns a value of type a (same as what StreamData does as we described above). For example, arbitrary for the Int type is implemented very similarly to how we implemented the function in int/0 above.

The shrink function is completely independent of the arbitrary function. It takes a value of type a (let's call it x) and returns a list of values of type a. These values should be the possible shrinks of x. To go into slightly more detail, the return value of shrink should be the first level of the shrink tree for x: as you can imagine, since the shrinks of x are of the same a type, they can also be passed to the shrink function and be further shrunk. This can be done recursively, effectively forming a tree of shrinks of x. Note that thanks to Haskell's laziness, this tree is completely lazy: the list returned by shrink is evaluated one element at the time when needed, so when we want to "go down" in the tree instead of "going left", we can just call shrink on a value and the rest of the original list will not be evaluated.

shrink is easy to implement as well: for example, shrink for the [a] type could simply take a list and return all the lists obtained by removing each element from the original list. This would make the tree have smaller lists as we go down in the tree, ending up with [] (the smallest list) as leaves.

This approach is elegant and easy to implement, but it has one major drawback. Since generation and shrinking are completely separated, any generation logic that is not encoded in the type must be repeated when shrinking. To understand what this means, let's say we want to generate even integers. Assuming we have a map/2 function that maps a function over values generated by a generator, the even_int generator would look like this:

map(int(), &(&1 * 2))

This generator correctly generates even integers but in the typed approach such integers would still be shrunk as integers, so we would have to repeat the even logic when using this generator in tests:

require Integer

check all i <- even_int,
          Integer.is_even(i) do
  # ...
end

The solution to this problem is to couple generation and shrinking,

Lazy trees

The approach I went with (which was taken straight out of test.check) keeps in mind the idea of a shrink tree mentioned above. Basically, a generator now returns a StreamData.LazyTree struct instead of a simple value. A "lazy tree" is a tree that has a realized (that is, not eager) root and a lazy stream of children, each child being a lazy tree itself.

defmodule StreamData.LazyTree do
  defstruct [:root, :children]
end

The root is the generated value, and the children are the shrunk values. The first level (the root of each child) is the first level in the shrink tree which corresponds to the return value of shrink in the Haskell implementation. Let's see how the list_of/1 generator, which takes a generator and generates lists of values generated by that generator, could be implemented returning a lazy tree:

def list_of(data) do
  %StreamData{generator: fn seed, size ->
    {seed1, seed2} = split_seed(seed)
    {length, _} = :rand.uniform_s(size, seed1)
    # Let's pretend we have a function that generates values from a generator
    # for "n" times.
    root = generate_n_values(data, length, seed2, size)
    list_lazy_tree(root)
  end}
end

defp list_lazy_tree([]) do
  %LazyTree{root: [], children: []}
end

defp list_lazy_tree(list) do
  children =
    (0..length(list) - 1)
    |> Stream.map(&List.delete_at(list, &1))
    |> Stream.map(&list_lazy_tree/1)

  %LazyTree{root: list, children: children}
end

The core idea is that the root of the lazy tree we return is the generated list. The first-level children are a stream of trees where the roots are the values obtained by removing each element one at a time from the generated list, and the children are recursively calculated using the same process.

Generator combinators (such as map/2) now just transform the whole tree instead of just the returned value. For example, map/2 maps the given function over all the values in each tree generated by the given generator.

With lazy trees in place, shrinking becomes straightforward: it's now just a fancy depth-first search in the lazy tree. Basically, we walk down the tree until we find a value that passes the test, and then we move "to the right" and start again. When a node passes and has no right siblings, the parent is the smallest failing value; when it fails and has no right siblings or children, it's the smallest failing value. This algorithm does not explore the whole tree (which is unfeasible since when we start composing generators the trees can become very large) but in practice it usually finds very good shrinks.

Random byte stream approach (to shrinking this time)

I briefly mentioned above that Hypothesis, Python's library for property-based testing, uses an infinite stream of random bytes that generators can use to generate values. This stream is the foundation of shrinking as well: when we need to shrink a value, we can remove bytes from the original stream and also shrink bytes of the original stream (towards 0). Generators are expected to be written so that they are somehow "proportional" to the byte stream so that shrinking the byte stream also shrinks the generated value. By using a deterministic way to generate values from the byte stream, shrinking is also deterministic. Hypothesis' approach is fascinating, and I am doing some research to see how it compares practically to our approach and our platform, to evaluate a potential switch to this architecture.

Conclusion

I had tons of fun writing this library and learned a lot of things. This post was just a little insight on how I learned those things and hopefully an interesting read for everyone and especially potential contributors to StreamData.

Credit

Most ideas in StreamData are not original and are taken partially from QuickCheck but mostly from test.check, so a big thank you goes to everyone that helped with those projects. This blog post was also influenced by this great article by the author of Hypothesis, so thank you as well, David!

Written on August 15, 2017