Testing in pyro is a case of the more general problem of testing stochastic software. This breaks down into two common problems: knowing that your implementation is correct at the start, and, knowing that you haven't broken it as you work on other things. This issue is focussed on the latter, difference testing. (This issue is related to #18 and #16.)
If we think of our test as a stochastic black box, then running it multiple times yields a population of samples for it's return value and run time. A difference test is essentially a test that the distributions before and after the change (eg current git commit vs dev) are the same. We can thus use any strategy for estimating a probability that two distributions are the same. Ideally, any strategy will have a confidence knob that allows us to run the test to a desired level of confidence. (Thus we might run to 90% confidence for daily testing, but run a 99% confidence test weekly.)
Here are some test strategies (I think there are some deep connections among these and relations to topics such as model criticism, MDM, two-sample tests, etc.):
- permutation: chose a statistic (real-valued function of a population of samples) compute it for the two populations as well as permuted versions.
- moment (or distribution fitting): fit a parametric model (eg normal) to each population, to get confidence intervals on parameters. check if the fit parameters differ.
- classifier: directly train a classifier to determine which of the two populations a sample comes from. if it can do so sufficiently above chance, then the populations are different. (note the connection to MDM, GANs, etc.)
For pyro, we would like to do tests at multiple levels of granularity:
- very small deterministic tests, such as poutine tests, where we may still need to test the distribution on run-times.
- stochastic elements of an algorithm, such as a single gradient estimate or a single MCMC transition.
- full integration tests: a complete run of an inference algorithm against a model.
note 1: for the problem of initial correctness we can use the same machinery if there is a known-good version of a particular test, eg an analytic solution. here we may only want to test for a difference of means, not of variances.
note 2: i'm guessing there is relevant prior work, though it may be scattered. for instance this is somewhat relevant.
Testing in pyro is a case of the more general problem of testing stochastic software. This breaks down into two common problems: knowing that your implementation is correct at the start, and, knowing that you haven't broken it as you work on other things. This issue is focussed on the latter, difference testing. (This issue is related to #18 and #16.)
If we think of our test as a stochastic black box, then running it multiple times yields a population of samples for it's return value and run time. A difference test is essentially a test that the distributions before and after the change (eg current git commit vs dev) are the same. We can thus use any strategy for estimating a probability that two distributions are the same. Ideally, any strategy will have a confidence knob that allows us to run the test to a desired level of confidence. (Thus we might run to 90% confidence for daily testing, but run a 99% confidence test weekly.)
Here are some test strategies (I think there are some deep connections among these and relations to topics such as model criticism, MDM, two-sample tests, etc.):
For pyro, we would like to do tests at multiple levels of granularity:
note 1: for the problem of initial correctness we can use the same machinery if there is a known-good version of a particular test, eg an analytic solution. here we may only want to test for a difference of means, not of variances.
note 2: i'm guessing there is relevant prior work, though it may be scattered. for instance this is somewhat relevant.