The Often Overlooked Test Oracle – The Key to More Powerful Testing

This is a transcript of the above Webinar by Doug Hoffman. We thank Doug Hoffman for all of this great content.

Oh good morning afternoon or evening or night as the case may be, in the next 45 minutes I’m going to skim over material that I typically present in a three to five day class. So hold on. Prepare for what my students have described as drinking from a fire hose. 

OK so a rhetorical question about software the system under test: what can happen when there’s a bug? Well when there’s a bug in the software anything can happen. Anywhere. If the software can touch it can screw it up. So what we’re dealing with infinite numbers of possibilities here. 

Few background ideas before I get into the meat of today’s talk about Oracles. One way to think about testing is from the information perspective. Testing is a technical investigation of the product under test and it’s conducted to provide stakeholders with quality related information. It’s all about the information we get. When we design and run a test our goal is to ask a question of the software: are you broken like this? Our challenge isn’t just creating and running a test. It’s creating and running a test that tells us something interesting about the software. 

So let me describe what I define as a test. There are three parts to it: There’s an exercise, there’s a data gathering and an analysis. The exercise setup before the test and then we have steps and values that we input to stimulate the software and along with that we gather relevant data (relevant being what we think is useful). It may be test specific results. For example, it may be independent of the test case we may be looking at information that we are gathering about memory usage. And it can occur before during and or after the exercise. Some of the stuff we make collect beforehand would be system variables, configuration information, and information about the software itself. And of course the final goal is to analyze and determine whether the test has told us something interesting. 

Typically I found that people look at testing as an input, process, output model, that we provide the stimulation or the influences the inputs. The software does its thing and then we get the results. It implies that we control all the inputs and it implies that we can verify all the results. Now the truth is a little bit different from that. We’re not dealing with all the factors. There’s data in memory and textural datasets that influence the behavior and can be mucked with by bugs. There’s programs state you’ve got to have the right screen up you’ve got to have the right internal state to the program in order for it to behave properly and there’s the system environment the stuff that supports our application that we haven’t dealt with. 

So I came up with what I call the expanded execution model. It has the input, process, output of course. And also we have we have to consider the pre and post-datasets because they both influence and are impacted by the software. The second time you add a user I.D. you get a different response from the software then you did the first time, it tells you to pick another user I.D. instead of just entering it. Also it’s possible the software make changes to other users in the database besides the one it added. Remember we’re looking for bugs and when one is encountered strange things happen. The software also has to be in the right starting state. You get different behavior when you’re logged in and when you’re not. You need to be on the correct screen in order to order an item. The program state is not always visible. Quite possible for the system to correctly reject the loan application for one user and then fail to reset the internal state variable and therefore reject all subsequent loan applications. Environment plays a critical role in the behavior of the software. How much memory is available? What services are available? What OS or browser you’re using and what the available network bandwidth is? Are a few of the millions of environmental factors that influence the behavior of the software. Have you ever had a timeout? Well that’s the environment. Likewise memory leaks soaking up all the network bandwidth leaving scratched files around are examples of environmental outcomes. 

With the expanded model we have some implications. We don’t control all the inputs, all the influences. We don’t verify everything. Multiple domains are involved. It’s not just our exercise. The test exercise actually is usually the easy part. And we can’t verify everything, we can’t even know all the factors. Know it’s almost a relief to realize that no test or tester can be perfect. A  non-reproducible bug doesn’t mean there is or isn’t a bug, it means that some influence was not noticed and is different now. A bug that escapes to the field means we didn’t check all the possible outcomes. We aren’t wrong and we’re not neglectful, it’s only that we’re not omnipotent. 

In ancient Greece there were temples for the various Gods and then some of them there were priests and priestesses called the Oracles who spoke for the gods. People would go to the Oracles and ask them their burning questions and to hear the answers from the gods. One of the most famous was the Oracle at Delphi. A person would go to that Oracle and ask a question of the priests who would in turn respond with the answer from the God Apollo who among other things was the God of Truth. And thus we’ve picked up the term Oracle for the means to determine the outcome from a test. We’re asking the god of testing. For today I’ll define a test Oracle as a mechanism or principle by which we determine whether or not the software under test is behaving reasonably. Unreasonable behavior requires investigation by a person. And when the Oracle says that the behavior is unremarkable we just move on. We don’t second guess a pass on our tests. Did it pass our test? Ok let’s look at the ones that failed. 

Some notes on test Oracles. Every test has some kind of Oracle. The Oracles that we actually implement may blend characteristics of multiple types of Oracles that I will describe. We run multiple Oracles on one test. We may check different kinds of behaviors: did it to the right screen, did it update the database correctly? Then those may be actually separate Oracle’s. And when you add in Oracle the tests get stronger and by stronger that means the likelihood of finding a bug, if it’s there, goes up. The more Oracles we have the more kinds of bugs we can find. The Oracles may be independent of or integrated into a test. Often as we test we’re checking to see what’s going on. Sometimes when we test we may have a memory leak checker running that’s totally independent of our test. But it’s an Oracle nevertheless. And what I’ve discovered and what people who automate tests discover is that you have to have an Oracle for it. And particularly when we start doing high volume testing automation then it’s necessary to have good Oracles. 

Basically we have two ways we do comparisons. One is deterministic a mismatch means the behavior is abnormal, we didn’t get the expected results. This is the binary approach. Pass Fail. Most of us doing testing and most of our tests use a deterministic approach. There’s another approach which is probabilistic that indicates when behavior warrants further investigation but it doesn’t necessarily mean we found a bug. It means that we need to look into this. And as I mentioned the last slide comparison is done within the test sometimes and often it’s a separate activity altogether. 

Chris – Can you expand on that a little bit: So you’re saying that comparisons are done sometimes? 

Doug – Well sometimes the comparisons are done within the test. Sometimes we may create a log file and after the test is complete we go through to see did it did the test do what we wanted it to do or what we expected? So the comparison that we do whether deterministic or probabilistic is sometimes built into the test. My automated test instead of creating a log as we execute may read the next line of the expected log and do the comparison within the program. So I don’t create a new log. I just read the old log so that I can catch the error at the moment that it occurs. OK so it’s built into the test in that case. 

So I came up with a baker’s dozen of Test Oracles and what I’m going to do now is go through and quickly describe each of these. These are distinct Oracle approaches. We can then as I said a particular instance may be a blend. But each one of these is distinctive. 

The first is No Oracle distinctly possible that the test itself just runs and we ignore the outcomes. This is really easy and the test run really quickly. But we only noticed spectacular events. The case here is that the system crashes or the application crashes. In one instance my test set the printer on fire. People noticed that. But the No Oracle strategy we can do a whole lot of activities with No Oracle and feel like we did something worthwhile. But there may be a false sense of accomplishment. 

The other extreme is an independent implementation, that is we write the application or pieces of it again as the Oracle. We may cover some range of inputs, some particular variables some results. And we’ve implemented the actions of the software we’re testing. So we generate correct results and I put correct in quotes because software has bugs. And if we’re creating a Oracle then it’s software and it may have bugs. The independent implementation strategy is usually expensive and time consuming. And there’s one factor here that we need to keep in mind: the more complex the Oracle, the more likely it has bugs. And if it actually gets more complex than the software we’re testing then a miscompare is more likely to be a bug in the Oracle than in the software we’re testing. And our jobs aren’t to debug Oracles. 

So the first of the two types of consistency Oracle is the saved master strategy. This this Oracle checks for changes, checks for differences between this run of the test and some other run of the test. That’s the most frequently used automation strategy and we have both validated and unvalidated masters. A validated master means we’ve gone through every character in that master file and have reason to believe it is really correct. Unvalidated is what we typically have and that is we’ve gone through the file and didn’t notice anything. IBM in their infinite wisdom actually went out to customers two years after a release to identify all of the bugs that had been found in the field to figure out whether or not their tests, actually why their tests didn’t find those bugs and when they looked back through their test results. After two years they found that 30 percent of the bugs that customers reported they had uncovered they just hadn’t noticed that they were they were there. Most common method, the test generates a log of activity and then we compare the log this time with the log last time or we create a golden master something that we may validate and then we compare each run to that golden master. 

The second kind of consistency Oracle is looking for differences but we call it the functional equivalent strategy. We’re typically using very high volume and some alternative program a validated instance would be for example testing spreadsheets. And if you compare against the behavior of Excel by definition Excel is right even when it’s wrong it’s right. Your at your spreadsheet better do the same thing. So we had we have a trusted Oracle there and unvalidated. If we’re testing the spreadsheet we may use Open Office spreadsheet. And if there’s a miscompare me we may have the bug we may have a bug in the open office. 

Chris – Someone is asking is who isn’t golden in your description of a golden master. Isn’t it sort of the same as a benchmark? 

Doug – It can’t be it cant be. It depends on how you’re using the benchmark. If if it’s a set of activities that if you deviate from that benchmark it means that you have encountered a bug then yeah it’s a golden master approach. 

OK. So the functional equivalence we were testing against a competing product or an another version of the software for example our software on another platform. We want to compare the results and if they’re different then we found. Something that is likely to be a bug. So we just feed test data into the software and the similar product and compare the results. 

So the first of what I think are more interesting Oracles is the self verifying data. The Oracles so far you’re probably familiar with there. They’re fairly common. The self verifying data builds the expected outcomes into the data. Then there’s three basic ways I’ve found. One is self descriptive data. Another is a cyclic algorithm. I was doing Data Communications testing and wanted to send random data and random packet sizes but the line was half duplex at that time and therefore I didn’t want to take twice as long to run the test. I wanted to be able to send the data and check it at the far end. And so by using a random number generator for the first value and an increment between values and some count of values number of bytes that we have or number of the values that we’d have. I can append that to the data itself and at the far end by using the start increment and count can identify exactly what should have come down the line. The third and I think most interesting way is a shared key with algorithms so we generate the the random record the data and the fields from a seed value. If you’re familiar with the random number generators on the computers they’re called pseudo random number generators. They will generate the exact same sequence of statistically random numbers. If you have a key or seed value to start with you’ll get exactly the same series of random numbers. So if we embed the key with the record that we’ve generated person’s name, address, telephone numbers, we we can write a program that will generate records of that kind of information. Then if we have the seed we can identify what that record should contain. So we embed the key and then we use the algorithm and key to to identify as the Oracle. This is what should be the answer. 

Another interesting approach is a model based strategy and the way we do this is we identify and describe some aspect of the software that’s a state machine. The menu tree of an application is it can be described as a state machine. Each window is a state and the events are what we do in that window that will take us to another state. Another window so we can identify the states and the events and put that into a machine readable model. And then we design our tests that read in that state machine and do whatever the test is designed to do to get from this place to that place. And so we were reading and applying the model. Each time we make a transition or don’t make a transition we can check to see did what happened actually match what the model says it should happen? Then we can update the model as needed as the code changes. I’ve used this in testing front panel operations for a family of products so we could write the test for setting the clock and by reading in the state model for the various devices. The same test works with each device. 

The next is constraint based. Here we’re looking for individual values or characteristics that have constraints limits on the values 256 character names or the the maximum in the integer value. And so the constraint based strategy we’re writing our tests and looking for as an oracle that we are maintaining those constraints. Often we have the variables that constrain one another. And an example of this is the spreadsheet the number of active cells you can have in a spreadsheet is limited. Maybe sixty five thousand for example. Well that means that the number of rows times the number of columns has to be less than that. If we add rows we’ve constrained the number of columns and so our Oracle may check to see if the number of cells specified is correct or not correct. With the size form type etc. for some applications can be used to constrain one another the page width is limited and so the size and type of the print and the type font that define what we can put on that page, one constrains the other. 

And we have constraints like the invariant rules got better have been born before I got hired. If that’s violated we have some problem in our data set or in our application. So the way we use this we just confirmed conformance. And as I don’t think I’ve defined synchronous and asynchronous checking. So synchronous checking is checking that’s done along with the test at the same time and usually as part of the test itself. So it’s synchronized with the activity of the test. Asynchronous checking is independent of what’s going on. And a good example here is memory leak checks. But we can also do things like database integrity checks that we run just in a loop to make sure that the database links the internal links are kept consistent and we don’t have bugs in the engine itself. I was working for a database company and the checking may be specific or independent. As I said those two examples one is specific and one is independent of the tests. 

The next is a probabilistic strategy. We’re looking for relationships that usually but don’t always hold. So here an approximation, a heuristic a rule of thumb, partial information that supports but not necessarily mandates the conclusion. Error in this case means that probably is a problem. An example here is that an employee is usually older than their dependents. But there are exceptions possible but if you find a case in your dataset that the dependents is older that’s worth looking into at the company Oracle. One point in time they actually checked the length of time it took to run each test and if it was faster than one third of that time or it took longer than three times as long as the last time it ran they looked into it and about half the time there was a bug. The test actually passed but there was a bug in the code. We weren’t doing something that we should have or we were doing a whole bunch of stuff we weren’t supposed to be doing. 

The computational strategy the principle idea here is that for some things we’re testing we can actually run an inverse or reverse function. To identify whether or not these results could be the consequence of the inputs that we used. And so basically what we’re doing is the reversal function. If we split a table, we merge tables and from the results we compute possible values for the inputs and then check to see that the actual inputs are in that said. Here we are subject to common mode errors because usual leave the function we’re using to reverse it has some code in common with the function that we’re testing and so if there’s a bug in that common code we may end up undoing the error when we do the reverse computation. It can’t miss obvious errors. The square root of 4 is to not negative 2 but the square root function could give us negative 2 and we square it in order to see whether or not the square root function is working so that there are some drawbacks to a computational strategy. 

The statistical strategy this this one is pretty interesting. It’s rarely used and is not for the math phobic but it’s a reasonable way to test some very complex systems. Some applications can be characterized statistically. Let me use the example here of sales taxes. The sales tax can be computed by knowing where the orders originate because the sales tax are location dependent and the amount of each order. But there are a lot of different places and therefore a lot of different tax rates. And so it’s impractical to walk through all of the locations and then check the tax computations. But it’s also impractical to check the correctness of an arbitrarily a randomly selected order without creating a copy of the application itself. So this is the statistical strategy that might be used in order to do massive test case execution. We can relate the population statistics for the locations and the population statistics for the amounts of the tax rate and we can relate that to the population statistics for the resulting tax amounts. If we generate random orders with known statistical characteristics for the locations and similarly known statistical distribution of the amounts we can predict statistically what the taxes should be. So we can measure the statistics of the resulting taxes and identify whether or not they they they are a logical out come of the inputs. We can do massive numbers of random orders and give us a feel for whether or not the software is working. This involves complex statistical mathematics but we can generate gazillions of statistically distributed orders and know that the population statistics we get from the taxes without checking each order. 

Property based. We use a secondary characteristic of a test or a variable or variables that we compare coincidentally correlated relationships are not necessarily complete not necessarily causal. Some examples: sales order number should be in time sequence order. So if we sought the sales orders by number we should get the same order / all of the same orders that we get in sorting by the time of the time stamp on the orders. I’ve also checked the property of a printer test by looking at the number of pages that it printed. If the print test is supposed to generate three pages and it generates some other number of pages then we got a problem. And so we could do a property based Oracle. In another example I’ve used a U.S. ZIP code is either 5 or 9 digits if it has a character in it it’s wrong. If it’s not five digits or nine digits, it’s wrong. I use this to test the database unload load where we were checking to see did we lose a character or shift by a bit or a byte. Another example that I’ve written about on property based is an article on heuristic oracles. When I was checking for discontinuities in a sine wave the sine function the characteristic that the values increase or decrease between the high and low between the minimum and maximum. So if we generate the sign of X and then take the sign of X plus some little amount and then add a little amount to that and a little amount to that we should find that the results come back that they steadily increase or decrease depending on which half of the the  wave you’re on. And the only exception is when it gets to the minimum or maximum. 

Diagnostic strategy we’re actually getting into white box testing here. Because we’re tracking the execution of the code using the code assertions. So we insert assertions for logging of the activities, tracing of the program. And then we run the tests and the assertions check themselves or we generate a log of execution and we can call through that to look for errors. Code assertions: yeah you can’t get there from here. You have to be logged in with a valid user I.D. in order to buy something. And the other method we put print statements into the code that might identify the the values is going in to a branch so that we can find out that this branch did not go to the right place. And here are the values that went into it so we can see that it’s not the the code itself but it’s the values that were input. There we have to trace back to figure out how the values got out of whack. 

These last two are not often used for automated tests. So the input and results are specified together. When we choose one we’re choosing the other based on this is easy or or well known. We we carefully craft the input values to match with the expected result. In these cases the Oracles frequently built into the test because they’re so closely aligned. And it’s often taken for manual regression tests in complex software. My brother in law worked for an organization and he was testing a program that tries to predict what happens when atomic particles collide. I don’t know is there a chain reaction? What other particles are being generated? You want to simulate that as closely as possible before you actually collide the particles because if there is a chain reaction that it can be damaging to your accelerator and the five miles or so radius from that and so you want to be fairly confident that it isn’t going to happen. The test program itself takes two days to run on supercomputers. So we want to know exactly what input and exactly what to expect in our test. 

And the human Oracle. The most common one you’d set a person in front of the software to observe what’s going on. The human uses their judgment to decide the verdict. In this case we are actually an oracle on many different levels. We are multiple oracles at once. It works for manual or automated exercises although automated exercises it’s hard to look at a million individual results. I know it works for scripted or even unscripted tests and doing random walks. The computer that we have between our ears is still the most powerful computer that we know of. People with A.I. or trying to get past that but in general we are a whole lot more flexible. And one note here a human Oracle is applied whenever we think there may be a bug because it takes a human to investigate. We haven’t figured out how to automatically do that without having to have a human double check. I’m not sure we’ll ever get to that point. 

So recapping all tests use oracles. There are many types of Oracle. I’ve identified a baker’s dozen. It’s impossible to check everything. We don’t control all of the inputs, all of the influences on our software. Oracles don’t need to be deterministic to be useful. Myers wrote that in order to have a good test you must have a known result. Well at the time that may have been true but non deterministic Oracles can be very useful and you don’t necessarily need to know what the exact answers should be. And we have oracles that are independent of the tests themselves. It’s not just that two plus two equals four. It’s that two plus two equals four and we didn’t touch the network and we didn’t leave files around. We didn’t interfere with the action of other programs that are running. 

So you’ve been fire hosed! I’m sure that there are questions out there… 

Chris – That was good. You’re right. That’s a lot of information especially with a baker’s dozen of oracles. There are a few questions and if anyone has any more please just go ahead. The first one: is is a smoke test a form of no oracle?

Doug – Well it depends on whether or not you’re checking results a smoke test can have individual test cases. I like to describe it as Hello World before every individual function and it may be no Oracle in the sense that you’re just touching the function and don’t care what the response is. You have no oracle. But sometimes you’re actually checking the response. A. 

Chris – Could you clarify the difference between constraint based oracles and property based oracles? 

Doug – That’s subtle. Let me run back here to. So in a property based strategy. It’s not necessarily a constraint. The Sales Order number being in a time sequence order is a property. You can sort by sales order and you can sort by the time the order was entered. Checking the number of pages printed by a test that’s not a constraint, that’s a property.

Chris – You said the diagnostic strategy was more of a white box approach which is knowing more about the code. Are some oracles more white box or some more black box? How do you sort of divide those?

Doug – Right. So I define white box testing or glass box testing as testing based on knowledge of the code. So based on having looked at the code I design a test for the code I want to get to this place in the code with these values and then see that that’s true. That’s white box testing. Once I create that test I may run it in a black box fashion. But the design of that test is based on knowledge of the internals a black box test. We’re just looking at the externals. We know that it’s supposed to do this. So we give it the input. We don’t care how it does it processes it and we look at the result. That’s that’s a black box. Here we’re actually instrumented in the code. So we have to open up the code and look at the code and our tests are based on that code. We’re plugging in instructions into the code in order to test it. 

Chris – Would another example of computational strategy be translating text into another language using an online translator? And then back to English? Would that be. 

Doug – Yeah, that’s called round tripping and yeah that that is a computational Oracle and I’ve used it and the results could be hilarious because the in the case of translators they may be lossy functions and by lossy meaning some of the information is thrown away. The language may not have the future tense and therefore whatever you said when translated into that language and then translated back can’t be identical. Information was lost in translation. 

Chris – Doug you said in the last two examples of Handcrafted and Human oracles they’re not really used much in test automation. Are some of the oracles more likely to be used? and so where are or which ones?

Doug – Yeah. Any of the other oracles are likely to be used. And what what I found is that when you do a massive set of activity right. Massive random walks. You use one or more of the oracles. And depending on the application different oracles may be applied. So all of them except handcrafted and human and there are cases where you may can craft a whole bunch. We may generate a table of inputs and expected results and we may have a function that will generate those combinations to test some aspect. Some subset of what our application does and so we we may use a hand crafted approach where we’re selecting the inputs and outputs together. And James Bach taught me about the blink test where you run a huge number of inputs and watch an output field for example and you can spot if there are exceptions that you see you get a negative time. So it’s computing the duration of meetings and if you see a negative amount of time then either the inputs were not correct and accepted or the computations are not correct. So that human Oracle is used in that case with a massive the number of inputs.