In January, I read a few press releases describing a NSF grant (the same group that originally funded development of our BBST class series) describing a research project running between four universities — Princeton, Yale, MIT, and the University of Pennsylvania. The news claimed the DeepSpec research project will open the doors to the possibility of completely bug free software. Not long after the press release came outbursts on social media from the testing community making the obvious argument that bug free software just isn’t possible, and anyone that thinks it is is a fool.

DeepSpec announced an event to be held in the computer science building at Princeton University June 6-8 2016. I went to that workshop as a representative of the AST Committee on Standards and Professional Practices.

The DeepSpec workshop was split into three days. Each day was a mixture of representatives from industrial partners  — Google, Intel, Galios, Free & Fair, and NASA — and researchers from the represented universities. Each speaker described projects they were working on in varying detail. Sometimes it was a high level discussion on their experience with specification, formal verification, and teaching developers to use the coq proof assistant built in Gallina. Other times, it was a very low level discussion and proof demonstrating code.

The coq proof assistant is not a testing tool, and the workshop was not about testing. Formal verification is a tool for discovering how low level aspects of a program — Kernls, SSL libraries, compilers (the compCert C compiler for example), programming languages (Haskell was a popular example), or chip and circuit design — do or do not match up with a specification. A phrase commonly used in the workshop was “We want to reason about how software works”. The proofs build in the coq proof assistant look very similar to a mathematical proof.

On Specifications

Mathematics works because there is a formal system of rules defining what numbers, concepts, and different functions do. That system has existed for a long time now, and because of that we (or some people at least) understand the rule set well enough work within it. Formal specification has the potential to work for the same exact reason in very specific projects. Compilers for example have set of functions that people mostly understand at this point. And, because of that specifications can mostly describe that.

Mostly is an important word there. The formal specification people have the same problems with specifications the rest of the software world does. Their specifications are incomplete, they are out of date,  they are fluid and in a constant state of change for active projects, they are wrong, and they are misunderstood. Proofs built with the proof assistant are only as good  as the specification they are proving. One example of this problem was on day two during the poster-board session after lunch. Ken Roe from Johns Hopkins described the Heartbleed SSL bug and mentioned that it could have been avoided if only a specific read condition had been described in the SSL specification. John Hughes, CEO of Quviq AB, talked about the specification problem clearly in this slide.

Areas talked about in the workshop where specifications are heavily relied upon include medical devices, and certain aspects of automotive and aeronautical software. These are areas of development where a specification is far more precise than what I generally see in commercial software development.

On Testing

Some of the attendees there seemed at least aware that software testing is a craft. One person used the phrase ‘exploratory testing’ in conversation without prompting. For the most part though, the type of testing they focus on is what some in the testing community now refer to as checking. The most commonly talked technique at the workshop was continually generating random input values and then analyzing the result. They called this technique fuzzing.

The conversation veered to coverage models during one of the breakout sessions. There seemed to be a common sentiment that coverage models are useless. Partly because if a person can continually pass in random values, they thought the results are approximately the same. And partly because being able to see coverage in some particular way — decision, line, method, function, variable, or whatever — doesn’t tell you much about how the program was tested or how meaningful that testing is.

Some of the presenters viewed testing as a trivial task. The presenter representing Gallios and Free & Fair, Joe Kiniry, described bugs in electronic voting machines that “Couldn’t even add 1” without adding the context that voting is a concurrent, hopefully secure, geographically distributed, and data intensive activity.

The Trading Zone

Harry Collins describes a phenomena called a trading zone in a series of essays titles Trading Zones and Interactional Expertise: Creating New Kinds of Collaboration. The basic idea is that this occurs when different fields of work collide and have to talk to each other somehow. We sometimes see this through outrage and conflict, for example the response from some CDT testers on reading the initial press release claiming the ability to create bug free software. When trading zones happen, the languages can blend to something new that both groups can understand, or the language of one group overpowers that of the other.

This was my first experience with falling into a trading zone. The people participate in a specific culture. They use words different from what I am used to, for example the word invariant was commonly used to describe a program that did not match up with a specification instead of the word bug. The topic of bug-freeness was much more nuanced in person. When I asked several people, they said something like “maybe not the whole program, but at least the parts that work well with formal verification”. Since I was wading through a very obvious trading zone , I chose to use their language and asked if they were willing to refute or disprove Edsger Dijkstra’s assertion that the absence of bugs could not be proven. No one I spoke with was willing to do that.

I took a lot of pictures while I was at the workshop. Most of these pictures are from slides being presented. Some of the pictures are of buildings around the campus, the oldest of which being Nassau Hall which was completes in 1756. And, there is one picture of an unassuming white house. That is where Albert Einstein lived while he worked at Princeton. You can find those pictures here. You can get little tidbits of the workshop by looking at #deepspec on twitter.

If you are aware of other academic projects related to software testing and development, please let me know.

Authors notes:

My feeling is that there is value in this project in specific areas of software development. While I know that verification based on a specification can be dangerous if that is where things stop, exploration (aka testing) of that spec could discover when important things are missing, wrong, or poorly designed. I do think they have a very big hurdle to overcome for commercial acceptance in any regard aside from research partners.

There are two big problems preventing acceptance at the moment — training and education, and performance. Writing proofs in the coq proof assistant is not equivalent to any kind of test code I have seen. To increase usage, there would need to be a large outreach from people that know how to use the proof assistant, to the industrial partners (that was part of the focus of the workshop). The other part, performance, has two prongs. Writing proofs is a time consuming activity, and running the proofs is slow and resource intensive. That slowness might be acceptable when releases happen a little less than once a year. But, it does mean more time spent struggling with tooling.