How much distrust do we need, how much trust can we afford in software science?

While summarizing results of identifier studies for a magazine I had to make a decision: I had to decide how much I trust in some experiments. In other words: how much distrust is needed when reading papers, reports, etc.? For example, if someone just wrote "I did an experiment and technique A turned out better then technique B", should I just take these works for granted and assume this is some evidence?

Example: Shneiderman's Experiment on Identifiers

The problem happened to me when I tried to summarize identifier studies. One of the earlier studies on identifiers was mentioned by Shneiderman and Mayer and I asked myself whether I should take the available, very short description of the experiment as a form of evidence into account:

"Two other experiments, carried out by Ken Yasukawa and Don McKay, sought to measure the effect of commenting and mnemonic variable names on program comprehension in short, 20-50 statement FORTRAN programs. The subjects were first- and second-year computer science students. The programs using comments (28 subjects received the noncommented version, 31 the commented) and the programs using meaningful variable names (29 subjects received the mnemonic form, 26 the nonmnemonic) were statistically significantly easier to comprehend as measured by multiple choice questions." [1, p. 231]

That's it. Almost nothing more is said about the experiment in the given source. I could have said "well, the paper appeared in a peer-reviewed journal, hence this is evidence", but I did not feel that way. The problem was not only, that concrete numbers were missing in the description. The problem was also that I did not had a precise idea what was done in the experiment.

I wanted to give an impression of what was done in the experiment and then report means (and differences in means) and in case there are interaction effects, I wanted to report them as well (not in terms of statistical numbers, but rather in terms of text). I know, means are problematic, but my goal was to summarize results for a magazine - the audience should not be bothered with p-values, etc. But I think it was also important to give people an idea what exactly participants did in the experiment and how conclusions were drawn from it. I did not know what programs or how many of them were given to the subjects, how exactly the different treatments looked like, etc.

Ok, I was not satisfied with the description. But I also did not want to make it too easy for me and just say "I should ignore the text", because in the end the description still came from a peer-reviewed journal. So, I tried to find more about the experiment. I was digging at Ben Shneiderman's webpage, but was not successful. But in his book Software Psychology [2] I found some more text:

"One of our experiments, performed by Don McKay, was similar to Newstead's, but the program did not contain comments. Four different FORTRAN programs were ranked by difficulty by experienced programmers. The programs were presented to novices in mnemonic (IDVSR, ISUM, COEF) or nonmnemonic (I1, I2, I3) forms with a comprehension quiz. The mnemonic groups performed signifianctly better (5 percent level) than the nonmnemonik groups for all four programs." [2, pp. 70-71]

After this paragraph, the book contains a figure that illustrates a difference in "mean comprehension scores" for four different programs. The score between the programs seems to vary from 3 to 5 for mnemonic, respectively 2 to 3.5 for non-mnemonic variables.

The second citation in combination with the figure gave me some more trust that the experiment revealed something. But it still puzzled me what the programs looked like that were given to the subjects. I also wanted to see more in more detail what variable names were used. But I also wanted to know what questions were given to the participants. And the first citation mentions multiple choice questions that were used (the second just speaks about a quiz). How many alternative answers had the subjects? How much time did they have for reading the code? What were the raw measurements, what the means, the confidence intervals? I just had the figure which does not show confidence intervals nor do they give a precise understanding of what the mean is. And finally: how was the data analyzed and what were the precise results? Was just a repeated measures ANOVA used? What about the second factor (programs)? Were there interaction effects?

Finally, I spent a lot of time on the experiment (mainly for searching for a more detailled experiment descriptions and for comparing both descriptions, whether they do match). And I finally decided for myself, that I should not take the experiment as a form of evidence into account. In the text for the magazine, I just wrote that "there was once an experiment which is today rather historically intersting, but inappropriate as a form of evidence".

I felt bad.

I had the feeling that I did not give Ben Shneiderman the appropriate credit for his efforts. But I really felt that it is my duty to be much more sceptical with his text - despite the fact than Ben Shneiderman is one of the leading experimenters in software science.

On Scepticism, Evidence and Trust

As a scientist (well, in fact as an educated person), you should not trust too much, you should not just believe someone (no matter who he is) and you should not stick with your own fantasy or your own personal and subjective impressions. I.e. when you are confronted with a statement, the following should hold: A statement ...

... is not more important, because it follows a current hype,
... does not become true just because the author of such a statement is an expert,
... is not more valid, because it was articulated by an authority.
....is not more important, just because you believe in it or the statements comes from you.

This kind of argumentation is far from being new. For example, Karl Popper wrote:

"Thus I may be utterly convinced of the truth of a statement; certain of the evidence of my perceptions; overwhelmed by the intensity of my experience: every doubt may seem to me absurd. But does this afford the slightest reason for science to accept my statement? Can any statement be justified by the fact that K. R. P. is utterly convinced of its truth? The answer is, ‘No’” [3, p. 24]

Of course, you cannot endlessly play this scepticicm game. You cannot ignore everything on this planet and just say that you feel sceptical about it. In the very end, you need to take some evidence into account. This evidence might be damn strong or just weak (and weak evidence does not mean that you heard an anecdote somewhere).

But even strong evidence implies trust to a certain extent. You must trust that a study was executed, you must trust that the resulting numbers were measured, you must trust that not further numbers were measured that were withheld by the authors, you must trust in the validity of the analysis and you must trust in the seriousness of the interpretation. You must trust that the goal of the study's author was to find out something.

Unfortunately, whenever some kind of trust is required, it is a door opener for fraud. Trust can be exploited. People can intentionally lie when trust is required.

Reporting Experiments - Setup, Execution, and Analysis Protocol

Before speaking about the problem with trust, let's take a look what can be known about an experiment.

In an ideal world, there is a guarantee that a study was executed, that this study follows a well-defined protocol, and that the study was analyzed in a way that matches the study's design. Such protocol consists of three parts: the setup protocol (which describes what and how something should be done when it is replicated), the execution protocol (which describes the special circumstances in which the experiment was actually executed) and the analysis protocol (that gives the results of the study in statistical terms).

The setup defines the subjects that are permitted to participate (such as "professional software developer with skills X and Y"), the dependent (such as "reaction time") and independent variables (such as "programming language") and the hypotheses that are tested. Furthermore, the protocol contains the experiment layout (such as "AB test", etc.), the measurements techniques (such as "reaction time measurement with stop watch") and the different treatments given to the subjects (such as Java 1.5, Squeak 5.0). Furthermore, it describes how and under what circumstances the different treatments are given to the subjects (such as the programming tasks given to the subjects, the task descriptions, the used IDE, etc.). In case the measurement techniques require some aparatus (such as some software used for measurements), this is also contained in the setup protocol. And in case, the aparatus cannot be delivered as part of the protocol, a precise description of the apartus is given.

The execution protocol describes the selection process for the subjects, the subjects that were finally tested, and the specific conditions under which they were tested. These special conditions could be the time interval in which participants were tested, the location where the test was executed, the machines used in the experiment, the concrete IDE (incl. version), etc. And finally the execution protocoll contains the raw data.

The analysis protocol describes how the possible effect of the independent variables on the dependent variables is determined. Since probably some statistics software is used for the analysis, this software is mentioned as well. The analysis protocol describes the results of the experiment in terms of statistical values. For the statistical values, corresponding reporting styles should be used such as APA (although this is very uncommon in software science). Each test comes with a measurement for the evidence (aka. p-value) and effect size (such as Cohen's d, eta squares, or just the means and the differences in means as well - latter ones are no effect sizes, but it is often more useful to have measurements that mean something to the readers instead of abstract things such as Cohen's d that most reader won't be familiar with).

On double-checking results

The information above is required because it permits readers to double check the experiment. It can be checked, whether the layout followed a standard-layout, whether the measurement technique is state-of-the-art, whether the tasks given to the participants were appropriate and whether the analysis follows the experimental design. And in case the reader doubts that the statistical results are right, he can recompute them. It even permits him to apply alternative statistical procedures.

In a real scientific world, there would not be the need for all this, because if an experiment was published in a peer-reviewed journal, you can trust that the reviewers did all this for you. Of course, this is no 100% guarantee. Even in disciplines such as medicine that have a very high research standard, studies are retracted in journals (see for example a recent case with a COVID-19 study). But the situation in software science is different.

Taking the terribly low number of experiments in software science into account, there is good reason to doubt that an average reviewer in software science is able to do a serious review (just because quite few people are familiar with experimental designs and analyses). As a consequence, we cannot assume that a reviewer double-checked an experiment. Hence, it makes sense today not to trust on published experiments in software science, but to double check them. It does not necessarily mean that that authors intentionally lied. They might have just done some errors. Not intentionally, but just accidentally.

Unfortunately, we run here into a problem: double-checking costs time. Even if someone is well-trained in experimental analyses, it takes time to do the recomputation from the raw data (in case the data is available). But stats are just part of the game. There is the need to check whether the data collection followed the procotols, etc. But we cannot double-check everything. At a certain point we have to stop and say "I just have to trust". But we should make explicit on what we need to trust and what was actually double-checked. And we should give readers a fair chance to decide on his own, what he should trust in and what not.

Why not Making Chains of Trust more Explicit?

But what about the average developer who is interested in what evidence actually exists in software science? He is probably not trained enough to double check experimental results. But that implies that the developer will not take any evidence into account that exists in software science. And that implies that the results of software science will be in vain. This should not be the consequence, otherwise our discipline will never get out of this situation where countless statements without any evaluation exist.

Hence, it makes sense to give developers all essential information about an experiment but also the information about whom and what needs to be trusted. I.e. we should provide developers information such as "I, Stefan Hanenberg, recomputed the results of the experiment X and the results of the analysis are Y. I.e. if you cannot do your analysis on your own, you need to trust me that the computation of the analysis is correct".

Probably it makes sense to make even more information available such as "I got the measurements and I repeated the analysis, but I was not able to access the tasks given to the subjects. I.e. I only confirm that the results of the experiment match the given data, but I cannot confirm that the data followed an appropriate experiment protocol, hence we need to trust the author of the experiment about that".

And maybe, it makes even sense to make some ratings about the resulting chains of trust. An experiment result such as "we need to trust the author or the experiment" seems to be less trustworthy than "we need to trust that a valid setup protocol was followed, but the results match the given raw data" which is less trustworthy than "we received everything about the experiment and we confirm that the experiment followed an appropriate design, was executed in an appropriate way and the reported results match the raw data". And the best case would be probably: "we received everything needed from the experiment, the results are as described from the author and the experiment was executed by others and they received comparable results".

Why Could that Help?

Our discipline suffers from the problem that a number of statements ("object-orientation is good", "functional programming is good", "UML improves understandability, etc.") are hardly or badly evaluated. But even if there are experiments available, we should make explicit what parts of the experiments are trustworthy and what parts are not - because in the very end, we want to rely on strongly trustworthy results and not just on "we trust some single person".

By making explicit what results in our field do not just depend on our trust in the authors, we make explicit where we could or need to improve our discipline.

References

Ben Shneiderman, Richard Mayer. Syntactic/semantic interactions in programmer behavior: A model and experimental results. International Journal of Computer and Information Sciences 8, 219–238 (1979). https://doi.org/10.1007/BF00977789
Ben Shneiderman. Software psychology: Human factors in computer and information systems, Winthrop Publishers, 1980
Karl Raimund Popper. The Logic of Scientific Discovery. Routledge, 2002. 1st English Edition:1959.

On Software and Science

Wednesday, September 9, 2020