Tuesday, May 19, 2020

Reporting Standards in Software Science Desperately Needed

Reporting Standards in Software Science Desperately Needed

  
If we are really interested in achieving something in software science, there is a need for reporting standards. I really mean this statement. But just recently I made the experience how urgently needed such standards are: I summarized an experiment and became aware how much time it took to extract relevant information from it. Some information was missing, some was confusing, etc. In case a standard such as CONSORT whould have been applied, it probably would have cost me minutes to summarize an experiment - instead of many, many hours where I finally even needed to contact the author, because some information was missing.
  

Recent Experiences While Summarizing Research Results

I recently summarized research results from experiments. The goal was relatively simple: Just collect the results from some studies and summarize them in a way that an ordinary software developer is able to understand them. I think such work is needed, because for example the study by Devambu et al. has shown that most developers judge the validity of claims in software construction based on their personal experience and not because of independent studies [1]. But taking into account that experience is limited and that subjective experiences are quite error-prone, it makes sense to give developers information about studies that exist and that give evidence for some claims. And what's even more important: Give developers studies that contradict given claims.

The topic of my summary was "identifizers", i.e. I wanted to summarize studies who checked what the influence of identifiers on code reading or code understanding is. Yes, I know. No big deal. Everyone of us knows how important the choice of good identifiers is. But I really wanted to know what was actually measured by researchers. And we should know something about the effect sizes.

Most of the studies were done or at least initiated by Dave Binkley. I read most of his papers already in the past and since I am well-trained in reading studies I assumed that it is no big deal to give a quick summary of some of them. And there was another reason why I focussed on his papers: From my experience and in my opinion his studies are well-conducted and I trust in the validity of the results, i.e. I trust that the numbers were collected in a way as described in the papers, I trust that the analyses of the data and I trust that the writings do not try to over-sell results: I think his research has to goal to find answers. His papers are not written for the sake of writing papers, but for the sake to improving the knowledge in our field.
 

My goal for the summary

More precisely, I wanted to summarize papers in a way that gives a 1-2 sentence description of the experimental design, another 1-2 sentences about the dependent and independent variables and a few sentences about the main results. And maybe some more sentences about what can be learned from the study. The goal was not to bother readers with stuff that is needed for scientific writings. I.e. I wanted to skip information about whether the experiment followed a crossover design, whether e.g. a latin-square was used or what statistical procedure was applied. 

Actually, I think it is necessary if readers who are not too deep in scientific writings get results in an understandable way. I.e. if an AB test has been applied, I think it makes sense not to write about statistical power, p-values, confidence intervals or effect sizes, but just to write that "a differences was detected" (in case a significant result was achieved) and then to report means and mean differences. And in case multiple factors are tested, my goal was not to write about interaction effects, etc. but just to explain interactions in a way that an average person can get the meaning quickly. 
 

Shouldn't someone else do the job?

Quickly is the point here. If we want developers to understand results of studies, they have to be communicated efficiently. And the typical research paper at a conference or in a journal does not seem to have the goal to communicate results efficiently. Authors of conference papers are given a certain number of pages they can fill. And authors are actually forced to fill this number of pages. If for example a conference such as the International Conference on Software Engineering (ICSE) has a page limit of 12 pages, you will hardly find a paper at that conference that does not have 12 pages. This has something to do with the review process (which should not be discussed here, although there is an urgent need to discuss it). Ok, so you want to communicate scientific results for a broader audience. But how?

Actually, scientific journalism in other disciplines does this job: people who are trained in writing (for a popular market) summarize results in a way that people are able to understand them. This is important, because people should be informed about what knowledge exists - especially taking into account that people pay for the generation of this knowledge (because a lot of scientific work is paid from tax money). But for software science this kind of journalism does not exist. Yes, there are a bunch of magazines that address technical things. You find books on new APIs or new technology that explain how to apply it. But this is something different. These writing explain how industrial products could be used. They do not explain what we actually know about them. It would be great if there would be people who summarize research results - but we currently have to live with the fact that this is actuall not done in our field.

So, back to the studies.
 

Giving a quick summary took damn long time

Again, I really love Dave's work. I think his studies are great. His writings are great. But it turned out that just writing a quick summary took much more time than expected. When I now explain what happened to me and why I had troubles to summarize the paper, this should not and must not be understood as a criticicm of Dave's work. Really not. Dave's work is definitively a shining example of how good science in our field should be. Dave's paper is just an example for what troubles people could have reading scientific papers. And I assume my papers suffer from the very same problems.

One of the papers I started with was Identifier length and limited programmer memory [2]. I remembered that this study compared 8 expressions with different lengths and that subjects were asked to write down a part of the expression. So, I wanted to write sentences such as:
 "The experiment gave A subjects B expressions to read for a time C (D subjects were removed for some reasons). Each expression consisted of E parts and the authors used the criterion F to distinguish between short and long expressions. After reading, a part from the expression was removed and subjects had to complete it. The average time for reading short expressions was T1 and for long expressions it was T2, so the (statistical significant) differences was T3, respectively it took people G percent more time to read the long compared to the short expressions."
I am aware that these sentences are quite a simplification of the results. Especially, I do not mention all independent variables and I do not mention the applied statistical method. By reporting the means people do not get an idea of the size of the confidence intervals, etc. But, again, the goal was to give a quick (but still informative) and not a complete overview. Why do I think that this kind of summary is informative? Well, I think it contains the most relevant information. The number of subjects gives an idea how large the experiment was (and people are mad about this idea of "being representative" - that's another point that needs to be dicussed, but not here, not now), the dropout rate gives an idea how much the data says about the relation between the "originally adressed sample" and the actual data used for the analysis. And the average times give people an idea how large such differences are. Yes, there are effect size metrics such as Cohen's D or eta square, but if someone does not know these things, such numbers would rather confuse him.
 

Sample size and dropout rate

Doing the first step (number of subjects) seemed relatively easy, because the number 158 is already mentioned in the abstract, so I directly started searching for the dropout rate. Suddenly it took some time to understand what exactly happened to the data. The paper does not have an explicit section such as "experiment execution" or something. But there is a section "Data preparation" where I found the following:
"[...] the data for a few subjects was removed. For example, one subject reported writing down each name. A second subject reported being a biology faculty member with little computer science training. Finally, the time spent viewing Screen 1 was examined. It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p. 435]
Ok, but what was the actual data being used? The second sentence seems to describe that the data of a whole subject was removed. But what means "this affected 18 responses"? Does this mean that the data of 18 subjects was removed? Or just 18 answeres? And what about the other six? Does it mean that 24 answers, i.e. three subjects were removed? Or was each single response treated individually? I felt suddenly slightly reluctant to write down a sentence such as 158 subjects participated, because I was not able to find precisely what data was skipped. But, ok, I lived with the problem - and just reported that 158 subjects participated. Actually, this step alone took me quite a bit of time, because I reread the paper more than once because I assumed I missed some relevant information.
 

How large is the effect of expression length?

The main reason why I looked into the paper was, that I wanted to know whether expression length was a significant factor and in case it was, how large the effect was. The paper report on a significant average difference of 20.1 seconds between long and short expressions in reading time, i.e. longer expressions took longer. But how much longer did they take in comparison to short expressions? 

I started searching either for effect size measures or at least some descriptive numbers such as means or confidence intervals or something. I was really convinced that I must have missed it somewhere. So I re-read the paper over and over again - and did not find the number. The only things I had was the following:
"It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p.435]
So, should I just report that the average reading time was between 1.5 and 9 min which would mean that 20.1 second is "between factor 14 and 4 %"? That does not sound meaningful. Again, searching just for this single number (that I finally did not get) took me some time. The same is true for a second variable: syllable. It is reported that each additional syllable costs the developer 1.8 seconds. But what does the first syllable cost?

In fact, I felt more uncomfortable with the variable syllable because there are multiple treatments of this variable and I would be much more interested in the precision of 1.8 seconds.
 

How exactly were the results of the study computed?

What puzzled me as well was the question, how the results were achieved: what statistical procedure was used? And what tool was used? The paper just says that linear mixed-effects regression models were used. Ok, but with what tool? And what exactly were the input variables for the regression models?

Going back to the question on the effect of the variable length, the paper says that "the initial model includes the explanatory variable Length" [2, p. 437]. Length? In a regression? The paper uses length, which is a  binary variable (the paper distinguishes between short and long), so in principle, it is just a simple AB-test or did I miss something? Or was bunch of variables added to the initial model and just length was the one that was significant?

Actually, it turned out that I had many, many more problems. And it took me quite a lot of time just to identify that some of these problems were real. I should mention that because I had so much trouble, I contacted Dave who sent me the raw data set within hours so I was able to analyze the data on my own in order to get the results from the experiment.
 

Why standards such as CONSORT are urgently needed in software science

Finally, I got the raw measurements and was able to recompute some numbers and everything was fine. But why did I feel something is really problematic?

Again, I am well-trained in reading studies. But if it took me hours to understand what was in the paper, I assume that it took many more hours for people who are not trained in these things. So, how can we even assume that someone will take studies into account if it takes many hours to read them? The study by  Devambu et al. indicates that we should blame developers for not knowing what is actually known in the field. But if understanding a single study takes many hours, it actually makes sense that people do not read them. Why? Because developers have more to do than just spending a whole day on reading a single paper. And in case essential information is finally missing, the whole days was spent in vain.

So, how come that essential information are hard to find in studies or are even missing? Again, I do not blame the authors of the here mentioned study for forgetting something. But the paper was published in a peer-reviewed journal. How is it possible that it passed the peer-reviewing process while some essential information is missing (again, we need to speak about the reviewer process at some point, but not here)? 

I am happy that the paper was published, because otherwise the whole body of knowledge in our field would be even less - and it is already inacceptable low (see the study by Ko et al. [3]). But what would have reduced the problem?

Here come research standards into the game. If our field would be disciplined enough to apply a relatively simple reporting standard such as CONSORT [4], things would be easier. Such standard implicitly contains a summary in each paper that permits you to find information quite fast. For the review process, it is a relatively easy thing to check, whether a paper fulfills the standard. I.e. authors can double-check whether the relevant information is contained and reviewers can do this double-checking as well. 

Applying such a standard would have another implication: if for example a conference would apply such a standard, many papers could be directly rejected because they do not fulfill the standard. The problem identified by Ko et al. (and there are in fact many, many more authors who documented that evidence is hardly gathered in our field) would vanish: scientific venues would publish just papers that follow the scientific rules. This would reduce the problem that readers are confronted with tons of papers whose content cannot be considered as part of our body of knowledge. 

Yes, there is the other problem which makes it hard to imagine that we finally get to the point that the software science literature would contain scientific relevant studies: people must be willing to execute (and publish) experiments which are able to conflict with their own position. But this is a different issue I discussed somewhere else.

Yes, research standards are urgently needed. At least reporting standards. Urgently.

References

  1. Devanbu, Zimmermann, Bird. Belief & evidence in empirical software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 108–119, 2016. [https://doi.org/10.1145/2884781.2884812]
     
  2. Binkley, Lawrie, Maex, Morrell, Identifier length and limited programmer memory, Science of Computer Programming 74 (2009) [https://doi.org/10.1016/j.scico.2009.02.006]
     
  3. Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
     
  4. The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]

No comments:

Post a Comment