Thursday, October 15, 2020

What Should Software Science learn from the Corona Crisis?

What Should Software Science Learn From the Corona Crisis?

The corona crisis does not only influence people's daily life, it also influences how people think about science. Suddenly, scientists are present in the news, scientific results influence new laws that appear because of the corona crisis, and the results of scientific studies become part of people's daily conversations. Actually, this new popularity of science is good.

And there are a number of people who doubt in scientific results. Actually, this is not that bad. Science requires doubt. Progress happens, because some people do not believe in commonly accepted theories and search for alternative explanations or new interpretations for given phenomena. 

What's bad is when people ignore results or invent new theories without having any evidence for them. And what's even more bad is, that there are people who follow such new theories without even demanding evidence. Such people can be fooled too easily and for other people fraud becomes a profitable business.

People's typical reaction on skeptics of the corona crisis is that education would help. If people would be better educated, their knowledge about science helps them to distinguish between serious interpretations of scientific results and rather wild guesses based on personal anecdotes. But while this statement is probably true in general, we cannot assume that every discipline provides such a profounding knowledge. 

Taking into account that the scientific foundation of software science is rather low it actually makes sense to think the other way around: what can software science learn from the corona crisis? So, why not trying to find some "lessons learned" for software science from the ongoing crisis?

It is the numbers that do matter

It sounds stupid to point this out, but the first thing to be learned from the corona crisis is, that it is the numbers that do matter. 

The first and rather obvious number that directly comes to one's mind is the death rate. But other numbers such as infection rates, etc. do matter as well for medicine. For other disciplines such as economics numbers do matter, too. There, monetary aspects such as the costs of the crisis do matter. In the end, it is not a single number that matters. Each single number plays its role, but a number of different numbers need to be taken into account in order to get the big picture. 

But the important insight is not only that numbers matter. The important insight is, that hardly anything else than numbers matters. Even if there is a person who has a strong believe in the effectiveness of some medicine, therapy or vaccine, it does not imply that such statement should be taken too serious. Even if someone ignores the huge, negative impact of corona on the economy, it does not imply that this negative impact does not exist. Rhetorical skills of people might strongly work on some people. But rhetorical skills hardly change the reality.

In the end, the effectiveness of some treatment, or the validity of an argument requires numbers: we want evidence for statements and not only some famous peoples' believes. Software science should demand evidence. Software science should demand numbers - numbers that do matter.

Numbers are dirty

The second lesson to be learned is, that numbers are rarely pure. Numbers are dirty. Empiricists are aware that measurements are rarely as pure as people would want them to be. Measurements imply errors in measurements, measurement tools have their drawbacks. Although people want measurement tools to be as precise as possible, we have to accept that every measurement tool has problems. 

And empiricists are used to the problem that people, who do not accept empirical results, discredit the numbers. In the corona crisis, the death rates are discredited. People doubt, that the number of deaths is valid - and they have good reasons to doubt in the perfection of the reported numbers. Obviously, there is no independent institute that can analyze for every single case whether a person died just with or from corona. We do not necessarily speak about intentional lies. We speak about cases where it is not clear whether there was a causal relationship between the virus and a person's death. And we have to accept that even corona tests can fail. It is the nature of measurements that there are error rates - and it is the goal to reduce such error rates.

We also see that the numbers are attacked on different levels. For example, we find people who doubt whether the reported death cases are actually true, i.e. people argue that there could be additional corona cases that were intentionally not reported. Or some people argue that the reported number of infections are too low, because some governments are not interested in reporting high numbers. And even if numbers are accepted, people who are not willing to accept empirical results start new interpretations. For example, people argue that a high infection rate is not the result of an ongoing pandamia, but rather the result of a high number of tests. Or a high death rate is not the result of failing countermeasures, but rather the result of an extremely aggressive virus.

In the end, we have to accept that all numbers have their problems. This does not mean that we should blindly trust in all reported numbers. It is important to see how the reality is mapped to numbers and it is important to understand potential problems. But just stating "the reported numbers are wrong" is rarely a constructive criticism. There is a need to understand how problematic a number is, how measurements could be improved, etc. And it is always necessary to question the relevance of reported numbers. And it is important to identify people who discredit numbers for rather personal reasons and who hinder that way the process of knowledge gathering.

For software science, the lesson learned is that we should not be too quick to discredit reported numbers. We need to understand the process of data collection (and interpretation) and need to understand how large possible errors of certain measurements techniques are. This means that finally we need to identify relevant measurements for our discipline and we need to define measurement techniques in order to get valid measures. And we should be cautious with people who discredit numbers for the sake of discrediting numbers.

It is not a single study that matters, it is multiple of them

Another lesson learned from the corona crisis should be that scientific knowledge usually does not arise from a single study. 

Up to now, there are hundrets and hundrets of studies on corona from the field of medicine. And our knowledge on corona is the results of a combination of a large number of these studies. This does not means that each single study is fantastic. There are in the meantime a number of studies which are today considered invalid. And there are studies who just reproduce results that already have been already reproduced by others.

But the essential lesson learned is, that people in a mature discipline study the same phenomenon over and over again from different perspectives. Different experimental designs, different treatments, different measurements, different measurement methods, etc. -- the knowledge on the field consists of multiple tools and efforts in order to get the big picture.

Such effort is required in software science as well. Instead of celebrating novel ideas in our field, we should appreciate more studies that study given phenomena in depth. We should collect multiple studies on the same phenomena. We should encourage people to study phenomena, although there exist already some studies on such phenomenon.

The Need for Education and Demystifying Science

These corona times teaches us, how necessary it is that people understand non-subjective reasoning and how necessary it is that people distinguish between fact and fiction. Unfortunately, this requires education. It is not enough to argue for or against a statement by adding a phrase such as "scientific studies have shown" to it. Science is not a a magical process. Science just means to be as much non-subjective as possible. Science tries to run, collect, summarize and interpret studies without any agenda in mind. Education demystifies science and statements such as "there is a scientific study" start losing their authority -- which is good, because there is the need to understand studies and not only to accept an author's interpretation of a study.

People should doubt in the results of studies. Such doubts must not be some naive scepticism. It requires knowledge about the underlying procedures and it requires knowledge about the lines of reasoning built upon collected numbers. The necessary willingness to doubt in results also requires knowledge and recognition of valid results. Knowledge about methods teaches us where the limits of doubt are.

Infortunately, this is probably the biggest issue for software science. Actually, it is not clear whether software science provides to its actors enough knowledge for the mentioned kind of reasoning. There are even reasons to believe, that software science education, which is massively influenced by or based upon math, is counterproductive for understanding the results of empirical studies: when you are familiar with proofs by contradiction or with counter examples that disproof a general statements, it is hard to understand why a single case in an empirical discipline does not destroy a whole theory. When you are used to counter examples, it is hard to understand why a single person, who suffers from covid-10 for a second time, does not automatically falsify an immunity theory.

Summary

There is a lot that can be learned from the corona crisis. Software science can learn a lot from the corona crisis. We as software engineers or software scientists should not just read newspapers today and pretend that the process of knowledge gathering for corona is completely different to what needs to be done in our field. 

We should demand numbers. We need to provide such numbers. Our lines of argumentation should rely on numbers. And we need to accept the impurity of numbers - and use education as a weapon against wild speculations and naive scepticism in our field.



Wednesday, September 9, 2020

How much distrust do we need, how much trust can we afford in software science?

How much distrust do we need, how much trust can we afford in software science?

While summarizing results of identifier studies for a magazine I had to make a decision: I had to decide how much I trust in some experiments. In other words: how much distrust is needed when reading papers, reports, etc.? For example, if someone just wrote "I did an experiment and technique A turned out better then technique B", should I just take these works for granted and assume this is some evidence?

Example: Shneiderman's Experiment on Identifiers

The problem happened to me when I tried to summarize identifier studies. One of the earlier studies on identifiers was mentioned by Shneiderman and Mayer and I asked myself whether I should take the available, very short description of the experiment as a form of evidence into account: 
"Two other experiments, carried out by Ken Yasukawa and Don McKay, sought to measure the effect of commenting and mnemonic variable names on program comprehension in short, 20-50 statement FORTRAN programs. The subjects were first- and second-year computer science students. The programs using comments (28 subjects received the noncommented version, 31 the commented) and the programs using meaningful variable names (29 subjects received the mnemonic form, 26 the nonmnemonic) were statistically significantly easier to comprehend as measured by multiple choice questions." [1, p. 231]
That's it. Almost nothing more is said about the experiment in the given source. I could have said "well, the paper appeared in a peer-reviewed journal, hence this is evidence", but I did not feel that way. The problem was not only, that concrete numbers were missing in the description. The problem was also that I did not had a precise idea what was done in the experiment.

I wanted to give an impression of what was done in the experiment and then report means (and differences in means) and in case there are interaction effects, I wanted to report them as well (not in terms of statistical numbers, but rather in terms of text).  I know, means are problematic, but my goal was to summarize results for a magazine - the audience should not be bothered with p-values, etc. But I think it was also important to give people an idea what exactly participants did in the experiment and  how conclusions were drawn from it. I did not know what programs or how many of them were given to the subjects, how exactly the different treatments looked like, etc.

Ok, I was not satisfied with the description. But I also did not want to make it too easy for me and just say "I should ignore the text", because in the end the description still came from a peer-reviewed journal. So, I tried to find more about the experiment. I was digging at Ben Shneiderman's webpage, but was not successful. But in his book Software Psychology [2] I found some more text: 
"One of our experiments, performed by Don McKay, was similar to Newstead's, but the program did not contain comments. Four different FORTRAN programs were ranked by difficulty by experienced programmers. The programs were presented to novices in mnemonic (IDVSR, ISUM, COEF) or nonmnemonic (I1, I2, I3) forms with a comprehension quiz. The mnemonic groups performed signifianctly better (5 percent level) than the nonmnemonik groups for all four programs." [2, pp. 70-71]
After this paragraph, the book contains a figure that illustrates a difference in "mean comprehension scores" for four different programs. The score between the programs seems to vary from 3 to 5 for mnemonic, respectively 2 to 3.5 for non-mnemonic variables.

The second citation in combination with the figure gave me some more trust that the experiment revealed something. But it still puzzled me what the programs looked like that were given to the subjects. I also wanted to see more in more detail what variable names were used. But I also wanted to know what questions were given to the participants. And the first citation mentions multiple choice questions that were used (the second just speaks about a quiz). How many alternative answers had the subjects? How much time did they have for reading the code? What were the raw measurements, what the means, the confidence intervals? I just had the figure which does not show confidence intervals nor do they give a precise understanding of what the mean is. And finally: how was the data analyzed and what were the precise results? Was just a repeated measures ANOVA used? What about the second factor (programs)? Were there interaction effects?

Finally, I spent a lot of time on the experiment (mainly for searching for a more detailled experiment descriptions and for comparing both descriptions, whether they do match).  And I finally decided for myself, that I should not take the experiment as a form of evidence into account. In the text for the magazine, I just wrote that "there was once an experiment which is today rather historically intersting, but inappropriate as a form of evidence".

I felt bad. 

I had the feeling that I did not give Ben Shneiderman the appropriate credit for his efforts. But I really felt that it is my duty to be much more sceptical with his text - despite the fact than Ben Shneiderman is one of the leading experimenters in software science.

On Scepticism, Evidence and Trust

As a scientist (well, in fact as an educated person), you should not trust too much, you should not just believe someone (no matter who he is) and you should not stick with your own fantasy or your own personal and subjective impressions. I.e. when you are confronted with a statement, the following should hold: A statement ...
  • ... is not more important, because it follows a current hype,
  • ... does not become true just because the author of such a statement is an expert,
  • ... is not more valid, because it was articulated by an authority.
  • ....is not more important, just because you believe in it or the statements comes from you.
This kind of argumentation is far from being new. For example, Karl Popper wrote:
"Thus I may be utterly convinced of the truth of a statement; certain of the evidence of my perceptions; overwhelmed by the intensity of my experience: every doubt may seem to me absurd. But does this afford the slightest reason for science to accept my statement? Can any statement be justified by the fact that K. R. P. is utterly convinced of its truth? The answer is, ‘No’” [3, p. 24]
Of course, you cannot endlessly play this scepticicm game. You cannot ignore everything on this planet and just say that you feel sceptical about it. In the very end, you need to take some evidence into account. This evidence might be damn strong or just weak (and weak evidence does not mean that you heard an anecdote somewhere). 

But even strong evidence implies trust to a certain extent. You must trust that a study was executed, you must trust that the resulting numbers were measured, you must trust that not further numbers were measured that were withheld by the authors, you must trust in the validity of the analysis and you must trust in the seriousness of the interpretation. You must trust that the goal of the study's author was to find out something. 

Unfortunately, whenever some kind of trust is required, it is a door opener for fraud. Trust can be exploited. People can intentionally lie when trust is required.

Reporting Experiments - Setup, Execution, and Analysis Protocol

Before speaking about the problem with trust, let's take a look what can be known about an experiment.

In an ideal world, there is a guarantee that a study was executed, that this study follows a well-defined protocol, and that the study was analyzed in a way that matches the study's design. Such protocol consists of three parts: the setup protocol  (which describes what and how something should be done when it is replicated), the execution protocol (which describes the special circumstances in which the experiment was actually executed) and the analysis protocol (that gives the results of the study in statistical terms).

The setup defines the subjects that are permitted to participate (such as "professional software developer with skills X and Y"), the dependent (such as "reaction time") and independent variables (such as "programming language") and the hypotheses that are tested. Furthermore, the protocol contains the experiment layout (such as "AB test", etc.),  the measurements techniques (such as "reaction time measurement with stop watch") and the different treatments given to the subjects (such as Java 1.5, Squeak 5.0). Furthermore, it describes how and under what circumstances the different treatments are given to the subjects (such as the programming tasks given to the subjects, the task descriptions, the used IDE, etc.). In case the measurement techniques require some aparatus (such as some software used for measurements), this is also contained in the setup protocol. And in case, the aparatus cannot be delivered as part of the protocol, a precise description of the apartus is given.

The execution protocol describes the selection process for the subjects, the subjects that were finally tested, and the specific conditions under which they were tested. These special conditions could be the time interval in which participants were tested, the location where the test was executed, the machines used in the experiment, the concrete IDE (incl. version), etc. And finally the execution protocoll contains the raw data.

The analysis protocol describes how the possible effect of the independent variables on the dependent variables is determined. Since probably some statistics software is used for the analysis, this software is mentioned as well. The analysis protocol describes the results of the experiment in terms of statistical values. For the statistical values, corresponding reporting styles should be used such as APA (although this is very uncommon in software science). Each test comes with a measurement for the evidence (aka. p-value) and effect size (such as Cohen's d, eta squares, or just the means and the differences in means as well - latter ones are no effect sizes, but it is often more useful to have measurements that mean something to the readers instead of abstract things such as Cohen's d that most reader won't be familiar with).

On double-checking results

The information above is required because it permits readers to double check the experiment. It can be checked, whether the layout followed a standard-layout, whether the measurement technique is state-of-the-art, whether the tasks given to the participants were appropriate and whether the analysis follows the experimental design. And in case the reader doubts that the statistical results are right, he can recompute them. It even permits him to apply alternative statistical procedures.

In a real scientific world, there would not be the need for all this, because if an experiment was published in a peer-reviewed journal, you can trust that the reviewers did all this for you. Of course, this is no 100% guarantee. Even in disciplines such as medicine that have a very high research standard, studies are retracted in journals (see for example a recent case with a COVID-19 study). But the situation in software science is different. 

Taking the terribly low number of experiments in software science into account, there is good reason to doubt that an average reviewer in software science is able to do a serious review (just because quite few people are familiar with experimental designs and analyses). As a consequence, we cannot assume that a reviewer double-checked an experiment. Hence, it makes sense today not to trust on published experiments in software science, but to double check them. It does not necessarily mean that that authors intentionally lied. They might have just done some errors. Not intentionally, but just accidentally.

Unfortunately, we run here into a problem: double-checking costs time. Even if someone is well-trained in experimental analyses, it takes time to do the recomputation from the raw data (in case the data is available). But stats are just part of the game. There is the need to check whether the data collection followed the procotols, etc. But we cannot double-check everything. At a certain point we have to stop and say "I just have to trust". But we should make explicit on what we need to trust and what was actually double-checked. And we should give readers a fair chance to decide on his own, what he should trust in and what not.

Why not Making Chains of Trust more Explicit?

But what about the average developer who is interested in what evidence actually exists in software science? He is probably not trained enough to double check experimental results. But that implies that the developer will not take any evidence into account that exists in software science. And that implies that the results of software science will be in vain. This should not be the consequence, otherwise our discipline will never get out of this situation where countless statements without any evaluation exist.

Hence, it makes sense to give developers all essential information about an experiment but also the information about whom and what needs to be trusted. I.e. we should provide developers information such as "I, Stefan Hanenberg, recomputed the results of the experiment X and the results of the analysis are Y. I.e. if you cannot do your analysis on your own, you need to trust me that the computation of the analysis is correct". 

Probably it makes sense to make even more information available such as "I got the measurements and I repeated the analysis, but I was not able to access the tasks given to the subjects. I.e. I only confirm that the results of the experiment match the given data, but I cannot confirm that the data followed an appropriate experiment protocol, hence we need to trust the author of the experiment about that".

And maybe, it makes even sense to make some ratings about the resulting chains of trust. An experiment result such as "we need to trust the author or the experiment" seems to be less trustworthy than "we need to trust that a valid setup protocol was followed, but the results match the given raw data" which is less trustworthy than "we received everything about the experiment and we confirm that the experiment followed an appropriate design, was executed in an appropriate way and the reported results match the raw data". And the best case would be probably: "we received everything needed from the experiment, the results are as described from the author and the experiment was executed by others and they received comparable results".

Why Could that Help?

Our discipline suffers from the problem that a number of statements ("object-orientation is good", "functional programming is good", "UML improves understandability, etc.") are hardly or badly evaluated. But even if there are experiments available, we should make explicit what parts of the experiments are trustworthy and what parts are not - because in the very end, we want to rely on strongly trustworthy results and not just on "we trust some single person".

By making explicit what results in our field do not just depend on our trust in the authors, we make explicit where we could or need to improve our discipline.

References


  1. Ben Shneiderman, Richard Mayer. Syntactic/semantic interactions in programmer behavior: A model and experimental results. International Journal of Computer and Information Sciences 8, 219–238 (1979). https://doi.org/10.1007/BF00977789
  2. Ben Shneiderman. Software psychology: Human factors in computer and information systems, Winthrop Publishers, 1980

  3. Karl Raimund Popper. The Logic of Scientific Discovery. Routledge, 2002. 1st English Edition:1959.

Tuesday, May 19, 2020

Reporting Standards in Software Science Desperately Needed

Reporting Standards in Software Science Desperately Needed

  
If we are really interested in achieving something in software science, there is a need for reporting standards. I really mean this statement. But just recently I made the experience how urgently needed such standards are: I summarized an experiment and became aware how much time it took to extract relevant information from it. Some information was missing, some was confusing, etc. In case a standard such as CONSORT whould have been applied, it probably would have cost me minutes to summarize an experiment - instead of many, many hours where I finally even needed to contact the author, because some information was missing.
  

Recent Experiences While Summarizing Research Results

I recently summarized research results from experiments. The goal was relatively simple: Just collect the results from some studies and summarize them in a way that an ordinary software developer is able to understand them. I think such work is needed, because for example the study by Devambu et al. has shown that most developers judge the validity of claims in software construction based on their personal experience and not because of independent studies [1]. But taking into account that experience is limited and that subjective experiences are quite error-prone, it makes sense to give developers information about studies that exist and that give evidence for some claims. And what's even more important: Give developers studies that contradict given claims.

The topic of my summary was "identifizers", i.e. I wanted to summarize studies who checked what the influence of identifiers on code reading or code understanding is. Yes, I know. No big deal. Everyone of us knows how important the choice of good identifiers is. But I really wanted to know what was actually measured by researchers. And we should know something about the effect sizes.

Most of the studies were done or at least initiated by Dave Binkley. I read most of his papers already in the past and since I am well-trained in reading studies I assumed that it is no big deal to give a quick summary of some of them. And there was another reason why I focussed on his papers: From my experience and in my opinion his studies are well-conducted and I trust in the validity of the results, i.e. I trust that the numbers were collected in a way as described in the papers, I trust that the analyses of the data and I trust that the writings do not try to over-sell results: I think his research has to goal to find answers. His papers are not written for the sake of writing papers, but for the sake to improving the knowledge in our field.
 

My goal for the summary

More precisely, I wanted to summarize papers in a way that gives a 1-2 sentence description of the experimental design, another 1-2 sentences about the dependent and independent variables and a few sentences about the main results. And maybe some more sentences about what can be learned from the study. The goal was not to bother readers with stuff that is needed for scientific writings. I.e. I wanted to skip information about whether the experiment followed a crossover design, whether e.g. a latin-square was used or what statistical procedure was applied. 

Actually, I think it is necessary if readers who are not too deep in scientific writings get results in an understandable way. I.e. if an AB test has been applied, I think it makes sense not to write about statistical power, p-values, confidence intervals or effect sizes, but just to write that "a differences was detected" (in case a significant result was achieved) and then to report means and mean differences. And in case multiple factors are tested, my goal was not to write about interaction effects, etc. but just to explain interactions in a way that an average person can get the meaning quickly. 
 

Shouldn't someone else do the job?

Quickly is the point here. If we want developers to understand results of studies, they have to be communicated efficiently. And the typical research paper at a conference or in a journal does not seem to have the goal to communicate results efficiently. Authors of conference papers are given a certain number of pages they can fill. And authors are actually forced to fill this number of pages. If for example a conference such as the International Conference on Software Engineering (ICSE) has a page limit of 12 pages, you will hardly find a paper at that conference that does not have 12 pages. This has something to do with the review process (which should not be discussed here, although there is an urgent need to discuss it). Ok, so you want to communicate scientific results for a broader audience. But how?

Actually, scientific journalism in other disciplines does this job: people who are trained in writing (for a popular market) summarize results in a way that people are able to understand them. This is important, because people should be informed about what knowledge exists - especially taking into account that people pay for the generation of this knowledge (because a lot of scientific work is paid from tax money). But for software science this kind of journalism does not exist. Yes, there are a bunch of magazines that address technical things. You find books on new APIs or new technology that explain how to apply it. But this is something different. These writing explain how industrial products could be used. They do not explain what we actually know about them. It would be great if there would be people who summarize research results - but we currently have to live with the fact that this is actuall not done in our field.

So, back to the studies.
 

Giving a quick summary took damn long time

Again, I really love Dave's work. I think his studies are great. His writings are great. But it turned out that just writing a quick summary took much more time than expected. When I now explain what happened to me and why I had troubles to summarize the paper, this should not and must not be understood as a criticicm of Dave's work. Really not. Dave's work is definitively a shining example of how good science in our field should be. Dave's paper is just an example for what troubles people could have reading scientific papers. And I assume my papers suffer from the very same problems.

One of the papers I started with was Identifier length and limited programmer memory [2]. I remembered that this study compared 8 expressions with different lengths and that subjects were asked to write down a part of the expression. So, I wanted to write sentences such as:
 "The experiment gave A subjects B expressions to read for a time C (D subjects were removed for some reasons). Each expression consisted of E parts and the authors used the criterion F to distinguish between short and long expressions. After reading, a part from the expression was removed and subjects had to complete it. The average time for reading short expressions was T1 and for long expressions it was T2, so the (statistical significant) differences was T3, respectively it took people G percent more time to read the long compared to the short expressions."
I am aware that these sentences are quite a simplification of the results. Especially, I do not mention all independent variables and I do not mention the applied statistical method. By reporting the means people do not get an idea of the size of the confidence intervals, etc. But, again, the goal was to give a quick (but still informative) and not a complete overview. Why do I think that this kind of summary is informative? Well, I think it contains the most relevant information. The number of subjects gives an idea how large the experiment was (and people are mad about this idea of "being representative" - that's another point that needs to be dicussed, but not here, not now), the dropout rate gives an idea how much the data says about the relation between the "originally adressed sample" and the actual data used for the analysis. And the average times give people an idea how large such differences are. Yes, there are effect size metrics such as Cohen's D or eta square, but if someone does not know these things, such numbers would rather confuse him.
 

Sample size and dropout rate

Doing the first step (number of subjects) seemed relatively easy, because the number 158 is already mentioned in the abstract, so I directly started searching for the dropout rate. Suddenly it took some time to understand what exactly happened to the data. The paper does not have an explicit section such as "experiment execution" or something. But there is a section "Data preparation" where I found the following:
"[...] the data for a few subjects was removed. For example, one subject reported writing down each name. A second subject reported being a biology faculty member with little computer science training. Finally, the time spent viewing Screen 1 was examined. It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p. 435]
Ok, but what was the actual data being used? The second sentence seems to describe that the data of a whole subject was removed. But what means "this affected 18 responses"? Does this mean that the data of 18 subjects was removed? Or just 18 answeres? And what about the other six? Does it mean that 24 answers, i.e. three subjects were removed? Or was each single response treated individually? I felt suddenly slightly reluctant to write down a sentence such as 158 subjects participated, because I was not able to find precisely what data was skipped. But, ok, I lived with the problem - and just reported that 158 subjects participated. Actually, this step alone took me quite a bit of time, because I reread the paper more than once because I assumed I missed some relevant information.
 

How large is the effect of expression length?

The main reason why I looked into the paper was, that I wanted to know whether expression length was a significant factor and in case it was, how large the effect was. The paper report on a significant average difference of 20.1 seconds between long and short expressions in reading time, i.e. longer expressions took longer. But how much longer did they take in comparison to short expressions? 

I started searching either for effect size measures or at least some descriptive numbers such as means or confidence intervals or something. I was really convinced that I must have missed it somewhere. So I re-read the paper over and over again - and did not find the number. The only things I had was the following:
"It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p.435]
So, should I just report that the average reading time was between 1.5 and 9 min which would mean that 20.1 second is "between factor 14 and 4 %"? That does not sound meaningful. Again, searching just for this single number (that I finally did not get) took me some time. The same is true for a second variable: syllable. It is reported that each additional syllable costs the developer 1.8 seconds. But what does the first syllable cost?

In fact, I felt more uncomfortable with the variable syllable because there are multiple treatments of this variable and I would be much more interested in the precision of 1.8 seconds.
 

How exactly were the results of the study computed?

What puzzled me as well was the question, how the results were achieved: what statistical procedure was used? And what tool was used? The paper just says that linear mixed-effects regression models were used. Ok, but with what tool? And what exactly were the input variables for the regression models?

Going back to the question on the effect of the variable length, the paper says that "the initial model includes the explanatory variable Length" [2, p. 437]. Length? In a regression? The paper uses length, which is a  binary variable (the paper distinguishes between short and long), so in principle, it is just a simple AB-test or did I miss something? Or was bunch of variables added to the initial model and just length was the one that was significant?

Actually, it turned out that I had many, many more problems. And it took me quite a lot of time just to identify that some of these problems were real. I should mention that because I had so much trouble, I contacted Dave who sent me the raw data set within hours so I was able to analyze the data on my own in order to get the results from the experiment.
 

Why standards such as CONSORT are urgently needed in software science

Finally, I got the raw measurements and was able to recompute some numbers and everything was fine. But why did I feel something is really problematic?

Again, I am well-trained in reading studies. But if it took me hours to understand what was in the paper, I assume that it took many more hours for people who are not trained in these things. So, how can we even assume that someone will take studies into account if it takes many hours to read them? The study by  Devambu et al. indicates that we should blame developers for not knowing what is actually known in the field. But if understanding a single study takes many hours, it actually makes sense that people do not read them. Why? Because developers have more to do than just spending a whole day on reading a single paper. And in case essential information is finally missing, the whole days was spent in vain.

So, how come that essential information are hard to find in studies or are even missing? Again, I do not blame the authors of the here mentioned study for forgetting something. But the paper was published in a peer-reviewed journal. How is it possible that it passed the peer-reviewing process while some essential information is missing (again, we need to speak about the reviewer process at some point, but not here)? 

I am happy that the paper was published, because otherwise the whole body of knowledge in our field would be even less - and it is already inacceptable low (see the study by Ko et al. [3]). But what would have reduced the problem?

Here come research standards into the game. If our field would be disciplined enough to apply a relatively simple reporting standard such as CONSORT [4], things would be easier. Such standard implicitly contains a summary in each paper that permits you to find information quite fast. For the review process, it is a relatively easy thing to check, whether a paper fulfills the standard. I.e. authors can double-check whether the relevant information is contained and reviewers can do this double-checking as well. 

Applying such a standard would have another implication: if for example a conference would apply such a standard, many papers could be directly rejected because they do not fulfill the standard. The problem identified by Ko et al. (and there are in fact many, many more authors who documented that evidence is hardly gathered in our field) would vanish: scientific venues would publish just papers that follow the scientific rules. This would reduce the problem that readers are confronted with tons of papers whose content cannot be considered as part of our body of knowledge. 

Yes, there is the other problem which makes it hard to imagine that we finally get to the point that the software science literature would contain scientific relevant studies: people must be willing to execute (and publish) experiments which are able to conflict with their own position. But this is a different issue I discussed somewhere else.

Yes, research standards are urgently needed. At least reporting standards. Urgently.

References

  1. Devanbu, Zimmermann, Bird. Belief & evidence in empirical software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 108–119, 2016. [https://doi.org/10.1145/2884781.2884812]
     
  2. Binkley, Lawrie, Maex, Morrell, Identifier length and limited programmer memory, Science of Computer Programming 74 (2009) [https://doi.org/10.1016/j.scico.2009.02.006]
     
  3. Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
     
  4. The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]

Thursday, May 7, 2020

Before Doing Science in Software Construction Something Else is Needed: Critical Thinking

Before Doing Science in Software Construction Something Else is Needed: Critical Thinking


Why is Science Needed in Software Construction?

Software construction is a huge, multi-billion market where new technology appears almost every day (in case you doubt that software is a multi-billion market, just take a look at the 10 most valuable companies on this planet today). Such new technology comes with multiple claims and the most general one is, that the new technology makes software development easier and hence cheaper.

Taking the size of the market into account, there are good reasons to doubt whether all technology on the market exists for a good reason - beyond the reason that new technology increases the income of companies or consultants who propagate this technology. Taking the size of the software market into account, there are reasons to believe that a lot of technology exists although its promised benefit neither ever existed nor will ever exist.

In the very end, one has to accept that most claims associated with a certain technology are not the result of non-subjective studies. Instead, they are the result of subjective perceptions or impressions of people who either have strong faith in a new technology, who really hope that the technology improves something, or who just love a new technology ("faith, hope, and love are a developer’s dominant virtues" [1, p. 937]). Finally, some of these claims are just the result of marketing considerations: Claims that are made and spread just because they increase the probability of success for the new technology and not because they are true.

That non-subjective studies are rather rare exceptions in the field of software construction is a sad, but well-documented phenomenon. For example, Kaijanaho has shown that up to 2012 only 22 randomized controlled trials on programming language features with human participants were published [2, p. 133]. Another example is the paper by Ko et al. who analyzed the literature published at the four leading, scientific venues in our field. The authors came to the conclusion that "the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years" [3, p. 137].

So, what's wrong with this situation? The problem is, that new technology causes costs. Costs for learning this technology, applying it, and maintaining software written in it. And there are additional, hidden costs. First, there are costs because new technology supersedes existing technology. Such existing software becomes often rewritten which means that investments done in past need to be repeated the future. And in case existing software is not newly written, there are additional costs for maintaining the old technology. And old technology causes larger costs because once a technology is no longer taught and no longer applied it becomes more expensive to maintain it simply because there are no longer people on the market who are able to master the old technology. An extreme example for this was the Y2K problem, whose costs were to a certain extent caused by forgetting the old technology COBOL.

But there is another, tragic problem. The problem is, that in case a new technology would appear that solves a number of problems we have today, such technology could not be identified. The claims associated with this new technology would just be lost among all the other claims that exist for today's technology or claims that will be associated with competitors.

We must not forget that the goal is not to find excuses to stick to old and inefficient technology. The goal is to make progress. But progress does not mean just to apply new stuff that appeared recently, but to apply technology that improves the field of software construction.

So, what we need are methods to separate good from bad technology. We need to separate knowledge from speculation and marketing claims. And we need to teach such methods to developers to give them the ability to separate knowledge from speculation. This does not mean that we need developers who execute studies. But we need developers who are able to read studies and who are able to identify trustworthy studies from bad ones. In the end, we want a discipline that relies on the knowledge of the field as a whole and not on speculations of individuals.

The Scientific Method

The alternative to subjective experiences and impressions is the application of the scientific method, which is actually the alternative to subjectivity and not just one alternative among others. This does not imply that the term scientific method describes a clear, never-changing and unique process of knowledge gathering. Instead, it is a collection of things that can be done, should be done or must be done. And this collection changes over time, because not only knowledge in a certain discipline changes because of the scientific method. The method changes as well.

It is not surprising that the scientific method is often critically discussed in the field of software construction which is more an expression of the immaturity of the field instead of the community's willingness to generate and gain non-subjective insights. Just to give an impression: even at international, academic conferences on software construction, there are discussions whether not the scientific method makes any sense at all. At such places, there are discussions about the need for control groups, the validity of statistical methods or the validity of experimental setups. All these discussions exist despite the fact that there are tons of literature available from other fields on these topics (which give very clear answers to these topics). One could argue that this immaturity just exists because the field is quite young. In fact, this statement can be easily rejected. In medicine, which is typically considered as one of the old fields, most of the experimental results that we accept todays as those ones that follow valid research methods, are just done in the last 30-40 year.

The fundamental part of the scientific method is, that there are people who are willing to test the validity of hypotheses. This implies that they are willing to accept results although they conflict their own, personal and subjective impressions or attitudes. But this means that they not only accept their own experimental results, but they also accept results from others. Although this seems quite natural, it has one important implication. It means that people established some common agreement what a valid research result is and what not.

Scientific Standards

Let's discuss the very general idea of research standards via an example. Let's assume there are two programming techniques A and B and one would like to test the hypothesis that it takes less time to solve a given problem using technique A than it takes using technique B. So one person tests 20 people, 10 solve a given problem using A, 10 solve it with B. Then the time for both groups are measured and then compared. This is a standard AB-test where not only the experimental setup (randomization of participants, etc.) but also the analysis for the data (t-test, respectively U-test) is well-known since decades. But the general question is, whether or not one should take the results of the experiment into account as a valid result.

It turns out that especially in software construction people complain a lot about such a standard approach. And in case technique A is more  efficient than B, a larger number of people who prefer technique B will find reasons either to ignore the result or to discredit the experiment. Actually, there are quite plausible arguments against the experiment and the most general one is the problem of generalizability: one either doubts that the number of subjects is "representative" in order to draw any conclusion from the experiment. Another doubt is, whether the given programming problems represent "something that can be found in the field" or whether the problems are any "general programming problems at all".

We should not be too ignorant to reject such objections directly, because there is some truth in them. But we should also not be too open minded to take such objections too serious, because of the following reasons: there is no experiment in the world that is able to solve the problem that underlies these objections. No matter how many developers are used as subjects in the experiment, one can always argue that the number is too low. And no matter on how many programming problems the techniques are tested, there are other programming problems on this planet that were not used in the experiment. 

In order to overcome such situation there is a need to have some common understanding of the applied methods: there is the need for community agreements. If people agree on how experimental results are to be gathered, there is no need to doubt in results that come from experiments that follow such agreements. In other disciplines, the problem was identified as well (some longer time ago) and corresponding scientific standards were created. Examples for such standard are the CONSORT standard in medicine [4] (which mainly addresses the way how experiments are to be reported) and the WWC-standard that is used in education [5] (which not only covers the way how experiments are to be executed, but which also handles the process of how experiments should be reviewed).

The need for such community agreements is obvious and we argued already in 2015 that such community agreements are necessary in software construction as well [6]. Today we find movements towards such standards. An example for this is the Dagstuhl seminar "Toward Scientific Evidence Standards in Empirical Computer Science" that takes place in January 2021 [7].

On the Selection of Desired, and the Ignorance of Undesired Results

Such movements are good and necessary. However we should ask ourselves, whether the field of software construction is ready for such standards. Because the introduction of research standards entails some serious risks that should be taken into account. But before discussing these risks, I would like to start with some examples.

In the last years, one situation occured over and over again to me. A collegue contacted me and asked, whether there is one experiment available that supports a certain claim. The collegue's motivation is typically that she or he tries to find a way to argue about the need for some new technology and from her/his perspective this motivation would be stronger if there would be some matching experimental results. At that point I usually start a conversation and ask what if there are experimental results that show the opposite. At that point I usually get the answer that such experiments would be interesting, but wouldn't help in the given situation. In order words: an experimental result (in case it exists) is ignored in case it contradicts a personal intention.

Something else happened to me in the last years which is related to an experiment I published in 2010; an experiment that did not show a difference between static and dynamic type systems [8]. Today it seems quite clear that the experiment had problems and it would have been better if the experiment was never published. In the meantime, other experiments showed the positive effects of static type systems (such as for example [9]): Taking the sum of experiments into account, the question of whether or not a static type system helps developers can be considered answered (so far). But what happened is that people, to whom it is helpful that that no difference between static and dynamic type systems was detected, have the tendency to refer only to the first study in 2010 but to later ones. For example, Gao, Bird and Barr explain relatively detailed the results of the 2010 paper, but do not mention the latter one [10]. Again, it seems as if only those results are taken into account that match a given intention - and results that contradict such intention are ignored.

Finally, another situation occured more than once or twice. A collegue created some new technology and asked me for advice in order to construct an experiment that reveals the benfit of the new technology. After some discussions (which often last for hours) we typically come to the point that the collegue is really convinced about the benefit of the technology in a certain situation, but thinks that in a different situation the technology could be even harmful. Often, this collegue is in the situation that a PhD needs to be finished and "just the last chapter - the evaluation" needs to be done. And what happens next is often that an experiment is created that just concentrates on the probable positive aspects of the new technology - the (possible) negative aspects are not tested.

The commonality of these examples is, that people today have the tendency to select only those results that do match their own perspectives or attitudes. In other words: even if strong empirical evidence, i.e. a number of experimental results, exists for a given claim, people still have the tendency to search for singular results that contradict such claim if people do not share this claim.

This is comparable to people who advocate homeopathy and select those rare experiments where homepathy showed a positive effect - and ignore the overwhelming evidence we have about homeopathy.

The Required and Currently Missing Foundation is Critical Thinking

Probably there is a reason for such a behavior and I assume that such reason has something to do with people's attitude in our field. In our education, from the very beginning people are involved in ideological warfares: procedural versus functional versus object-oriented programming, Eclipse versus IntelliJ, GIT versus Mercurial, JavaScript versus TypeScript, Angular versus React, etc. "Chosing a side" seems to play an essential role in software construction. And it actually makes sense to a certain extent. If I master a technology, it is beneficial for me if this technology becomes the leading technology in the field. If I master a technology that no one uses and that no one is interested in, my technological skills are not and maybe will never be beneficial to me. Consequently, people advocate the technology they use and they try to find reasons why this technology should be used by others as well. And in order to achieve this, all kinds of arguments will be applied and it does not matter whether an argument is actually valid as long as supports my intentions. This behavior becomes stronger as soon as people start developing their own technology. If someone writes as part of this PhD a programming language, there seems to be the tendency to defend this language. 

The idea of defending a self-created technology or to defend a technology just because one is able to master it seems quite natural. But actually, this is probably the core of the problem. We need to communicate from the very beginning that the goal is to make progress. And that progress means that we are willing to identify problems. And in case there is strong evidence that a certain technology has serious problems, we must be open minded enough to take alternative technologies into account. We must be able to accept and apply critical thinking.

Of course, this must not lead to the situation that people switch technology directly after some rumours appear about some better technology - in fact, this would be closer to the situation we have today where a large number of people accept new technology for the sake of being new. Just to throw everything away in order to apply something new is closer to actionism than to critical thinking. Critical thinking also does not mean that we find ad hoc arguments against some technology. Critical thinking must not mean that we encourage wild speculations. It just means that we are willing to accept different arguments. Critical thinking means that we are willing to collect and accept pros and cons. It means that we are willing to give up our own position.

This willingness is the very foundation we need in our field. Because it does not matter if we define research standards in our field and enforce people to follow such research standards as long as people are not willing to accept results that conflict their own positions. Otherwise, research results will be either just ignored or people generate and publish only those results are match their own attitudes.

Once we have achieved this kind of critical thinking and once we are able to give this idea to students, we can go the next step towards evidence in order to give people the ability to differ between strong, weak and senseless arguments, arguments that are backed up by evidence and those ones that are not. Then, we have researchers who are willing to define experiments whose results might contradict the experimenters' positions. This would be the moment where science could start in our field. This would be the moment when we are ready to apply the scientific method.

References

  1. Stefan Hanenberg, Faith, Hope, and Love: An essay on software science’s neglect of human factors, OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applications, October 2010, pp. 933–946. [https://doi.org/10.1145/1932682.1869536]
  2. Antti-Juhani Kaijanaho, Evidence-Based Programming Language Design A Philosophical and Methodological Exploration, PhD-Thesis, Faculty of Information Technology, University of Jyväskylä, 2015. [https://jyx.jyu.fi/handle/123456789/47698]
  3. Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
  4. The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]
  5. U.S. Department of Education’s Institute of Education Sciences (IES), What Works Clearinghouse Standards Handbook Version 4.1, January 2020. [https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf]
  6. Stefan Hanenberg, Andi Stefik, On the need to define community agreements for controlled experiments with human subjects: a discussion paper, Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, October 2015, pp. 61–67. [https://doi.org/10.1145/2846680.2846692]
  7. Brett A. Becker, Christopher D. Hundhausen, Ciera Jaspan, Andreas Stefik, Thomas Zimmermann (organizers), Toward Scientific Evidence Standards in Empirical Computer Science, Dagstuhl Seminar, 2021 (to appear) [https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21041]
  8. Stefan Hanenberg, An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Reno/Tahoe, Nevada, USA, ACM, 2010, pp. 22–35. [https://doi.org/10.1145/1932682.1869462]
  9. Stefan Endrikat, Stefan Hanenberg, Romain  Robbes, Andreas Stefik, How do API documentation and static typing affect API usability?, Proceedings of the 36th International Conference on Software Engineering, May 2014, pp. 632–642. [https://doi.org/10.1145/2568225.2568299]
  10. Zheng Gao, Christian Bird, Earl T. Barr, To Type or Not to Type: : Quantifying Detectable Bugs in JavaScript Proceedings of the 39th International Conference on Software Engineering, 2017, pp. 758-769. [https://doi.org/10.1109/ICSE.2017.75]