Saturday, June 11, 2022

Please, Stop Complaining About Missing Generalizability of Code Examples in Experiments

Since years, I hear and read complaints that studies do not generalize. I mostly get such responses by reviewers who argue why they believe that one of my experiments doesn't generalize. Actually, I have heard and seen those complaints not only about my studies but on other experiments as well. As a result such experiments don't get published. 

There is nothing wrong if bad papers do not get published. No one wants to have wrong results in the literature. But it is bad if results don't get published because of someone's belief that the results do not generalize. And it is even more problematic if results do not get published because the doubts about missing generalization are just the consequence of some misunderstandings on experimentation.

To summarize the following text: Please, stop complaining about missing generalizability of code examples in experiments.

Unfortunately, it takes some space to explain this  in more detail.

On Controlled Experiments

Controlled experiments are quite simple. Their goal is to measure something in a situation where everything that can be controlled is controlled. And in case there are things that cannot be controlled (so-called confounding factors), experimenters should either try to avoid or to measure them. 

In the simplest case there is one dependent variable (such as time to completion in a programming task) and some independent variables (such as certain techniques that are used, code styles, etc.) -- variables that are intentionally varied by the experimenter. The independent variables are those things that are in the focus of the experimenter, i.e. those things that are studied.

After executing the study, the experimenter checks, whether the variations on the independent variable have any effect on the dependent variable using some statistical procedures. The whole idea about experiments is quite trivial. 

A Simple Example: IfJava vs. IfTrue

Let's consider a possible study of a classical AB experiment (one independent variable with two treatments A and B). An experimenter might think that there are differences between Java's if-statements and some given alternative. I.e., there are two variants:

Treatment A (IfJava):
  if (someCondition)  
    ...
  else
    ...

Treatment B (IfTrue):
  someCondition ifTrue 
    ...
  else
    ...

There is one independent variable (if-style) with two treatments (IfJava, IfTrue). With respect to the dependent variable,  it is quite plausible to measure the time until the if-statement is understood. But we need to speak about confounding factors.

On Confounding Factors

Confounding factors are factors with an undesired effect. Undesired means they influence the dependent variable which should only be influenced by the independent variables. Unfortunately, confounding factors don't just add some constant to the dependent variable. Instead, they come with their own distribution (mean, deviation, etc.). Confounding factors can hide the effect of the independent variable: if a confounding factor is too strong (or its deviation too large), one measures mainly the effect of the confounding factor in an experiment and not the effect of the independent variable. 

In the best case, the effect of confounding factors is small and known and can be extracted from the literature. But taking the current state of our literature into account, we cannot expect any hard numbers from it. So, what does it mean for our experiment?

The goal of the experiment is to measure the difference between IfJava and IfTrue. And we have to use concrete code snippets. But how should such snippets look like? One could have a spontaneous idea. Just let's use some arbitrary if-statement that could like the following.

Treatment A (JavaIf):
  if ((myVariable > 23) && isThisRight() && !someOtherCondition())  
    return 1;
  else
    return 2;

Great. We could ask participants "what is the result of the if-statement?" and in case the statistical analysis sees a difference the experimenter calls the if-style that requires less time more readable.

Unfortunately, we have a confounding factor: the complexity of the code. It is plausible that the more complex the condition, the more time it takes to answer the question. I.e., the dependent variable time is influenced by something that is not in the focus of the study.

We can examine the literature for readability models for Boolean expressions. Additionally, we need statistical information about such models. But such models with associated statistics don't exist. What can we do?

People would say "well, you just have to vary the complexity of the Boolean expression and consider this as a second variable in the experiment". Such comment is not serious. First, we cannot vary the expression's complexity in a controlled way because there is no known complexity model for Boolean expressions. Second, it completely misses the problem of confounding factors: in case the effect of the Boolean expression is too strong, we could accidentally hide the difference between JavaIf and IfTrue (in case it exists). And third, our goal is not to study Boolean expressions. Our goal is to study if-styles. Why should we bother about the complexity of Boolean expressions?

Actually, the last idea -- not to bother about Boolean expressions -- is problem and solution in one. It solves our problem in the study. But it has the problem that most reviewers will then argue that the study's result is not generalizable.

Becoming Aware of How Large the Problem Is

Before coming to the solution, it makes sense to speak about problem in more detail -- the reason why it does not make much sense to vary the Boolean expressions in our code.

In the previous code example we see that the condition is not a pure Boolean expression. It is an expression in the programming language Java that finally evaluates to a Boolean value. Respectively, it is an expression of type Boolean. It is important to understand this difference.

A Boolean expression comes from Boolean algebra. It consists of variables and operators (and some brackets). But the code contains method calls as well. I.e. even if there would be a readability model for Boolean expressions, we have to live with the problem that somehow the method calls play a role as well. And as soon as we got there, we need to emphasize that names play a role as well. And we have to take Java's semantics such as for the operator && into account, because in case the left hand side of an && already evaluates to false, the right hand side will not be executed (which is important in the presence of side effects, etc.).

The intention of the previous paragraph is to make explicit that one cannot say "let's generate some expressions". A serious scientist will take all these things as potentially confounding factors into account. And without having knowledge about these factors, one should better get rid of these factors. 

The Solution And The Problem

As already said, there is a simple solution to this problem: don't bother about Boolean expressions. And it simply means that instead of using Boolean expressions in the condition, one just uses a Boolean literal with the following code:

Treatment A (JavaIf):
  if (true)  
    return 1;
  else
    return 2;

For a number of people (and unfortunately, for a number of people in the software science community as well) this code looks stupid. And the typical arguments (that one also finds in reviews) are:
  • there is no logic in an if-statement whose condition is a literal, because the result statement is already known upfront, and 
  • this is pure artificial code you will never find in any code repository.
It is completely understandable if someone from industry argues that way, especially someone who is not familiar with experimentation. But a reviewer should be aware of the problem of confounding factors and the reason why one has to adapt the code in order to get rid of such factors. 

On the Introduction of Additional Factors

Unfortunately, the story about missing generalizability is not yet over. But this time, it comes from a different source.

Let's assume (note that we haven't done the experiment) that it takes on average 1.1 seconds to answer the question in IfJava while it takes on average 1.0 seconds to answer the question using IfTrue. 10% difference sounds a lot. But experienced experimenters will be alarmed.

Since you measure something on participants, and since there is a deviation between participants (as well as deviation within a participant), not only the mean values are interesting. You also need to know something about the deviation. From that you can determine the effect size such as Cohen's d and from that you can estimate the sample size with some statistical tools. Let's assume that the effect size is d = .8 (which assumes that your deviation is really, really small). The resulting sample size will be 42 participants per group, i.e. 84 participants in total. This is a large number of people. At that point, experimenters typically think about alternatives.

What experimenters can do is to measure more data points per participant. I.e. one would rather design the experiment as a crossover trial or even as an N-of-1 trial. I.e., one would give one participant multiple tasks. But such decision has consequences and one of it is that you cannot give participants the identical task, because once a participant knows the code, he does not need to think about the code for a second or a third time. Hence, there is a need to vary the code. 

One could change the Boolean literal. But this does not change much. And it would mean that a participant who receives more than 2 tasks will get at least two time the identical task. One could think about the body of the if- or the else branch. But this introduces again some complexity from some other source not related to the if-statement.

Fortunately, there is a trick: use the if-statement again in the body. The possible code looks like following:

Treatment A (JavaIf):
  if (true)  
    if (true)  
      return 1;
    else
      return 2;
  else
    return 3;

This kind of code can be varied. You can for example consider nesting depth as a parameter, etc. Then, you can give participants some of this code (you just have to think about learning, fatigue, and novelty effects). Note that the additional factor (such as nesting depth) is not inherently interesting. It is the result of a design choice which was necessary because of missing knowledge in the literature about complexity of Boolean expressions and the resulting counteractions in order to remove confounding factors combined with expected, required sample size.

People might argue that the situation now is the same as before: there is one factor (nesting) which is not known upfront and which is potentially a confounding factor. To a certain extent this statement is right. But it ignores that the resulting code does not consist of other language constructs that should not be studied (except the Boolean literal, the return statement and the integer literal).

On Generalizability of Code Examples

The previous code examples are probably good choices to study potential differences between IfJava and IfTrue. Still, your study will probably never be published. Again, the main argument against will be that the experiment code is no real code. 

The resulting problem is, that the results will neither become available to other researchers nor to other language designers. In case there is a difference between if-styles, the next language designer has no chance to hear about it. And other researchers will not be able to benefit from the measured differences and deviations. And if in some years someone has the same idea about IfTrue, such person cannot just take a look into the literature to find out what is already known.

Actually, the argument against the code examples reveals a complete misunderstanding of experimentation. Again, the resulting code is the result of controlling factors and reducing confounding factors. It was the goal of the experimenter to find an experiment that gets rid of disturbing factors. One can be relatively sure that the experimenter is aware that the experimental code is not what one finds in reality. But he had damn good reasons still to use it.

Starting from complains about the missing generalizability, people will start longer speculations about possible effects of other factors that exist in reality and they will speculate whether the difference in reality is really 10% or not. Again, this is a complete misunderstanding of experimentation.

Again, the goal of experimentation is to measure the effect of something in a controlled environment. The goal is not to test, what the effect in reality is. In reality, there are many more factors that have an effect. In order to understand the effect in reality, these different factors and their possible interactions need to be known first.

Telling a software scientist to find more realistic code examples is comparable to telling the experimenter of an Aspirin study that one should not artificially measure the effect of Aspirin on headaches, but to consider more realistic scenarios in hospitals such as heart attacks or cancer. Of course, Aspirin was studied on headaches, because it was designed to reduce headaches. It was not designed to heal cancer. Studying Aspirin in a more or less arbitrary setting (more realistic example) will probably not measure anything. Not because Aspirin has no effect (on pain). But because the deviation of different illnesses is too large (where the pain reducing effect plays a too minor role).

Let's get back to our example. IfTrue was built to have a positive effect on if-statements.  It was neither designed to make Boolean expressions easier, not to make anything else better. Arguing that such construct should be studied in a more realistic example is simply wrong.

Conclusion

Again, please stop complaining about missing generalizability of code examples in experiments, because it simply does not make any sense. Check what the focus of a study is, check what factors are intended to be studied and check, whether confounding factors were reduced as much as possible.

The whole idea about peer-reviewing is that people should judge whether evidence was collected based on known facts from the field. This implies that personal opinions, estimations, or feelings do not belong to the review process. Our current state of reviewing practice has actually nothing to do with this idea.

And in case you still see the need for generalizability of code examples in experiments, please answer the following questions.

First, what criteria do you apply in order to identify real code?

Second, what evidence do you have that your personal idea of real code is actually real?

Third, how do you think deviation in real code should be considered?

And in case you don't understand the third question, ask yourself whether you should really review any experiments.

Please, feel free to leave comments.

(actually, the problem that complaining about missing generalizability is not only a matter of code examples. But this is something for a different article.)

Saturday, March 19, 2022

Please, Stop Collecting Developer Opinions

Just recently, I was quite enthusiastic to read a software science paper, because its title sounded promising. I do not want to refer to this specific paper, because it is neither the goal to discredit one specific work nor one specific author. It is a set of similar papers that needs to be criticised.

Over the last years more and more questionnaires can be found at scientific venues in software science. The commonality of these papers is that authors ask developers about their opinions on some topic. Then, responses from a huge number of people are collected. Then, the results are analyzed.

So far, there is no problem.

There is nothing wrong with opinions. It is interesting to know what peoples' opinions are. It is especially interesting from a marketing perspective, because it says something about the perception of people.

The problem lies in the conclusions.

What a number of works in software science do and which is fundamentally wrong is to infer from subjective perceptions something about the perceived phenomena.

Let's assume there is a technology X that tries to make developers' life easier. Then, someone asks developers whether it makes their life easier and let's assume with high evidence the answer is yes. What can we conclude from it? 

We can conclude that developers think that it makes their life easier. It is also possible that developers just pretend that it makes their life easier. But we do not know whether it makes their life easier. Making any claim about how or whether the technology influenced a developer's life is not possible from the evidence gathered so far. 

In order to find out whether the technology X helps, no result from subjective perceptions would bring us closer to an answer. Whatever the result of the questionnaire is, the question whether or not the technology helps is still unanswered.

One could argue that a developer's life becomes better because he thinks that there is a technology that helps him - independent of whether or not it actually helps him. This kind of argument is comparable to a placebo argument that we frequently find in homeopathy. But it should be clear that this argument should not be used in software science, because it is more a meta argument: if something makes people think that it makes their lives better, then it is good. The argument is comparable to the question whether a free beer makes a developer's life better.

Of couse, this leads us (as always) to the need for studies. But the argument is not that questionnaires are no studies. They are. But the problem with them is that they purely depend on subjective perceptions. 

There are good reasons why you find whole textbooks about perception in psychology. Perception is not only subjective from the perspective that people can perceive the same phenomenon in different ways (because of differences in physionomy, differences in experience, etc.). Perception is influenceable. You can easily find a bunch of studies that show that perception can be influenced and the Pepsi versus Coke experiment [1] is just one example (again, whole textbooks are on that topic, there is no point in giving a longer list on that here).

So, what is actually the problem? When we study technology, we need to measure interactions with the technology in a non-subjective way. You can still ask developers questions. "In this scenario, what is the outcome?" could be an appropriate question. But it differs from a question "Do you think that technology X helps?".

We need to stop asking for subjective perceptions.

The implications of the statement are much more serious than we think. Community processes that can be found today for example in programming language design typically ask people about opinions. But it leads to far to discuss this issue here.

Please, stop collecting developer opinions.

Opinions are important. But they do not permit to draw any conclusion beyond peoples' opinions. And should a technical discipline have the focus on opinions? I think the answer is no.

Feel free to leave a comment.