On Software and Science: May 2020

Tuesday, May 19, 2020

Reporting Standards in Software Science Desperately Needed

If we are really interested in achieving something in software science, there is a need for reporting standards. I really mean this statement. But just recently I made the experience how urgently needed such standards are: I summarized an experiment and became aware how much time it took to extract relevant information from it. Some information was missing, some was confusing, etc. In case a standard such as CONSORT whould have been applied, it probably would have cost me minutes to summarize an experiment - instead of many, many hours where I finally even needed to contact the author, because some information was missing.

Recent Experiences While Summarizing Research Results

I recently summarized research results from experiments. The goal was relatively simple: Just collect the results from some studies and summarize them in a way that an ordinary software developer is able to understand them. I think such work is needed, because for example the study by Devambu et al. has shown that most developers judge the validity of claims in software construction based on their personal experience and not because of independent studies [1]. But taking into account that experience is limited and that subjective experiences are quite error-prone, it makes sense to give developers information about studies that exist and that give evidence for some claims. And what's even more important: Give developers studies that contradict given claims.

The topic of my summary was "identifizers", i.e. I wanted to summarize studies who checked what the influence of identifiers on code reading or code understanding is. Yes, I know. No big deal. Everyone of us knows how important the choice of good identifiers is. But I really wanted to know what was actually measured by researchers. And we should know something about the effect sizes.

Most of the studies were done or at least initiated by Dave Binkley. I read most of his papers already in the past and since I am well-trained in reading studies I assumed that it is no big deal to give a quick summary of some of them. And there was another reason why I focussed on his papers: From my experience and in my opinion his studies are well-conducted and I trust in the validity of the results, i.e. I trust that the numbers were collected in a way as described in the papers, I trust that the analyses of the data and I trust that the writings do not try to over-sell results: I think his research has to goal to find answers. His papers are not written for the sake of writing papers, but for the sake to improving the knowledge in our field.

My goal for the summary

More precisely, I wanted to summarize papers in a way that gives a 1-2 sentence description of the experimental design, another 1-2 sentences about the dependent and independent variables and a few sentences about the main results. And maybe some more sentences about what can be learned from the study. The goal was not to bother readers with stuff that is needed for scientific writings. I.e. I wanted to skip information about whether the experiment followed a crossover design, whether e.g. a latin-square was used or what statistical procedure was applied.

Actually, I think it is necessary if readers who are not too deep in scientific writings get results in an understandable way. I.e. if an AB test has been applied, I think it makes sense not to write about statistical power, p-values, confidence intervals or effect sizes, but just to write that "a differences was detected" (in case a significant result was achieved) and then to report means and mean differences. And in case multiple factors are tested, my goal was not to write about interaction effects, etc. but just to explain interactions in a way that an average person can get the meaning quickly.

Shouldn't someone else do the job?

Quickly is the point here. If we want developers to understand results of studies, they have to be communicated efficiently. And the typical research paper at a conference or in a journal does not seem to have the goal to communicate results efficiently. Authors of conference papers are given a certain number of pages they can fill. And authors are actually forced to fill this number of pages. If for example a conference such as the International Conference on Software Engineering (ICSE) has a page limit of 12 pages, you will hardly find a paper at that conference that does not have 12 pages. This has something to do with the review process (which should not be discussed here, although there is an urgent need to discuss it). Ok, so you want to communicate scientific results for a broader audience. But how?

Actually, scientific journalism in other disciplines does this job: people who are trained in writing (for a popular market) summarize results in a way that people are able to understand them. This is important, because people should be informed about what knowledge exists - especially taking into account that people pay for the generation of this knowledge (because a lot of scientific work is paid from tax money). But for software science this kind of journalism does not exist. Yes, there are a bunch of magazines that address technical things. You find books on new APIs or new technology that explain how to apply it. But this is something different. These writing explain how industrial products could be used. They do not explain what we actually know about them. It would be great if there would be people who summarize research results - but we currently have to live with the fact that this is actuall not done in our field.

So, back to the studies.

Giving a quick summary took damn long time

Again, I really love Dave's work. I think his studies are great. His writings are great. But it turned out that just writing a quick summary took much more time than expected. When I now explain what happened to me and why I had troubles to summarize the paper, this should not and must not be understood as a criticicm of Dave's work. Really not. Dave's work is definitively a shining example of how good science in our field should be. Dave's paper is just an example for what troubles people could have reading scientific papers. And I assume my papers suffer from the very same problems.

One of the papers I started with was Identifier length and limited programmer memory [2]. I remembered that this study compared 8 expressions with different lengths and that subjects were asked to write down a part of the expression. So, I wanted to write sentences such as:

"The experiment gave A subjects B expressions to read for a time C (D subjects were removed for some reasons). Each expression consisted of E parts and the authors used the criterion F to distinguish between short and long expressions. After reading, a part from the expression was removed and subjects had to complete it. The average time for reading short expressions was T1 and for long expressions it was T2, so the (statistical significant) differences was T3, respectively it took people G percent more time to read the long compared to the short expressions."

I am aware that these sentences are quite a simplification of the results. Especially, I do not mention all independent variables and I do not mention the applied statistical method. By reporting the means people do not get an idea of the size of the confidence intervals, etc. But, again, the goal was to give a quick (but still informative) and not a complete overview. Why do I think that this kind of summary is informative? Well, I think it contains the most relevant information. The number of subjects gives an idea how large the experiment was (and people are mad about this idea of "being representative" - that's another point that needs to be dicussed, but not here, not now), the dropout rate gives an idea how much the data says about the relation between the "originally adressed sample" and the actual data used for the analysis. And the average times give people an idea how large such differences are. Yes, there are effect size metrics such as Cohen's D or eta square, but if someone does not know these things, such numbers would rather confuse him.

Sample size and dropout rate

Doing the first step (number of subjects) seemed relatively easy, because the number 158 is already mentioned in the abstract, so I directly started searching for the dropout rate. Suddenly it took some time to understand what exactly happened to the data. The paper does not have an explicit section such as "experiment execution" or something. But there is a section "Data preparation" where I found the following:

"[...] the data for a few subjects was removed. For example, one subject reported writing down each name. A second subject reported being a biology faculty member with little computer science training. Finally, the time spent viewing Screen 1 was examined. It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p. 435]

Ok, but what was the actual data being used? The second sentence seems to describe that the data of a whole subject was removed. But what means "this affected 18 responses"? Does this mean that the data of 18 subjects was removed? Or just 18 answeres? And what about the other six? Does it mean that 24 answers, i.e. three subjects were removed? Or was each single response treated individually? I felt suddenly slightly reluctant to write down a sentence such as 158 subjects participated, because I was not able to find precisely what data was skipped. But, ok, I lived with the problem - and just reported that 158 subjects participated. Actually, this step alone took me quite a bit of time, because I reread the paper more than once because I assumed I missed some relevant information.

How large is the effect of expression length?

The main reason why I looked into the paper was, that I wanted to know whether expression length was a significant factor and in case it was, how large the effect was. The paper report on a significant average difference of 20.1 seconds between long and short expressions in reading time, i.e. longer expressions took longer. But how much longer did they take in comparison to short expressions?

I started searching either for effect size measures or at least some descriptive numbers such as means or confidence intervals or something. I was really convinced that I must have missed it somewhere. So I re-read the paper over and over again - and did not find the number. The only things I had was the following:

"It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p.435]

So, should I just report that the average reading time was between 1.5 and 9 min which would mean that 20.1 second is "between factor 14 and 4 %"? That does not sound meaningful. Again, searching just for this single number (that I finally did not get) took me some time. The same is true for a second variable: syllable. It is reported that each additional syllable costs the developer 1.8 seconds. But what does the first syllable cost?

In fact, I felt more uncomfortable with the variable syllable because there are multiple treatments of this variable and I would be much more interested in the precision of 1.8 seconds.

How exactly were the results of the study computed?

What puzzled me as well was the question, how the results were achieved: what statistical procedure was used? And what tool was used? The paper just says that linear mixed-effects regression models were used. Ok, but with what tool? And what exactly were the input variables for the regression models?

Going back to the question on the effect of the variable length, the paper says that "the initial model includes the explanatory variable Length" [2, p. 437]. Length? In a regression? The paper uses length, which is a binary variable (the paper distinguishes between short and long), so in principle, it is just a simple AB-test or did I miss something? Or was bunch of variables added to the initial model and just length was the one that was significant?

Actually, it turned out that I had many, many more problems. And it took me quite a lot of time just to identify that some of these problems were real. I should mention that because I had so much trouble, I contacted Dave who sent me the raw data set within hours so I was able to analyze the data on my own in order to get the results from the experiment.

Why standards such as CONSORT are urgently needed in software science

Finally, I got the raw measurements and was able to recompute some numbers and everything was fine. But why did I feel something is really problematic?

Again, I am well-trained in reading studies. But if it took me hours to understand what was in the paper, I assume that it took many more hours for people who are not trained in these things. So, how can we even assume that someone will take studies into account if it takes many hours to read them? The study by Devambu et al. indicates that we should blame developers for not knowing what is actually known in the field. But if understanding a single study takes many hours, it actually makes sense that people do not read them. Why? Because developers have more to do than just spending a whole day on reading a single paper. And in case essential information is finally missing, the whole days was spent in vain.

So, how come that essential information are hard to find in studies or are even missing? Again, I do not blame the authors of the here mentioned study for forgetting something. But the paper was published in a peer-reviewed journal. How is it possible that it passed the peer-reviewing process while some essential information is missing (again, we need to speak about the reviewer process at some point, but not here)?

I am happy that the paper was published, because otherwise the whole body of knowledge in our field would be even less - and it is already inacceptable low (see the study by Ko et al. [3]). But what would have reduced the problem?

Here come research standards into the game. If our field would be disciplined enough to apply a relatively simple reporting standard such as CONSORT [4], things would be easier. Such standard implicitly contains a summary in each paper that permits you to find information quite fast. For the review process, it is a relatively easy thing to check, whether a paper fulfills the standard. I.e. authors can double-check whether the relevant information is contained and reviewers can do this double-checking as well.

Applying such a standard would have another implication: if for example a conference would apply such a standard, many papers could be directly rejected because they do not fulfill the standard. The problem identified by Ko et al. (and there are in fact many, many more authors who documented that evidence is hardly gathered in our field) would vanish: scientific venues would publish just papers that follow the scientific rules. This would reduce the problem that readers are confronted with tons of papers whose content cannot be considered as part of our body of knowledge.

Yes, there is the other problem which makes it hard to imagine that we finally get to the point that the software science literature would contain scientific relevant studies: people must be willing to execute (and publish) experiments which are able to conflict with their own position. But this is a different issue I discussed somewhere else.

Yes, research standards are urgently needed. At least reporting standards. Urgently.

References

Devanbu, Zimmermann, Bird. Belief & evidence in empirical software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 108–119, 2016. [https://doi.org/10.1145/2884781.2884812]
Binkley, Lawrie, Maex, Morrell, Identifier length and limited programmer memory, Science of Computer Programming 74 (2009) [https://doi.org/10.1016/j.scico.2009.02.006]
Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]

Thursday, May 7, 2020

Before Doing Science in Software Construction Something Else is Needed: Critical Thinking

Why is Science Needed in Software Construction?

Software construction is a huge, multi-billion market where new technology appears almost every day (in case you doubt that software is a multi-billion market, just take a look at the 10 most valuable companies on this planet today). Such new technology comes with multiple claims and the most general one is, that the new technology makes software development easier and hence cheaper.

Taking the size of the market into account, there are good reasons to doubt whether all technology on the market exists for a good reason - beyond the reason that new technology increases the income of companies or consultants who propagate this technology. Taking the size of the software market into account, there are reasons to believe that a lot of technology exists although its promised benefit neither ever existed nor will ever exist.

In the very end, one has to accept that most claims associated with a certain technology are not the result of non-subjective studies. Instead, they are the result of subjective perceptions or impressions of people who either have strong faith in a new technology, who really hope that the technology improves something, or who just love a new technology ("faith, hope, and love are a developer’s dominant virtues" [1, p. 937]). Finally, some of these claims are just the result of marketing considerations: Claims that are made and spread just because they increase the probability of success for the new technology and not because they are true.

That non-subjective studies are rather rare exceptions in the field of software construction is a sad, but well-documented phenomenon. For example, Kaijanaho has shown that up to 2012 only 22 randomized controlled trials on programming language features with human participants were published [2, p. 133]. Another example is the paper by Ko et al. who analyzed the literature published at the four leading, scientific venues in our field. The authors came to the conclusion that "the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years" [3, p. 137].

So, what's wrong with this situation? The problem is, that new technology causes costs. Costs for learning this technology, applying it, and maintaining software written in it. And there are additional, hidden costs. First, there are costs because new technology supersedes existing technology. Such existing software becomes often rewritten which means that investments done in past need to be repeated the future. And in case existing software is not newly written, there are additional costs for maintaining the old technology. And old technology causes larger costs because once a technology is no longer taught and no longer applied it becomes more expensive to maintain it simply because there are no longer people on the market who are able to master the old technology. An extreme example for this was the Y2K problem, whose costs were to a certain extent caused by forgetting the old technology COBOL.

But there is another, tragic problem. The problem is, that in case a new technology would appear that solves a number of problems we have today, such technology could not be identified. The claims associated with this new technology would just be lost among all the other claims that exist for today's technology or claims that will be associated with competitors.

We must not forget that the goal is not to find excuses to stick to old and inefficient technology. The goal is to make progress. But progress does not mean just to apply new stuff that appeared recently, but to apply technology that improves the field of software construction.

So, what we need are methods to separate good from bad technology. We need to separate knowledge from speculation and marketing claims. And we need to teach such methods to developers to give them the ability to separate knowledge from speculation. This does not mean that we need developers who execute studies. But we need developers who are able to read studies and who are able to identify trustworthy studies from bad ones. In the end, we want a discipline that relies on the knowledge of the field as a whole and not on speculations of individuals.

The Scientific Method

The alternative to subjective experiences and impressions is the application of the scientific method, which is actually the alternative to subjectivity and not just one alternative among others. This does not imply that the term scientific method describes a clear, never-changing and unique process of knowledge gathering. Instead, it is a collection of things that can be done, should be done or must be done. And this collection changes over time, because not only knowledge in a certain discipline changes because of the scientific method. The method changes as well.

It is not surprising that the scientific method is often critically discussed in the field of software construction which is more an expression of the immaturity of the field instead of the community's willingness to generate and gain non-subjective insights. Just to give an impression: even at international, academic conferences on software construction, there are discussions whether not the scientific method makes any sense at all. At such places, there are discussions about the need for control groups, the validity of statistical methods or the validity of experimental setups. All these discussions exist despite the fact that there are tons of literature available from other fields on these topics (which give very clear answers to these topics). One could argue that this immaturity just exists because the field is quite young. In fact, this statement can be easily rejected. In medicine, which is typically considered as one of the old fields, most of the experimental results that we accept todays as those ones that follow valid research methods, are just done in the last 30-40 year.

The fundamental part of the scientific method is, that there are people who are willing to test the validity of hypotheses. This implies that they are willing to accept results although they conflict their own, personal and subjective impressions or attitudes. But this means that they not only accept their own experimental results, but they also accept results from others. Although this seems quite natural, it has one important implication. It means that people established some common agreement what a valid research result is and what not.

Scientific Standards

Let's discuss the very general idea of research standards via an example. Let's assume there are two programming techniques A and B and one would like to test the hypothesis that it takes less time to solve a given problem using technique A than it takes using technique B. So one person tests 20 people, 10 solve a given problem using A, 10 solve it with B. Then the time for both groups are measured and then compared. This is a standard AB-test where not only the experimental setup (randomization of participants, etc.) but also the analysis for the data (t-test, respectively U-test) is well-known since decades. But the general question is, whether or not one should take the results of the experiment into account as a valid result.

It turns out that especially in software construction people complain a lot about such a standard approach. And in case technique A is more efficient than B, a larger number of people who prefer technique B will find reasons either to ignore the result or to discredit the experiment. Actually, there are quite plausible arguments against the experiment and the most general one is the problem of generalizability: one either doubts that the number of subjects is "representative" in order to draw any conclusion from the experiment. Another doubt is, whether the given programming problems represent "something that can be found in the field" or whether the problems are any "general programming problems at all".

We should not be too ignorant to reject such objections directly, because there is some truth in them. But we should also not be too open minded to take such objections too serious, because of the following reasons: there is no experiment in the world that is able to solve the problem that underlies these objections. No matter how many developers are used as subjects in the experiment, one can always argue that the number is too low. And no matter on how many programming problems the techniques are tested, there are other programming problems on this planet that were not used in the experiment.

In order to overcome such situation there is a need to have some common understanding of the applied methods: there is the need for community agreements. If people agree on how experimental results are to be gathered, there is no need to doubt in results that come from experiments that follow such agreements. In other disciplines, the problem was identified as well (some longer time ago) and corresponding scientific standards were created. Examples for such standard are the CONSORT standard in medicine [4] (which mainly addresses the way how experiments are to be reported) and the WWC-standard that is used in education [5] (which not only covers the way how experiments are to be executed, but which also handles the process of how experiments should be reviewed).

The need for such community agreements is obvious and we argued already in 2015 that such community agreements are necessary in software construction as well [6]. Today we find movements towards such standards. An example for this is the Dagstuhl seminar "Toward Scientific Evidence Standards in Empirical Computer Science" that takes place in January 2021 [7].

On the Selection of Desired, and the Ignorance of Undesired Results

Such movements are good and necessary. However we should ask ourselves, whether the field of software construction is ready for such standards. Because the introduction of research standards entails some serious risks that should be taken into account. But before discussing these risks, I would like to start with some examples.

In the last years, one situation occured over and over again to me. A collegue contacted me and asked, whether there is one experiment available that supports a certain claim. The collegue's motivation is typically that she or he tries to find a way to argue about the need for some new technology and from her/his perspective this motivation would be stronger if there would be some matching experimental results. At that point I usually start a conversation and ask what if there are experimental results that show the opposite. At that point I usually get the answer that such experiments would be interesting, but wouldn't help in the given situation. In order words: an experimental result (in case it exists) is ignored in case it contradicts a personal intention.

Something else happened to me in the last years which is related to an experiment I published in 2010; an experiment that did not show a difference between static and dynamic type systems [8]. Today it seems quite clear that the experiment had problems and it would have been better if the experiment was never published. In the meantime, other experiments showed the positive effects of static type systems (such as for example [9]): Taking the sum of experiments into account, the question of whether or not a static type system helps developers can be considered answered (so far). But what happened is that people, to whom it is helpful that that no difference between static and dynamic type systems was detected, have the tendency to refer only to the first study in 2010 but to later ones. For example, Gao, Bird and Barr explain relatively detailed the results of the 2010 paper, but do not mention the latter one [10]. Again, it seems as if only those results are taken into account that match a given intention - and results that contradict such intention are ignored.

Finally, another situation occured more than once or twice. A collegue created some new technology and asked me for advice in order to construct an experiment that reveals the benfit of the new technology. After some discussions (which often last for hours) we typically come to the point that the collegue is really convinced about the benefit of the technology in a certain situation, but thinks that in a different situation the technology could be even harmful. Often, this collegue is in the situation that a PhD needs to be finished and "just the last chapter - the evaluation" needs to be done. And what happens next is often that an experiment is created that just concentrates on the probable positive aspects of the new technology - the (possible) negative aspects are not tested.

The commonality of these examples is, that people today have the tendency to select only those results that do match their own perspectives or attitudes. In other words: even if strong empirical evidence, i.e. a number of experimental results, exists for a given claim, people still have the tendency to search for singular results that contradict such claim if people do not share this claim.

This is comparable to people who advocate homeopathy and select those rare experiments where homepathy showed a positive effect - and ignore the overwhelming evidence we have about homeopathy.

The Required and Currently Missing Foundation is Critical Thinking

Probably there is a reason for such a behavior and I assume that such reason has something to do with people's attitude in our field. In our education, from the very beginning people are involved in ideological warfares: procedural versus functional versus object-oriented programming, Eclipse versus IntelliJ, GIT versus Mercurial, JavaScript versus TypeScript, Angular versus React, etc. "Chosing a side" seems to play an essential role in software construction. And it actually makes sense to a certain extent. If I master a technology, it is beneficial for me if this technology becomes the leading technology in the field. If I master a technology that no one uses and that no one is interested in, my technological skills are not and maybe will never be beneficial to me. Consequently, people advocate the technology they use and they try to find reasons why this technology should be used by others as well. And in order to achieve this, all kinds of arguments will be applied and it does not matter whether an argument is actually valid as long as supports my intentions. This behavior becomes stronger as soon as people start developing their own technology. If someone writes as part of this PhD a programming language, there seems to be the tendency to defend this language.

The idea of defending a self-created technology or to defend a technology just because one is able to master it seems quite natural. But actually, this is probably the core of the problem. We need to communicate from the very beginning that the goal is to make progress. And that progress means that we are willing to identify problems. And in case there is strong evidence that a certain technology has serious problems, we must be open minded enough to take alternative technologies into account. We must be able to accept and apply critical thinking.

Of course, this must not lead to the situation that people switch technology directly after some rumours appear about some better technology - in fact, this would be closer to the situation we have today where a large number of people accept new technology for the sake of being new. Just to throw everything away in order to apply something new is closer to actionism than to critical thinking. Critical thinking also does not mean that we find ad hoc arguments against some technology. Critical thinking must not mean that we encourage wild speculations. It just means that we are willing to accept different arguments. Critical thinking means that we are willing to collect and accept pros and cons. It means that we are willing to give up our own position.

This willingness is the very foundation we need in our field. Because it does not matter if we define research standards in our field and enforce people to follow such research standards as long as people are not willing to accept results that conflict their own positions. Otherwise, research results will be either just ignored or people generate and publish only those results are match their own attitudes.

Once we have achieved this kind of critical thinking and once we are able to give this idea to students, we can go the next step towards evidence in order to give people the ability to differ between strong, weak and senseless arguments, arguments that are backed up by evidence and those ones that are not. Then, we have researchers who are willing to define experiments whose results might contradict the experimenters' positions. This would be the moment where science could start in our field. This would be the moment when we are ready to apply the scientific method.

References

Stefan Hanenberg, Faith, Hope, and Love: An essay on software science’s neglect of human factors, OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applications, October 2010, pp. 933–946. [https://doi.org/10.1145/1932682.1869536]
Antti-Juhani Kaijanaho, Evidence-Based Programming Language Design A Philosophical and Methodological Exploration, PhD-Thesis, Faculty of Information Technology, University of Jyväskylä, 2015. [https://jyx.jyu.fi/handle/123456789/47698]
Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]
U.S. Department of Education’s Institute of Education Sciences (IES), What Works Clearinghouse Standards Handbook Version 4.1, January 2020. [https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf]
Stefan Hanenberg, Andi Stefik, On the need to define community agreements for controlled experiments with human subjects: a discussion paper, Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, October 2015, pp. 61–67. [https://doi.org/10.1145/2846680.2846692]
Brett A. Becker, Christopher D. Hundhausen, Ciera Jaspan, Andreas Stefik, Thomas Zimmermann (organizers), Toward Scientific Evidence Standards in Empirical Computer Science, Dagstuhl Seminar, 2021 (to appear) [https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21041]
Stefan Hanenberg, An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Reno/Tahoe, Nevada, USA, ACM, 2010, pp. 22–35. [https://doi.org/10.1145/1932682.1869462]
Stefan Endrikat, Stefan Hanenberg, Romain Robbes, Andreas Stefik, How do API documentation and static typing affect API usability?, Proceedings of the 36th International Conference on Software Engineering, May 2014, pp. 632–642. [https://doi.org/10.1145/2568225.2568299]
Zheng Gao, Christian Bird, Earl T. Barr, To Type or Not to Type: : Quantifying Detectable Bugs in JavaScript Proceedings of the 39th International Conference on Software Engineering, 2017, pp. 758-769. [https://doi.org/10.1109/ICSE.2017.75]