Before Doing Science in Software Construction Something Else is Needed: Critical Thinking
Why is Science Needed in Software Construction?
Software construction is a huge, multi-billion market where new technology appears almost every day (in case you doubt that software is a multi-billion market, just take a look at the 10 most valuable companies on this planet today). Such new technology comes with multiple claims and the most general one is, that the new technology makes software development easier and hence cheaper.Taking the size of the market into account, there are good reasons to doubt whether all technology on the market exists for a good reason - beyond the reason that new technology increases the income of companies or consultants who propagate this technology. Taking the size of the software market into account, there are reasons to believe that a lot of technology exists although its promised benefit neither ever existed nor will ever exist.
In the very end, one has to accept that most claims associated with a certain technology are not the result of non-subjective studies. Instead, they are the result of subjective perceptions or impressions of people who either have strong faith in a new technology, who really hope that the technology improves something, or who just love a new technology ("faith, hope, and love are a developer’s dominant virtues" [1, p. 937]). Finally, some of these claims are just the result of marketing considerations: Claims that are made and spread just because they increase the probability of success for the new technology and not because they are true.
That non-subjective studies are rather rare exceptions in the field of software construction is a sad, but well-documented phenomenon. For example, Kaijanaho has shown that up to 2012 only 22 randomized controlled trials on programming language features with human participants were published [2, p. 133]. Another example is the paper by Ko et al. who analyzed the literature published at the four leading, scientific venues in our field. The authors came to the conclusion that "the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years" [3, p. 137].
So, what's wrong with this situation? The problem is, that new technology causes costs. Costs for learning this technology, applying it, and maintaining software written in it. And there are additional, hidden costs. First, there are costs because new technology supersedes existing technology. Such existing software becomes often rewritten which means that investments done in past need to be repeated the future. And in case existing software is not newly written, there are additional costs for maintaining the old technology. And old technology causes larger costs because once a technology is no longer taught and no longer applied it becomes more expensive to maintain it simply because there are no longer people on the market who are able to master the old technology. An extreme example for this was the Y2K problem, whose costs were to a certain extent caused by forgetting the old technology COBOL.
But there is another, tragic problem. The problem is, that in case a new technology would appear that solves a number of problems we have today, such technology could not be identified. The claims associated with this new technology would just be lost among all the other claims that exist for today's technology or claims that will be associated with competitors.
We must not forget that the goal is not to find excuses to stick to old and inefficient technology. The goal is to make progress. But progress does not mean just to apply new stuff that appeared recently, but to apply technology that improves the field of software construction.
So, what we need are methods to separate good from bad technology. We need to separate knowledge from speculation and marketing claims. And we need to teach such methods to developers to give them the ability to separate knowledge from speculation. This does not mean that we need developers who execute studies. But we need developers who are able to read studies and who are able to identify trustworthy studies from bad ones. In the end, we want a discipline that relies on the knowledge of the field as a whole and not on speculations of individuals.
The Scientific Method
The alternative to subjective experiences and impressions is the application of the scientific method, which is actually the alternative to subjectivity and not just one alternative among others. This does not imply that the term scientific method describes a clear, never-changing and unique process of knowledge gathering. Instead, it is a collection of things that can be done, should be done or must be done. And this collection changes over time, because not only knowledge in a certain discipline changes because of the scientific method. The method changes as well.
It is not surprising that the scientific method is often critically discussed in the field of software construction which is more an expression of the immaturity of the field instead of the community's willingness to generate and gain non-subjective insights. Just to give an impression: even at international, academic conferences on software construction, there are discussions whether not the scientific method makes any sense at all. At such places, there are discussions about the need for control groups, the validity of statistical methods or the validity of experimental setups. All these discussions exist despite the fact that there are tons of literature available from other fields on these topics (which give very clear answers to these topics). One could argue that this immaturity just exists because the field is quite young. In fact, this statement can be easily rejected. In medicine, which is typically considered as one of the old fields, most of the experimental results that we accept todays as those ones that follow valid research methods, are just done in the last 30-40 year.
The fundamental part of the scientific method is, that there are people who are willing to test the validity of hypotheses. This implies that they are willing to accept results although they conflict their own, personal and subjective impressions or attitudes. But this means that they not only accept their own experimental results, but they also accept results from others. Although this seems quite natural, it has one important implication. It means that people established some common agreement what a valid research result is and what not.
Scientific Standards
Let's discuss the very general idea of research standards via an example. Let's assume there are two programming techniques A and B and one would like to test the hypothesis that it takes less time to solve a given problem using technique A than it takes using technique B. So one person tests 20 people, 10 solve a given problem using A, 10 solve it with B. Then the time for both groups are measured and then compared. This is a standard AB-test where not only the experimental setup (randomization of participants, etc.) but also the analysis for the data (t-test, respectively U-test) is well-known since decades. But the general question is, whether or not one should take the results of the experiment into account as a valid result.
It turns out that especially in software construction people complain a lot about such a standard approach. And in case technique A is more efficient than B, a larger number of people who prefer technique B will find reasons either to ignore the result or to discredit the experiment. Actually, there are quite plausible arguments against the experiment and the most general one is the problem of generalizability: one either doubts that the number of subjects is "representative" in order to draw any conclusion from the experiment. Another doubt is, whether the given programming problems represent "something that can be found in the field" or whether the problems are any "general programming problems at all".
We should not be too ignorant to reject such objections directly, because there is some truth in them. But we should also not be too open minded to take such objections too serious, because of the following reasons: there is no experiment in the world that is able to solve the problem that underlies these objections. No matter how many developers are used as subjects in the experiment, one can always argue that the number is too low. And no matter on how many programming problems the techniques are tested, there are other programming problems on this planet that were not used in the experiment.
In order to overcome such situation there is a need to have some common understanding of the applied methods: there is the need for community agreements. If people agree on how experimental results are to be gathered, there is no need to doubt in results that come from experiments that follow such agreements. In other disciplines, the problem was identified as well (some longer time ago) and corresponding scientific standards were created. Examples for such standard are the CONSORT standard in medicine [4] (which mainly addresses the way how experiments are to be reported) and the WWC-standard that is used in education [5] (which not only covers the way how experiments are to be executed, but which also handles the process of how experiments should be reviewed).
The need for such community agreements is obvious and we argued already in 2015 that such community agreements are necessary in software construction as well [6]. Today we find movements towards such standards. An example for this is the Dagstuhl seminar "Toward Scientific Evidence Standards in Empirical Computer Science" that takes place in January 2021 [7].
On the Selection of Desired, and the Ignorance of Undesired Results
Such movements are good and necessary. However we should ask ourselves, whether the field of software construction is ready for such standards. Because the introduction of research standards entails some serious risks that should be taken into account. But before discussing these risks, I would like to start with some examples.
In the last years, one situation occured over and over again to me. A collegue contacted me and asked, whether there is one experiment available that supports a certain claim. The collegue's motivation is typically that she or he tries to find a way to argue about the need for some new technology and from her/his perspective this motivation would be stronger if there would be some matching experimental results. At that point I usually start a conversation and ask what if there are experimental results that show the opposite. At that point I usually get the answer that such experiments would be interesting, but wouldn't help in the given situation. In order words: an experimental result (in case it exists) is ignored in case it contradicts a personal intention.
Something else happened to me in the last years which is related to an experiment I published in 2010; an experiment that did not show a difference between static and dynamic type systems [8]. Today it seems quite clear that the experiment had problems and it would have been better if the experiment was never published. In the meantime, other experiments showed the positive effects of static type systems (such as for example [9]): Taking the sum of experiments into account, the question of whether or not a static type system helps developers can be considered answered (so far). But what happened is that people, to whom it is helpful that that no difference between static and dynamic type systems was detected, have the tendency to refer only to the first study in 2010 but to later ones. For example, Gao, Bird and Barr explain relatively detailed the results of the 2010 paper, but do not mention the latter one [10]. Again, it seems as if only those results are taken into account that match a given intention - and results that contradict such intention are ignored.
Finally, another situation occured more than once or twice. A collegue created some new technology and asked me for advice in order to construct an experiment that reveals the benfit of the new technology. After some discussions (which often last for hours) we typically come to the point that the collegue is really convinced about the benefit of the technology in a certain situation, but thinks that in a different situation the technology could be even harmful. Often, this collegue is in the situation that a PhD needs to be finished and "just the last chapter - the evaluation" needs to be done. And what happens next is often that an experiment is created that just concentrates on the probable positive aspects of the new technology - the (possible) negative aspects are not tested.
The commonality of these examples is, that people today have the tendency to select only those results that do match their own perspectives or attitudes. In other words: even if strong empirical evidence, i.e. a number of experimental results, exists for a given claim, people still have the tendency to search for singular results that contradict such claim if people do not share this claim.
This is comparable to people who advocate homeopathy and select those rare experiments where homepathy showed a positive effect - and ignore the overwhelming evidence we have about homeopathy.
The Required and Currently Missing Foundation is Critical Thinking
Probably there is a reason for such a behavior and I assume that such reason has something to do with people's attitude in our field. In our education, from the very beginning people are involved in ideological warfares: procedural versus functional versus object-oriented programming, Eclipse versus IntelliJ, GIT versus Mercurial, JavaScript versus TypeScript, Angular versus React, etc. "Chosing a side" seems to play an essential role in software construction. And it actually makes sense to a certain extent. If I master a technology, it is beneficial for me if this technology becomes the leading technology in the field. If I master a technology that no one uses and that no one is interested in, my technological skills are not and maybe will never be beneficial to me. Consequently, people advocate the technology they use and they try to find reasons why this technology should be used by others as well. And in order to achieve this, all kinds of arguments will be applied and it does not matter whether an argument is actually valid as long as supports my intentions. This behavior becomes stronger as soon as people start developing their own technology. If someone writes as part of this PhD a programming language, there seems to be the tendency to defend this language.
The idea of defending a self-created technology or to defend a technology just because one is able to master it seems quite natural. But actually, this is probably the core of the problem. We need to communicate from the very beginning that the goal is to make progress. And that progress means that we are willing to identify problems. And in case there is strong evidence that a certain technology has serious problems, we must be open minded enough to take alternative technologies into account. We must be able to accept and apply critical thinking.
Of course, this must not lead to the situation that people switch technology directly after some rumours appear about some better technology - in fact, this would be closer to the situation we have today where a large number of people accept new technology for the sake of being new. Just to throw everything away in order to apply something new is closer to actionism than to critical thinking. Critical thinking also does not mean that we find ad hoc arguments against some technology. Critical thinking must not mean that we encourage wild speculations. It just means that we are willing to accept different arguments. Critical thinking means that we are willing to collect and accept pros and cons. It means that we are willing to give up our own position.
This willingness is the very foundation we need in our field. Because it does not matter if we define research standards in our field and enforce people to follow such research standards as long as people are not willing to accept results that conflict their own positions. Otherwise, research results will be either just ignored or people generate and publish only those results are match their own attitudes.
Once we have achieved this kind of critical thinking and once we are able to give this idea to students, we can go the next step towards evidence in order to give people the ability to differ between strong, weak and senseless arguments, arguments that are backed up by evidence and those ones that are not. Then, we have researchers who are willing to define experiments whose results might contradict the experimenters' positions. This would be the moment where science could start in our field. This would be the moment when we are ready to apply the scientific method.
References
- Stefan Hanenberg, Faith, Hope, and Love: An essay on software science’s neglect of human factors, OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applications, October 2010, pp. 933–946. [https://doi.org/10.1145/1932682.1869536]
- Antti-Juhani Kaijanaho, Evidence-Based Programming Language Design A Philosophical and Methodological Exploration, PhD-Thesis, Faculty of Information Technology, University of Jyväskylä, 2015. [https://jyx.jyu.fi/handle/123456789/47698]
- Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
- The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]
- U.S. Department of Education’s Institute of Education Sciences (IES), What Works Clearinghouse Standards Handbook Version 4.1, January 2020. [https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf]
- Stefan Hanenberg, Andi Stefik, On the need to define community agreements for controlled experiments with human subjects: a discussion paper, Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, October 2015, pp. 61–67. [https://doi.org/10.1145/2846680.2846692]
- Brett A. Becker, Christopher D. Hundhausen, Ciera Jaspan, Andreas Stefik, Thomas Zimmermann (organizers), Toward Scientific Evidence Standards in Empirical Computer Science, Dagstuhl Seminar, 2021 (to appear) [https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21041]
- Stefan Hanenberg, An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Reno/Tahoe, Nevada, USA, ACM, 2010, pp. 22–35. [https://doi.org/10.1145/1932682.1869462]
- Stefan Endrikat, Stefan Hanenberg, Romain Robbes, Andreas Stefik, How do API documentation and static typing affect API usability?, Proceedings of the 36th International Conference on Software Engineering, May 2014, pp. 632–642. [https://doi.org/10.1145/2568225.2568299]
- Zheng Gao, Christian Bird, Earl T. Barr, To Type or Not to Type: : Quantifying Detectable Bugs in JavaScript Proceedings of the 39th International Conference on Software Engineering, 2017, pp. 758-769. [https://doi.org/10.1109/ICSE.2017.75]
No comments:
Post a Comment