Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Refactoring in Computational Notebooks

Refactoring in Computational Notebooks ERIC S. LIU, DYLAN A. LUKES, WILLIAM G. GRISWOLD, UC San Diego, USA Due to the exploratory nature of computational notebook development, a notebook can be extensively evolved even though it is small, potentially incurring substantial technical debt. Indeed, in interview studies notebook authors have attested to performing on-going tidying and big cleanups. However, many notebook authors are not trained as software developers, and environments like JupyterLab possess few features to aid notebook maintenance. As software refactoring is traditionally a critical tool for reducing technical debt, we sought to better understand the unique and growing ecology of computational notebooks by investigating the refactoring of public Jupyter notebooks. We randomly selected 15,000 Jupyter notebooks hosted on GitHub and studied 200 with meaningful commit histories. We found that notebook authors do refactor, favoring a few basic classic refactorings as well as those involving the notebook cell construct. Those with a computing background refactored diferently than others, but not more so. Exploration-focused notebooks had a unique refactoring proile compared to more exposition-focused notebooks. Authors more often refactored their code as they went along, rather than deferring maintenance to big cleanups. These indings point to refactoring being intrinsic to notebook development. CCS Concepts: · Software and its engineering → Software evolution; Maintaining software. Additional Key Words and Phrases: computational notebooks, end-user programming, refactoring 1 INTRODUCTION Computational notebook environments like Jupyter have become extremely popular for performing data analysis. In September 2018 there were over 2.5 million Jupyter notebooks on GitHub 29], and [ by August 2020 this igure had grown to over 8.2 million 13].[A computational notebook consists of a sequencecells of, each containing code, prose, or media generated by the notebook’s computations (e.g., graphs), embodying a combination of literate programming 19][ and read-eval-print-loop (REPL) interaction. Unlike a REPL, any cell can be selectively executed at any time. The typical language of choice is Python. Such an environment afords rapid development and evolution, enabling exploratory analysis: write code into a cell, run the cell, examine the output text or media rendering, then decide how to change or extend the notebook 31[]. For example, an early version of a notebook for analyzing deep-sea sensor data might reveal some out of range readings, prompting the author to add a cell upstream for data cleaning. Later, the same notebook might require the tuning of parameters in a downstream cell to render a more insightful plot. Then the author might send this notebook to someone else who can read it like a literate program. Such data-driven development could compromise the conceptual integrity 4, Ch. 4] [ of the notebook as various explorations are developed, evolved, and abandoned, thus undermining the author’s ability to continue making quick, correct changes18 [ ]. Moreover, many notebook authors identify as scientists (e.g., chemists), and so may not have been exposed to concepts and skills related to reducing technical5debt , 8] and [ maintaining conceptual integrity, such as refactoring and software design principles. Compounding this problem is that, to date, notebook Author’s address: Eric S. Liu, Dylan A. Lukes, William G. Griswold, esl036@ucsd.edu,dlukes@eng.ucsd.edu,wgg@cs.ucsd.edu, Computer Science & Engineering, UC San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0404, USA. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 1049-331X/2022/8-ART111 $15.00 https://doi.org/10.1145/3576036 ACM Trans. Softw. Eng. Methodol. 111:2 • Liu, Lukes, & Griswold environments provide little tool assistance for refactoring, which could especially beneit untrained notebook authors. That said, it is unclear how much computational notebook authors could beneit from the application of such techniques and tools. Compared to traditional computer programs, notebooks are typically quite small ś hundreds of lines ś and have short lives, with development often ending at the moment of inal insight. In recent interview studies, authors who write notebooks to explore data sets have attested that they tidy their code as they go and periodically perform big cleanups 18, 31]. [ It is unknown, however, whether these reported activities are actually happening, and if so, whether they include recognized maintenance activities like refactoring. Nor is it known whether authors who are employing notebooks for other purposes behave similarly. Insights into these questions could inform, among other things, the ongoing development of notebook IDEs like JupyterLab. This paper reports on an investigation of refactoring in computational notebooks. Little is known about end-user or notebook refactoring in the wild. We randomly selected 15,000 Jupyter notebooks hosted on GitHub and studied 200 with meaningful commit histories. We then visually inspected each notebook’s commits, identi- fying refactorings throughout its history. Also, expecting that factors such as notebgenr ookeuse ) and ( author background (as indicated by their current profession) would inluence notebook refactoring behavior, we classiied the notebooks according to genre and author background. Finally, we analyzed these refactorings and related changes with regard to the above questions and factors. In our analysis, we found that notebook authors do indeed refactor. They use a diferent mix of refactoring operations than traditional developers: they do not often use object-oriented language features such as classes (despite their availability) and commonly use (and refactor) notebook cells, favoring non-OO refactorings like Change Function Signature, Extract Function, Reorder Cells, and Rename. The refactorings employed vary by genre, with exploratory-focused genres favoring the above refactorings in the above order, whereas exposition-focused genres favored these in a diferent orderSplit with Cell at the top. Notebook authors with a computing background refactor diferently than others, but not more. Finally, we found that notebook authors generally refactor their notebooks as they go, rather than performing big code cleanups. 2 RELATED WORK Much is known about how traditional software developers refactor, but little is known about how end-user programmers, or computational notebook authors speciically, refactor. However, notebook authors do attest to tidying and cleaning up their notebooks. 2.1 Traditional Sotware Developers The most extensive study of refactoring by traditional software developers was performed by Murphy-Hill et al. [26]. Drawing on data from CVS source code repositories and event logs from the Eclipse IDE, they drew a comprehensive picture of how software developers refactor. They found that traditional developers refactor frequently, favoring refactoring as they develop over concentrated overhauls. They also found that despite the availability of numerous refactoring operations in the IDE, developers overwhelmingly chose to refactor by hand, even experts who were well-versed in the refactoring tools. Finally, they found Rename that was by far the most applied refactoring, and that the vast majority of applications are comprised of just a few refactoring operationsRename, ( Extract Local Variable, Inline Method, Extract Method, Move, and Change Method Signature), with a long tail of lightly-used refactorings. An important take-away from Murphy-Hill et. al. is that since traditional developers overwhelmingly choose to refactor by hand, the lack of refactoring tools in Jupyter is not necessarily a deterrent to refactoring by notebook authors. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:3 2.2 End-User Programmers Little is known about whether or how end-user programmers or exploratory programmers refactor. Stolee and Elbaum found that Yahoo! Pipes programs would beneit from refactoring, and developed a tool to automatically detect and refactor łsmellyž pipe programs32 [ ]. In a lab study they found that pipe developers generally preferred refactored pipes, although there was a preference for seeing all the pipes at once rather than abstracting portions away into a separate pipe. Badame and Dig similarly found that spreadsheets would beneit from refactoring, and developed a refactoring tool for Excel spreadsheets in which spreadsheet authors could select and apply refactorings to selected cells 1]. A [ s in the pipes study, a lab study found that the refactored spreadsheets were preferred, but also preferred seeing all the cells versus abstracting away details. Refactorings with their tool were completed more quickly and accurately than by hand. These studies tell us that end-user programmers want a level of control over refactoring to achieve their preferred results. However, they do not tell us whether end-user programmers refactor in the wild. 2.3 Computational Notebook Authors Rule et al. examined over 190,000 GitHub repositories containing Jupyter notebooks, inding that the average computational notebook was short (85 lines) and that 43.9% could not be run simply by running all the cells in the notebook top-to-bottom [31]. They also found that most notebooks do not stand alone, as their repositories contain other notebooks or instructions (e.g., a README ile). Sampling 897 of these repositories (over 6,000 notebooks), Koenzen et al. found that code clones (duplicates) within and between notebooks of a GitHub repository are common, with an average of 7.6% of cells representing duplicates 20]. These [ results suggest, at least for notebooks on GitHub, that while most notebooks are small, they are not trivial computational artifacts. Interview studies have shed additional light on the characteristics of the development of computational note- books. Chattopadhyay et al. interviewed and surveyed data scientists regarding the challenges they encountered in notebook development6]. [ Among many concerns, their respondents reported that refactoring was very important, yet poorly supported by their notebook environments. The study does not report on their actual coding practices. Because we sampled from public GitHub notebooks, our study reports on the evolution of notebooks not only from data scientists, who often have signiicant training in computer programming, but also from authors from a variety of other backgrounds. Rule et al. interviewed what they called notebook analysts (i.e., authors of Exploratory Analysis notebooks), many who said they need to share their results with others and sometimes wish to document the results of their work for their own use 31].[ In this regard, notebooks are valuable in enabling a narrative presentation of code, text, and media such as graphs. However, notebooks are developed in an exploratory fashion, resulting in the introduction of numerous computational alternatives, often as distinct cells or the iterative tweaking of parameters. Many analysts attested that after their exploratory analysis is complete, their notebooks are often reorganized and annotated in a cleanup phase for their inal intended use. Often a notebook is used as source material for a new, clean notebook or a PowerPoint presentation. In Kery et al.’s interviews, analysts said that code and other elements may be introduced anywhere in a notebook as a matter of preference, but that exploratory alternatives are often introduced adjacent to the original computation18 [ ]. The resulting łmessyž notebook is cleane ł dž up by cells being deleted, moved, combined, or split, or by deining new functions. Such changes often occur as incremental łtidyingž during the exploration phase when the messiness is found to interfere with ś slow down ś ongoing exploration. When notebooks grow too large for Jupyter’s simple IDE support, computations are sometimes moved of into new notebooks or libraries. Building on these indings, Kery et al. developed and evaluated tools for ine-grained tracking and management of code versions within a notebook [16, 17]. ACM Trans. Softw. Eng. Methodol. 111:4 • Liu, Lukes, & Griswold These interview studies make it clear that notebook analysts feel maintenance is important because exploratory development creates technical debt that both interferes with productivity and runs counter to the needs of sharing and presentation. Our prestudy identiied four other notebook genres, raising the question of whether their notebooks generate similar needs. As interview studies, however, they establish what a number of notebook authors think they do, serving as hypotheses about what their colleagues really do. It’s also unclear whether incremental tidying and concentrated cleanup phases during notebook development are achieved by what would be recognized as refactoring, that is changing ł the structure of a program without changing the way it behav26 esž].[If so, there remain questions of which refactoring operations are favored, and how those vary according to notebook genre or author background. The present study provides concrete evidence to evaluate the above hypotheses relating to notebook maintenance. 2.4 Refactoring Support for Jupyter Refactoring support is minimal in the popular Jupyter Notebook and JupyterLab environments. Both support only basic cell-manipulation operations: Split Cell, Merge Cell Up/Down, and Move Cell Up/Down. The refactoring operations most frequently observed by Murphy-Hill et al. are Rename, absent: Extract Local Variable, Inline Method, and Extract Method [26]. As reported by Chattopadhyay et al. (and discussed in the previous subsection), data scientists, at least, found such omissions to be problematic [6]. Since 2020, after we sampled the notebooks for this study Rename , can be added Jupyter Lab by installing the JupyterLab-LSP extension22 [ ]. More recently, in 2021, JetBrains released a third-party front-end for Jupyter, called DataSpell, that provides many of the same refactoring operations as its PyCharm IDE for traditional Python development, including those cited above as missing in the Jupyter environments. It also provides the next ive most frequently observed by Murphy et al.: Move, Change Method Signature, Convert Local to Field, Introduce Parameter and Extract Constant. The present study provides insight on the possible usefulness of the refactoring support provided by these notebook environments. Renaming is an interesting case. To rename a variable in a notebook by hand, one must ind and replace each usage. In a complex notebook, it is easy to miss a usage of an identiier. In contrast, developers using a traditional IDE can expect instantaneous feedback when a variable is used without being deined. Even without IDE support, they will become aware of it at runtime. However, in the notebook-kernel execution model, this feedback can be even more delayed, because renaming an identiier amounts to introducing a new identiier. That is, after renaming, there are two potential problems: the old identiier (name) remains bound to its value in the runtime kernel and the new identiier is uninitialized. Regarding the former, a cell mistakenly referring to the old identiier would access this old value, perhaps disguising the error. Regarding the latter, the notebook author must re-execute the cells upstream of the deining cell to complete the rename. If the notebook author were using the nbsafety custom kernel, it would suggest what cells to re-execute (when a cell dependent on the rename is executed)23 [ ]. The expectation, however, is that a notebook’s refactoring operations would be behavior preserving, and thus take the dynamic state of the runtime kernel into account. Head et al. developed a tool for gathering up and reorganizing łmessyž notebo15 oks]. [The tool identiies all the cells involved in producing a chosen result, which can then be reordered into a linear execution order and moved to a new location in the notebook or a diferent (e.g., new) notebook. Participants in a user study attested that they preferred cleaning their notebooks by copying cells into a new notebook, and that the tool helped them do so, among several other tasks. 3 STUDY DESIGN A challenge for our study was how to gain access to the histories of a sizable number of computational notebooks. As one example, data from an instrumented notebook development environment would be attractive, but Jupyter ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:5 is not instrumented to capture the detailed edit histories from which manual refactorings could be detected, nor was it feasible for us to instrument Jupyter for distribution to a sizable number of notebook authors for an extended period of time. In this regard, the GitHub commit histories of public Jupyter notebooks provided an attractive opportunity. One limitation is that it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced. This underreporting pessimizes our results to a degree, and our results should be interpreted in this context. Limitations and threats related to the use of source code control and GitHub are detailed at the beginning of Section 6. We employed a multi-stage process of automated extraction and analysis, as well as visual inspection, to achieve depth, breadth, and accuracy for the following research questions: RQ-RF: How much and with what operations do computational notebook authors refactor? RQ-GR: How does refactoring use vary by computational notebook genre? RQ-BG: How does refactoring use vary according to computational notebook author background? In particular, do those with a CS background refactoring diferently than others? RQ-CC: What is the extent of tidying vs. cleanups of code? 3.1 Determining a Catalog of Notebook Genres & Refactorings To determine the viability of the study and ix a stable catalog of refactoring operations throughout a large-scale analysis, we performed a prestudy. In June 2019, we downloaded Adam Rule et al.’s data set of 1,000 repositories 31], containing [ approximately 6,000 notebooks, and randomly sampled 1,000. Using notebook metadata downloaded via the GitHub API, we iltered out inappropriate notebooks: those generated by checkpointing (automated backup), lacking 10 commits with content changes, and completed ill-in-the-blank homework assignments. A commit can have no notebook content changes when only notebook metadata has changed, e.g., a cell’s execution count. Completed ill-in-the- blank homework assignments are uninteresting because they were designed to not require evolution. From the remaining notebooks we selected the 50 notebooks with the most evolutionary development in terms of the number and size of their commits. We next visually inspected every commit in the 50 notebooks. The visual inspection of a GitHub commit is normally enabled by the display of a dif ł ž computedgit by diff, which provides a concise summary of what has been added, deleted, and changed since the previous commit. Because a Jupyter notebook ile is stored in JSON format and includes program data and metadata, we instead use nbdime d , a tool for diing and merging Jupyter notebook iles. Usingnbdime, we visually inspected each notebook, recording a commit number and the context for every change, except those that were merely adding code. Among the changes recorded were manipulations of code and cells and changes to function signatures and imports. In the end, 35 of the 50 notebooks contained such notable changes. To extract a catalog of refactoring operations from these changes, we drew from existing classical refactorings, as well as deining notebook-speciic refactorings where necessary. One author inspected all commit difs and mapped them to refactorings as deined by Fowler 10,[11] and reined by Murphy-Hill et al. 26].[The same author analyzed the remaining changes to deine notebook-speciic refactorings. A mapping was brought to all authors for evaluation of their validity as a refactoring as deined in Fowler ś a structural change lacking external behavioral efects ś and inalized their classiication according to structural distinctiveness. In all, we compiled 15 refactorings, as listed on the left of Table 4. During the above visual inspection, we also determined the purpgenr ose,e,orof the notebook examining the entire notebook at its irst and last commit. Through iterative reinement, we settled on ive genres. The Exploratory Analysisgenre is the classical use-case for a computational notebook, an attempt to understand a ACM Trans. Softw. Eng. Methodol. 111:6 • Liu, Lukes, & Griswold data set through computational analysis, iteratively applied until an understanding is reached. Such a notebook might support the publication of a research article or making a policy decision. The other genres operate largely in service of this application. Programming The Assignment genre consists of notebooks developed to complete a programming assignment, but not the ill-in-the-blank type. Many are developed as a inal project, and thus are relatively open-ended. However, the primary purpose of writing the notebook is to learn a subject like exploratory analysis. Analytical Demonstrationnotebooks demonstrate an analytical technique, such as a method for iltering out bad data. Technology Demonstrationnotebooks demonstrate how to use a particular technology, such as TensorFlow. Finally Educational , Materialnotebooks support the conduct of a course, serving as łliterate computingž lecture notes, textbooks, or labs, for example. There are two sub-genres here, notebooks teaching exploratory analysis and those teaching a traditional academic subject, such as chemistry. 3.2 Selecting Notebooks for the Study Targeting a sample of 200 notebooks for analysis, we irst randomly sampled 15,000 Jupyter notebooks from the 4,343,212 publicly available on GitHub at the time, October 2019. We then rejected checkpoint iles and repositories with fewer than 10 commits, leaving us with 6,694 notebooks. We downloaded these notebooks, allowing us to automatically cull those with fewer than 10 non-empty commits, leaving us with 278. We then employed random sampling until we reached 200 notebooks, not only rejecting notebooks for insuicient evolution as in the prestudy, but also for having the same primary author as a previously selected notebook, as our goal was to identify a diverse set of notebooks. Encountering two notebooks from an author was likely because GitHub users commonly copy someone else’s repository, with all its history, calle clone.dCloning a is particularly likely for online textbooks and other educational materials, such as labs, which may have thousands of clones. These repositories can contain hundreds of notebooks, most by the same author, further increasing the chance of sampling two notebooks by the same author. The primary authorship of a notebook was determined by visual inspection, as described below in Section 3.4. In the end, to reach 200 notebooks we sampled 273 of the 278, rejecting 66 for insuicient evolution and 7 for repeat authorship. We encountered no repeat (clonenoteb d) ooks. Six of the seven repeat authors were for Course Material notebooks. The seventh was a low-probability event: a second notebook from an author who had published only three notebooks, none cloned. 3.3 Identifying Refactorings We used nbdime on the 200 selected notebooks to ind all the refactorings as identiied in the prestudy. Each refactoring was recorded as an entry in a master table, consisting of the unique notebook ID, notebook genre, commit hash, and refactoring operation code. Due to the large number of commit difs inspected, identifying refactorings was subject to error due to fatigue. Consequently, a second author audited the accuracy of the inspector’s identiication of refactorings by randomly sampling and reexamining 10% of the notebooks (20) and their commit difs (345). As the data is segmented according to commit dif, the error rate is calculated as the number of reinspected difs that contained any kind of coding error (missed refactoring, extra refactoring, or misidentiied refactoring) divided by the total number of commit difs that were reinspected. Using the same Negotiated Agreement protocol employed exhaustively for genre, described immediately below, 34 difs were determined to contain some kind of error, an error rate of 9.9%. Of those errors, over half were due to missing a refactoring (21 commit difs). Next were extra refactorings (10 commit difs), with only 1 commit dif containing a mislabeled refactoring and 2 commit difs containing multiple of these errors. The most common missed refactoring was Rename (11 commit difs, including one containing ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:7 multiple diferent errors), perhaps because the textual diferenceRename of a is more visually subtle than others. Still, Rename is ranked as one of the top refactorings in our results. 3.4 Identifying Genre and Author Background Due to the complexity of determining a notebook’s genre, we applied a coding methodology called Negotiated Agreement [12], which produces 100% agreement through a labor-intensive process. First, two authors made their own passes over all the notebooks. At a minimum, they examined the entirety of each notebook at its irst and last commits, as well as examining the repository’s README ile for clues as to the intended use of the repository. The two authors then discussed each notebook on which they disagreed to attempt to reach a common conclusion. For the remaining notebooks where disagreement or uncertainty persisted, the third author was brought in to the discussion for a inal decision. Table 1 details the results of each pass. Count Proportion Raw Agreement 66 33% Negotiated Agreement 186 93% Undecided/Disagree 14 7% Table 1. Raw and negotiated agreement between two authors on classification of notebook genre. The 14 notebooks remaining unclassified ater negotiation were classified in a discussion with the third author. We also wanted to learn the technical background ś primary expertise ś of each notebook author, especially whether an author was a computer scientist or not. For this purpose, the categories of background were construed broadly (e.g. Life Sciences). To determine a notebook author’s background, one of the authors of this paper inspected each notebook and its repository for author information. The irst consideration was to identify the primary author of each notebook. In many cases the owner of a repository is not the notebook author, so it was necessary to inspect the notebook’s commit history. In some cases one person originated a notebook and then later someone took it over. In most cases, the originating author wrote the vast majority of the notebook and the authorship was easily attributed to the originating author. In the case of relatively equal contributions, the tie went to the originating author. The next consideration was actually determining the primary author’s background. We decided that the best indication of one’s primary expertise is their profession, as indicated by their role or job title, as they are quantiiably expert enough to get paid to use their expertise. An author’s educational background was consulted when the job title was vague, such as "Senior Engineer". Students were classiied according to their program-level academic ailiation. This process required extensive research. The inspector, the same author who determined each notebook’s primary author, started with the author’s GitHub proile, and, if necessary, e-mail addresses extracted from commits. We then searched the internet for publicly available data including personal websites, blog posts, company and university websites, and public information provided by LinkedIn (i.e., without use of a login). The inspector was able to determine the backgrounds of 189 of the 200 notebook authors. The results are shown in Table 5a. To assess the accuracy of the result, a second author audited the inspector’s determination of primary author and their background by randomly sampling and reexamining 20% of the notebooks, 40 total. Using the same Negotiated Agreement protocol described above, the error rate was determined to be 5%. The two misclassiied backgrounds were for notebook authors with a multidisciplinary job role and education. ACM Trans. Softw. Eng. Methodol. 111:8 • Liu, Lukes, & Griswold 3.5 Extraction of Notebook Statistics To perform automated analyses such as the density of commits over time, we mined the Git history of the notebooks and their repositories, and then developed Jupyter notebooks. Analyses of code employed the RebBaron library [30]. 4 RESULTS 4.1 Data Availability Data relating to our notebook selection process and supporting the results in this and the following sections are available under a CC-BY 4.0 license at https://igshare.com/s/4c5f96bc7d8a8116c271. 4.2 RQ-RF (How much and with what operations do notebook authors refactor?) 4.2.1 RQ-RF How Much: Computational Notebook Authors Refactor, with High Variation.To understand whether computational notebook authors refactor and how much, we extracted the number of refactorings performed and the number of commits containing those refactorings, per genre. Recall that, with this methodology, it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced, underreporting the total amount of refactoring. Table 2 shows that notebook authors refactor, with 13% of commits containing a refactoring (about 1 in 7.7). Of those commits, they contain an average of 1.30 refactorings per commit. Figure 1 enumerates the frequencies of refactorings within refactoring commits, with a maximum of 6 in a single commit. Over 100 notebooks contain 2 or fewer refactorings. The maximum number of refactorings on a notebook is 21. Fig. 1. The frequency of refactorings in commits that contain refactorings. 4.2.2 RQ-RF Operations: Computational Notebook Authors Favor a Few Non-Object-Oriented Refactorings.To learn what refactoring operations are employed by notebook authors, we clustered our extracted refactorings by operation type and sorted them by frequency, as shown in the rightmost column in Table 4. Notably, just a few refactoring operations account for most of the applied refactorings. The top four operations account for over 57% of refactorings: Change Function Signature, Extract Function, Reorder Cells, and Rename. Including the ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:9 next one, Split Cell, reveals that a third of refactoring operations account for over two-thirds of refactorings. Notably, two of these top ive refactoring operations (and three of the top six) refactor cells, which are unique to computational notebooks. All of the cell-refactoring operations together comprise 40% of the total number of observed refactorings. 4.3 RQ-GR: PAs appear Exploratory; Exploratory and Expository Genres are Distinct As discussed in Section 2.3, Exploratory Analysis (EA) notebooks are the classic use-case for computational notebooks. The genres of Educational Materials (EMs), Technology Demonstrations (TDs), and Analytical Demonstrations (ADs) are diferent in their focus on exposition for a wide audience rather than the exploration of a novel data set. This raises two questions: (a) are Programming Assignments (PAs) refactored more like Exploratory Analyses or the exposition-focused genres, and (b) to what extent are these genres distinct from each other when it comes to refactoring? Regarding the irst, although Programming Assignments are often exploratory analyses, they are not truly open-ended, novel investigations. Regarding the second, evolution (refactoring) in exposition-oriented notebooks would be expected to be driven more by changes in technology or the goal of communicating clearly to a wide audience, for example, than the efects of exploration, which would be expected to be generally absent. In the following, due to the small number of Analytical Demonstration notebooks, we omit them from the following analysis. At a irst level of analysis, we see similarities in refactoring between the Exploratory Analysis and Programming Assignment genres, and diferences with the expository genres. Referring back to Table 2, for one, 16% of their commits contain refactorings, whereas for the others just 9% do. Two, they have 13% and 10% of notebooks each with no refactorings, whereas the expository genres hover around 30%. Their percentage of refactoring commits that contain only refactorings is well below the other genres. The expository genres exhibit similarities among each other as well, but there are some diferences. Of the Technology Demo’s refactoring commits, 7.8% were refactoring-only commits, compared to 5.6% for Educational Materials. Likewise, for the Technology Demos, Rename refactoring occurs three times as often and Reorder Cells a quarter. To further explore this possible exploratoryśexpository genre clustering, we applied statistical tests for both refactoring rate and the proile of selected refactorings. Note that these tests are employed here as descriptive statistics, as we did not have a speciic hypothesis at the outset. ACM Trans. Softw. Eng. Methodol. 111:10 • Liu, Lukes, & Griswold ACM Trans. Softw. Eng. Methodol. Notebook Notebooks NB size NBs with All Commits Commits Ref’ing- Total Ref’s / Genre (avg. no Refac- Commits / NB w/Ref’ing Ref’ings Com- Only toring mit chars) Commits Exploratory 79 12,299 10 (13%) 1415 17.9 221 (16%) 10 (4.5%) 299 1.35 Analysis (EA) Programming 41 9,905 4 (10%) 717 17.5 113 (16%) 4 (3.5%) 144 1.27 Assignment(PA) Educational 41 6,036 14 (34%) 843 20.3 72 (9%) 4 (5.6%) 100 1.39 Material (EM) Technology 32 6,779 10 (31%) 651 20.6 51 (8%) 4 (7.8%) 57 1.12 Demo. (TD) Analytical 7 20,435 2 (29%) 128 18.3 18 (14%) 1 (5.6%) 19 1.06 Demo. (AD) All Notebooks 200 9,926 40 (20%) 3754 18.8 475 (13%) 23 (4.8%) 619 1.30 Table 2. Evolution Statistics by Notebook Genre. Notebook (NB) size is non-comment code taken from the last commit of each notebook. Author Notebooks NB size NBs with All Commits Commits Ref’ing- Total Ref’s / Background (avg. no Refac- Commits / NB w/Ref’ing Ref’ings Com- Only toring mit chars) Commits DS & ML 81 8,893 18 (22%) 1439 17.8 183 (13%) 7 (3.8%) 234 1.28 CS & IT 40 10,084 8 (20%) 793 19.8 92 (12%) 6 (6.5%) 136 1.48 Other 79 10,286 14 (18%) 1522 19.3 200 (13%) 10 (5.0%) 249 1.25 All Notebooks 200 9,926 40 (20%) 3754 18.8 475 (13%) 23 (4.8%) 619 1.30 Table 3. Evolution Statistics by Author Background. Notebook size is taken from the last commit of each notebook in our snapshot. DS & ML stands for Data Science and Machine Learning. CS & IT stands for Computer Science and Information Technology. Refactoring in Computational Notebooks • 111:11 ACM Trans. Softw. Eng. Methodol. Table 4. Frequency of refactoring operations by genre. Table 5. Backgrounds of primary notebook authors (a) and Frequency of refactoring operations by author background (b). DS & ML stands for Data Science and Machine Learning. CS & IT stands for Computer Science and Information Technology. 111:12 • Liu, Lukes, & Griswold To reveal diferences in refactoring rates between genres (excluding AD), we computed the average rate of refactoring per commit for each notebook, and ran a Kruskal-Wallis H Test followed by a post-hoc analysis via Dunn’s Multiple Comparisons T[est 9, 21]. We report these results in Table 6 (A). There is strong evidence that the rates of refactoring difer for pairwise combinations that are not between exploratory genres (EA vs. PA) and expository genres (EM vs. TD). Within the like-genre pairings, we ind no evidence to suggest a diference in refactoring rates. To examine the distinctiveness of the refactoring proile of each genre (excluding AD), we ran Pearson’s tests of independence for each pair of genres [27]. These tests excluded the Analytical Demos and were run for the top 10 refactorings overall ś excluding the bottom 5 ś because Pearson’s test assumes non-zero observation counts (cf. Table 4).We report these results in Table 6 (B). Our results suggest that except for the Exploratory Analysis / Programming Assignment pairing (p=0.192), all genres’ refactoring distributions are pairwise distinct. According to the obtained Cramér’s V values, the efect sizes are modest. Genres A. Refactorings per Commit B. Refactoring Type Frequencies H p χ p V Omnibus 29.86 <0.001 65.91 <0.001 0.21 EA PA ś 12.39 0.15 0.388 0.192 EA EM ś <0.001 28.97 0.23 0.002 EA TD ś <0.001 21.43 0.20 0.018 PA EM ś 21.75 0.20 0.013 0.018 PA TD ś <0.001 19.92 0.19 0.021 EM TD ś 20.96 0.19 0.362 0.018 EA+PA EM+TD 28.09 <0.001 30.89 0.24 0.002 Table 6. A. Result of Kruskal-Wallis H test for refactorings per commit (by notebook), as well as post-hoc analysis by genre pairing via Dunn’s multiple comparison test. The ad hoc test does not return an H value. B. Results of Pearson’s tests of independence between all genres, post-hoc analysis via pairwise Pearson’s tests (with Yates’ correction), and corresponding Cramér’s V values [7]. Reported p-values for pairwise comparisons (including EA+PA vs. EM+TD) have been adjusted using the Benjamini-Hochberg procedure [3, 36]. AD notebooks have been excluded due to insuficient representation in our sample. (n = 193). The pattern observed in the pairwise ad hoc tests suggests that the exploratory genres might be distinct from the expository genres in terms of refactoring rates and types. Analogous to the above, a 2-way Kruskal-Wallis H test comparing aggregated exploratory (EA+PA) and expository (EM+TD) notebooks supports this, as these The overall distribution of refactorings per commit fails a normality test, so ANOVA is inappropriate. We additionally applied Yates’ continuity correction to avoid overestimation of statistical signiicance or efect size in light of the small frequencies (<5) for some of the less common refactorings [35]. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:13 two groupings refactor at diferent rates (p<0.001), whileindep a endence test for the distinctiveness of these meta-genres’ refactoring proiles was signiicant (p=0.002). Given the consistent patterns observed between the overall distribution of refactorings and the proile of selected refactoring operations, we ind strong evidence that Programming Assignments, from a refactoring perspective, cluster with Exploratory Analyses into an exploratory meta-genre, and that the Educational Materials and Technical Demonstrations cluster into an expository meta-genre. 4.4 RQ-BG: Computational Notebook Authors with a Computing Background Refactor Diferently, but not More Table 3 summarizes refactoring behavior according to author background, and Table 5b further breaks down the use of refactoring operations. For completeness, Table 7 provides a breakdown of author background according to notebook genre. Overall we see a striking similarity between the Data Science category and the Computer Science category. The major diference that we observe is that Data Scientists tend toExtra favorct Module over Extract Class (6.4% versus 1.7%), compared to Computer Scientists (1.5% versus 6.6%). This could be attributed to computer scientists being inluenced by their wide use of object-oriented practices in traditional development. Table 9 highlights this, showing that CS & IT authors use classes much more than others. Still, even CS & IT authors use classes lightly: only 6 of their 40 notebooks contain even a single class declaration. Those outside of Data Science and Computer Science (łOtherž in Table 5b) are not especially unique, either, except that they employedChange Function Signature and Split Cell rather more, and Extract Function, Reorder Cells, and Rename rather less. The prevalence ofChange Function Signature is almost entirely due to those with backgrounds in Mathematics and Finance, a small minority of the notebook authors in the Other category. Table 7. Notebook genres by author background. Author Backgrounds A. Refactorings per Commit B. Refactoring Type Frequencies H p p V Omnibus 0.40 0.819 29.60 0.042 0.16 CS DS ś N/A 8.62 0.473 0.12 CS Other ś N/A 13.94 0.166 0.16 DS Other ś N/A 19.87 0.037 0.19 CS+DS Other N/A N/A 20.62 0.037 0.19 Table 8. A. Result of Kruskal-Wallis H test for refactorings per commit (by notebook); post-hoc analysis by author background pairing is not applicable because the Omnibus is not significant. The ad hoc test does not return an H value. B. Results of 2 2 χ χ Pearson’s tests of independence between all author backgrounds, post-hoc analysis via pairwise Pearson’stests (with Yates’ correction), and corresponding Cramér’s V values. Reported p-values for pairwise comparisons (including CS+DS vs. Other) have been adjusted using the Benjamini-Hochberg proceduren. (= 193). ACM Trans. Softw. Eng. Methodol. 111:14 • Liu, Lukes, & Griswold As shown in Table 8, pairwise tests on the refactoring distributions by author category bear out these observations. Only the Data Science and Other pairing presents a statistically signiicant diference (p=0.037), with a modest efect size (V=0.19). There is scant evidence to suggest that Computer Scientists refactor distinctly from Others (p=0.166, V=0.16) or Data Scientists (p=0.473, V=0.12). We additionally compared all of those of with computational backgrounds (CD+DS) and those with non- computational backgrounds (Other), and ind similarly signiicant evidence to suggest diferences in refactoring (p=0.037, V=0.19). . Overall, then, we ind strong evidence that notebook authors with computing-related backgrounds refactor diferently than their non-computing counterparts, but they do not refactor more. This suggests that the rate of refactoring is mostly inluenced by the evolutionary characteristics of the notebook genre, such as exploration, exposition, and changes to underlying technology. Background Code Cells Functions Classes Methods CS & IT 25.4 3.60 0.50 2.50 DS & ML 22.8 4.09 0.28 0.83 Other 27.4 2.58 0.23 0.97 All 25.1 3.40 0.31 1.22 Table 9. Per-notebook averages for the use of structural language and notebook features. 4.5 RQ-CC: Computational Notebook Authors Mostly Tidy Code as They Go; EA Code Cleaned Up More We sought to quantify the self-reports of incremental tidying and biggercleanupsfor Exploratory Analysis notebooks [18, 31] as discussed in Section 2.3, with respect to the evolution code,of and compare these to other genres. Already in Section 4.3 we observed that Programming Assignments exhibit an exploratory character when it comes to the distribution of refactorings across commits and the selection of refactoring operations. 4.5.1 Exploratory Notebooks are not Refactored the Most.We take it as a given that refactoring is an act of tidying or cleaning ś an attempt to alter the look and arrangement of code to better relect the concepts embodied in the code or ease future changes. As shown in Table 2, Exploratory Analyses do not stand out in the amount of refactoring applied to them, whether measured relative to the number of commits or notebook code size (Educational Materials do). 4.5.2 Computational Notebook Authors łFlossž Much More than łRoot Canalž. Murphy-Hill and Black distinguish refactorings that are employed to keep code healthy as a part of ongoing development and refactorings that repair unhealthy code [25]. They name these two strategiesloss refactoring and root-canal refactoring, respectively. The distinction is important because their relative frequency says something about developer priorities regarding the on-going condition of their code. One might suppose that the typical notebook author is unaware of the importance of keeping their notebook code structurally healthy, and so might postpone refactoring until the problems are acutely interfering with on-going development. Murphy-Hill et al.operationalized loss refactorings as those that are committed with other software changes, and root-canal refactorings as those that are committed alone 26].[While perhaps a low bar for identifying root canal refactoring, still, only 4.8% of notebook commits containing refactorings are refactoring-only, with the The matching p-values for the DS vs. Other and CS+DS vs. Other are a result of the Benjamini-Hochberg p-value adjustment procedure (a step-up procedure) [3, 36] The efect sizes appear equal only due to rounding. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:15 remaining 95.2% of commits containing refactorings being classiied as lossing (Table 4, column 8). Of the 23 refactoring-only commits, 10 come from from Exploratory Analysis notebooks, 4 from Programming Assignments, 4 from Educational Material, 4 from Technical Demonstrations, and 1 from Analytical Demonstrations (See Figure 2). Exploratory Analyses’ 10 refactoring-only commits, at 4.5% of all their refactoring commits, is slightly below the average of 4.8%. The 23 refactoring-only commits contain 28 refactorings in total, 1.22 refactorings per commit on average, a bit lower than the rate of 1.30 for commits containing both refactorings and other changes. The frequency of refactoring operations is shown in Table 10. It is notable Extrathat ct Function did not occur in the refactoring-only commits, given its popularity generally (Table 4). A commit that contains many refactorings rather than solely refactorings ś a quasi-root-canal classiication ś could also be a sign of cleanups. As shown in Figure 1, less than a quarter of refactorings contain two or more refactorings, and only 25 contain three or more, and the most refactorings in one commit is six. Exploratory Analysis notebooks have a slightly above average number of multi-refactoring commits, and the 6-refactoring commit belongs to an Exploratory Analysis notebook. Although it’s hard to deine what would be enough refactorings in one commit to be evidence of a concerted cleanup, three would seem to be a low bar. Five or six is more interesting, but we saw just four of these. Overall, then, there is substantial evidence of tidying via loss refactoring, and little evidence of cleanups via root canal refactoring. Table 10. Frequency of refactoring operations within refactoring-only commits. 4.5.3 Computational Notebook Authors do not Perform much Architectural Refactoring.Architectural refactorings could be a signal of code cleanups, as they alter the global name scope of the notebook by creating a new scope and/or adding/removing entities from the global scope. The architectural refactorings obser Extra vedctare Function, Extract Module (which extracts functions or classes into a separate PythonExtra ile), ct Class, and Extract (Global) Constant. They account for 25% of Exploratory Analysis refactorings, a bit more than the 19% of other genres.Extract Module and Extract Class are especially interesting, as they gather and move one or more functions into a new ile and new scope, respectively. Twenty-three Exploratory Analysis refactorings (7%) come from this category, compared to 22 (6%) for the other genres. Although 78% of observed refactorings are non-architectural, we see some support for cleanup behavior, and more so for Exploratory Analyses. 4.5.4 Computational Notebook Authors Refactor Throughout.Rule et al.’s interviewees reported performing cleanups after analysis was complete 31],[whereas Kery et al.’s interviewees reported more periodic cleanups in response to accrued technical debt [18]. As such, we assessed when refactoring takes place over time. ACM Trans. Softw. Eng. Methodol. 111:16 • Liu, Lukes, & Griswold We plotted refactoring commits and non-refactoring commits over time, shown in Figure 2. Time is binned for each notebook according to the observed lifespan of the notebook. Interestingly, we see proportional spikes of both refactoring and non-refactoring commit activity in the irst and last 5% of observed notebook lifetimes. This can be seen numerically in Table 11, in particular the commits during the middle 90% of notebook lifetime are 7.7 times less frequent than during the irst 5% and 3.5 times less frequent than the last 5%. Over a third of notebook activity happens during these two periods. Overall, commits containing refactorings occur at a rate that closely tracks the overall commit rate. The relative commit rate for Exploratory Analysis notebooks is nearly identical. By treating each commit on a notebook as a tick of a virtual commit clock, as shown in Figure 2’s inset, we can see that refactoring activity, on average, is much more uniform over observed notebooks’ commit-lifetimes. Using this notion of commit-time, Figure 3 plots the frequency of actual refactorings, both architectural (dark blue) and non-architectural (light blue). Since the average number of refactorings per commit is 1.30, it’s not surprising that the trend looks highly similar to Figure 2-inset. We observe a slight hump in the middle for both types of refactorings, and the rate of architectural refactoring closely tracks the non-architectural refactoring rate. Also, the commit-time rate of Exploratory Analysis architectural refactorings closely mirrors that of all notebooks (not shown). The rate of refactoring over time supports the hypothesis of on-going tidying. To the extent that architectural refactoring is a signal of cleaning, the data provides evidence for on-going cleaning as opposed to post-analysis cleanups. Lifetime Mean Median Std. Dev. First 5% 5.0 4 4.6 Middle 90% 11.6 9 7.9 Last 5% 2.2 1 2.0 All 18.8 16 9.1 Table 11. Statistics on the distribution of commits over normalized notebook lifetime. 4.5.5 Computational Notebook Authors Modestly Comment Out and Delete Code. In the above-mentioned inter- view studies, Exploratory notebook authors attested to commenting out deprecated code as part of cleaning 18, 31].[ We observed 151 commits in 92 notebooks in which code was commented out. Although consequential, this rate is much lower than the number of refactoring commits (475) and notebooks that contain refactorings (160). Still, 82 of those commits occurred in Exploratory Analysis notebooks, 5.8% of their commits, twice the rate of the other notebook genres. Although the numbers are small, they suggest that Exploratory Analysis notebooks are undergoing more tidying or cleaning up of this sort. Deletion of code can also be cleaning. We measured deletions of non-comment code in commits (Figure 4). The median commit on an Exploratory Analysis notebook that results in a net deletion of code deletes a sizable 3.1% of a notebook’s code. The next highest is Technology Demonstrations at 1.8%. The outliers (diamonds) are especially interesting, as they suggest large deletions indicative of cleanups. Each genre has about 20% outliers, suggesting that Exploratory notebooks don’t stand out in this regard. Exploratory Analyses have 48 outlier deletions, 0.61 per notebook. Finally we measured how much smaller a notebook’s non-comment code size is between its maximum and its last commit. If code size shrinks substantially, then that is a sign of cleanups. Following a Pareto-like distribution, 20% of Exploratory Analysis notebooks shrink more than 22%, whereas 20% of other genres shrink only a little more than 10%. This is the strongest case for Exploratory Analysis notebooks undergoing cleanups distinctly from the other genres. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:17 Fig. 2. Refactoring commits (botom, dark blue) vs. non-refactoring commits (top, light blue) over normalized notebook lifetime (20 bins). Inset: Refactoring commits over normalized commit history (10 bins). Ten bins were used because refactoring commits are a small fraction of all commits. Fig. 3. Refactorings over normalized commit history (10 bins). For each bar, architectural commits are shown at the botom in dark blue, and non-architectural commits are shown at the top in light blue. 4.5.6 Summary for RQ-CC.Taken together, we see ample evidence of code tidying and some evidence of code cleanups across all genres. Exploratory Analysis notebooks see more code deleted and commented out. On average they apply more architectural refactorings and slightly more multiple-refactoring commits. The preponderance of observed code tidying is especially notable because any code that was introduced and refactored in the same commit ś de facto code tidying ś was not observable. This is discussed further in Sections 6. 5 DISCUSSION 5.1 Refactoring: Intrinsic to Notebook Development Despite Small Size Genres exhibit unique refactoring proiles, perhaps due to their distinct evolutionary drivers, but refactoring was observed in most notebooks of all genres. We were surprised to see substantial similarities in refactoring behavior among data scientists, computer scientists, and those from other backgrounds such as physical scientists. Likewise, our observation of a broad practice of loss refactoring, a best practice in traditional software development, is notable, as was the broader pattern of on-going maintenance over big clean-ups. These suggest that the pressures of technical debt are motivating notebook authors to perform regular notebook maintenance. ACM Trans. Softw. Eng. Methodol. 111:18 • Liu, Lukes, & Griswold Fig. 4. Box plots for commits with net deletions of non-comment code, by genre. We surmise that refactoring is intrinsic to notebook development, despite the small size of notebooks. Belady and Lehman’s model for software development predicts that entropy increases exponentially with each2change ], and [ exploratory development tends to introduce many (often small) changes. Their model predicts that computational notebooks will experience increasing diiculty in making changes, eventually being forced to refactor or abandon the notebook (perhaps by copying essential code into a new notebook). 5.2 Notebook Authors Refactor Less, and Diferently, than Traditional Developers As observed in Section 4.2.1, 13% of commits for the notebooks in this study contain refactorings. This contrasts with the 33% found by Murphy et al. in their study of CVS logs of Java develop 26, ğ3.5]. ers [Since we found that those with a computer science background actually refactored their notebooks a little less than others (See Section 4.4), it appears that unique characteristics of computational notebooks such as their typical small size are inluential. Another inluence could be that notebook authors commit their changes to GitHub less frequently, hiding more loss refactorings (See Section 6.1). Furthermore, as observed in Section 4.2.2, notebook authors rely heavily on a few basic refactorings. The same trend is seen amongst traditional Java developers, but with a distinct selection of refactoring operations. In Murphy-Hill et al.’s analysis of CVS commit logs of manual refactorings of Java code, the top four refactoring Although Belady and Lehman’s article is titled A ł Model of Large Program Developmentž, the model itself does not depend on the size of the program. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:19 operations were, in order Rename , (29.7%), Push Down Method/Field (14.5%), Generalize Declared Type (9%), and Extract Function/Method (6.9%) [26, Fig. 2]. We observe the following: 5.2.1 Rename is much less common in notebook development. For the traditional developers, Rename occurred more than twice as often as the next refactoring, whereas in our sample of noteb Rename ooks came in fourth at 10.5%. Although the small size of the typical notebook compared to a Java application might be a partial explanation, the typical notebook makes heavy use of the global scope, putting pressure on the notebook author to maintain good naming as the notebook grows and evolves. We hypothesize that the diiculty of renaming identiiers in the Jupyter IDE, as discussed in Section 2.4, is a deterrent to renaming in notebooks. 5.2.2 Object-oriented refactorings are much less common in notebook development.For traditional developers, the next two most common refactorings, Push Down Method/Field and Generalize Declared Type, are core to object-oriented development. Python supports object-oriented development, but as discussed in Section 4.4, it is not widely practiced in the notebooks in this study. 5.2.3 Cells are a key structural afordance in notebook evolution. Although the wide use of Jupyter’s unique cell construct is not surprising, it is perhaps more surprising that the refactoring of cells is so common, at 40% of the total. However, as shown in Figure 9, the occurrence of code cells is over seven times greater than functions for the notebooks in this study, the next most common construct. What we are likely observing is that cells are displacing functions, in comparison to traditional development. Another factor Reorder is that Cells, Split Cell, and Merge Cells are supported in the Jupyter IDE, unlike the other observed refactoring operations. 5.3 Need for Beter Refactoring Support in Notebook IDEs Implementing tool assistance for refactoring is non-trivial, and IDE developers might be hesitant to do so without evidence that the beneits outweigh the costs. The presence of refactoring in 80% of the computational notebooks sampled for this study argues for at least some support for traditional refactoring tools that enable author-directed refactorings. In particular, our results suggest that notebook authors using Jupyter would beneit from environment support for at least the three refactorings in the top six that are not yet automate Change d, Function Signature, Extract Function, and Rename. Rename is particularly challenging to perform manually in a computational notebook because old symbols retain their values in the environment until the kernel is restarted or the symbol is removed by the notebook author. A proper Rename refactoring would also rename the symbol in the environment, not just the code. Similarly,Reorder for Cells, which is supported only syntactically in computational notebooks, support for checking deinition-use dependencies between reordered14 cells ] could [ help avoid bugs for the many computational notebook authors who use that refactoring. As observed in Section 2.4, JetBrain’s new DataSpell IDE provides a subset of the refactorings supported by their Python refactoring engine. Among these are three of the top six refactorings observed in our study, mixed in with several less useful object-oriented refactorings. Rename The provided by DataSpell does not rename the symbol in the runtime kernel. Although our results document that phases of code cleanups are not especially common, especially as compared to tidying, one possible reason is the lack of tool support. Without tool support or test suites, a notebook cleanup is high risk, as the substantial changes to a complex notebook could lead to hard-to-ix bugs. In this regard, our results argue for cleanup support like code gathering tools [15] (See Section 2.4). On the other hand, a customized mix of refactoring assistance for diferent genres or authors of difering backgrounds is not strongly supported. Although we observed diferences, the efect sizes of the statistical tests are modest and the overall top refactorings observed were performed in sizable percentages in notebooks of both exploratory and expository characters, as well as by authors of difering backgrounds. ACM Trans. Softw. Eng. Methodol. 111:20 • Liu, Lukes, & Griswold 5.4 Multi-Notebook Refactoring In interviews notebook authors said that they sometimes keep exploration and exposition in separate note- books [31], split a multiple-analysis notebook into multiple noteb 18], or ooks drop [ deprecated code into a separate notebook [18, 31]. We saw evidence of these in the ample code deleted from notebooks (See 4.5.5). Such actions create relationships among notebooks akin to version control. Extrapolating from a suggestion from one of Head et al.’s study participants 15],[it may be valuable for notebook refactoring tools to be aware of these relationships, for example having an option Rename for to span related notebooks. 5.5 Future Work on Studies of Notebook Refactoring The present study focused on refactorings introduced between commits on a notebook. Future work could, for example, use logs from instrumented notebook IDEs to reveal the full prevalence of refactoring, as well as study ine-grained evolution behaviors such as how refactoring tools are used (as Murphy-Hill et al. did for Java developers [26]) or track activity among related notebooks (such as those being used for ad hoc version control). Future work could also investigate which situations motivate the use of refactoring in computational notebooks, as well as the efectiveness of refactoring. The latter could be studied for both manual and assisted refactoring, examining properties such as improved legibility and cell ordering better relecting execution dependencies, versus, say, the introduction of bugs or a worse design. 6 LIMITATIONS AND THREATS 6.1 Limitations Due to Use of Source Control Many computational notebook authors may be unaware of source code control tools. This study omits their notebooks, and our results may not generalize to that population. However, numerous articles in the traditional sciences advocate for the use of tools like GitHub for the beneits of reproducibility, collaboration, and protecting valuable assets24 [ , 28], establishing it as a best practice, if not yet a universal one. Notebook authors who know how to use source code control are more likely to know software engineering best practices as well, for example if they had encountered Software Carpentry33[, 34]. By factoring our analysis according to author background, we partially mitigated this limitation. Notebooks may have been omitted because authors deemed them too trivial to commit to source control. We also excluded notebooks with fewer than 10 commits, the vast majority of notebooks on GitHub. Still, our study includes notebooks with a wide range of sizes and number of commits. Our dependence on inter-commit analysis also presents limitations. Refactorings that occur in the same commit in which the refactored code is introduced are undetectable. This underreporting may not be uniform across refactoring operations, because intra-commit refactorings are by deinition loss refactorings, which could be less architectural in nature. Even so, we observed little root canal refactoring. Additionally, our analysis is more likely to underreport for notebooks whose histories have a relatively small number of large commits rather than many small commits. Related, some notebooks may not be committed early in their history, resulting in an initial large commit, hiding refactorings. Also, a few of the notebooks in our study are not done ł ž: still in active development or in long-term maintenance. Given the marked bursts in overall commit rate early and late in the histories we captured (Figure 2 and Table 11), we are conident that we observed meaningful lifetimes for the vast majority of notebooks. To further quantify this, we performed an analysis of the similarity of the irst and last commit of each notebook. We took a conservative approach, extracting the bag of lexical code tokens for each, and then calculated the similarity as the size of the (bag) intersection of the two commits, divided by the size of the larger bag. Half of the notebooks’ irst-last commits are less than 34% similar. At the extremes, 47 notebooks’ irst-last commits are less than 10% similar, whereas 14 notebooks’ irst-last commits are more than 90% similar. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:21 A future study could avoid the limitation of using source control snapshots by investigating notebook evolution through an instrumented notebook IDE (See Section 5.5). 6.2 Limitation Due to Use of Public Notebooks Our study examined publicly-available notebooks. Notebooks authored as private assets may be evolved diferently, perhaps due to having smaller audiences or fewer authors. Likewise, many programming assignments may be completed within a private repository due to academic integrity requirements. These limitations are partially mitigated by factoring our analysis by author background and notebook genre. In a separate vein, public notebooks on GitHub are frequently cloned, notably textbooks and ill-in-the-blank homework assignments, creating the possibility that our random sample might have selected the same notebook multiple times, skewing our results. Our exclusion of student ill-in-the-blank notebooks (Section 3.4) eliminated one class of possibilities. In the end, as discussed in Section 3.2, our random sample contained no duplicate notebooks. 6.3 Limitation Due to Single-Notebook Focus Our analysis detected when Extract Module moved code out to a library package and referenced it by import, and we analyzed code deletion in the context of cleanups. Deleted code may have been pasted into another notebook (See Sections 4.5.5 and 5.4), but its destination was not tracked in our analysis. As mentioned in Section 5.5, A future study could track the copying and movement of code across notebooks to better understand their relationship to refactoring. 6.4 Internal Validity Threats Due to Use of Visual Inspection We classiied refactorings through visual inspection, which is susceptible to error. Some previous studies of professional developers used automated methods, but they were shown to detect only a limited range of refactor- ings26 [ ]. We used visual inspection to enable the detection of idiomatic expressions of refactorings in Jupyter notebooks, regardless of the programming language used. Five notebooks did not employ Python. Refactorings have been standardized and formalized in the literature, and the authors are expert in software refactoring. The methods we practiced as described in Sections 3.1 and 3.2 further controlled for mistakes. An audit, as described in Section 3.3, found a 9.9% error rate. Determining notebook genre was more diicult, as there is no scientiic standard for these. As described in Section 3.4, we employed Negotiated Agreement to eliminate errors. Although we can claim our classiications to be stable, others could dispute our criteria for classiication. As described in the same Section, for notebook author background, an audit found a 5% error rate. 6.5 External Validity Threat Due to Sample Size Finally, in order to enable a detailed and accurate inspection of each notebook, its commits, authorship, and containing repository, this study was limited to studying 200 notebooks. We randomized our selection process at multiple stages to ensure a representative sample. 7 CONCLUSION Computational notebooks have emerged as an important medium for developing analytical software, particularly for those without a background in computing. In recent interview studies, authors of notebooks conducting exploratory analyses frequently spoke of tidying and cleaning up their notebooks. Little was known, however, about how notebook authors in general actually maintain their notebooks, especially as regards refactoring, a key practice among traditional developers. This article contributes a study of computational notebook refactoring in the wild through an analysis of the commit histories of 200 Jupyter notebooks on GitHub. In summary: ACM Trans. Softw. Eng. Methodol. 111:22 • Liu, Lukes, & Griswold RQ-RF (Notebook Refactoring): Despite the small size of computational notebooks, notebook authors refactor, even if they lack a background related to computing. From this we surmise that refactoring is intrinsic to notebook development. Authors depend primarily on a few non-object-oriented refactorings (in Change order): Function Signature, Extract Function, Reorder Cells, Rename, Split Cell, and Merge Cells. Traditional developers coding in languages like Java refactor more than twice as much. They prioritize the same non-cell operations as notebook authors, but apply Rename most frequently and favor a more object-oriented mix of refactorings. RQ-GR (Refactoring by Genre): Computational notebooks of all genres undergo consequential refactoring, suggesting that the messiness of exploration often discussed in the literature is not the only driver of refactoring. Programming assignments (e.g., term projects) appear rather similar to Exploratory Analyses with respect to how they are refactored, despite their diferent end purpose. Overall, refactoring behaviors are diferentiated by the notebook’s exploratory versus expository purpose. RQ-BG (Refactoring by Background): Computational notebook authors with a computing background (computer scientists and data scientists) seem to refactor diferently than others, but not more. This adds weight to the conclusion above that refactoring is instrinsic to computational notebook development. RQ-CC (Tidying vs. Cleanups of Code): Computational notebook authors exhibit a pattern of ongoing code tidying. Cleanups, cited in interview studies, were less evident, although they occur more in exploratory analyses. Cleanups appear to be achieved more often by moving code into new notebooks, a kind of ad hoc version control. Our results suggest that notebook authors might beneit from IDE support for Change Function Signature, Extract Function, and Rename, withRename taking the kernel state into account. Also, given the frequency of use of theReorder Cells operation, notebook authors might beneit from it being extended to check for deinition-use dependencies. To replicate and extend these results, future work could instrument notebook IDEs to log ine-grained evolution behaviors and how refactoring tools are used, including across related notebooks. Future work could also study the circumstances that motivate the use of refactoring in computational notebooks, as well as the net beneits of refactoring with respect to factors like legibility, cell ordering relecting execution dependencies, and the accidental introduction of bugs. ACKNOWLEDGMENTS This research was supported in part by the National Science Foundation under Grant No. CCF-1719155. We thank Jim Hollan for reading an early draft of this paper. We are grateful to the reviewers for their insightful and constructive comments, which helped make this a much better article. REFERENCES [1] Sandro Badame and Danny Dig. 2012. Refactoring Meets Spreadsheet Formulas.PrIn oceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM) (ICSM ’12). IEEE Computer Society, Washington, DC, USA, 399ś409. https://doi.org/10.1109/ICSM. 2012.6405299 [2] L. A. Belady and M. M. Lehman. 1976. A Model of Large Program Development. IBM Syst. J. 15, 3 (Sept. 1976), 225ś252. https: //doi.org/10.1147/sj.153.0225 [3] Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289ś300. [4] Fred Brooks. 1975. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, Reading, MA. 195 pages. [5] Nanette Brown, Yuanfang Cai, Yuepu Guo, Rick Kazman, Miryung Kim, Philippe Kruchten, Erin Lim, Alan MacCormack, Robert Nord, Ipek Ozkaya, and et al. 2010. Managing Technical Debt in Software-Reliant Systems. ProceeIn dings of the FSE/SDP Workshop on Future of Software Engineering Research (Santa Fe, New Mexico, USA(FoSER ) ’10). Association for Computing Machinery, New York, NY, USA, 47ś52. https://doi.org/10.1145/1882362.1882373 [6] Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. ProceInedings of the 2020 CHI Conference on Human Factors in Computing Systems ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:23 (Honolulu, HI, USA(CHI ) ’20). Association for Computing Machinery, New York, NY, USA, 1ś12. https://doi.org/10.1145/3313831.3376729 [7] Harald Cramér. 1999.Mathematical Methods of Statistics (PMS-9). Princeton University Press, Princeton, NJ. http://www.jstor.org/ stable/j.ctt1bpm9r4 [8] Ward Cunningham. 1992. The WyCash Portfolio Management System.AIn ddendum to the Proceedings on Object-Oriented Programming Systems, Languages, and Applications (Addendum)(Vancouver, British Columbia, Canada) (OOPSLA ’92). Association for Computing Machinery, New York, NY, USA, 29ś30. https://doi.org/10.1145/157709.157715 [9] Olive Jean Dunn. 1964. Multiple comparisons using rankTsums. echnometrics 6, 3 (1964), 241ś252. [10] Martin Fowler. 2020. Refactoring Catalog. https://refactoring.com/catalog/. Accessed: 2020-02-17. [11] Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley, Boston, MA, USA. [12] D.R. Garrison, M. Cleveland-Innes, Marguerite Koole, and James Kappelman. 2006. Revisiting methodological issues in transcript analysis: Negotiated coding and reliability The Internet . and Higher Education 9, 1 (2006), 1ś8. https://doi.org/10.1016/j.iheduc.2005.11.001 [13] GitHub Search 2020. Search via GitHub API. http://api.github.com/search/code?q=python+language:jupyter-notebook. Accessed: 2020-08-26. [14] William G. Griswold and David Notkin. 1993. Automated Assistance for Program Restructuring. ACM Trans. Softw. Eng. Methodol.2, 3 (July 1993), 228ś269. https://doi.org/10.1145/152388.152389 [15] Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. 2019. Managing Messes in Computational Notebooks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk)(CHI ’19). ACM, New York, NY, USA, Article 270, 12 pages. https://doi.org/10.1145/3290605.3300500 [16] Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 1265ś1276. https://doi.org/10.1145/3025453.3025626 [17] Mary Beth Kery and Brad A. Myers. 2018. Interactions for Untangling Messy History in a Computational Noteb2018 ook.IEEE In Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 147ś155. https://doi.org/10.1109/VLHCC.2018.8506576 [18] Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science Using a Literate Programming Tool. Proce Inedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada)(CHI ’18). ACM, New York, NY, USA, Article 174, 11 pages. https://doi.org/10.1145/3173574.3173748 [19] Donald E. Knuth. 1984. Literate Programming. Comput. J. 27, 2 (May 1984), 97ś111. https://doi.org/10.1093/comjnl/27.2.97 [20] Andreas P. Koenzen, Neil A. Ernst, and Margaret-Anne D. Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. 2020In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1ś9. https://doi.org/10.1109/VL/HCC50065.2020.9127202 [21] William H Kruskal and W Allen Wallis. 1952. Use of ranks in one-criterion variance Journal analysis. of the American statistical Association 47, 260 (1952), 583ś621. [22] LSP 2021. LSP integration for jupyter[lab]. https://jupyterlab-lsp.readthedocs.io/en/latest/index.html. Accessed: 2021-12-29. [23] Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, and Aditya Parameswaran. 2021. Fine-Grained Lineage for Safer Notebook Interactions.Proc. VLDB Endow. 14, 6 (feb 2021), 1093ś1101. https://doi.org/10.14778/3447689.3447712 [24] Florian Markowetz. 2015. Five Selish Reasons to Work Reproducibly Genome . Biology16 (2015), 274. https://doi.org/10.1186/s13059- 015-0850-7 [25] Emerson Murphy-Hill and Andrew P. Black. 2008. Refactoring Tools: Fitness for Purp IEEE oseSoftw . . 25, 5 (Sept. 2008), 38ś44. https://doi.org/10.1109/MS.2008.123 [26] Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black. 2009. How We Refactor, and How We KnowPrIt.oceInedings of the 31st International Conference on Software Engineering (ICSE ’09). IEEE Computer Society, Washington, DC, USA, 287ś297. https: //doi.org/10.1109/ICSE.2009.5070529 [27] Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157ś175. [28] Jefrey Perkel. 2016. Democratic databases: science on GitHub Natur . e 538, 7623 (October 2016), 127ś128. https://doi.org/10.1038/538127a [29] Jefrey Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice Nature. 563 (11 2018), 145ś146. https: //doi.org/10.1038/d41586-018-07196-1 [30] Laurent Peuch. 2020. Welcome to RedBaron’s documentation! https://redbaron.readthedocs.io/en/latest/. Accessed: 2020-08-25. [31] Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Noteb Prooceoks. edings In of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada)(CHI ’18). ACM, New York, NY, USA, Article 32, 12 pages. https://doi.org/10.1145/3173574.3173606 [32] Kathryn T. Stolee and Sebastian Elbaum. 2011. Refactoring Pipe-like Mashups for End-user Programmers. ProceeIn dings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA (ICSE ) ’11). ACM, New York, NY, USA, 81ś90. https://doi.org/10.1145/1985793.1985805 ACM Trans. Softw. Eng. Methodol. 111:24 • Liu, Lukes, & Griswold [33] G. Wilson. 2006. Software Carpentry: Getting Scientists to Write Better Code by Making Them More ProComputing ductive. in Science Engineering 8, 6 (Nov 2006), 66ś69. https://doi.org/10.1109/MCSE.2006.122 [34] Greg Wilson. 2014. Software Carpentry: Lessons LearneF1000Resear d. ch 3 (2014), 62. https://doi.org/10.12688/f1000research.3-62.v2 [35] Frank Yates. 1934. Contingency tables involving small numbers and χ 2the test. Supplement to the Journal of the Royal Statistical Society 1, 2 (1934), 217ś235. [36] Daniel Yekutieli and Yoav Benjamini. 1999. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference 82, 1-2 (1999), 171ś196. ACM Trans. Softw. Eng. Methodol. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Software Engineering and Methodology (TOSEM) Association for Computing Machinery

Loading next page...
 
/lp/association-for-computing-machinery/refactoring-in-computational-notebooks-dixJlCBvG6

References (47)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2023 Association for Computing Machinery.
ISSN
1049-331X
eISSN
1557-7392
DOI
10.1145/3576036
Publisher site
See Article on Publisher Site

Abstract

ERIC S. LIU, DYLAN A. LUKES, WILLIAM G. GRISWOLD, UC San Diego, USA Due to the exploratory nature of computational notebook development, a notebook can be extensively evolved even though it is small, potentially incurring substantial technical debt. Indeed, in interview studies notebook authors have attested to performing on-going tidying and big cleanups. However, many notebook authors are not trained as software developers, and environments like JupyterLab possess few features to aid notebook maintenance. As software refactoring is traditionally a critical tool for reducing technical debt, we sought to better understand the unique and growing ecology of computational notebooks by investigating the refactoring of public Jupyter notebooks. We randomly selected 15,000 Jupyter notebooks hosted on GitHub and studied 200 with meaningful commit histories. We found that notebook authors do refactor, favoring a few basic classic refactorings as well as those involving the notebook cell construct. Those with a computing background refactored diferently than others, but not more so. Exploration-focused notebooks had a unique refactoring proile compared to more exposition-focused notebooks. Authors more often refactored their code as they went along, rather than deferring maintenance to big cleanups. These indings point to refactoring being intrinsic to notebook development. CCS Concepts: · Software and its engineering → Software evolution; Maintaining software. Additional Key Words and Phrases: computational notebooks, end-user programming, refactoring 1 INTRODUCTION Computational notebook environments like Jupyter have become extremely popular for performing data analysis. In September 2018 there were over 2.5 million Jupyter notebooks on GitHub 29], and [ by August 2020 this igure had grown to over 8.2 million 13].[A computational notebook consists of a sequencecells of, each containing code, prose, or media generated by the notebook’s computations (e.g., graphs), embodying a combination of literate programming 19][ and read-eval-print-loop (REPL) interaction. Unlike a REPL, any cell can be selectively executed at any time. The typical language of choice is Python. Such an environment afords rapid development and evolution, enabling exploratory analysis: write code into a cell, run the cell, examine the output text or media rendering, then decide how to change or extend the notebook 31[]. For example, an early version of a notebook for analyzing deep-sea sensor data might reveal some out of range readings, prompting the author to add a cell upstream for data cleaning. Later, the same notebook might require the tuning of parameters in a downstream cell to render a more insightful plot. Then the author might send this notebook to someone else who can read it like a literate program. Such data-driven development could compromise the conceptual integrity 4, Ch. 4] [ of the notebook as various explorations are developed, evolved, and abandoned, thus undermining the author’s ability to continue making quick, correct changes18 [ ]. Moreover, many notebook authors identify as scientists (e.g., chemists), and so may not have been exposed to concepts and skills related to reducing technical5debt , 8] and [ maintaining conceptual integrity, such as refactoring and software design principles. Compounding this problem is that, to date, notebook Author’s address: Eric S. Liu, Dylan A. Lukes, William G. Griswold, esl036@ucsd.edu,dlukes@eng.ucsd.edu,wgg@cs.ucsd.edu, Computer Science & Engineering, UC San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0404, USA. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 1049-331X/2022/8-ART111 $15.00 https://doi.org/10.1145/3576036 ACM Trans. Softw. Eng. Methodol. 111:2 • Liu, Lukes, & Griswold environments provide little tool assistance for refactoring, which could especially beneit untrained notebook authors. That said, it is unclear how much computational notebook authors could beneit from the application of such techniques and tools. Compared to traditional computer programs, notebooks are typically quite small ś hundreds of lines ś and have short lives, with development often ending at the moment of inal insight. In recent interview studies, authors who write notebooks to explore data sets have attested that they tidy their code as they go and periodically perform big cleanups 18, 31]. [ It is unknown, however, whether these reported activities are actually happening, and if so, whether they include recognized maintenance activities like refactoring. Nor is it known whether authors who are employing notebooks for other purposes behave similarly. Insights into these questions could inform, among other things, the ongoing development of notebook IDEs like JupyterLab. This paper reports on an investigation of refactoring in computational notebooks. Little is known about end-user or notebook refactoring in the wild. We randomly selected 15,000 Jupyter notebooks hosted on GitHub and studied 200 with meaningful commit histories. We then visually inspected each notebook’s commits, identi- fying refactorings throughout its history. Also, expecting that factors such as notebgenr ookeuse ) and ( author background (as indicated by their current profession) would inluence notebook refactoring behavior, we classiied the notebooks according to genre and author background. Finally, we analyzed these refactorings and related changes with regard to the above questions and factors. In our analysis, we found that notebook authors do indeed refactor. They use a diferent mix of refactoring operations than traditional developers: they do not often use object-oriented language features such as classes (despite their availability) and commonly use (and refactor) notebook cells, favoring non-OO refactorings like Change Function Signature, Extract Function, Reorder Cells, and Rename. The refactorings employed vary by genre, with exploratory-focused genres favoring the above refactorings in the above order, whereas exposition-focused genres favored these in a diferent orderSplit with Cell at the top. Notebook authors with a computing background refactor diferently than others, but not more. Finally, we found that notebook authors generally refactor their notebooks as they go, rather than performing big code cleanups. 2 RELATED WORK Much is known about how traditional software developers refactor, but little is known about how end-user programmers, or computational notebook authors speciically, refactor. However, notebook authors do attest to tidying and cleaning up their notebooks. 2.1 Traditional Sotware Developers The most extensive study of refactoring by traditional software developers was performed by Murphy-Hill et al. [26]. Drawing on data from CVS source code repositories and event logs from the Eclipse IDE, they drew a comprehensive picture of how software developers refactor. They found that traditional developers refactor frequently, favoring refactoring as they develop over concentrated overhauls. They also found that despite the availability of numerous refactoring operations in the IDE, developers overwhelmingly chose to refactor by hand, even experts who were well-versed in the refactoring tools. Finally, they found Rename that was by far the most applied refactoring, and that the vast majority of applications are comprised of just a few refactoring operationsRename, ( Extract Local Variable, Inline Method, Extract Method, Move, and Change Method Signature), with a long tail of lightly-used refactorings. An important take-away from Murphy-Hill et. al. is that since traditional developers overwhelmingly choose to refactor by hand, the lack of refactoring tools in Jupyter is not necessarily a deterrent to refactoring by notebook authors. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:3 2.2 End-User Programmers Little is known about whether or how end-user programmers or exploratory programmers refactor. Stolee and Elbaum found that Yahoo! Pipes programs would beneit from refactoring, and developed a tool to automatically detect and refactor łsmellyž pipe programs32 [ ]. In a lab study they found that pipe developers generally preferred refactored pipes, although there was a preference for seeing all the pipes at once rather than abstracting portions away into a separate pipe. Badame and Dig similarly found that spreadsheets would beneit from refactoring, and developed a refactoring tool for Excel spreadsheets in which spreadsheet authors could select and apply refactorings to selected cells 1]. A [ s in the pipes study, a lab study found that the refactored spreadsheets were preferred, but also preferred seeing all the cells versus abstracting away details. Refactorings with their tool were completed more quickly and accurately than by hand. These studies tell us that end-user programmers want a level of control over refactoring to achieve their preferred results. However, they do not tell us whether end-user programmers refactor in the wild. 2.3 Computational Notebook Authors Rule et al. examined over 190,000 GitHub repositories containing Jupyter notebooks, inding that the average computational notebook was short (85 lines) and that 43.9% could not be run simply by running all the cells in the notebook top-to-bottom [31]. They also found that most notebooks do not stand alone, as their repositories contain other notebooks or instructions (e.g., a README ile). Sampling 897 of these repositories (over 6,000 notebooks), Koenzen et al. found that code clones (duplicates) within and between notebooks of a GitHub repository are common, with an average of 7.6% of cells representing duplicates 20]. These [ results suggest, at least for notebooks on GitHub, that while most notebooks are small, they are not trivial computational artifacts. Interview studies have shed additional light on the characteristics of the development of computational note- books. Chattopadhyay et al. interviewed and surveyed data scientists regarding the challenges they encountered in notebook development6]. [ Among many concerns, their respondents reported that refactoring was very important, yet poorly supported by their notebook environments. The study does not report on their actual coding practices. Because we sampled from public GitHub notebooks, our study reports on the evolution of notebooks not only from data scientists, who often have signiicant training in computer programming, but also from authors from a variety of other backgrounds. Rule et al. interviewed what they called notebook analysts (i.e., authors of Exploratory Analysis notebooks), many who said they need to share their results with others and sometimes wish to document the results of their work for their own use 31].[ In this regard, notebooks are valuable in enabling a narrative presentation of code, text, and media such as graphs. However, notebooks are developed in an exploratory fashion, resulting in the introduction of numerous computational alternatives, often as distinct cells or the iterative tweaking of parameters. Many analysts attested that after their exploratory analysis is complete, their notebooks are often reorganized and annotated in a cleanup phase for their inal intended use. Often a notebook is used as source material for a new, clean notebook or a PowerPoint presentation. In Kery et al.’s interviews, analysts said that code and other elements may be introduced anywhere in a notebook as a matter of preference, but that exploratory alternatives are often introduced adjacent to the original computation18 [ ]. The resulting łmessyž notebook is cleane ł dž up by cells being deleted, moved, combined, or split, or by deining new functions. Such changes often occur as incremental łtidyingž during the exploration phase when the messiness is found to interfere with ś slow down ś ongoing exploration. When notebooks grow too large for Jupyter’s simple IDE support, computations are sometimes moved of into new notebooks or libraries. Building on these indings, Kery et al. developed and evaluated tools for ine-grained tracking and management of code versions within a notebook [16, 17]. ACM Trans. Softw. Eng. Methodol. 111:4 • Liu, Lukes, & Griswold These interview studies make it clear that notebook analysts feel maintenance is important because exploratory development creates technical debt that both interferes with productivity and runs counter to the needs of sharing and presentation. Our prestudy identiied four other notebook genres, raising the question of whether their notebooks generate similar needs. As interview studies, however, they establish what a number of notebook authors think they do, serving as hypotheses about what their colleagues really do. It’s also unclear whether incremental tidying and concentrated cleanup phases during notebook development are achieved by what would be recognized as refactoring, that is changing ł the structure of a program without changing the way it behav26 esž].[If so, there remain questions of which refactoring operations are favored, and how those vary according to notebook genre or author background. The present study provides concrete evidence to evaluate the above hypotheses relating to notebook maintenance. 2.4 Refactoring Support for Jupyter Refactoring support is minimal in the popular Jupyter Notebook and JupyterLab environments. Both support only basic cell-manipulation operations: Split Cell, Merge Cell Up/Down, and Move Cell Up/Down. The refactoring operations most frequently observed by Murphy-Hill et al. are Rename, absent: Extract Local Variable, Inline Method, and Extract Method [26]. As reported by Chattopadhyay et al. (and discussed in the previous subsection), data scientists, at least, found such omissions to be problematic [6]. Since 2020, after we sampled the notebooks for this study Rename , can be added Jupyter Lab by installing the JupyterLab-LSP extension22 [ ]. More recently, in 2021, JetBrains released a third-party front-end for Jupyter, called DataSpell, that provides many of the same refactoring operations as its PyCharm IDE for traditional Python development, including those cited above as missing in the Jupyter environments. It also provides the next ive most frequently observed by Murphy et al.: Move, Change Method Signature, Convert Local to Field, Introduce Parameter and Extract Constant. The present study provides insight on the possible usefulness of the refactoring support provided by these notebook environments. Renaming is an interesting case. To rename a variable in a notebook by hand, one must ind and replace each usage. In a complex notebook, it is easy to miss a usage of an identiier. In contrast, developers using a traditional IDE can expect instantaneous feedback when a variable is used without being deined. Even without IDE support, they will become aware of it at runtime. However, in the notebook-kernel execution model, this feedback can be even more delayed, because renaming an identiier amounts to introducing a new identiier. That is, after renaming, there are two potential problems: the old identiier (name) remains bound to its value in the runtime kernel and the new identiier is uninitialized. Regarding the former, a cell mistakenly referring to the old identiier would access this old value, perhaps disguising the error. Regarding the latter, the notebook author must re-execute the cells upstream of the deining cell to complete the rename. If the notebook author were using the nbsafety custom kernel, it would suggest what cells to re-execute (when a cell dependent on the rename is executed)23 [ ]. The expectation, however, is that a notebook’s refactoring operations would be behavior preserving, and thus take the dynamic state of the runtime kernel into account. Head et al. developed a tool for gathering up and reorganizing łmessyž notebo15 oks]. [The tool identiies all the cells involved in producing a chosen result, which can then be reordered into a linear execution order and moved to a new location in the notebook or a diferent (e.g., new) notebook. Participants in a user study attested that they preferred cleaning their notebooks by copying cells into a new notebook, and that the tool helped them do so, among several other tasks. 3 STUDY DESIGN A challenge for our study was how to gain access to the histories of a sizable number of computational notebooks. As one example, data from an instrumented notebook development environment would be attractive, but Jupyter ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:5 is not instrumented to capture the detailed edit histories from which manual refactorings could be detected, nor was it feasible for us to instrument Jupyter for distribution to a sizable number of notebook authors for an extended period of time. In this regard, the GitHub commit histories of public Jupyter notebooks provided an attractive opportunity. One limitation is that it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced. This underreporting pessimizes our results to a degree, and our results should be interpreted in this context. Limitations and threats related to the use of source code control and GitHub are detailed at the beginning of Section 6. We employed a multi-stage process of automated extraction and analysis, as well as visual inspection, to achieve depth, breadth, and accuracy for the following research questions: RQ-RF: How much and with what operations do computational notebook authors refactor? RQ-GR: How does refactoring use vary by computational notebook genre? RQ-BG: How does refactoring use vary according to computational notebook author background? In particular, do those with a CS background refactoring diferently than others? RQ-CC: What is the extent of tidying vs. cleanups of code? 3.1 Determining a Catalog of Notebook Genres & Refactorings To determine the viability of the study and ix a stable catalog of refactoring operations throughout a large-scale analysis, we performed a prestudy. In June 2019, we downloaded Adam Rule et al.’s data set of 1,000 repositories 31], containing [ approximately 6,000 notebooks, and randomly sampled 1,000. Using notebook metadata downloaded via the GitHub API, we iltered out inappropriate notebooks: those generated by checkpointing (automated backup), lacking 10 commits with content changes, and completed ill-in-the-blank homework assignments. A commit can have no notebook content changes when only notebook metadata has changed, e.g., a cell’s execution count. Completed ill-in-the- blank homework assignments are uninteresting because they were designed to not require evolution. From the remaining notebooks we selected the 50 notebooks with the most evolutionary development in terms of the number and size of their commits. We next visually inspected every commit in the 50 notebooks. The visual inspection of a GitHub commit is normally enabled by the display of a dif ł ž computedgit by diff, which provides a concise summary of what has been added, deleted, and changed since the previous commit. Because a Jupyter notebook ile is stored in JSON format and includes program data and metadata, we instead use nbdime d , a tool for diing and merging Jupyter notebook iles. Usingnbdime, we visually inspected each notebook, recording a commit number and the context for every change, except those that were merely adding code. Among the changes recorded were manipulations of code and cells and changes to function signatures and imports. In the end, 35 of the 50 notebooks contained such notable changes. To extract a catalog of refactoring operations from these changes, we drew from existing classical refactorings, as well as deining notebook-speciic refactorings where necessary. One author inspected all commit difs and mapped them to refactorings as deined by Fowler 10,[11] and reined by Murphy-Hill et al. 26].[The same author analyzed the remaining changes to deine notebook-speciic refactorings. A mapping was brought to all authors for evaluation of their validity as a refactoring as deined in Fowler ś a structural change lacking external behavioral efects ś and inalized their classiication according to structural distinctiveness. In all, we compiled 15 refactorings, as listed on the left of Table 4. During the above visual inspection, we also determined the purpgenr ose,e,orof the notebook examining the entire notebook at its irst and last commit. Through iterative reinement, we settled on ive genres. The Exploratory Analysisgenre is the classical use-case for a computational notebook, an attempt to understand a ACM Trans. Softw. Eng. Methodol. 111:6 • Liu, Lukes, & Griswold data set through computational analysis, iteratively applied until an understanding is reached. Such a notebook might support the publication of a research article or making a policy decision. The other genres operate largely in service of this application. Programming The Assignment genre consists of notebooks developed to complete a programming assignment, but not the ill-in-the-blank type. Many are developed as a inal project, and thus are relatively open-ended. However, the primary purpose of writing the notebook is to learn a subject like exploratory analysis. Analytical Demonstrationnotebooks demonstrate an analytical technique, such as a method for iltering out bad data. Technology Demonstrationnotebooks demonstrate how to use a particular technology, such as TensorFlow. Finally Educational , Materialnotebooks support the conduct of a course, serving as łliterate computingž lecture notes, textbooks, or labs, for example. There are two sub-genres here, notebooks teaching exploratory analysis and those teaching a traditional academic subject, such as chemistry. 3.2 Selecting Notebooks for the Study Targeting a sample of 200 notebooks for analysis, we irst randomly sampled 15,000 Jupyter notebooks from the 4,343,212 publicly available on GitHub at the time, October 2019. We then rejected checkpoint iles and repositories with fewer than 10 commits, leaving us with 6,694 notebooks. We downloaded these notebooks, allowing us to automatically cull those with fewer than 10 non-empty commits, leaving us with 278. We then employed random sampling until we reached 200 notebooks, not only rejecting notebooks for insuicient evolution as in the prestudy, but also for having the same primary author as a previously selected notebook, as our goal was to identify a diverse set of notebooks. Encountering two notebooks from an author was likely because GitHub users commonly copy someone else’s repository, with all its history, calle clone.dCloning a is particularly likely for online textbooks and other educational materials, such as labs, which may have thousands of clones. These repositories can contain hundreds of notebooks, most by the same author, further increasing the chance of sampling two notebooks by the same author. The primary authorship of a notebook was determined by visual inspection, as described below in Section 3.4. In the end, to reach 200 notebooks we sampled 273 of the 278, rejecting 66 for insuicient evolution and 7 for repeat authorship. We encountered no repeat (clonenoteb d) ooks. Six of the seven repeat authors were for Course Material notebooks. The seventh was a low-probability event: a second notebook from an author who had published only three notebooks, none cloned. 3.3 Identifying Refactorings We used nbdime on the 200 selected notebooks to ind all the refactorings as identiied in the prestudy. Each refactoring was recorded as an entry in a master table, consisting of the unique notebook ID, notebook genre, commit hash, and refactoring operation code. Due to the large number of commit difs inspected, identifying refactorings was subject to error due to fatigue. Consequently, a second author audited the accuracy of the inspector’s identiication of refactorings by randomly sampling and reexamining 10% of the notebooks (20) and their commit difs (345). As the data is segmented according to commit dif, the error rate is calculated as the number of reinspected difs that contained any kind of coding error (missed refactoring, extra refactoring, or misidentiied refactoring) divided by the total number of commit difs that were reinspected. Using the same Negotiated Agreement protocol employed exhaustively for genre, described immediately below, 34 difs were determined to contain some kind of error, an error rate of 9.9%. Of those errors, over half were due to missing a refactoring (21 commit difs). Next were extra refactorings (10 commit difs), with only 1 commit dif containing a mislabeled refactoring and 2 commit difs containing multiple of these errors. The most common missed refactoring was Rename (11 commit difs, including one containing ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:7 multiple diferent errors), perhaps because the textual diferenceRename of a is more visually subtle than others. Still, Rename is ranked as one of the top refactorings in our results. 3.4 Identifying Genre and Author Background Due to the complexity of determining a notebook’s genre, we applied a coding methodology called Negotiated Agreement [12], which produces 100% agreement through a labor-intensive process. First, two authors made their own passes over all the notebooks. At a minimum, they examined the entirety of each notebook at its irst and last commits, as well as examining the repository’s README ile for clues as to the intended use of the repository. The two authors then discussed each notebook on which they disagreed to attempt to reach a common conclusion. For the remaining notebooks where disagreement or uncertainty persisted, the third author was brought in to the discussion for a inal decision. Table 1 details the results of each pass. Count Proportion Raw Agreement 66 33% Negotiated Agreement 186 93% Undecided/Disagree 14 7% Table 1. Raw and negotiated agreement between two authors on classification of notebook genre. The 14 notebooks remaining unclassified ater negotiation were classified in a discussion with the third author. We also wanted to learn the technical background ś primary expertise ś of each notebook author, especially whether an author was a computer scientist or not. For this purpose, the categories of background were construed broadly (e.g. Life Sciences). To determine a notebook author’s background, one of the authors of this paper inspected each notebook and its repository for author information. The irst consideration was to identify the primary author of each notebook. In many cases the owner of a repository is not the notebook author, so it was necessary to inspect the notebook’s commit history. In some cases one person originated a notebook and then later someone took it over. In most cases, the originating author wrote the vast majority of the notebook and the authorship was easily attributed to the originating author. In the case of relatively equal contributions, the tie went to the originating author. The next consideration was actually determining the primary author’s background. We decided that the best indication of one’s primary expertise is their profession, as indicated by their role or job title, as they are quantiiably expert enough to get paid to use their expertise. An author’s educational background was consulted when the job title was vague, such as "Senior Engineer". Students were classiied according to their program-level academic ailiation. This process required extensive research. The inspector, the same author who determined each notebook’s primary author, started with the author’s GitHub proile, and, if necessary, e-mail addresses extracted from commits. We then searched the internet for publicly available data including personal websites, blog posts, company and university websites, and public information provided by LinkedIn (i.e., without use of a login). The inspector was able to determine the backgrounds of 189 of the 200 notebook authors. The results are shown in Table 5a. To assess the accuracy of the result, a second author audited the inspector’s determination of primary author and their background by randomly sampling and reexamining 20% of the notebooks, 40 total. Using the same Negotiated Agreement protocol described above, the error rate was determined to be 5%. The two misclassiied backgrounds were for notebook authors with a multidisciplinary job role and education. ACM Trans. Softw. Eng. Methodol. 111:8 • Liu, Lukes, & Griswold 3.5 Extraction of Notebook Statistics To perform automated analyses such as the density of commits over time, we mined the Git history of the notebooks and their repositories, and then developed Jupyter notebooks. Analyses of code employed the RebBaron library [30]. 4 RESULTS 4.1 Data Availability Data relating to our notebook selection process and supporting the results in this and the following sections are available under a CC-BY 4.0 license at https://igshare.com/s/4c5f96bc7d8a8116c271. 4.2 RQ-RF (How much and with what operations do notebook authors refactor?) 4.2.1 RQ-RF How Much: Computational Notebook Authors Refactor, with High Variation.To understand whether computational notebook authors refactor and how much, we extracted the number of refactorings performed and the number of commits containing those refactorings, per genre. Recall that, with this methodology, it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced, underreporting the total amount of refactoring. Table 2 shows that notebook authors refactor, with 13% of commits containing a refactoring (about 1 in 7.7). Of those commits, they contain an average of 1.30 refactorings per commit. Figure 1 enumerates the frequencies of refactorings within refactoring commits, with a maximum of 6 in a single commit. Over 100 notebooks contain 2 or fewer refactorings. The maximum number of refactorings on a notebook is 21. Fig. 1. The frequency of refactorings in commits that contain refactorings. 4.2.2 RQ-RF Operations: Computational Notebook Authors Favor a Few Non-Object-Oriented Refactorings.To learn what refactoring operations are employed by notebook authors, we clustered our extracted refactorings by operation type and sorted them by frequency, as shown in the rightmost column in Table 4. Notably, just a few refactoring operations account for most of the applied refactorings. The top four operations account for over 57% of refactorings: Change Function Signature, Extract Function, Reorder Cells, and Rename. Including the ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:9 next one, Split Cell, reveals that a third of refactoring operations account for over two-thirds of refactorings. Notably, two of these top ive refactoring operations (and three of the top six) refactor cells, which are unique to computational notebooks. All of the cell-refactoring operations together comprise 40% of the total number of observed refactorings. 4.3 RQ-GR: PAs appear Exploratory; Exploratory and Expository Genres are Distinct As discussed in Section 2.3, Exploratory Analysis (EA) notebooks are the classic use-case for computational notebooks. The genres of Educational Materials (EMs), Technology Demonstrations (TDs), and Analytical Demonstrations (ADs) are diferent in their focus on exposition for a wide audience rather than the exploration of a novel data set. This raises two questions: (a) are Programming Assignments (PAs) refactored more like Exploratory Analyses or the exposition-focused genres, and (b) to what extent are these genres distinct from each other when it comes to refactoring? Regarding the irst, although Programming Assignments are often exploratory analyses, they are not truly open-ended, novel investigations. Regarding the second, evolution (refactoring) in exposition-oriented notebooks would be expected to be driven more by changes in technology or the goal of communicating clearly to a wide audience, for example, than the efects of exploration, which would be expected to be generally absent. In the following, due to the small number of Analytical Demonstration notebooks, we omit them from the following analysis. At a irst level of analysis, we see similarities in refactoring between the Exploratory Analysis and Programming Assignment genres, and diferences with the expository genres. Referring back to Table 2, for one, 16% of their commits contain refactorings, whereas for the others just 9% do. Two, they have 13% and 10% of notebooks each with no refactorings, whereas the expository genres hover around 30%. Their percentage of refactoring commits that contain only refactorings is well below the other genres. The expository genres exhibit similarities among each other as well, but there are some diferences. Of the Technology Demo’s refactoring commits, 7.8% were refactoring-only commits, compared to 5.6% for Educational Materials. Likewise, for the Technology Demos, Rename refactoring occurs three times as often and Reorder Cells a quarter. To further explore this possible exploratoryśexpository genre clustering, we applied statistical tests for both refactoring rate and the proile of selected refactorings. Note that these tests are employed here as descriptive statistics, as we did not have a speciic hypothesis at the outset. ACM Trans. Softw. Eng. Methodol. 111:10 • Liu, Lukes, & Griswold ACM Trans. Softw. Eng. Methodol. Notebook Notebooks NB size NBs with All Commits Commits Ref’ing- Total Ref’s / Genre (avg. no Refac- Commits / NB w/Ref’ing Ref’ings Com- Only toring mit chars) Commits Exploratory 79 12,299 10 (13%) 1415 17.9 221 (16%) 10 (4.5%) 299 1.35 Analysis (EA) Programming 41 9,905 4 (10%) 717 17.5 113 (16%) 4 (3.5%) 144 1.27 Assignment(PA) Educational 41 6,036 14 (34%) 843 20.3 72 (9%) 4 (5.6%) 100 1.39 Material (EM) Technology 32 6,779 10 (31%) 651 20.6 51 (8%) 4 (7.8%) 57 1.12 Demo. (TD) Analytical 7 20,435 2 (29%) 128 18.3 18 (14%) 1 (5.6%) 19 1.06 Demo. (AD) All Notebooks 200 9,926 40 (20%) 3754 18.8 475 (13%) 23 (4.8%) 619 1.30 Table 2. Evolution Statistics by Notebook Genre. Notebook (NB) size is non-comment code taken from the last commit of each notebook. Author Notebooks NB size NBs with All Commits Commits Ref’ing- Total Ref’s / Background (avg. no Refac- Commits / NB w/Ref’ing Ref’ings Com- Only toring mit chars) Commits DS & ML 81 8,893 18 (22%) 1439 17.8 183 (13%) 7 (3.8%) 234 1.28 CS & IT 40 10,084 8 (20%) 793 19.8 92 (12%) 6 (6.5%) 136 1.48 Other 79 10,286 14 (18%) 1522 19.3 200 (13%) 10 (5.0%) 249 1.25 All Notebooks 200 9,926 40 (20%) 3754 18.8 475 (13%) 23 (4.8%) 619 1.30 Table 3. Evolution Statistics by Author Background. Notebook size is taken from the last commit of each notebook in our snapshot. DS & ML stands for Data Science and Machine Learning. CS & IT stands for Computer Science and Information Technology. Refactoring in Computational Notebooks • 111:11 ACM Trans. Softw. Eng. Methodol. Table 4. Frequency of refactoring operations by genre. Table 5. Backgrounds of primary notebook authors (a) and Frequency of refactoring operations by author background (b). DS & ML stands for Data Science and Machine Learning. CS & IT stands for Computer Science and Information Technology. 111:12 • Liu, Lukes, & Griswold To reveal diferences in refactoring rates between genres (excluding AD), we computed the average rate of refactoring per commit for each notebook, and ran a Kruskal-Wallis H Test followed by a post-hoc analysis via Dunn’s Multiple Comparisons T[est 9, 21]. We report these results in Table 6 (A). There is strong evidence that the rates of refactoring difer for pairwise combinations that are not between exploratory genres (EA vs. PA) and expository genres (EM vs. TD). Within the like-genre pairings, we ind no evidence to suggest a diference in refactoring rates. To examine the distinctiveness of the refactoring proile of each genre (excluding AD), we ran Pearson’s tests of independence for each pair of genres [27]. These tests excluded the Analytical Demos and were run for the top 10 refactorings overall ś excluding the bottom 5 ś because Pearson’s test assumes non-zero observation counts (cf. Table 4).We report these results in Table 6 (B). Our results suggest that except for the Exploratory Analysis / Programming Assignment pairing (p=0.192), all genres’ refactoring distributions are pairwise distinct. According to the obtained Cramér’s V values, the efect sizes are modest. Genres A. Refactorings per Commit B. Refactoring Type Frequencies H p χ p V Omnibus 29.86 <0.001 65.91 <0.001 0.21 EA PA ś 12.39 0.15 0.388 0.192 EA EM ś <0.001 28.97 0.23 0.002 EA TD ś <0.001 21.43 0.20 0.018 PA EM ś 21.75 0.20 0.013 0.018 PA TD ś <0.001 19.92 0.19 0.021 EM TD ś 20.96 0.19 0.362 0.018 EA+PA EM+TD 28.09 <0.001 30.89 0.24 0.002 Table 6. A. Result of Kruskal-Wallis H test for refactorings per commit (by notebook), as well as post-hoc analysis by genre pairing via Dunn’s multiple comparison test. The ad hoc test does not return an H value. B. Results of Pearson’s tests of independence between all genres, post-hoc analysis via pairwise Pearson’s tests (with Yates’ correction), and corresponding Cramér’s V values [7]. Reported p-values for pairwise comparisons (including EA+PA vs. EM+TD) have been adjusted using the Benjamini-Hochberg procedure [3, 36]. AD notebooks have been excluded due to insuficient representation in our sample. (n = 193). The pattern observed in the pairwise ad hoc tests suggests that the exploratory genres might be distinct from the expository genres in terms of refactoring rates and types. Analogous to the above, a 2-way Kruskal-Wallis H test comparing aggregated exploratory (EA+PA) and expository (EM+TD) notebooks supports this, as these The overall distribution of refactorings per commit fails a normality test, so ANOVA is inappropriate. We additionally applied Yates’ continuity correction to avoid overestimation of statistical signiicance or efect size in light of the small frequencies (<5) for some of the less common refactorings [35]. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:13 two groupings refactor at diferent rates (p<0.001), whileindep a endence test for the distinctiveness of these meta-genres’ refactoring proiles was signiicant (p=0.002). Given the consistent patterns observed between the overall distribution of refactorings and the proile of selected refactoring operations, we ind strong evidence that Programming Assignments, from a refactoring perspective, cluster with Exploratory Analyses into an exploratory meta-genre, and that the Educational Materials and Technical Demonstrations cluster into an expository meta-genre. 4.4 RQ-BG: Computational Notebook Authors with a Computing Background Refactor Diferently, but not More Table 3 summarizes refactoring behavior according to author background, and Table 5b further breaks down the use of refactoring operations. For completeness, Table 7 provides a breakdown of author background according to notebook genre. Overall we see a striking similarity between the Data Science category and the Computer Science category. The major diference that we observe is that Data Scientists tend toExtra favorct Module over Extract Class (6.4% versus 1.7%), compared to Computer Scientists (1.5% versus 6.6%). This could be attributed to computer scientists being inluenced by their wide use of object-oriented practices in traditional development. Table 9 highlights this, showing that CS & IT authors use classes much more than others. Still, even CS & IT authors use classes lightly: only 6 of their 40 notebooks contain even a single class declaration. Those outside of Data Science and Computer Science (łOtherž in Table 5b) are not especially unique, either, except that they employedChange Function Signature and Split Cell rather more, and Extract Function, Reorder Cells, and Rename rather less. The prevalence ofChange Function Signature is almost entirely due to those with backgrounds in Mathematics and Finance, a small minority of the notebook authors in the Other category. Table 7. Notebook genres by author background. Author Backgrounds A. Refactorings per Commit B. Refactoring Type Frequencies H p p V Omnibus 0.40 0.819 29.60 0.042 0.16 CS DS ś N/A 8.62 0.473 0.12 CS Other ś N/A 13.94 0.166 0.16 DS Other ś N/A 19.87 0.037 0.19 CS+DS Other N/A N/A 20.62 0.037 0.19 Table 8. A. Result of Kruskal-Wallis H test for refactorings per commit (by notebook); post-hoc analysis by author background pairing is not applicable because the Omnibus is not significant. The ad hoc test does not return an H value. B. Results of 2 2 χ χ Pearson’s tests of independence between all author backgrounds, post-hoc analysis via pairwise Pearson’stests (with Yates’ correction), and corresponding Cramér’s V values. Reported p-values for pairwise comparisons (including CS+DS vs. Other) have been adjusted using the Benjamini-Hochberg proceduren. (= 193). ACM Trans. Softw. Eng. Methodol. 111:14 • Liu, Lukes, & Griswold As shown in Table 8, pairwise tests on the refactoring distributions by author category bear out these observations. Only the Data Science and Other pairing presents a statistically signiicant diference (p=0.037), with a modest efect size (V=0.19). There is scant evidence to suggest that Computer Scientists refactor distinctly from Others (p=0.166, V=0.16) or Data Scientists (p=0.473, V=0.12). We additionally compared all of those of with computational backgrounds (CD+DS) and those with non- computational backgrounds (Other), and ind similarly signiicant evidence to suggest diferences in refactoring (p=0.037, V=0.19). . Overall, then, we ind strong evidence that notebook authors with computing-related backgrounds refactor diferently than their non-computing counterparts, but they do not refactor more. This suggests that the rate of refactoring is mostly inluenced by the evolutionary characteristics of the notebook genre, such as exploration, exposition, and changes to underlying technology. Background Code Cells Functions Classes Methods CS & IT 25.4 3.60 0.50 2.50 DS & ML 22.8 4.09 0.28 0.83 Other 27.4 2.58 0.23 0.97 All 25.1 3.40 0.31 1.22 Table 9. Per-notebook averages for the use of structural language and notebook features. 4.5 RQ-CC: Computational Notebook Authors Mostly Tidy Code as They Go; EA Code Cleaned Up More We sought to quantify the self-reports of incremental tidying and biggercleanupsfor Exploratory Analysis notebooks [18, 31] as discussed in Section 2.3, with respect to the evolution code,of and compare these to other genres. Already in Section 4.3 we observed that Programming Assignments exhibit an exploratory character when it comes to the distribution of refactorings across commits and the selection of refactoring operations. 4.5.1 Exploratory Notebooks are not Refactored the Most.We take it as a given that refactoring is an act of tidying or cleaning ś an attempt to alter the look and arrangement of code to better relect the concepts embodied in the code or ease future changes. As shown in Table 2, Exploratory Analyses do not stand out in the amount of refactoring applied to them, whether measured relative to the number of commits or notebook code size (Educational Materials do). 4.5.2 Computational Notebook Authors łFlossž Much More than łRoot Canalž. Murphy-Hill and Black distinguish refactorings that are employed to keep code healthy as a part of ongoing development and refactorings that repair unhealthy code [25]. They name these two strategiesloss refactoring and root-canal refactoring, respectively. The distinction is important because their relative frequency says something about developer priorities regarding the on-going condition of their code. One might suppose that the typical notebook author is unaware of the importance of keeping their notebook code structurally healthy, and so might postpone refactoring until the problems are acutely interfering with on-going development. Murphy-Hill et al.operationalized loss refactorings as those that are committed with other software changes, and root-canal refactorings as those that are committed alone 26].[While perhaps a low bar for identifying root canal refactoring, still, only 4.8% of notebook commits containing refactorings are refactoring-only, with the The matching p-values for the DS vs. Other and CS+DS vs. Other are a result of the Benjamini-Hochberg p-value adjustment procedure (a step-up procedure) [3, 36] The efect sizes appear equal only due to rounding. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:15 remaining 95.2% of commits containing refactorings being classiied as lossing (Table 4, column 8). Of the 23 refactoring-only commits, 10 come from from Exploratory Analysis notebooks, 4 from Programming Assignments, 4 from Educational Material, 4 from Technical Demonstrations, and 1 from Analytical Demonstrations (See Figure 2). Exploratory Analyses’ 10 refactoring-only commits, at 4.5% of all their refactoring commits, is slightly below the average of 4.8%. The 23 refactoring-only commits contain 28 refactorings in total, 1.22 refactorings per commit on average, a bit lower than the rate of 1.30 for commits containing both refactorings and other changes. The frequency of refactoring operations is shown in Table 10. It is notable Extrathat ct Function did not occur in the refactoring-only commits, given its popularity generally (Table 4). A commit that contains many refactorings rather than solely refactorings ś a quasi-root-canal classiication ś could also be a sign of cleanups. As shown in Figure 1, less than a quarter of refactorings contain two or more refactorings, and only 25 contain three or more, and the most refactorings in one commit is six. Exploratory Analysis notebooks have a slightly above average number of multi-refactoring commits, and the 6-refactoring commit belongs to an Exploratory Analysis notebook. Although it’s hard to deine what would be enough refactorings in one commit to be evidence of a concerted cleanup, three would seem to be a low bar. Five or six is more interesting, but we saw just four of these. Overall, then, there is substantial evidence of tidying via loss refactoring, and little evidence of cleanups via root canal refactoring. Table 10. Frequency of refactoring operations within refactoring-only commits. 4.5.3 Computational Notebook Authors do not Perform much Architectural Refactoring.Architectural refactorings could be a signal of code cleanups, as they alter the global name scope of the notebook by creating a new scope and/or adding/removing entities from the global scope. The architectural refactorings obser Extra vedctare Function, Extract Module (which extracts functions or classes into a separate PythonExtra ile), ct Class, and Extract (Global) Constant. They account for 25% of Exploratory Analysis refactorings, a bit more than the 19% of other genres.Extract Module and Extract Class are especially interesting, as they gather and move one or more functions into a new ile and new scope, respectively. Twenty-three Exploratory Analysis refactorings (7%) come from this category, compared to 22 (6%) for the other genres. Although 78% of observed refactorings are non-architectural, we see some support for cleanup behavior, and more so for Exploratory Analyses. 4.5.4 Computational Notebook Authors Refactor Throughout.Rule et al.’s interviewees reported performing cleanups after analysis was complete 31],[whereas Kery et al.’s interviewees reported more periodic cleanups in response to accrued technical debt [18]. As such, we assessed when refactoring takes place over time. ACM Trans. Softw. Eng. Methodol. 111:16 • Liu, Lukes, & Griswold We plotted refactoring commits and non-refactoring commits over time, shown in Figure 2. Time is binned for each notebook according to the observed lifespan of the notebook. Interestingly, we see proportional spikes of both refactoring and non-refactoring commit activity in the irst and last 5% of observed notebook lifetimes. This can be seen numerically in Table 11, in particular the commits during the middle 90% of notebook lifetime are 7.7 times less frequent than during the irst 5% and 3.5 times less frequent than the last 5%. Over a third of notebook activity happens during these two periods. Overall, commits containing refactorings occur at a rate that closely tracks the overall commit rate. The relative commit rate for Exploratory Analysis notebooks is nearly identical. By treating each commit on a notebook as a tick of a virtual commit clock, as shown in Figure 2’s inset, we can see that refactoring activity, on average, is much more uniform over observed notebooks’ commit-lifetimes. Using this notion of commit-time, Figure 3 plots the frequency of actual refactorings, both architectural (dark blue) and non-architectural (light blue). Since the average number of refactorings per commit is 1.30, it’s not surprising that the trend looks highly similar to Figure 2-inset. We observe a slight hump in the middle for both types of refactorings, and the rate of architectural refactoring closely tracks the non-architectural refactoring rate. Also, the commit-time rate of Exploratory Analysis architectural refactorings closely mirrors that of all notebooks (not shown). The rate of refactoring over time supports the hypothesis of on-going tidying. To the extent that architectural refactoring is a signal of cleaning, the data provides evidence for on-going cleaning as opposed to post-analysis cleanups. Lifetime Mean Median Std. Dev. First 5% 5.0 4 4.6 Middle 90% 11.6 9 7.9 Last 5% 2.2 1 2.0 All 18.8 16 9.1 Table 11. Statistics on the distribution of commits over normalized notebook lifetime. 4.5.5 Computational Notebook Authors Modestly Comment Out and Delete Code. In the above-mentioned inter- view studies, Exploratory notebook authors attested to commenting out deprecated code as part of cleaning 18, 31].[ We observed 151 commits in 92 notebooks in which code was commented out. Although consequential, this rate is much lower than the number of refactoring commits (475) and notebooks that contain refactorings (160). Still, 82 of those commits occurred in Exploratory Analysis notebooks, 5.8% of their commits, twice the rate of the other notebook genres. Although the numbers are small, they suggest that Exploratory Analysis notebooks are undergoing more tidying or cleaning up of this sort. Deletion of code can also be cleaning. We measured deletions of non-comment code in commits (Figure 4). The median commit on an Exploratory Analysis notebook that results in a net deletion of code deletes a sizable 3.1% of a notebook’s code. The next highest is Technology Demonstrations at 1.8%. The outliers (diamonds) are especially interesting, as they suggest large deletions indicative of cleanups. Each genre has about 20% outliers, suggesting that Exploratory notebooks don’t stand out in this regard. Exploratory Analyses have 48 outlier deletions, 0.61 per notebook. Finally we measured how much smaller a notebook’s non-comment code size is between its maximum and its last commit. If code size shrinks substantially, then that is a sign of cleanups. Following a Pareto-like distribution, 20% of Exploratory Analysis notebooks shrink more than 22%, whereas 20% of other genres shrink only a little more than 10%. This is the strongest case for Exploratory Analysis notebooks undergoing cleanups distinctly from the other genres. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:17 Fig. 2. Refactoring commits (botom, dark blue) vs. non-refactoring commits (top, light blue) over normalized notebook lifetime (20 bins). Inset: Refactoring commits over normalized commit history (10 bins). Ten bins were used because refactoring commits are a small fraction of all commits. Fig. 3. Refactorings over normalized commit history (10 bins). For each bar, architectural commits are shown at the botom in dark blue, and non-architectural commits are shown at the top in light blue. 4.5.6 Summary for RQ-CC.Taken together, we see ample evidence of code tidying and some evidence of code cleanups across all genres. Exploratory Analysis notebooks see more code deleted and commented out. On average they apply more architectural refactorings and slightly more multiple-refactoring commits. The preponderance of observed code tidying is especially notable because any code that was introduced and refactored in the same commit ś de facto code tidying ś was not observable. This is discussed further in Sections 6. 5 DISCUSSION 5.1 Refactoring: Intrinsic to Notebook Development Despite Small Size Genres exhibit unique refactoring proiles, perhaps due to their distinct evolutionary drivers, but refactoring was observed in most notebooks of all genres. We were surprised to see substantial similarities in refactoring behavior among data scientists, computer scientists, and those from other backgrounds such as physical scientists. Likewise, our observation of a broad practice of loss refactoring, a best practice in traditional software development, is notable, as was the broader pattern of on-going maintenance over big clean-ups. These suggest that the pressures of technical debt are motivating notebook authors to perform regular notebook maintenance. ACM Trans. Softw. Eng. Methodol. 111:18 • Liu, Lukes, & Griswold Fig. 4. Box plots for commits with net deletions of non-comment code, by genre. We surmise that refactoring is intrinsic to notebook development, despite the small size of notebooks. Belady and Lehman’s model for software development predicts that entropy increases exponentially with each2change ], and [ exploratory development tends to introduce many (often small) changes. Their model predicts that computational notebooks will experience increasing diiculty in making changes, eventually being forced to refactor or abandon the notebook (perhaps by copying essential code into a new notebook). 5.2 Notebook Authors Refactor Less, and Diferently, than Traditional Developers As observed in Section 4.2.1, 13% of commits for the notebooks in this study contain refactorings. This contrasts with the 33% found by Murphy et al. in their study of CVS logs of Java develop 26, ğ3.5]. ers [Since we found that those with a computer science background actually refactored their notebooks a little less than others (See Section 4.4), it appears that unique characteristics of computational notebooks such as their typical small size are inluential. Another inluence could be that notebook authors commit their changes to GitHub less frequently, hiding more loss refactorings (See Section 6.1). Furthermore, as observed in Section 4.2.2, notebook authors rely heavily on a few basic refactorings. The same trend is seen amongst traditional Java developers, but with a distinct selection of refactoring operations. In Murphy-Hill et al.’s analysis of CVS commit logs of manual refactorings of Java code, the top four refactoring Although Belady and Lehman’s article is titled A ł Model of Large Program Developmentž, the model itself does not depend on the size of the program. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:19 operations were, in order Rename , (29.7%), Push Down Method/Field (14.5%), Generalize Declared Type (9%), and Extract Function/Method (6.9%) [26, Fig. 2]. We observe the following: 5.2.1 Rename is much less common in notebook development. For the traditional developers, Rename occurred more than twice as often as the next refactoring, whereas in our sample of noteb Rename ooks came in fourth at 10.5%. Although the small size of the typical notebook compared to a Java application might be a partial explanation, the typical notebook makes heavy use of the global scope, putting pressure on the notebook author to maintain good naming as the notebook grows and evolves. We hypothesize that the diiculty of renaming identiiers in the Jupyter IDE, as discussed in Section 2.4, is a deterrent to renaming in notebooks. 5.2.2 Object-oriented refactorings are much less common in notebook development.For traditional developers, the next two most common refactorings, Push Down Method/Field and Generalize Declared Type, are core to object-oriented development. Python supports object-oriented development, but as discussed in Section 4.4, it is not widely practiced in the notebooks in this study. 5.2.3 Cells are a key structural afordance in notebook evolution. Although the wide use of Jupyter’s unique cell construct is not surprising, it is perhaps more surprising that the refactoring of cells is so common, at 40% of the total. However, as shown in Figure 9, the occurrence of code cells is over seven times greater than functions for the notebooks in this study, the next most common construct. What we are likely observing is that cells are displacing functions, in comparison to traditional development. Another factor Reorder is that Cells, Split Cell, and Merge Cells are supported in the Jupyter IDE, unlike the other observed refactoring operations. 5.3 Need for Beter Refactoring Support in Notebook IDEs Implementing tool assistance for refactoring is non-trivial, and IDE developers might be hesitant to do so without evidence that the beneits outweigh the costs. The presence of refactoring in 80% of the computational notebooks sampled for this study argues for at least some support for traditional refactoring tools that enable author-directed refactorings. In particular, our results suggest that notebook authors using Jupyter would beneit from environment support for at least the three refactorings in the top six that are not yet automate Change d, Function Signature, Extract Function, and Rename. Rename is particularly challenging to perform manually in a computational notebook because old symbols retain their values in the environment until the kernel is restarted or the symbol is removed by the notebook author. A proper Rename refactoring would also rename the symbol in the environment, not just the code. Similarly,Reorder for Cells, which is supported only syntactically in computational notebooks, support for checking deinition-use dependencies between reordered14 cells ] could [ help avoid bugs for the many computational notebook authors who use that refactoring. As observed in Section 2.4, JetBrain’s new DataSpell IDE provides a subset of the refactorings supported by their Python refactoring engine. Among these are three of the top six refactorings observed in our study, mixed in with several less useful object-oriented refactorings. Rename The provided by DataSpell does not rename the symbol in the runtime kernel. Although our results document that phases of code cleanups are not especially common, especially as compared to tidying, one possible reason is the lack of tool support. Without tool support or test suites, a notebook cleanup is high risk, as the substantial changes to a complex notebook could lead to hard-to-ix bugs. In this regard, our results argue for cleanup support like code gathering tools [15] (See Section 2.4). On the other hand, a customized mix of refactoring assistance for diferent genres or authors of difering backgrounds is not strongly supported. Although we observed diferences, the efect sizes of the statistical tests are modest and the overall top refactorings observed were performed in sizable percentages in notebooks of both exploratory and expository characters, as well as by authors of difering backgrounds. ACM Trans. Softw. Eng. Methodol. 111:20 • Liu, Lukes, & Griswold 5.4 Multi-Notebook Refactoring In interviews notebook authors said that they sometimes keep exploration and exposition in separate note- books [31], split a multiple-analysis notebook into multiple noteb 18], or ooks drop [ deprecated code into a separate notebook [18, 31]. We saw evidence of these in the ample code deleted from notebooks (See 4.5.5). Such actions create relationships among notebooks akin to version control. Extrapolating from a suggestion from one of Head et al.’s study participants 15],[it may be valuable for notebook refactoring tools to be aware of these relationships, for example having an option Rename for to span related notebooks. 5.5 Future Work on Studies of Notebook Refactoring The present study focused on refactorings introduced between commits on a notebook. Future work could, for example, use logs from instrumented notebook IDEs to reveal the full prevalence of refactoring, as well as study ine-grained evolution behaviors such as how refactoring tools are used (as Murphy-Hill et al. did for Java developers [26]) or track activity among related notebooks (such as those being used for ad hoc version control). Future work could also investigate which situations motivate the use of refactoring in computational notebooks, as well as the efectiveness of refactoring. The latter could be studied for both manual and assisted refactoring, examining properties such as improved legibility and cell ordering better relecting execution dependencies, versus, say, the introduction of bugs or a worse design. 6 LIMITATIONS AND THREATS 6.1 Limitations Due to Use of Source Control Many computational notebook authors may be unaware of source code control tools. This study omits their notebooks, and our results may not generalize to that population. However, numerous articles in the traditional sciences advocate for the use of tools like GitHub for the beneits of reproducibility, collaboration, and protecting valuable assets24 [ , 28], establishing it as a best practice, if not yet a universal one. Notebook authors who know how to use source code control are more likely to know software engineering best practices as well, for example if they had encountered Software Carpentry33[, 34]. By factoring our analysis according to author background, we partially mitigated this limitation. Notebooks may have been omitted because authors deemed them too trivial to commit to source control. We also excluded notebooks with fewer than 10 commits, the vast majority of notebooks on GitHub. Still, our study includes notebooks with a wide range of sizes and number of commits. Our dependence on inter-commit analysis also presents limitations. Refactorings that occur in the same commit in which the refactored code is introduced are undetectable. This underreporting may not be uniform across refactoring operations, because intra-commit refactorings are by deinition loss refactorings, which could be less architectural in nature. Even so, we observed little root canal refactoring. Additionally, our analysis is more likely to underreport for notebooks whose histories have a relatively small number of large commits rather than many small commits. Related, some notebooks may not be committed early in their history, resulting in an initial large commit, hiding refactorings. Also, a few of the notebooks in our study are not done ł ž: still in active development or in long-term maintenance. Given the marked bursts in overall commit rate early and late in the histories we captured (Figure 2 and Table 11), we are conident that we observed meaningful lifetimes for the vast majority of notebooks. To further quantify this, we performed an analysis of the similarity of the irst and last commit of each notebook. We took a conservative approach, extracting the bag of lexical code tokens for each, and then calculated the similarity as the size of the (bag) intersection of the two commits, divided by the size of the larger bag. Half of the notebooks’ irst-last commits are less than 34% similar. At the extremes, 47 notebooks’ irst-last commits are less than 10% similar, whereas 14 notebooks’ irst-last commits are more than 90% similar. ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:21 A future study could avoid the limitation of using source control snapshots by investigating notebook evolution through an instrumented notebook IDE (See Section 5.5). 6.2 Limitation Due to Use of Public Notebooks Our study examined publicly-available notebooks. Notebooks authored as private assets may be evolved diferently, perhaps due to having smaller audiences or fewer authors. Likewise, many programming assignments may be completed within a private repository due to academic integrity requirements. These limitations are partially mitigated by factoring our analysis by author background and notebook genre. In a separate vein, public notebooks on GitHub are frequently cloned, notably textbooks and ill-in-the-blank homework assignments, creating the possibility that our random sample might have selected the same notebook multiple times, skewing our results. Our exclusion of student ill-in-the-blank notebooks (Section 3.4) eliminated one class of possibilities. In the end, as discussed in Section 3.2, our random sample contained no duplicate notebooks. 6.3 Limitation Due to Single-Notebook Focus Our analysis detected when Extract Module moved code out to a library package and referenced it by import, and we analyzed code deletion in the context of cleanups. Deleted code may have been pasted into another notebook (See Sections 4.5.5 and 5.4), but its destination was not tracked in our analysis. As mentioned in Section 5.5, A future study could track the copying and movement of code across notebooks to better understand their relationship to refactoring. 6.4 Internal Validity Threats Due to Use of Visual Inspection We classiied refactorings through visual inspection, which is susceptible to error. Some previous studies of professional developers used automated methods, but they were shown to detect only a limited range of refactor- ings26 [ ]. We used visual inspection to enable the detection of idiomatic expressions of refactorings in Jupyter notebooks, regardless of the programming language used. Five notebooks did not employ Python. Refactorings have been standardized and formalized in the literature, and the authors are expert in software refactoring. The methods we practiced as described in Sections 3.1 and 3.2 further controlled for mistakes. An audit, as described in Section 3.3, found a 9.9% error rate. Determining notebook genre was more diicult, as there is no scientiic standard for these. As described in Section 3.4, we employed Negotiated Agreement to eliminate errors. Although we can claim our classiications to be stable, others could dispute our criteria for classiication. As described in the same Section, for notebook author background, an audit found a 5% error rate. 6.5 External Validity Threat Due to Sample Size Finally, in order to enable a detailed and accurate inspection of each notebook, its commits, authorship, and containing repository, this study was limited to studying 200 notebooks. We randomized our selection process at multiple stages to ensure a representative sample. 7 CONCLUSION Computational notebooks have emerged as an important medium for developing analytical software, particularly for those without a background in computing. In recent interview studies, authors of notebooks conducting exploratory analyses frequently spoke of tidying and cleaning up their notebooks. Little was known, however, about how notebook authors in general actually maintain their notebooks, especially as regards refactoring, a key practice among traditional developers. This article contributes a study of computational notebook refactoring in the wild through an analysis of the commit histories of 200 Jupyter notebooks on GitHub. In summary: ACM Trans. Softw. Eng. Methodol. 111:22 • Liu, Lukes, & Griswold RQ-RF (Notebook Refactoring): Despite the small size of computational notebooks, notebook authors refactor, even if they lack a background related to computing. From this we surmise that refactoring is intrinsic to notebook development. Authors depend primarily on a few non-object-oriented refactorings (in Change order): Function Signature, Extract Function, Reorder Cells, Rename, Split Cell, and Merge Cells. Traditional developers coding in languages like Java refactor more than twice as much. They prioritize the same non-cell operations as notebook authors, but apply Rename most frequently and favor a more object-oriented mix of refactorings. RQ-GR (Refactoring by Genre): Computational notebooks of all genres undergo consequential refactoring, suggesting that the messiness of exploration often discussed in the literature is not the only driver of refactoring. Programming assignments (e.g., term projects) appear rather similar to Exploratory Analyses with respect to how they are refactored, despite their diferent end purpose. Overall, refactoring behaviors are diferentiated by the notebook’s exploratory versus expository purpose. RQ-BG (Refactoring by Background): Computational notebook authors with a computing background (computer scientists and data scientists) seem to refactor diferently than others, but not more. This adds weight to the conclusion above that refactoring is instrinsic to computational notebook development. RQ-CC (Tidying vs. Cleanups of Code): Computational notebook authors exhibit a pattern of ongoing code tidying. Cleanups, cited in interview studies, were less evident, although they occur more in exploratory analyses. Cleanups appear to be achieved more often by moving code into new notebooks, a kind of ad hoc version control. Our results suggest that notebook authors might beneit from IDE support for Change Function Signature, Extract Function, and Rename, withRename taking the kernel state into account. Also, given the frequency of use of theReorder Cells operation, notebook authors might beneit from it being extended to check for deinition-use dependencies. To replicate and extend these results, future work could instrument notebook IDEs to log ine-grained evolution behaviors and how refactoring tools are used, including across related notebooks. Future work could also study the circumstances that motivate the use of refactoring in computational notebooks, as well as the net beneits of refactoring with respect to factors like legibility, cell ordering relecting execution dependencies, and the accidental introduction of bugs. ACKNOWLEDGMENTS This research was supported in part by the National Science Foundation under Grant No. CCF-1719155. We thank Jim Hollan for reading an early draft of this paper. We are grateful to the reviewers for their insightful and constructive comments, which helped make this a much better article. REFERENCES [1] Sandro Badame and Danny Dig. 2012. Refactoring Meets Spreadsheet Formulas.PrIn oceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM) (ICSM ’12). IEEE Computer Society, Washington, DC, USA, 399ś409. https://doi.org/10.1109/ICSM. 2012.6405299 [2] L. A. Belady and M. M. Lehman. 1976. A Model of Large Program Development. IBM Syst. J. 15, 3 (Sept. 1976), 225ś252. https: //doi.org/10.1147/sj.153.0225 [3] Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289ś300. [4] Fred Brooks. 1975. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, Reading, MA. 195 pages. [5] Nanette Brown, Yuanfang Cai, Yuepu Guo, Rick Kazman, Miryung Kim, Philippe Kruchten, Erin Lim, Alan MacCormack, Robert Nord, Ipek Ozkaya, and et al. 2010. Managing Technical Debt in Software-Reliant Systems. ProceeIn dings of the FSE/SDP Workshop on Future of Software Engineering Research (Santa Fe, New Mexico, USA(FoSER ) ’10). Association for Computing Machinery, New York, NY, USA, 47ś52. https://doi.org/10.1145/1882362.1882373 [6] Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. ProceInedings of the 2020 CHI Conference on Human Factors in Computing Systems ACM Trans. Softw. Eng. Methodol. Refactoring in Computational Notebooks • 111:23 (Honolulu, HI, USA(CHI ) ’20). Association for Computing Machinery, New York, NY, USA, 1ś12. https://doi.org/10.1145/3313831.3376729 [7] Harald Cramér. 1999.Mathematical Methods of Statistics (PMS-9). Princeton University Press, Princeton, NJ. http://www.jstor.org/ stable/j.ctt1bpm9r4 [8] Ward Cunningham. 1992. The WyCash Portfolio Management System.AIn ddendum to the Proceedings on Object-Oriented Programming Systems, Languages, and Applications (Addendum)(Vancouver, British Columbia, Canada) (OOPSLA ’92). Association for Computing Machinery, New York, NY, USA, 29ś30. https://doi.org/10.1145/157709.157715 [9] Olive Jean Dunn. 1964. Multiple comparisons using rankTsums. echnometrics 6, 3 (1964), 241ś252. [10] Martin Fowler. 2020. Refactoring Catalog. https://refactoring.com/catalog/. Accessed: 2020-02-17. [11] Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley, Boston, MA, USA. [12] D.R. Garrison, M. Cleveland-Innes, Marguerite Koole, and James Kappelman. 2006. Revisiting methodological issues in transcript analysis: Negotiated coding and reliability The Internet . and Higher Education 9, 1 (2006), 1ś8. https://doi.org/10.1016/j.iheduc.2005.11.001 [13] GitHub Search 2020. Search via GitHub API. http://api.github.com/search/code?q=python+language:jupyter-notebook. Accessed: 2020-08-26. [14] William G. Griswold and David Notkin. 1993. Automated Assistance for Program Restructuring. ACM Trans. Softw. Eng. Methodol.2, 3 (July 1993), 228ś269. https://doi.org/10.1145/152388.152389 [15] Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. 2019. Managing Messes in Computational Notebooks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk)(CHI ’19). ACM, New York, NY, USA, Article 270, 12 pages. https://doi.org/10.1145/3290605.3300500 [16] Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 1265ś1276. https://doi.org/10.1145/3025453.3025626 [17] Mary Beth Kery and Brad A. Myers. 2018. Interactions for Untangling Messy History in a Computational Noteb2018 ook.IEEE In Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 147ś155. https://doi.org/10.1109/VLHCC.2018.8506576 [18] Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science Using a Literate Programming Tool. Proce Inedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada)(CHI ’18). ACM, New York, NY, USA, Article 174, 11 pages. https://doi.org/10.1145/3173574.3173748 [19] Donald E. Knuth. 1984. Literate Programming. Comput. J. 27, 2 (May 1984), 97ś111. https://doi.org/10.1093/comjnl/27.2.97 [20] Andreas P. Koenzen, Neil A. Ernst, and Margaret-Anne D. Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. 2020In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1ś9. https://doi.org/10.1109/VL/HCC50065.2020.9127202 [21] William H Kruskal and W Allen Wallis. 1952. Use of ranks in one-criterion variance Journal analysis. of the American statistical Association 47, 260 (1952), 583ś621. [22] LSP 2021. LSP integration for jupyter[lab]. https://jupyterlab-lsp.readthedocs.io/en/latest/index.html. Accessed: 2021-12-29. [23] Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, and Aditya Parameswaran. 2021. Fine-Grained Lineage for Safer Notebook Interactions.Proc. VLDB Endow. 14, 6 (feb 2021), 1093ś1101. https://doi.org/10.14778/3447689.3447712 [24] Florian Markowetz. 2015. Five Selish Reasons to Work Reproducibly Genome . Biology16 (2015), 274. https://doi.org/10.1186/s13059- 015-0850-7 [25] Emerson Murphy-Hill and Andrew P. Black. 2008. Refactoring Tools: Fitness for Purp IEEE oseSoftw . . 25, 5 (Sept. 2008), 38ś44. https://doi.org/10.1109/MS.2008.123 [26] Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black. 2009. How We Refactor, and How We KnowPrIt.oceInedings of the 31st International Conference on Software Engineering (ICSE ’09). IEEE Computer Society, Washington, DC, USA, 287ś297. https: //doi.org/10.1109/ICSE.2009.5070529 [27] Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157ś175. [28] Jefrey Perkel. 2016. Democratic databases: science on GitHub Natur . e 538, 7623 (October 2016), 127ś128. https://doi.org/10.1038/538127a [29] Jefrey Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice Nature. 563 (11 2018), 145ś146. https: //doi.org/10.1038/d41586-018-07196-1 [30] Laurent Peuch. 2020. Welcome to RedBaron’s documentation! https://redbaron.readthedocs.io/en/latest/. Accessed: 2020-08-25. [31] Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Noteb Prooceoks. edings In of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada)(CHI ’18). ACM, New York, NY, USA, Article 32, 12 pages. https://doi.org/10.1145/3173574.3173606 [32] Kathryn T. Stolee and Sebastian Elbaum. 2011. Refactoring Pipe-like Mashups for End-user Programmers. ProceeIn dings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA (ICSE ) ’11). ACM, New York, NY, USA, 81ś90. https://doi.org/10.1145/1985793.1985805 ACM Trans. Softw. Eng. Methodol. 111:24 • Liu, Lukes, & Griswold [33] G. Wilson. 2006. Software Carpentry: Getting Scientists to Write Better Code by Making Them More ProComputing ductive. in Science Engineering 8, 6 (Nov 2006), 66ś69. https://doi.org/10.1109/MCSE.2006.122 [34] Greg Wilson. 2014. Software Carpentry: Lessons LearneF1000Resear d. ch 3 (2014), 62. https://doi.org/10.12688/f1000research.3-62.v2 [35] Frank Yates. 1934. Contingency tables involving small numbers and χ 2the test. Supplement to the Journal of the Royal Statistical Society 1, 2 (1934), 217ś235. [36] Daniel Yekutieli and Yoav Benjamini. 1999. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference 82, 1-2 (1999), 171ś196. ACM Trans. Softw. Eng. Methodol.

Journal

ACM Transactions on Software Engineering and Methodology (TOSEM)Association for Computing Machinery

Published: Apr 26, 2023

Keywords: Computational notebooks

There are no references for this article.