Publications

You can also find my articles on my Google Scholar profile.

Filter by (multi-select):

Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem

Published in IEEE Transactions on Software Engineering (TSE), 2026

Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Yasutaka Kamei [pdf]

View Abstract

Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.

Self-Admitted GenAI Usage in Open-Source Software

Published in IEEE Transactions on Software Engineering (TSE), 2026

Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Sebastian Baltes [pdf]

View Abstract

The widespread adoption of generative AI (GenAI) tools such as GitHub Copilot and ChatGPT is transforming software development. Since generated source code is virtually impossible to distinguish from manually written code, their real-world usage and impact on opensource software (OSS) development remain poorly understood. In this paper, we introduce the concept of self-admitted GenAI usage, that is, developers explicitly referring to the use of GenAI tools for content creation in software artifacts. Using this concept as a lens to study how GenAI tools are integrated into OSS projects, we analyze a curated sample of more than 200,000 GitHub repositories, identifying 1,292 such self-admissions across 156 repositories in commit messages, code comments, and project documentation. Using a mixed methods approach, we derive a taxonomy of 32 tasks, 10 content types, and 11 purposes associated with GenAI usage based on 1,292 qualitatively coded mentions. We then analyze 13 documents with policies and usage guidelines for GenAI tools and conduct a developer survey to uncover the ethical, legal, and practical concerns behind them. Our findings reveal that developers actively manage how GenAI is used in their projects, highlighting the need for project-level transparency, attribution, and quality control practices in AI-assisted software development. Finally, we examine the longitudinal impact of GenAI adoption on code churn in 151 repositories with self-admitted GenAI usage and find no general increase, contradicting popular narratives on the impact of GenAI on software development.

AILINKPREVIEWER: Enhancing Code Reviews with LLM-Powered Link Previews

Published in 32nd Asia-Pacific Software Engineering Conference, 2025

Panya Trakoolgerntong, Tao Xiao, Masanari Kondo, Chaiyong Ragkhitwetsagul, Morakot Choetkiertikul, Pattaraporn Sangaroonsilp, Yasutaka Kamei [pdf]

View Abstract

Code review is a key practice in software engineering, where developers evaluate code changes to ensure quality and maintainability. Links to issues and external resources are often included in Pull Requests (PRs) to provide additional context, yet they are typically discarded in automated tasks such as PR summarization and code review comment generation. This limits the richness of information available to reviewers and increases cognitive load by forcing context-switching. To address this gap, we present AILINKPREVIEWER, a tool that leverages Large Language Models (LLMs) to generate previews of links in PRs using PR metadata, including titles, descriptions, comments, and link body content. We analyzed 50 engineered GitHub repositories and compared three approaches: Contextual LLM summaries, Non-Contextual LLM summaries, and Metadata-based previews. The results in metrics such as BLEU, BERTScore, and compression ratio show that contextual summaries consistently outperform other methods. However, in a user study with seven participants, most preferred non-contextual summaries, suggesting a trade-off between metric performance and perceived usability. These findings demonstrate the potential of LLM-powered link previews to enhance code review efficiency and to provide richer context for developers and automation in software engineering. The video demo is available at https://www.youtube.com/ watch?v=h2qH4RtrB3E, and the tool and its source code can be found at https://github.com/c4rtune/AILinkPreviewer.

How Far Have LLMs Come Toward Automated SATD Taxonomy Construction?

Published in 32nd Asia-Pacific Software Engineering Conference, 2025

Sota Nakashima, Yuta Ishimoto, Masanari Kondo, Tao Xiao, Yasutaka Kamei [pdf]

View Abstract

Technical debt refers to suboptimal code that degrades software quality. When developers intentionally introduce such debt, it is called self-admitted technical debt (SATD). Since SATD hinders maintenance, identifying its categories is key to uncovering quality issues. Traditionally, constructing such taxonomies requires manually inspecting SATD comments and surrounding code, which is time-consuming, labor-intensive, and often inconsistent due to annotator subjectivity. In this study, we investigated to what extent large language models (LLMs) could generate SATD taxonomies. We designed a structured, LLM-driven pipeline that mirrors the taxonomy construction steps researchers typically follow. We evaluated it on SATD datasets from three domains: quantum software, smart contracts, and machine learning. It successfully recovered domain-specific categories reported in prior work, such as Layer Configurationin machine learning. It also completed taxonomy generation in under two hours and for less than $1, even on the largest dataset. These results suggest that, while full automation remains challenging, LLMs can support semi-automated SATD taxonomy construction. Furthermore, our work opens up avenues for future work, such as automated taxonomy generation in other areas.

Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions

Published in The ACM International Conference on the Foundations of Software Engineering (FSE), 2024

Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto [pdf]

View Abstract

GitHub’s Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development process. Additionally, it offers a unique opportunity to observe how developers respond when they disagree with the generated content. In our study, we employ a mixed-methods approach, blending quantitative analysis with qualitative insights, to examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that: (1) Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption. (2) PRs enhanced by Copilot for PRs require less review time and have a higher likelihood of being merged. (3) Developers using Copilot for PRs often complement the automated descriptions with their manual input. These results offer valuable insights into the growing integration of generative AI in software development.

How Trustworthy Is Your Continuous Integration (CI) Accelerator?: A Comparison of the Trustworthiness of CI Acceleration Products

Published in IEEE Software, 2024

Zhili Zeng, Tao Xiao, Maxime Lamothe, Hideaki Hata, Shane McIntosh [pdf]

View Abstract

The practice of Continuous Integration (CI) allows developers to quickly integrate and verify projects modifications. Thus, CI acceleration products are a boon to developers seeking rapid feedback. However, if outcomes vary between accelerated and non-accelerated settings, the trustworthiness of the acceleration is called into question. In this paper, we study the trustworthiness of two CI acceleration products, one based on program analysis (PA) and the other on machine learning (ML). We re-execute 50 failing builds from ten open-source projects in non-accelerated (baseline), PAaccelerated, and ML-accelerated settings. We find that when applied to known failing builds, PA-accelerated builds more often (43.83 percentage point difference across ten projects) align with the non-accelerated build results. We conclude that while there is still room for improvement for both CI acceleration products, the selected PA-product currently provides a more trustworthy signal of build outcomes than the ML-product.

A Mutation-Guided Assessment of Acceleration Approaches for Continuous Integration: An Empirical Study of Yourbase

Published in 21st IEEE International Conference on Mining Software Repositories (MSR), 2024

Zhili Zeng, Tao Xiao, Maxime Lamothe, Hideaki Hata, Shane McIntosh [pdf]

View Abstract

Continuous Integration (CI) is a popular software development practice that quickly verifies updates to codebases. To cope with the ever-increasing demand for faster software releases, CI acceleration approaches have been proposed; however, adoption of CI acceleration is not without risks. For example, CI acceleration products may mislabel change sets (e.g., a build labeled as failing that passes in an unaccelerated setting or vice versa) or produce results that are inconsistent with an unaccelerated build (e.g., the underlying reasons for failure differ between (un)accelerated builds). These inconsistencies threaten the trustworthiness of CI acceleration products. In this paper, we propose an approach inspired by mutation testing to systematically evaluate the trustworthiness of CI acceleration. We apply our approach to YourBase, a program analysis-based CI acceleration product, and uncover issues that hinder its trustworthiness. First, we study how often the same build in accelerated and unaccelerated CI settings produce different mutation testing outcomes. We call mutants with different outcomes in the two settings “gap mutants”. Next, we study the code locations where gap mutants appear. Finally, we inspect gap mutants to understand why acceleration causes them to survive. Our analysis of ten open-source projects uncovers 2,237 gap mutants. We find that: (1) the gap mutants account for 0.11%–23.50% of the studied mutants; (2) 88.95% of gap mutants can be mapped to specific source code functions and classes using the dependency representation of the studied CI acceleration product; and (3) 69% of gap mutants survive CI acceleration due to deterministic reasons that can be classified into six fault patterns. Our results show that even deterministic CI acceleration solutions suffer from trustworthiness limitations, and highlight the ways in which trustworthiness could be pragmatically improved.

DevGPT: Studying Developer-ChatGPT Conversations

Published in 21st IEEE International Conference on Mining Software Repositories (MSR), 2024

Tao Xiao, Christoph Treude, Hideaki Hata, Kenichi Matsumoto [pdf]

View Abstract

This paper introduces DevGPT, a dataset curated to explore how software developers interact with ChatGPT, a prominent large language model (LLM). The dataset encompasses 29,778 prompts and responses from ChatGPT, including 19,106 code snippets, and is linked to corresponding software development artifacts such as source code, commits, issues, pull requests, discussions, and Hacker News threads. This comprehensive dataset is derived from shared ChatGPT conversations collected from GitHub and Hacker News, providing a rich resource for understanding the dynamics of developer interactions with ChatGPT, the nature of their inquiries, and the impact of these interactions on their work. DevGPT enables the study of developer queries, the effectiveness of ChatGPT in code generation and problem solving, and the broader implications of AI-assisted programming. By providing this dataset, the paper paves the way for novel research avenues in software engineering, particularly in understanding and improving the use of LLMs like ChatGPT by developers.

“My GitHub Sponsors profile is live!” Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

Published in 46th International Conference on Software Engineering (ICSE), 2024

Youmei Fan, Tao Xiao, Christoph Treude, Hideaki Hata, Kenichi Matsumoto [pdf]

View Abstract

GitHub Sponsors was launched in 2019, enabling donations to opensource software developers to provide financial support, as per GitHub’s slogan: “Invest in the projects you depend on”. However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to sponsors), developers had advertised their GitHub Sponsors profiles on social media, such as Twitter (also known as X). Therefore, in this work, we investigate the impact of tweets that contain links to GitHub Sponsors profiles on sponsorship, as well as their reception on Twitter/X. We further characterize these tweets to understand their context and find that (1) such tweets have the impact of increasing the number of sponsors acquired, (2) compared to other donation platforms such as Open Collective and Patreon, GitHub Sponsors has significantly fewer interactions but is more visible on Twitter/X, and (3) developers tend to contribute more to open-source software during the week of posting such tweets. Our findings are the first step toward investigating the impact of social media on obtaining funding to sustain open-source software.

Quantifying and Characterizing Clones of Self-Admitted Technical Debt in Build Systems

Published in Empirical Software Engineering (ESE), 2024

Tao Xiao, Zhili Zeng, Dong Wang, Hideaki Hata, Shane McIntosh, Kenichi Matsumoto [pdf]

View Abstract

Self-Admitted Technical Debt (SATD) annotates development decisions that intentionally exchange long-term software artifact quality for short-term goals. Recent work explores the existence of SATD clones (duplicate or near duplicate SATD comments) in source code. Cloning of SATD in build systems (e.g., CMake and Maven) may propagate suboptimal design choices, threatening qualities of the build system that stakeholders rely upon (e.g., maintainability, reliability, repeatability). Hence, we conduct a large-scale study on 50,608 SATD comments extracted from Autotools, CMake, Maven, and Ant build systems to investigate the prevalence of SATD clones and to characterize their incidences. We observe that: (i) prior work suggests that 41–65% of SATD comments in source code are clones, but in our studied build system context, the rates range from 62% to 95%, suggesting that SATD clones are a more prevalent phenomenon in build systems than in source code; (ii) statements surrounding SATD clones are highly similar, with 76% of occurrences having similarity scores greater than 0.8; (iii) a quarter of SATD clones are introduced by the author of the original SATD statements; and (iv) among the most commonly cloned SATD comments, external factors (e.g., platform and tool configuration) are the most frequent locations, limitations in tools and libraries are the most frequent causes, and developers often copy SATD comments that describe issues to be fixed later. Our work presents the first step toward systematically understanding SATD clones in build systems and opens up avenues for future work, such as distinguishing different SATD clone behavior, as well as designing an automated recommendation system for repaying SATD effectively based on resolved clones.

More than React: Investigating the Role of Emoji Reaction in GitHub Pull Requests

Published in Empirical Software Engineering (ESE), 2023

Dong Wang, Tao Xiao, Teyon Son, Raula Gaikovina Kula, Takashi Ishio, Yasutaka Kamei, Kenichi Matsumoto [pdf]

View Abstract

Open source software development has become more social and collaborative, evident GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. From a code review context, the extent to which emoji reactions facilitate a more efficient review process is unknown. We conduct an empirical study to mine 1,850 active repositories across seven popular languages to analyze 365,811 Pull Requests (PRs) for their emoji reactions against the review time, first-time contributors, comment intentions, and the consistency of the sentiments. Answering these four research perspectives, we first find that the number of emoji reactions has a significant correlation with the review time. Second, our results show that a PR submitted by a first-time contributor is less likely to receive emoji reactions. Third, the results reveal that the comments with an intention of information giving, are more likely to receive an emoji reaction. Fourth, we observe that only a small proportion of sentiments are not consistent between comments and emoji reactions, i.e., with 11.8% of instances being identified. In these cases, the prevalent reason is when reviewers cheer up authors that admit to a mistake, i.e., acknowledge a mistake. Apart from reducing commenting noise, our work suggests that emoji reactions play a positive role in facilitating collaborative communication during the review process.

18 million links in commit messages: purpose, evolution, and decay

Published in Empirical Software Engineering (ESE), 2023

Tao Xiao, Sebastian Baltes, Hideaki Hata, Christoph Treude, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto [pdf]

View Abstract

Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on “9.6 million links in source code comments” showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub repositories to investigate whether they suffer the same fate. Results show that referencing external resources is prevalent and that the most frequent domains other than github.com are the external domains of Stack Overflow and Google Code. Similarly, links serve as source code context to commit messages, with inaccessible links being frequent. Although repeatedly referencing links is rare (4%), 14% of links that are prone to evolve become unavailable over time; e.g., tutorials or articles and software homepages become unavailable over time. Furthermore, we find that 70% of the distinct links suffer from decay; the domains that occur the most frequently are related to Subversion repositories. We summarize that links in commits share the same fate as links in code, opening up avenues for future work.

Understanding the Role of Images on Stack Overflow

Published in 20th IEEE International Conference on Mining Software Repositories (MSR), 2023

Dong Wang, Tao Xiao, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Yasutaka Kamei [pdf]

View Abstract

Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifically conduct an empirical study to investigate (I) the characteristics of images, (II) the extent to which images are used in different question types, and (III) the role of images on receiving answers. Our results first show that user interface is the most common image content and undesired output is the most frequent purpose for sharing images. Moreover, these images essentially facilitate the understanding of 68% of sampled questions. Second, we find thatdiscrepancy questions are more relatively frequent compared to those without images, but there are no significant differences observed in description length in all types of questions. Third, the quantitative results statistically validate that questions with images are more likely to receive accepted answers, but do not speed up the time to receive answers. Our work demonstrates the crucial role that images play by approaching the topic from a new angle and lays the foundation for future opportunities to use images to assist in tasks like generating questions and identifying question-relatedness.

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

Published in 44th International Conference on Software Engineering (ICSE), 2022

Shimada Naomichi, Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto [pdf]

View Abstract

GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Sponsors, including quantitative and qualitative analyses, to understand the characteristics of developers who are likely to receive donations and what developers think about donations to individuals. We found that: (1) sponsored developers are more active than non-sponsored developers, (2) the possibility to receive donations is related to whether there is someone in their community who is donating, and (3) developers are sponsoring as a new way to contribute to OSS. Our findings are the first step towards data-informed guidance for using GitHub Sponsors, opening up avenues for future work on this new way of financially sustaining the OSS community.

Characterizing and Mitigating Self-Admitted Technical Debt in Build Systems

Published in IEEE Transactions on Software Engineering (TSE), 2021

Tao Xiao, Dong Wang, Shane Mcintosh, Hideaki Hata, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto [pdf]

View Abstract

Technical Debt is a metaphor used to describe the situation in which long-term software artifact quality is traded for short-term goals in software projects. In recent years, the concept of self-admitted technical debt (SATD) was proposed, which focuses on debt that is intentionally introduced and described by developers. Although prior work has made important observations about admitted technical debt in source code, little is known about SATD in build systems. In this paper, we set out to better understand the characteristics of SATD in build systems. To do so, through a qualitative analysis of 500 SATD comments in the Maven build system of 291 projects, we characterize SATD by location and rationale (reason and purpose). Our results show that limitations in tools and libraries, and complexities of dependency management are the most frequent causes, accounting for 50% and 24% of the comments. We also find that developers often document SATD as issues to be fixed later. As a first step towards the automatic detection of SATD rationale, we train classifiers to detect the two most frequently occurring reasons and the four most frequently occurring purposes of SATD in the content of comments in Maven build systems. The classifier performance is promising, achieving an F1-score of 0.71–0.79. Finally, within 16 identified ‘ready-to-be-addressed’ SATD instances, the three SATD submitted by pull requests and the five SATD submitted by issue reports were resolved after developers were made aware. Our work presents the first step towards understanding technical debt in build systems and opens up avenues for future work, such as tool support to track and manage SATD backlogs.

More Than React: Investigating The Role of Emoji Reaction in GitHub Pull Requests

Published in 37th International Conference on Software Maintenance and Evolution (ICSME), 2021

Teyon Son, Tao Xiao, Dong Wang, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto [pdf]

View Abstract

Context: Open source software development has become more social and collaborative, especially with the rise of social coding platforms like GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. Interestingly, preliminary results indicate that emojis do not always reduce commenting noise (i.e., eight out of 20 emoji reactions), providing evidence that developers use emojis with ulterior intentions. From a reviewing context, the extent to which emoji reactions facilitate for a more efficient review process is unknown. Objective: In this registered report, we introduce the study protocols to investigate ulterior intentions and usages of emoji reactions, apart from reducing commenting noise during the discussions in GitHub pull requests (PRs). As part of the report, we first perform a preliminary analysis to whether emoji reactions can reduce commenting noise in PRs and then introduce the execution plan for the study. Method: We will use a mixed-methods approach in this study, i.e., quantitative and qualitative, with three hypotheses to test.

Understanding Shared Links and Their Intentions to Meet Information Needs in Modern Code Review: A Case Study of the OpenStack and Qt Projects

Published in Empirical Software Engineering (ESE), 2021

Dong Wang, Tao Xiao, Patanamon Thongtanunam, Raula Gaikovina Kula, Kenichi Matsumoto [pdf]

View Abstract

Code reviews serve as a quality assurance activity for software teams. Especially for Modern Code Review, sharing a link during a review discussion serves as an effective awareness mechanism where “Code reviews are good FYIs [for your information].”. Although prior work has explored link sharing and the information needs of a code review, the extent to which links are used to properly conduct a review is unknown. In this study, we performed a mixed-method approach to investigate the practice of link sharing and their intentions. First, through a quantitative study of the OpenStack and Qt projects, we identify 19,268 reviews that have 39,686 links to explore the extent to which the links are shared, and analyze a correlation between link sharing and review time. Then in a qualitative study, we manually analyze 1,378 links to understand the role and usefulness of link sharing. Results indicate that internal links are more widely referred to (93% and 80% for the two projects). Importantly, although the majority of the internal links are referencing to reviews, bug reports and source code are also shared in review discussions. The statistical models show that the number of internal links as an explanatory factor does have an increasing relationship with the review time. Finally, we present seven intentions of link sharing, with providing context being the most common intention for sharing links. Based on the findings and a developer survey, we encourage the patch author to provide clear context and explore both internal and external resources, while the review team should continue link sharing activities. Future research directions include the investigation of causality between sharing links and the review process, as well as the potential for tool support.