Data Mining Software Version Histories
Making changes within a complex software system is often error-prone — even the smallest mistake can endanger the entire system. Ten years ago, computer scientists from Saarbrücken around Prof. Andreas Zeller developed a technique that automatically issues suggestions on how to manage changes in software, based on the program’s version history. Their work was now named the most influential contribution of the last ten years at the International Conference on Software Engineering.
In the awarded research paper, the researchers from Saarbrücken examined the development of software over a long time span for the first time. This is documented in version histories that contain stored alterations to the software. They applied computing methods to the version histories, similar to those used by US online retailer Amazon. On Amazon, customers are given recommendations such as "Customers who bought this item also bought …." The computer scientists translated this approach to "Programmers who changed these functions also changed the following code blocks." In this way, their recently developed program "eROSE“ can guide developers safely through essential changes to complex software.
Their paper from 2004 immediately attracted attention. For the first time, the alteration history of a program had been used to automatically issue further review suggestions. Their work led to further research into automated version history analysis, a field currently engaging around 150 researchers from all over the world. In combination with error databases, the computer scientists from Saarbrücken could predict possible error sources within the Microsoft operating system Windows Vista. At the time, they were able to trace these issues to insufficient team structures. Today, Microsoft maintains a research department of its own, where staff members are responsible for the systematic review of error and version histories and for deducing suggestions from these archives.
The computer scientists from Saarbrücken were also successful with software companies like IBM, Google and SAP. "Using data mining, we were not only able to predict errors, but also gained insights into software development from a new perspective," says Andreas Zeller. In retrospect, it is no surprise that the analysis of version archives has become an independent research field within software engineering, Zeller says — this was long before "Big Data“ became a catchphrase on the Internet.
In their most recent project, the researchers also make use of the data mining principle: They automatically extract information from huge amounts of data. With their newest software Chabada, they examined 22,521 mini programs (apps) available in the Google Play store. With the aid of this software, they could reveal 81 percent of existing spy apps without having to know their behavioral patterns. Even Google took an interest in this approach: Ulfar Erlingson, head of Google Security Research, set up a meeting with the researchers shortly after the paper was published, and invited them to visit the Google center last fall to install their automated suggestion program.
The research group around Professor Zeller is not only establishing itself in the scientific world, but also in the industry. In 2013, Zeller co-founded the software company Testfabrik. With their Webmate software, his former PhD students have developed an automated testing service for complex Web 2.0 applications. The founders estimate that there is a market potential of around 120 million Euros a year for these kinds of services in Germany.