When A Patch Goes Bad Exploring The Properties of
When A Patch Goes Bad Exploring The Properties of
Abstract—Security is a harsh reality for software teams today. abuse cases [5], and automated static analysis [2]–[4]. While
Developers must engineer secure software by preventing these practices have been shown to be effective, they can also
vulnerabilities, which are design and coding mistakes that have be inefficient. Development teams are then faced with the
security consequences. Even in open source projects, vulnerable challenge of prioritizing their fortification efforts within the
source code can remain unnoticed for years. In this paper, we
entire development process. Developers might know what is
traced 68 vulnerabilities in the Apache HTTP server back to the
version control commits that contributed the vulnerable code possible, but lack a firm grip on what is probable. As a result,
originally. We manually found 124 Vulnerability-Contributing an uninformed development team can easily focus on the
Commits (VCCs), spanning 17 years. In this exploratory study, wrong areas for fortification.
we analyzed these VCCs quantitatively and qualitatively with the Fortunately, an historical, longitudinal analysis of how
over-arching question: “What could developers have looked for vulnerabilities originated in professional products can inform
to identify security concerns in this commit?” Specifically, we fortification prioritization. Understanding the specific trends of
examined the size of the commit via code churn metrics, the how vulnerabilities can arise in a software development
amount developers overwrite each others’ code via interactive product can help developers understand where to look and what
churn metrics, exposure time between VCC and fix, and
dissemination of the VCC to the development community via
to look for in their own product. Some of these trends have
release notes and voting mechanisms. Our results show that been quantified in vulnerability prediction [6]–[10] studies
VCCs are large: more than twice as much code churn on average using metrics aggregated at the file level, but little has been
than non-VCCs, even when normalized against lines of code. done to explore the original coding mistakes that contributed
Furthermore, a commit was twice as likely to be a VCC when the the vulnerabilities in the first place. In this study, we have
author was a new developer to the source code. The insight from identified and analyzed original coding mistakes as
this study can help developers understand how vulnerabilities Vulnerability-Contributing Commits (VCCs), or commits in
originate in a system so that security-related mistakes can be the version control repository that contributed to the
prevented or caught in the future. introduction of a post-release vulnerability.
Index Terms— vulnerability, churn, socio-technical, empirical.
A myriad of factors can lead to the introduction and lack of
detection of vulnerabilities. A developer may make a single
massive change to the system, leaving his peers with an
I. INTRODUCTION overwhelmingly large review. Furthermore, a developer may
Security is a harsh reality for software teams today. make small, incremental changes, but his work might be
Insecure software is not only expensive to maintain, but can affecting the work of many other developers. Or, a developer
cause immeasurable damage to a brand, or worse, to the may forget to disseminate her work in the change notes and so
livelihood of customers, patients, and citizens. the code may miss out on be reviewed entirely.
To software developers, the key to secure software lies in The objective of this research is to improve software
preventing vulnerabilities. Software vulnerabilities are special security by analyzing the size, interactive churn, and
types of “faults that violate an [implicit or explicit] security community dissemination of VCCs. We conducted an empirical
policy” [1]. If developers want to find and fix vulnerabilities case study of the Apache HTTP Server project (HTTPD).
they must focus beyond making the system work as specified Using a multi-researcher, cross-validating, semi-automated,
and prevent the system’s functionality from being abused. semi-manual process, we identified the VCCs for each known
According to security experts [2]–[4], finding vulnerabilities post-release vulnerability in HTTPD. To explore commit size,
requires expertise in both the specific product and in software we analyzed three code churn metrics. Interactive churn is a
security in general. suite of five recently-developed [6] socio-technical variants of
The field of engineering secure software has a plethora of code churn metrics that measure the degree to which
security practices for finding vulnerabilities, such as threat developers’ changes overwrite each others’ code at the line
modeling, penetration testing, code inspections, misuse and level. To explore community dissemination, we analyzed the
66
commit where the system started exhibiting some behavior. Of When a given vulnerability had multiple fix commits, we
particular interest to us was how git bisect takes into considered those commits to be one complete fix. If a
account divergent branches to make sure no commit is missed. vulnerability was incorrectly fixed (i.e. a regression), and the
With git bisect, the researcher defines the last known development team discovered this regression at a later date, we
“good” version of the system, then marks a known commit deferred to the development team as to whether or not the
when the system was vulnerable. Then, the researcher provides regression was considered a new vulnerability (in HTTPD,
a program that automatically identifies if the system has a regressions typically were considered new vulnerabilities when
given vulnerability, and git bisect will tell the researcher significant time had passed).
which commit contributed to the introduction of the In our case study, we were able to trace every reported
vulnerability. We note that we do not claim that our vulnerability from HTTPD back to their original fix commits in
methodology is novel; rather, we are providing this description the system.
for transparency and reproducibility.
Related literature [13], [23]–[25] uses terms such as 2. From the fix, write an ad hoc detection script to identify
“injecting”, “fix-inducing”, or “fault introducing” commits to the coding mistake for the given vulnerability automatically
describe what we call VCCs. We find those words to be For this step, we examined the fix, description, and
misleading as the original commits identified by this method surrounding information for the given vulnerability and
did not cause a developer to immediately fix it as words like identified the specific coding mistakes that led to the
“induce” imply, rather they contributed to the problem. vulnerability. Thus, the code prior to the fix was the vulnerable
Furthermore, we find that the word “contributing” is more apt code that our script would detect statically via a string search
for describing these commits because we found that, upon based on our understanding of the vulnerability as whole.
closer scrutiny, multiple commits often contributed to the For example, if the vulnerability was Cross-Site Scripting,
introduction of a given vulnerability. Thus, one VCC may not then the coding mistake the developer made was to output data
always “introduce” a vulnerability per se, but still be a that was not sanitized of HTML characters. Specifically, the
developer mistake that played a part in the security-related developer outputted user input and forgot to make an API call
mistake. Semantics aside, VCCs are a subset of what other to sanitize the HTML characters. To detect if this vulnerability
researchers have dubbed “fix-inducing” commits. existed, the string search would be looking for that instance
In this study, three researchers in all conducted this portion where a developer outputted user input without sanitization.
of the project, including the first author. Two researchers made Context would be added to the detection script to ensure that
the original identifications, and the fourth author was assigned the search does not provide false positives (although a false
to randomly and independently re-creating the bisect to check positive would be clearly apparent to the researcher anyway).
for correctness. The first author inspected all of these When the vulnerability fix involves entirely new code, we
identifications as well. This process of identifying VCCs examine the context to determine what error of omission the
required hundreds of man-hours over six months to collect and developer made. For example, if the vulnerability involved
vet this data set. Our VCC identification process can be forgetting to check for a null pointer, then our script would
summarized in these steps: detect when the surrounding lines would pass from one line to
1. Identify the fix commit(s) of the vulnerability; the next without checking for a null pointer.
2. From the fix, write an ad hoc detection script to identify Some vulnerabilities might have no context with which we
the coding mistake for the given vulnerability can bisect. For example, the fix for a vulnerability might
automatically; include declaring a new function in a utility library. Developers
3. Use git bisect to binary search the commit history for can place new functions at a variety of locations in a given file
the VCC; when order does not matter. The surrounding context of that
4. Inspect the potential VCC, revising the detection script new function has nothing to do with the code being vulnerable.
and re-running bisect as needed. Thus, in that situation, the vulnerable file cannot be bisected
and that data point was removed from the study.
1. Identify the fix commit(s) of the vulnerability. Some vulnerabilities might have multiple regions in a given
A fix commit is a specific change in the system’s version file where a fix was required. In that case, we treat each of
control repository where they altered the code to fix a those regions with a separate detection script to maximize the
vulnerability. For this step, we conduct a manual investigation number of potential VCCs we can find.
into each vulnerability to determine the fix commit. Sometimes As described in Section V, we found that only 12 of 134 fix
the development team kept records of which commit fixed a commits for files could not be bisected for this reason. In each
vulnerability, sometimes we needed to search using of those situations, other VCCs were identifiable for the same
information in NIST’s National Vulnerability Database3 vulnerability. Thus, no vulnerabilities were missed by the
(NVD), relevant project-specific notes (e.g. the CHANGES or analysis, only a few files.
STATUS file), commit messages in Git, and other vulnerability This step in particular requires human understanding of the
disclosure information. security concerns of the case study system and the specific
methods of mitigation that the developers undertook to fix the
3 vulnerability. This step cannot be automated, and requires time
http://nvd.nist.gov/
67
and expertise to hone. As discussed above, to mitigate TABLE I. SUMMARY STATISTICS OF HTTPD
subjectivity we used three researchers who cross-checked each Vulnerabilities Traced 83
other’s work and helped each other understand the specific Non 3rd Party Vulnerabilities 68
coding mistakes the developers made. Commits 24,061
Commit-File data points 25,847
3. Use git bisect to binary search the commit history VCCs found 124
Once we are satisfied that our script detects the vulnerable Vulnerability data points with no VCC 12
chunk of code, we use the command git bisect run to Vulnerabilities with no VCCs 0
determine the vulnerable file. Git bisect will conduct a binary Number of Authors of VCCs 31
search of the commits and run our detection script to determine
the commit where the code base was initially vulnerable. The multiple regions of a single file may be modified in a given fix,
output of this step is a commit. so we bisected each region of a fix to identify VCCs there.
Finally, a given VCC can also affect more files than the
4. Inspect the resulting commit, revising the detection script vulnerable file in question. Thus, we define a VCC as a
and re-running as needed modification to a given file, and consider the not-known-to-be-
vulnerable files of the offending commit to be not a VCC.
Upon running the bisect command, the researcher inspects In HTTPD, we had 25,847 commit-file data points. Of
the resulting commit to see if it was, in fact, contributing to the those data points, 124 were VCCs. Furthermore, a total of 12
introduction of a vulnerability. For repeatability, the files from 7 vulnerabilities had no possible VCC and were
researchers would correct the script and re-run the bisect until removed from study. These situations were when the fix
the resulting commit was clearly the one that contributed to the vulnerability involved defining a new constant or function in
introduction of the vulnerability. For example, when a file the system (as described in Section IV). Of those 7
would be renamed, the detection script would need to check the vulnerabilities, other VCCs were found for the other files in
old file. Or, if the context of the vulnerability was refactored in that vulnerability, so all 68 vulnerabilities remained covered by
such a way that was irrelevant to contributing to a vulnerability our analysis. Table 1 depicts these metrics.
(e.g. a developer renamed a method, changing comments), the
string search would be updated to handle the changing context. VI. SIZE OF VULNERABILITY-CONTRIBUTING COMMITS
V. CASE STUDY: APACHE HTTP SERVER One of the most common methods of measuring version
control commits is with code churn metrics [9], [14], [15], [27].
The Apache HTTP web server (HTTPD) has been the most In recent history, software researchers have discovered
commonly used web server in the world since 1996, and as of evidence of a “churn” effect in that frequently-changing code is
December 2012 is the server for 63.7% of active websites on statistically more likely to have faults and vulnerabilities. The
the World Wide Web [26]. Since 2002, the HTTPD team has motivation for code churn is intuitively appealing: the more
been publicly documenting their vulnerabilities. HTTPD is developers change code, the more problems they are likely to
primarily written in C, and provides a range of functionality via introduce. But is code churn truly to blame for introducing
various modules and networking protocols. In this study, we faults and vulnerabilities? If the motivation for studying code
were able to trace all 83 documented vulnerabilities back to churn is that high amounts of change in the code base can lead
their original source code fix. We found two vulnerabilities in to developers making mistakes, then VCCs ought to have
the NVD that were never acknowledged by HTTPD and never higher code churn than non-VCCs, on average. Thus, we
fixed, so they were removed from this study. examine the research question:
Some of the vulnerabilities reported by the HTTPD team Q1. Churn. Are VCCs larger than non-VCCs?
were actually vulnerabilities in third-party libraries. As a result, To measure commit size, we use three metrics: Code
the HTTPD team did not release a fix to their own code base to Churn, Relative Churn, and 30-Day churn. The Code Churn
fix the vulnerability, they simply used a patched version of the metric is typically computed for a given commit of a file. The
library. We removed 15 such vulnerabilities from this study version control system computes a diff, which is a matching of
since no VCC can exist for them. One notable external project which lines of code were changed in the given source code file,
that was extensively used by HTTPD was the Apache Portable denoted by lines deleted and lines added. More changes to a
Runtime (APR), which we considered to be a dependency. file indicates more lines added or deleted, therefore more Code
Thus, this study covers 68 vulnerabilities from HTTPD. Churn.
To keep comparisons of code churn the same across One concern from Nagappan et al.[15] regarding the raw
languages, we studied only source code in the C language. Two Code Churn metric is its relation to the overall size of the file.
vulnerabilities involved configuration of environment shell Code churn of 100 lines has a much different meaning for a file
scripts, and we removed those files from the code churn study. of 200 lines than 20,000 lines. Thus, we provide an additional
No other languages were involved in the 68 vulnerabilities. metric, called Relative Churn, where we normalize the code
A single vulnerability can, and often did, involve multiple churn of the file by number of lines of code (LOC) in the file
VCCs due to multiple files and/or multiple developer mistakes. after the commit. For example, a file with 100 lines of churn
Fixing a vulnerability can involve multiple files, so we bisected and ended with 200 LOC would have a Relative Churn of 50%.
each of those files to find VCCs for each of those. Furthermore,
68
We note that if a file was fully rewritten in a single commit, it TABLE II. ASSOCIATION RESULTS: MEANS AND MWW TEST
can have a code churn exceeding 100%. Mean
A disadvantage of those metrics is that they do not take into Metric VCC non-VCC MWW p-
account what has been happening to the system recently. For
value
example, a commit with 10 lines of Code Churn may be
Code Churn 608.5 42.2 p<0.00001
considered a low risk, but if that commit is taken together by a
Relative Churn 55.7% 23.1% p<0.00001
burst of five other 10-line commits in the last month, the
30-Day Churn 1012.3 266.7 p<0.00001
probability of introducing vulnerabilities may increase. Thus,
one commit may not explain the entire story of a given source
code file in its temporal context. We collected a third churn the non-VCCs in terms of each metric. As suggested in other
metric: 30-Day Churn. We decided upon 30 days as our metrics validation studies [28]–[30] for not having a normality
threshold after an analysis of the commit regularity of HTTPD assumption, we use the non-parametric Mann-Whitney-
developers. The 30-Day churn metric only covers the commits Wilcoxon (MWW) test. We compare the means and the p-
immediately prior in the last 30 days. Thus, 30-Day Churn value to 0.05 for VCCs and non-VCCs.
does not double-count commits from its single-commit We present the results of our association analysis in Table
counterpart. Also, throughout our data collection, we set all of II. Based on these results, we find that code churn metrics are
our diff utilities to ignore whitespace changes. empirically associated with VCCs. All three of Code Churn
Relative Churn, and 30-Day churn tend to have higher churn.
In summary, our metric definitions are:
Thus, we can conclude that bigger commits and bursts of big
x Code Churn is the number of lines inserted plus the commits have historically been associated with the commits
number of lines deleted for a commit diff, ignoring that have been found to contribute to the introduction of
whitespace. vulnerabilities in HTTPD.
x Relative Churn is the Code Churn divided by the total
lines of code for the file after the commit VII. SOCIO-TECHNICAL CONCERNS
x 30-Day Churn is the sum of Code Churn for a given Code churn metrics alone do not take into account a critical
source code file in the 30 days prior to the commit. factor of any software project: the developers. Specifically,
As an example of computing code churn, Fig. 1 shows an code churn metrics ignore who is making the changes and who
abbreviated example diff for a single commit (git hash is affected by those changes. We examine this additional,
08c38d0831) in HTTPD. The commit involved changing API socio-technical form of code churn to better gauge how
calls, depicted by three insertions and three deletions, so the developers are interacting via source code changes. The result,
Code Churn in this example is six. The file had 560 lines after however, is not a measurement of commit size, but in
this commit, so the Relative Churn was 1.1%. developer interaction via commits. We examine two research
To evaluate how the churn metrics are related to security questions:
vulnerabilities, we examine the difference between VCCs and x Q2. Interactive Churn. Are VCCs associated with churn
$ git log -1 -p -U0 08c38d0831 that affects other developers?
commit 08c38d0831c46ed5b62e2f83e42a4c84e111d553 x Q3. New Effective Author. Is a commit more likely to be
Author: Jeff Trawick <trawick@apache.org> a VCC when the author is a new committer to the code?
Date: Tue Aug 7 14:49:44 2012 +0000
Mutex directive: finish support of DefaultRuntimeDir We recently introduced a few interactive churn metrics [6].
diff --git a/server/util_mutex.c b/server/util_mutex.c In that study, we found that interactive churn metrics, when
--- a/server/util_mutex.c aggregated at the file level, are statistically associated with
+++ b/server/util_mutex.c source code files that had post-release vulnerabilities in the
@@ -120 +120 @@ AP_DECLARE(apr_status_t)
ap_parse_mutex[…]
PHP programming language. This is our first study of
- *mutexfile = ap_server_root_relative(pool, file); analyzing interactive churn metrics at the commit level, along
+ *mutexfile = ap_runtime_dir_relative(pool, file); with the 30-Day versions of interactive churn metrics.
@@ -307 +307 @@ static const char In this section, we will first explain our five interactive
*get_mutex_filename(a[…]
churn metrics (Section A), empirically answer our research
- return ap_server_root_relative(p,
+ return ap_runtime_dir_relative(p, questions (Section B) while providing a brief discussion about
@@ -555 +555 @@ AP_CORE_DECLARE(void) how interactive churn metrics can be actionable to software
ap_dump_mutexes(a[…] developers.
- dir = ap_server_root_relative(p, mxcfg->dir);
+ dir = ap_runtime_dir_relative(p, mxcfg->dir); A. Computing Interactive Churn Metrics
Fig. 1. Abbreviated commit and diff in HTTPD from Git The idea behind interactive churn [6] is to examine who is
making the changes and whose code is being changed at the
$ git blame 08c38d0831^ -- server/util_mutex.c | \
grep -e " 120)" -e " 307)" -e " 555)"
line level in source code. In a single commit, a developer may
55fcb2ed (Jim Jagielski 120) *mutexfile = ap_server[…] be revising her own code, or changing her colleague’s code.
c391b9d1 (Jeff Trawick 307) return ap_server_root_[…] While developers may not have records of explicit code
ff444d9c (Stefan Fritsch 555) dir = ap_server_root_r[…] ownership practices, the version control system can provide a
Fig. 2. Abbreviated blame output for computing PIC and NAA metrics listing of who was the last person to modify a given line of
69
code via a built-in blame tool. Each line that a developer’s last modified by authors other than Jeff), and the NAA is two
commit affects was last modified either by him- or herself, or (for Jim and Stefan). We know from line 307 that Jeff had
by another developer. This line-level analysis provides a fine- previously changed the file, so his NEA? is a “No”.
grained record of developers interacting (knowingly or Additionally, in the same spirit of 30-Day churn, we
unknowingly) via specific lines of source code. examined 30-Day PIC and 30-Day NAA. We define those
To compute interactive churn metrics for a given commit metrics as:
and a given file, we do the following: x 30-Day PIC is the total percentage of lines that affected
1. Process the commit diff to identify the author and the other developers for all commits to the given file for the
lines of code that were affected by the given commit. prior 30 days.
2. Run the blame tool on the file to identify the authors of x 30-Day NAA is the total number of distinct developers
the lines affected. affected by the last 30 days of commits for the given file.
3. Look up the lines affected by the diff in the blame In practice, we recommend that developers use PIC, NAA,
output, aggregating the different authors affected by the and NEA? as supplements to Code Churn and Relative Churn
commit. because together they provide a more complete picture of how
For example, Fig. 1 shows that the author of the commit the code changed in terms of people. Historically, Code Churn
was Jeff Trawick and that the three lines affected were at lines has been a prominent predictor of bugs and vulnerabilities [15],
120, 307, 555 (denoted by @@). To compute interactive churn, [31]. However, one of the disadvantages of Code Churn is how
we need to know if the three affected lines were last modified developers can interpret it. One may simplistically view high
by Jeff, or by someone else, so we use the git blame tool. Code Churn as an indication to just avoid changing code. For
Figure 2 shows the output of the blame command, filtered developers, however, code must change so it can be improved
by the three lines in question. The output shows that two of the upon. Thus, we find the Code Churn metric to lack property of
three lines affected by Jeff’s commit were lines previously being actionable. Consider the following scenarios, all of which
modified by two other developers, Jim and Stefan. Thus, we involve high amounts of Code Churn:
say that the number of interactive churn lines in this commit is x Self Churn: A developer is making major revisions to
two out of a possible three. mostly her own code. Thus, her commits overall would
We note that Fig. 2 displays a basic grep command for have high Code Churn, but low PIC, low NAA, and she is
filtering the output for the example. This specific method is too not a NEA.
simplistic for actual data collection, as it can lead to false x Small Team Churn: A team of 3 developers are revising
positives. Our scripts for collecting interactive churn metrics a large feature together. In aggregate, these commits would
handle the blame output more carefully to prevent such false have high Code Churn, high PIC, but a low NAA and the
positives. developers would not be a NEA.
We also leverage the blame tool to measure the developer x Newbie Rewrite: A new developer is making massive
activity regarding the most recent authors of the file. When a changes to large parts of the system, some of which he is
file undergoes high amounts of change activity, many different completely unfamiliar with. These commits would be
developers may be involved. Those developers can be high Code Churn, high PIC, high NAA, and often NEA.
overwriting each others’ changes, so prior developers’ code In a software development team, each of those scenarios
may no longer exist in the latest version of the file. can be risky or not depending on the context. More
Furthermore, for a given commit, a developer may be making importantly, however, all three of those scenarios are
changes to a file for the first time (or, for effectively the first considerably different situations, yet they would all yield high
time if his code was rewritten since his last commit). To Code Churn. With interactive churn metrics, we can separate
account for when developers are new to the file or not, we out different situations for a better understanding of what is
define an Effective Author as one who exists in at least one line happening in terms of developer activity.
in the blame output.
In summary, we define three interactive churn metrics: B. Analyzing Interactive Churn Metrics and VCCs
x PIC is the percentage of the interactive churn lines to the To examine the effects of how the activity of developers
total number of lines of code affected by the commit. If no interact on the source code line level could potentially be
lines of code were affected (e.g. only insertions), PIC is associated with making security-related mistakes, we ask the
undefined due to division by zero; following question:
x NAA is the number of distinct authors besides the commit Q2. Interactive Churn. Are VCCs associated with churn that
author whose lines were affected by the given commit. If affects other developers?
PIC is undefined, NAA is zero; We used the PIC, NAA, 30-Day PIC and 30-Day NAA
x NEA? is a nominal “Yes” or “No” for when an author is a metrics for this analysis. We also used the same analysis as the
New Effective Author, or not found by the blame churn metrics (MWW test of association). Our results are
command on the entire file prior to the commit detailed in Table III.
In following with our example from Fig. 1 and Fig. 2, the
PIC of the commit is 66% (two of the three affected lines were
70
Interestingly, the 30-Day PIC is lower on average for VCCs TABLE III. ASSOCIATION RESULTS: MEANS AND MWW TEST
than for non-VCCs. Thus, historically in HTTPD, VCCs Mean
happened more often when the prior 30-Day commits had self Metric VCC non-VCC MWW p-
churn rather than interactive churn. This intriguingly value
counterintuitive result is consistent with our prior study of the
PIC 70.8% 66.1% p=0.54
vulnerabilities of the PHP programming language [6].
NAA 1.78 1.01 p<0.05
The NAA metric was statistically significant as well. This
30-Day PIC 55.6% 65.0% p<0.01
result indicates that, historically, commits that affected more
30-Day NAA 3.45 2.70 p=0.60
developers (specifically closer to 2 other developers on average
than 1 other developer) were more likely to be VCCs.
Regarding the two statistically insignificant results, PIC and TABLE IV. CONTINGENCY TABLE FOR NEA? METRIC
30-Day NAA, no conclusions can be drawn. Statistically
speaking, we do not have enough evidence to claim that PIC VCC?
and 30-Day NAA are any different for VCCs and non-VCCs. No Yes
This result does not preclude PIC and 30-Day NAA from being No 19,206 (99.6%) 72 (0.3%)
NEA?
useful to developers, however, in identifying different kinds of Yes 6,517 (99.2%) 52 (0.8%)
churn. Thus, our answer to Q2 is “yes, but with a few row percentages in parentheses
exceptions”. x Q5. Baseline. How often was a VCC part of an original
source code import?
Beyond interactive churn metrics, we can also discover if a x Q6. Known Offender. How many VCCs occurred in files
developer was new to a given source code file using the NEA? that had already been patched for a different
metric. This leads to the following research question: vulnerability?
Q3. New Effective Author. Is a commit more likely to be a x Q7. Notable Changes. Were VCCs likely to be noted in
VCC when the author is a new committer to the source code? the change log or status files?
We used the NEA? metric defined in the latter section for
this question. We collected the NEA? metric for the entire data The typical time between a VCC and its corresponding fix
set to compare between VCCs and non-VCCs. Since the NEA? indicates how much time is available for developers to conduct
metric is nominal (i.e. has an outcome of “Yes” or “No”), we their code reviews and tests.
use the Chi-squared contingency table test, as suggested in the Q4. Exposure. How long did vulnerabilities remain in the
literature [28]–[30]. system?
In total, 52 (41.9%) of VCCs were from a New Effective We found that most vulnerabilities remained in the system
Author. The contingency table is shown in Table IV, and the from VCC to fix for a long time, on average 1,175 days or a
results were statistically significant (p<0.0001). Thus, the median of 853 days. Figure 3 (see following page) depicts a
empirical evidence favors that a commit is more likely to be a visualization of all source code files that had vulnerabilities,
VCC if it was authored by a NEA. Furthermore, we note that counting the number of vulnerabilities in the system at each
VCCs are extremely rare occurrences to begin with. But, when given time. Figure 4 depicts a histogram for the number of days
the author of a commit is a New Effective Author the between a VCC and its fix. Only 5% of the VCCs had an
probability of that commit being a VCC more than doubles exposure that was fewer than 30 days, and 26% of the VCCs
(from 0.3% to 0.8%). Thus, we conclude that a commit is more were in the system for fewer than 365 days. By contrast, 6% of
likely to be a VCC when the author is effectively a new the VCCs remained in the system for over a decade.
developer to that file. In mature systems such as HTTPD, however, code can
VIII. COMMUNITY DISSEMINATION remain unchanged in the system for a long period of time. But,
as developers make commits to each file, they get an
One of the key components to any open source software opportunity to review and test the code for other problems. To
project is to leverage the community of developers to review measure this, we computed the number of commits between
changes. Eric Raymond declared in his famous essay [32], that each VCC and its corresponding fix. Figure 5 depicts a
“many eyes make all bugs shallow”. A key part of leveraging histogram, with outliers 484 and 586 not shown for visual
the development community, however, is to disseminate one’s reasons. The median for number of commits between VCC and
work so that it can be reviewed. fix was 48 commits, with the average being skewed by outliers
In HTTPD, every commit is automatically sent to a mailing to 76.9 commits. 100% of VCCs had at least one commit
list, and the developers have various venues for discussing their between VCC and fix.
changes to the system. But, some commits are explicitly Interestingly, the two outliers where the number of commits
recorded in other fashions, such as the change logs that get between VCC and fix were 484 and 596 were
released to users. In this section, we investigate community http_protocol.c and mod_rewrite.c respectively. Both
dissemination with the following research questions: of these files parse un-trusted user data and have each had three
x Q4. Exposure. How long did vulnerabilities remain in the vulnerabilities over the years, yet vulnerabilities remained
system? unnoticed for a very long time despite consistent developer
71
activity. Our analysis of Known Offenders (Q6) examines this Q6. Known Offender. How many VCCs occurred in files that
phenomenon more deeply. had already been patched for a different vulnerability?
To measure Q6, we counted the number of VCCs that
Since Q1 told us that most VCCs are big commits, and that occurred after the fix of a different vulnerability on the same
Q4 indicates most vulnerabilities are old, one possibility is that file. We found that 33 (26.6%) of VCCs occurred on files that
that most vulnerabilities pre-existed in the initial writing of the were Known Offenders. These 33 VCCs covered 14 (20.1%) of
system or for a given feature. When a new feature is written, the vulnerabilities that HTTPD has patched over the years. This
the initial commit tends to be large, and HTTPD has been a result may seem less surprising given the results of Exposure
stable and mature product for well over a decade. In terms of (Q4) that vulnerabilities remain in the system for long periods
community dissemination, developers may be able to find more of time. Thus, while we believe the Known Offender property
vulnerabilities by focusing on new source code files. We of VCCs is prevalent enough for developers to pay attention to,
examine this possibility in Q5. it does not explain the majority of VCCs nor even a quarter of
Q5. Baseline. How often was a VCC part of an original source the vulnerabilities in HTTPD.
code import?
To count this, we identified which VCCs were “baseline” Finally, the HTTPD project has a wide variety of ways they
commits, that is, commits where the file was new to the collaborate on commits. Two ways that we were able to mine
system. The git tool identified baseline commits for us, and were the STATUS file and the CHANGES file in their source
we manually inspected the results to ensure that the source file tree. The STATUS file is an informal record of what each
was truly new to the system and not renamed or reorganized in developer is currently working on that is checked into version
the directory structure. We also consulted the change notes and control. Developers also use this file for brief discussion or
other artifacts to triangulate this data. Additionally, 31 (22.7%) voting on an issue at hand. The CHANGES file is a more
of the VCCs were associated with more than one vulnerability, formal document used for the logging of major changes to the
so we only count those commits once. system, such as new features or bug fixes that affect users.
In total, only 13.5% of VCCs were baseline commits. If we Thus, if a developer notes her changes in one of these files, she
double-count one VCC for multiple vulnerabilities, then 23.5% is effectively disseminating the work to the HTTPD
of VCCs were baseline commits. While these numbers are not community for potential review.
small, they not account for the majority of commits. In other Q7. Notable Changes. Were VCCs likely to be noted in the
words, vulnerabilities arise in pre-existing source code more change log or status files?
often than new source code files. To answer this question, we examined each VCC and
We note here that we do not equate “new features” with determined if the developer decided to disseminate his or her
“new source code file”. We did not have enough contextual change to the community via the CHANGES or STATUS file.
information to classify each VCC into a bug fix or a feature. We looked at four days on either side of each VCC to catch
situations where the developers checked in their votes or
Another element of community dissemination is the ability change note statements in a separate commit. This investigation
for the developers to react to past vulnerabilities in the system. was a qualitative one that involved delving into triangulating
As Figure 3 showed, the fixes for the 68 vulnerabilities in developer discussions and notes. To mitigate the subjective
HTTPD have been spread out over long periods of time. Thus, nature of this evaluation, we always had at least two
some source code files can be “Known Offenders”, or files that researchers make this assessment individually, then resolve any
have been fixed for a vulnerability in the past. If VCCs disagreements by consensus. To compare our results against
primarily occur on Known Offender files, then developers can non-VCCs, we also examined a control group by randomly
focus their efforts more acutely on the small subset of files that
have been affected by a post-release vulnerability.
Fig. 3. Timeline for source code files in HTTPD with vulnerabilities Fig. 5. Histogram of commits between VCC and fix.
72
TABLE VI. SUMMARY OF COMMUNITY DISSEMINATION RESULTS
sampling 150 commits to HTTPD, limiting our sampling to yet can explain all vulnerabilities, or even the majority of
only commits that affected C source code files. Again, we note vulnerabilities.
that 22.7% of our VCCs were associated with one
vulnerability, so we only count those commits once. X. LIMITATIONS
To compare the difference between VCCs and our control The VCC identification process involves a mixture of
group of non-VCCs, we conducted a Chi-squared test of automation coupled with human judgment about what
contingency tables. We consider a commit to be notable if we constitutes a coding mistake that led to a vulnerability. The
found it to be in the STATUS or CHANGES file human judgment can lead to some subjectivity, and potentially
Our results of CHANGES and STATUS can be found in variability in the data. To mitigate this, we used three
Table V. Our Chi-squared test of the bottom row of Table V researchers to check each others’ work, debate the difference
showed that, in fact, VCCs are more frequently noted in until agreement, and make corrections as necessary.
STATUS or CHANGES (p<0.05). This result indicates that Furthermore, our method of identifying VCCs leans more
VCCs are being publicized more often than their counterparts, toward sound results than complete results. In other words, we
although not by much. Thus, the issue is not necessarily that do not know that our VCCs were, in fact, the only VCCs in the
VCCs are any less publicized than non-VCCs; rather, system. We are confident that the ones we found are correct
developers may be forgetting security concerns when they (i.e. sound), though. Fortunately, the vastness of the non-VCCs
review changes to HTTPD. in the data set means that a few false negatives would not skew
the results very far.
IX. DISCUSSION Also, since our process of identifying VCCs was static, we
We summarize the results of Q1-Q7 in Table VI. To us, the do not know if the given vulnerability was truly exploitable at
most telling results are that VCCs tend to be large commits the time of the VCC. Most vulnerabilities in systems do not
(Q1), yet tend not to be the baseline commits (Q5). The fact have public exploits, and constructing exploits for
that that vulnerabilities exist in the code for many years at a vulnerabilities would have made this project infeasible in terms
time (Q5) and that only around a quarter of the VCCs were in of time and expertise. But, we mitigated this by focusing on the
Known Offender files (Q6) may indicate that many more static coding mistakes of the vulnerabilities from the fixes and
vulnerabilities may still exist in HTTPD to be discovered. The by taking a holistic view of the vulnerability by way of all the
Baseline (Q5) result also indicates that VCCs are likely to be relevant artifacts.
occurring in the future since vulnerabilities tend to be Finally, we do not know that the 68 vulnerabilities in
introduced as part of evolution, not as the initial import. HTTPD were, in fact, all of the vulnerabilities that HTTPD has.
Furthermore, VCCs historically have been Notable Changes New vulnerabilities are being found in HTTPD all the time, so
(Q7), yet the development community missed the security more VCCs may exist that we do not know about. However,
concerns when the commit entered the system. this fact is true of nearly every empirical study of bugs or
Vulnerabilities also tend to be spread out across the system, vulnerabilities; we cannot conclude about bugs or
as shown by our timeline in Figure 3. In particular, the figure vulnerabilities that we do not know about.
depicts most vulnerable files as having only one or two
vulnerabilities exposed at any given time. XI. SUMMARY
Finally, none of the properties in this study covered the The objective of this research is to improve software
majority of VCCs nor vulnerabilities. Having poured over these security by exploring code churn and other socio-technical
vulnerabilities one by one, we testify that these vulnerabilities properties of VCCs. We adapted a semi-automated
and their VCCs are quite a diverse set when looked at methodology for identifying the coding mistakes that led to 68
qualitatively. Trends exist, but no single property we know of vulnerabilities in the Apache HTTPD server. We identified
124 VCCs in HTTPD and conducted an exploratory analysis of
TABLE V. SUMMARY OF COMMUNITY DISSEMINATION RESULTS various properties of these VCCs. We examined seven research
VCCs Non-VCC Sample questions covering a wide variety of potential properties that
Noted in STATUS 9 (8.6%) 20 (13.3%) contribute to vulnerabilities. We analyzed code churn metrics,
Noted in CHANGES 46 (43.8%) 41 (27.3%) interactive churn metrics, and explored questions of
STATUS or CHANGES 51 (48.6%) 66 (44.0%) community dissemination. Developers can use this insight to
73
better understand how vulnerabilities arise in a software [16] N. Nagappan, B. Murphy, and V. Basili, “The Influence of
project. Organizational Structure on Software Quality: An Empirical
In the future, we plan to expand this research into more Case Study,” in 30th International Conference on Software
artifacts such as email discussions, more case studies, to the Engineering (ICSE), Leipzig, Germany, 2008, pp. 521–530.
[17] N. Nagappan, “Toward a Software Testing and Reliability Early
reliability realm, to improve churn metrics, and to increase Warning Metric Suite,” in Proceedings of the 26th International
insight into the meaning of interactive churn metrics in the Conference on Software Engineering, 2004, pp. 60–62.
context of software processes and socio-technical concerns. [18] J. R. Casebolt, J. L. Krein, A. C. MacLean, C. D. Knutson, and
D. P. Delorey, “Author entropy vs. file size in the gnome suite
REFERENCES of applications,” in Mining Software Repositories, 2009, pp. 91
[1] I. V. Krsual, “Software Vulnerability Analysis.” PhD –94.
Dissertation, Purdue University, 1998. [19] A. Meneely, L. Williams, W. Snipes, and J. Osborne,
[2] J. Allen, S. Barnum, R. Ellison, G. McGraw, and N. Mead, “Predicting Failures with Developer Networks and Social
Software Security Engineering, 1st ed. Addison-Wesley Network Analysis,” in Proceedings of the 16th ACM SIGSOFT
Professional. International Symposium on Foundations of software
[3] G. McGraw, Software Security: Building Security In. Addison- engineering, Atlanta, Georgia, 2008, pp. 13–23.
Wesley Professional, 2006. [20] A. Meneely and L. Williams, “Socio-Technical Developer
[4] C. Wysopal, L. Nelson, D. D. Zovi, and E. Dustin, The Art of Networks: Should We Trust Our Measurements?,” in
Software Security Testing: Identifying Software Security Flaws, International Conference on Software Engineering (ICSE),
1st ed. Addison-Wesley Professional, 2006. Waikiki, Hawaii, USA, 2011, pp. 281–290.
[5] P. Hope, G. McGraw, and A. I. Anton, “Misuse and abuse [21] A. Meneely, M. Corcoran, and L. Williams, “Improving
cases: getting past the positive,” IEEE Security & Privacy, vol. developer activity metrics with issue tracking annotations,” in
2, no. 3, pp. 90–92, Jun. 2004. WETSoM 2010 Workshop on Emerging Trends in Software
[6] A. Meneely and O. Williams, “Interactive Churn: Socio- Metrics, Cape Town, South Africa, 2010, pp. 75–80.
Technical Variants on Code Churn Metrics,” in Int’l Workshop [22] A. Meneely and L. Williams, “On the Use of Issue Tracking
on Software Quality, 2012, pp. 1–10. Annotations for Improving Developer Activity Metrics,”
[7] A. Meneely and L. Williams, “Strengthening the Empirical Advances in Software Engineering, vol. 2010, pp. 1–9, 2010.
Analysis of the Relationship Between Linus’ Law and Software [23] C. Williams and J. Spacco, “SZZ revisited: verifying when
Security,” in Empirical Software Engineering and changes induce fixes,” in Proceedings of the 2008 workshop on
Measurement, Bolzano-Bozen, Italy, 2010, pp. 1–10. Defects in large software systems, New York, NY, USA, 2008,
[8] A. Meneely and L. Williams, “Secure Open Source pp. 32–36.
Collaboration: an Empirical Study of Linus’ Law,” in Int’l [24] S. Kim, T. Zimmermann, K. Pan, and E. J. Whitehead,
Conference on Computer and Communications Security (CCS), “Automatic Identification of Bug-Introducing Changes,” in 21st
Chicago, Illinois, USA, 2009, pp. 453–462. IEEE/ACM International Conference on Automated Software
[9] Y. Shin, A. Meneely, L. Williams, and J. Osborne, “Evaluating Engineering, 2006. ASE ’06, 2006, pp. 81 –90.
Complexity, Code Churn, and Developer Activity Metrics as [25] J. Śliwerski, T. Zimmermann, and A. Zeller, “When do changes
Indicators of Software Vulnerabilities,” TSE, vol. 37, no. 6, pp. induce fixes?,” SIGSOFT Softw. Eng. Notes, vol. 30, no. 4, pp.
772–787, 2011. 1–5, May 2005.
[10] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, [26] Netcraft, “December 2012 Web Server Survey,” Internet
“Predicting vulnerable software components,” in Computer and Research. [Online]. Available:
Communications Security, New York, NY, USA, 2007, pp. http://news.netcraft.com/archives/2012/12/04/december-2012-
529–540. web-server-survey.html. [Accessed: 15-Feb-2013].
[11] E. L. Trist and K. W. Bamforth, “Some social and psychological [27] S. A. Ajila and R. T. Dumitrescu, “Experimental use of code
consequences of the longwall method of coal-getting,” delta, code churn, and rate of change to understand software
Technology, Organizations and Innovation: The early debates, product line evolution,” J. Syst. Softw., vol. 80, no. 1, pp. 74–91,
p. 79, 2000. 2007.
[12] “git-bisect.” [Online]. Available: [28] A. Meneely, B. Smith, and L. Williams, “Validating Software
https://www.kernel.org/pub/software/scm/git/docs/git- Metrics: A Spectrum of Philosophies,” TOSEM, vol. 21, no. 4,
bisect.html. [Accessed: 01-Apr-2013]. pp. 24–48, Oct. 2012.
[13] S. G. Elbaum and J. C. Munson, “Getting a handle on the fault [29] N. F. Schneidewind, “Validating Software Metrics: Producing
injection process: validation of measurement tools,” in metrics, Quality Discriminators,” pp. 225–232, May 1991.
1998, p. 133. [30] N. F. Schneidewind, “Methodology for Validating Software
[14] J. C. Munson and S. G. Elbaum, “Code churn: a measure for Metrics,” IEEE Transactions on Software Engineering (TSE).,
estimating the impact of code change,” in Software vol. 18, no. 5, pp. 410–422, 1992.
Maintenance, 1998. Proceedings. International Conference on, [31] Yonghee Shin, A. Meneely, L. Williams, and J. Osborne,
1998, pp. 24 –31. “Evaluating Complexity, Code Churn, and Developer Activity
[15] N. Nagappan and T. Ball, “Use of Relative Code Churn Metrics as Indicators of Software Vulnerabilities,” IEEE Trans.
Measures to Predict System Defect Density,” in 27th Softw. Eng., vol. to appear, 2011.
international Conference on Software Engineering (ICSE), St. [32] E. S. Raymond, The Cathedral and the Bazaar: Musings on
Louis, MO, USA, 2005, pp. 284–292. Linux and Open Source by an Accidental Revolutionary, 1st ed.
O’Reilly Media, 2010.
74