How Dataflow Diagrams Impact Software Security Analysis: An Empirical Experiment
How Dataflow Diagrams Impact Software Security Analysis: An Empirical Experiment
Abstract—Models of software systems are used throughout distributed nature of this architectural style poses additional
the software development lifecycle. Dataflow diagrams (DFDs), challenges in terms of cognitive load to security analysts.
in particular, are well-established resources for security anal- Systems following the microservice architecture split their
arXiv:2401.04446v1 [cs.SE] 9 Jan 2024
Number of Participants
Number of Participants
12 8 10
10 8
6
8
6
6 4
4
4
2 2
2
0 0 0
No Beginner Intermediate Advanced Proficient I've never I've done it I've done it I've done it I'm very 0 1 2 3+
programming level level level done it once or twice occasionally often before experienced
skills before before before in it
Programming skills Experience reading Java code Work Experience (years)
(a) (b) (c)
Fig. 3. Participants’ (a) programming skills, (b) experience in reading Java code, and (c) work experience as developers. All self-reported.
with that of the other lab sessions in this course. No other correctness, correctness of evidence, and time) were calculated
incentives were pledged or given. for each participant in both conditions separately.
2) Preparation: To prepare the participants, a 90-minute
lecture before the lab sessions was dedicated to introducing
1) Analysis Correctness: We quantified the given answers
them to the topic (available in this paper’s replication pack-
concerning the analysis correctness by manually checking the
age [22]). The lecture covered key concepts relevant to the
participants’ responses. To remove subjectivity, we created a
experiment. The primary focus was on the origin of software
reference solution that was used to check the answers. It was
vulnerabilities and methods for detecting them. The lecture
created prior to the execution of the experiment. The DFDs and
also encompassed topics such as DFDs, microservice architec-
source code of the applications have been analysed to create
tures, and security considerations in microservice applications.
the reference answers, which were afterwards confirmed by
Following this lecture, the students were expected to possess
conducting technical documentation of the code libraries used
the required knowledge to undertake the experiment. Their
in the applications, information provided by the developers
attendance at the lecture was recorded and was a prerequisite
of the applications, and other typical online resources. This
for participating in the experiment.
process was performed by the first author and validated
3) Informed consent and ethical assessment: All partic-
afterwards by two additional authors. After the experiment,
ipants read and signed an informed consent form before
the participants’ responses were mapped via the reference
the experiment, informing them that they are the subjects
solution to a table indicating correct and incorrect answers.
of an empirical experiment, that they participate voluntarily,
Each response was examined manually, compared against the
that they do not have to expect any negative consequences
reference solution, and true answers were marked in the table.
whatsoever if they do not participate, and that they can retract
We further reviewed answers that did not match the reference
their consent at any time. To ensure the experiment’s ethical
solution to check whether they were correct. For this, various
innocuity, it was assessed by the German Association for
typical online resources were conducted to verify whether
Experimental Economic Research e.V. before execution. A
the specific answer applies to the task. Each correctly given
description of the planned experiment and its design was
answer gives a score of 1. There is a peculiarity for some tasks.
approved under grant nr. 2pxo1bap. The certificate can be
Task 3 asks for a list of connections between services, and
accessed via https://gfew.de/ethik/2pxo1bap.
tasks 4 and 5 ask whether a property applies to each item on
a given list. Consequently, these tasks each required multiple
E. Measurement
distinct responses. All responses were checked individually.
To evaluate the participants’ performance, three metrics were Then, to allow a more detailed and nuanced evaluation, we
introduced. The analysis correctness represents the ability converted the results to scores. A score of 0 was assigned
to provide correct answers to the tasks. The correctness of for no correct responses, a score of 1 was given for partially
evidence measures whether the evidence that the participants correct responses (meaning that some but not all responses of
provided as support for their answers points to a code snippet a task were correct), and a score of 2 was awarded when all
that justifies their answer. Both are numerical scores derived responses were correct. With three tasks giving a maximum of
from the participants’ responses. Additionally, we measured one point and three tasks giving a maximum of two points, the
the time spent on solving the tasks. The three metrics (analysis overall highest achievable score in analysis correctness is 9.
2) Correctness of Evidence: The traceability information 9
that is contained in the used DFDs constitutes a reference 8
solution for quantifying the correctness of the evidence given
Analysis correctness
7
by the participants. Each given evidence was checked manu- 6
ally for matches to this reference solution. Here, we employed 5
some tolerance in accepting evidence as correct. For example, 4
when participants referred to a block of code slightly larger 3
than the lines of code needed to prove an answer, we still 2
accepted this as correct (e.g., referring to a method consisting 1
of some lines of code instead of referring to a single line of 0
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Overall
code in that method). We carried out a further validation check, Control Model-supported Maximum
achievable score
similar to the quantification of the analysis correctness. Each
provided evidence that differed from the reference solution Fig. 4. Comparison across the two treatments of the participants’ average
score in analysis correctness. Per task and overall.
was checked manually whether it supported the given answer
or not. The first author carried out the above steps. As for the 9
analysis correctness, each correct evidence gives a score of 8
Correctness of evidence
1. Again, for tasks 3, 4, and 5, the multiple distinct responses 7
were converted into a score of 0, 1, or 2 for no correct, partially 6
correct, and all correct responses, respectively. 5
3) Time: To measure the time spent on solving the tasks, 4
the participants were asked to record the current time when 3
starting and finishing to work on the tasks. We calculated the 2
1
time metric based on these answers (i.e., the period of time
0
between the start and finish of solving the tasks). Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Overall
4) Reported Usefulness: The DFDs’ usefulness as reported Control Model-supported Maximum
achievable score
by the participants was assessed via the open question about
Fig. 5. Comparison across the two treatments of the participants’ average
the participants’ experience in using the DFDs that was posed score in correctness of evidence. Per task and overall.
after the technical tasks. We qualitatively analysed all answers’
general intents (positive/negative feedback) and identified re- - RQ 1: In the context of our experiment, providing a
curring topics manually. This analysis was performed by the security-annotated DFD of the system to be analysed im-
first author and verified by two further authors. proved participants’ analysis correctness in solving secu-
rity analysis tasks. We observed a statistically significant
F. Statistical Tests (p = 0.0025) improvement of 41% on average.
Throughout the analysis, the difference of scores between
two groups was checked for statistical significance with a For some individual tasks, the difference in the average scores
∧
Wilcoxon-Mann-Whitney test. Before, a Shapiro-Wilke test is only marginal (task 2: 0.42 vs. 0.46 = 10% improvement
∧
was used to verify that the data does not follow a normal in model-supported condition; task 6: 0.75 vs. 0.79 = 5.6%
distribution. Hence, no parametric tests could be used. The improvement). In Section V, we discuss whether the nature
assumptions for the Wilcoxon-Mann-Whitney test (the sam- of the tasks might be an indication of the extent to which
ples are independent, random, and continuous and the sample a DFD improves the score in analysis correctness. However,
size is sufficiently large) are met in our experiment. even though the improvement is not statistically significant
for all individual tasks (statistical significance only for task 3
IV. R ESULTS with a p-value of 0.0003), Figure 4 clearly shows a trend of
improved performance in the model-supported condition.
A. Analysis Correctness
Figure 4 presents the average score in analysis correctness that B. Correctness of Evidence
the participants achieved in the two conditions. The figure Figure 5 presents the average score in correctness of evidence
shows that the participants performed better in the model- achieved by the participants. There are only small differences
supported condition, both overall and in every individual task. in the average scores for the model-supported and control
For all tasks, the average score is 6.75 out of a possible 9 condition. With an average of 3.08 out of a possible 9 in the
in the model-supported condition compared to 4.79 in the control condition compared to an average of 2.71 in the model-
control condition, a 41% higher average. The applied statistical supported condition (-12%), the participants performed better
test (compare Section III-F) indicates a statistically significant in the control condition, albeit without statistical significance
difference between the two conditions’ average scores in (p = 0.52). Task 4 has the lowest average correctness of
analysis correctness overall (p = 0.0025). These results provide evidence of the individual tasks (average of 0.29 out of 2 in the
the following answer to RQ1: control condition and 0.13 in the model-supported condition),
9
35
Reported uses of artefact
8
30 +23% +12%
7
25 -17%
Analysis correctness
6
20
5
15
4
10
5 3
0 2
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
1
Traceability Information Source Code None or could not Answer
Dataflow Diagram Textual Description 0
Not Using Using Not Using Using Not Using Using
Fig. 6. Reported usage of provided artefacts per task (in the model-supported Source code Dataflow diagram Traceability information
condition, where all artefacts were available). Application artefact
Maximum
achievable score
while task 1 has the highest average (0.58 out of 1 in the
control condition and 0.63 in the model-supported condition). Fig. 7. Average scores in analysis correctness of those participants that
reported using an artefact in more than 50% of the tasks (Using Artefact)
and those that reported using it less (Not Using Artefact).
C. Use of Provided Artefacts
After each task, the participants were asked to name all 9
show, that the participants did not solely rely on the provided 1
D. Influence of Use of Artefacts on Scores Fig. 8. Average scores in the correctness of evidence of those participants
that reported using an artefact in more than 50% of the tasks (Using Artefact)
We investigated whether the participants’ usage of the pro- and those that reported using it less (Not Using Artefact).
vided artefacts had an influence on their performance. Al-
though they were provided more artefacts in the model- Figure 7 presents the average scores in analysis correctness
supported condition, this does not necessarily mean that they of the two groups per application artefact. The figure shows
used them all. The answers to the tasks could be found that the Using Artefact group performed better compared to
with more than one of the artefacts. Thus, the influence of the Not Using artefact group for the artefacts DFD (+23%
single artefacts on the performance is not necessarily reflected in score) and traceability information (+12%), while they
in the comparison of outcomes between the two conditions. performed worse for the source code (-17%). Figure 8 presents
For example, participants in the model-supported condition the results of this analysis for the correctness of evidence. For
could have not used the provided DFD to answer the tasks. the artefact source code, the Using Artefact group achieved
Consequently, we compared the average scores in analysis a 15% higher score in correctness of evidence than the Not
correctness and in the correctness of evidence between two Using Artefact group. For the use of DFD, they performed
groups of participants for each artefact. To the group Using worse than the Not Using Artefact group (-20%). The highest
Artefact we assigned all participants that reported using the difference, however, is seen in the traceability information. The
artefact in more than 50% of the tasks (4 or more). The group Using Artefact group achieved a 315% higher average score
Not Using Artefact contains those participants who reported in correctness of evidence than the Not Using Artefact group.
using it less (3 or fewer). We considered only the outcomes We answer RQ2 based on these results since they distinguish
from the participants in the model-supported condition, since between the use of the DFD and traceability information. In
only here they had access to all artefacts. The grouping and the results above, this distinction could not be made because
analysis were done separately for each artefact, thus, the the traceability information is an integral part of the DFDs and
cardinality and members of the groups differ between artefacts. its isolated impact on the analysis could not be measured.
- RQ 2: Using traceability information significantly im- Four others referred to the DFDs (these answers were given in
proved the correctness of evidence given for answers. On the second week, the participants had thus already performed
average, participants that used this artefact in more than half the session with the DFD), stating that, in comparison, the lack
of the tasks achieved a 315% higher correctness of evidence of a DFD was an obstacle during the analysis. Specifically,
compared to participants that used it less than that. they raised concerns about the correctness of their given
answers and stated that finding the required features directly
E. Time in source code was challenging.
All participants were able to complete the tasks in the allotted “The traceability file and the DFD were a big help
90 minutes. Their average time to complete all tasks was 34 last time, this time I wasn’t really sure if I even
minutes in the control condition and 35 minutes in the model- answered correctly and didn’t really know if the
supported condition. No notable difference was observed. evidence I gave was correct. [...]”
To examine a possible correlation between performance and Six further answers of participants in the control condition
time spent to finish the tasks, we also created a scatter plot in the first week (and thus without the comparison to the
visualizing their scores against their time. No correlation model-supported condition) reported negatively about their
between scores and time could be visually identified. experience in the experiment. Specifically, they mentioned a
lack of expertise, uncertainty about the given answers, and
F. Perceived Usefulness and Usability of DFDs general difficulties in answering the tasks. Interestingly, two
The answers given to the open question at the end of the anal- participants criticized the lack of a “CFG” or “some kind of
ysis sessions provide insights into the participants’ perceived map of the architecture”. This could have been sparked by the
usefulness of the DFDs. The question asked about positive or introductory lecture where DFDs were addressed but is still
negative observations during the experiment. For participants seen as an interesting comment. The obstacles reported by the
in the model-supported condition, it explicitly mentioned the participants in the control condition give further weight to the
usefulness of the DFDs and traceability information. positive feedback of those in the model-supported condition.
Out of 23 answers given by participants in the model- Based on a qualitative analysis of the participants’ state-
supported condition, two were negative, stating that thorough ments, we can cautiously judge the perceived usefulness and
documentation would be preferred and that the DFD was “a accessibility of the DFDs to answer RQ3:
little bit hard to understand at first”. Three answers listed
- RQ 3: In our experiment, the perceived usefulness and
both positive and negative experiences, where the negative
accessibility reported by the participants varied from very
points were two mentions that finding implementation details
positive feedback to mild critiques reporting some confusion.
was hard (both participants reported using the traceability
Overall, the statements focussed on usefulness and were
only in one task) and one that the participant lacked domain
predominantly positive.
knowledge. A further 14 answers were predominantly positive.
“Dataflow Diagrams were incredibly helpful, and all G. Open Challenges of DFDs
questions were answered almost completely from it.”
The above observations of the quantitative and qualitative
Of the 23 answers, 9 mentioned specific beneficial scenarios results allowed us to distill a number of open challenges of
for the use of DFDs. The ability to provide an overview of DFDs, i.e., current obstacles that would increase the DFDs’
the system was mentioned 8 times, the benefit of referring to positive impact further if solved. Although these challenges
the important places in source code and use of the models as were not explicitly investigated as independent variables in
interface to the code was mentioned 3 times, and the reduction our experiment, they became evident from the results of
of the required domain knowledge was mentioned once. the experiment, explicit answers given by participants, and
Mild critique about the accessibility of the DFDs or trace- observations made during the analysis of the tasks.
ability information was raised in 4 responses, for example: Open Challenge 1: Understandability of Models The partici-
“[...] the transfer from the DFD to the traceability pants in our experiment performed with statistical significance
information could be made easier by clickable links better in the model-supported condition and they reported
in the DFD [...]” a generally good accessibility. Nevertheless, concerns were
In summary, the statements made by participants in the raised about the understandability of the models. Some par-
model-supported condition include descriptions of the general ticipants commented, that they did not understand the model
usefulness of the DFDs, of benefits in finding implementation initially or that they did not know what some annotations
details via the model items and traceability information, and mean. A more usable model representation of software systems
of their usefulness for architectural details and providing an should consider the accessibility for human users, especially
overview. The positive feedback outweighed the few negative those with lower domain knowledge.
comments. Most participants reported the DFDs to be of help Open Challenge 2: Presenting Missing Features The DFDs
in the analysis and to be accessible to use. in their current form do not support the explicit presentation
Of the answers given after the control condition, only one of the absence of features or properties. In the context of
was positive, stating that the textual description was helpful. security analysis, these could be security mechanisms that are
not implemented by a given application. To enable more com- type of tasks the DFDs have the most impact on, and how
prehensive analysis and increase users’ trust, it is important exactly they impact different types of tasks. We investigated
to show that such mechanisms were investigated and are not whether the nature of the tasks could be an explanation for the
implemented in the analysed application. In this context, the observations, i.e. whether the type of task can indicate how
challenges are to prove the absence, to decide what features the score is influenced. We found that the DFDs impacted the
to consider, and how to convey this information to the user. analysis tasks in our experiment in different ways. They are
We see this open challenge as the hardest one to solve, both described in the following. Please refer to Table I for the tasks.
conceptually and practically. Providing an Overview: Tasks 1, 2, and 6 have fairly simple
Open Challenge 3: Accessibility of Traceability Information answers in comparison to the other tasks. The answer for task
The quantitative results of our experiment show, that the trace- 1 (in which the analysis correctness improved by 33% in the
ability information has a positive impact on the correctness model-supported condition) could be found at two places in
of evidence provided for answers to the tasks. While this is the code, either a deployment file indicating the container’s
an expected observation, multiple participants also mentioned port, or a configuration file indicating the service’s port. Both
the usefulness of the traceability information for navigating the answers were accepted as correct. In the DFD, the port is
source code. However, it was also mentioned in some answers shown as annotation to the corresponding node. Interestingly,
that the connection to the source code was difficult to follow. the wrong answers given are one of two options. One is
Also, the traceability information was not used by everyone the port number of a different microservice, which likely
even when it was provided. We conclude that the ease of use showed up when searching for “port” with GitHub’s search
can be improved and that navigating the links to source code function. The other is the port of a database that is only
should be simplified. This challenge is of more practical nature visible in the code as part of the database’s URL. How this
and can likely be solved with some clever engineering. answer was reached by participants is puzzling. For task 6,
the improvement of the average score was the lowest of all
- RQ 4: We identified three open challenges of DFDs tasks (0.75 in the control condition and 0.79 in the model-
(understandability of models, presenting missing features, supported condition; 5.6% increase). The task has the overall
and accessibility of traceability information). If any of these best average scores, likely, because the authorization service’s
are solved, the positive impact of DFDs on security analysis name (“auth server”) hints towards the answer of which of the
can be expected to further increase. services handles the authorization. Task 2 could be answered
based on the textual description, on the Java annotation that
V. D ISCUSSION implements the API gateway in code, or on an annotation in
At the heart of the conducted experiment lay the question of the DFD. A 10% improvement in average score in analysis
the impact of providing DFDs and traceability information correctness was observed from the control condition (0.42)
on the participants’ performance. The results presented in to the model-supported condition (0.46). The answers lead us
Section IV indicate an overall positive impact on the analysis to believe that the question might have been formulated such
correctness. The scores improved with statistical significance that participants did not fully understand it. Many of the wrong
in the model-supported condition. Figure 7 emphasizes this answers in both conditions stated the used framework (Spring)
finding. Participants in the model-supported condition who instead of the library that was asked for (Zuul). Further, this
reported using the provided DFD in more than half of the tasks task had the lowest reported number of usages of the DFD as
had a 23% higher score on average than those who reported well as traceability information (compare Figure 6).
using it less. A 12% higher average score for participants using The answers and evidences indicate, that DFDs are helpful
the traceability information is further proof of the usefulness in providing an overview and presenting the answers to simple
of the DFDs, since the traceability information is one of questions such as the port number of a microservice. Evidently,
their core features. The observed 17% lower score in analysis finding any port in the code is a simple task in many systems’
correctness for participants who reported using the source code codebases, however, the answers suggest that finding the
in more than half of the tasks was an unexpected outcome correct one can be challenging. Likely, this is heightened
at first sight. A closer look at the usage of the source code by the complexity that the microservice architecture adds to
as artefact revealed, that out of 55 responses that mentioned an application’s codebase due to its decoupling. The answers
using the source code, 34 (this corresponds to 62%) did not given by participants in the model-supported condition further
use the DFD or traceability information in conjunction. In emphasize this quality of DFDs to provide an overview of
other words, the source code was predominantly not used the important system components (compare Section IV, where
alongside the models, but instead as the only artefact to answer this was the benefit most often mentioned by participants).
a task. Consequently, in our experiment, many participants Simultaneously, for simple tasks with a fairly easy answer,
who reported using the source code could also be described good coding practice such as choosing descriptive identifiers
as not using the provided models. With this re-phrasing, the seems to support the analysts well and there is no pressing
results are another indication of the models’ positive impact. need to provide a DFD. Whether this holds true in the
Looking at the individual tasks, the increase in scores analysis of larger applications should be investigated in future
differed between them. This suggests the question of which work. The results of task 2 indicate problems in the DFDs’
accessibility. The presented information seems to not be self- Indicating Absence of Features: Despite the open challenge
explanatory enough for the participants to answer this task 2 (presenting missing features in the DFDs, see Section IV),
reliably, even when the information is contained in the DFDs. the results also indicate that the DFDs in their current form
$ The results indicate, that DFDs serve as a means to already support users in answering tasks concerning the ab-
“navigate the jungle” that is the application’s codebase. sence of features in the code. Task 4 was different from the
They provide an overview of the application’s architecture other ones in that the challenge lay not in finding an artefact
and (security and other) features. At the same time, well- in the code but instead the absence of it. The task asked for
chosen identifiers in code can support the solving of simple the presence of encryption in two connections (for App 1,
analysis tasks and the DFDs add less value in this scenario. three for App 2) between services. The correct answer to all of
them was “No”. The average score in analysis correctness was
Reducing Required Domain Knowledge: To answer tasks 3 0.83 in the control condition and 1.33 in the model-supported
and 5 in the control condition, some domain knowledge was condition out of a possible score of 2 (60% increase). The
needed to correctly grasp the functionality of the relevant code. difficulty in this task also became apparent when looking at the
Task 3 required the participants to identify three outgoing results for the evidence. The participants achieved an average
connections (for App 1, two for App 2) of a microservice. score in correctness of evidence of 0.042 in both conditions.
One is a direct API call implemented with Spring Boot’s $ Although the DFDs still face the open challenge of
RestTemplate, another a registration with a service dis- presenting missing features, their current form already
covery service, and the third a registration with a tracing supports users in answering tasks that require identifying
server (similar for App 2). Some domain knowledge about the absence of features in code.
these technologies or Java was required to identify them. With
the DFD at hand, answer the task came down to identifying In summary of the discussion of the results, we see that the
the correct node in the diagram and noting the three nodes to DFDs had a positive impact on the scores in different types of
which there was an information flow. To answer task 5 without tasks. Specifically, they provide an overview of the analysed
the DFD, participants had to check whether three services (for application, they reduce the required domain knowledge, and
App 1, two for App 2) refer to the authorization service in a they can indicate the absence of features in the application. The
configuration file under an authorization section. In the highest increase in scores is seen for tasks where some domain
DFD, a connection to the authorization server indicated this. knowledge was needed to answer them without the DFDs. The
Again, knowledge about Spring or Java made it easier to find only task where the improvement of the analysis correctness in
the correct answers without the support of the DFDs. the model-supported condition was neglectable was a simple
Task 3 showed the biggest impact of the models, with a task where descriptive identifiers in code indicated the answer.
doubled average score in analysis correctness (0.875 in con-
VI. T HREATS TO VALIDITY
trol condition and 1.75 in model-supported condition; 100%
increase). While this task was more difficult to answer than Internal validity: With a large group of university students
the others without a DFD and the required domain knowledge, as participants, collaborations during or between the sessions
the magnitude of the difference is still substantial. For task and resulting cross-contamination cannot be ruled out com-
5, the average score in analysis correctness was 1.29 in the pletely. As mitigation, we strictly discouraged collaborations
control condition and 1.58 in the model-supported condition, a and conversations about the study and supervised the analysis
23% increase. The differences show how the DFDs reduce the sessions. Learning effects or the possibility of preparing for the
domain knowledge required for analysis activities. However, tasks were mitigated with the employed within-groups design
we hypothesize, that the participants without the DFD could where the scenarios switched over the two sessions and with
answer the task simply by identifying the keyword “autho- the use of different applications. With 90-minute long sessions,
rization” in the configuration files without checking if the experimental fatigue is limited. The random assignment to the
implementation is correct and behaves in the way that is asked groups G1 and G2 limits selection bias. Some of the analysed
for. We believe, that this led them to achieve an average score data (timestamps, experience, resource usage) is self-reported,
without the DFDs that is still high. Given the scenario in which and we have to rely on its correctness. The encouragement of
they solved the tasks (empirical experiment, where answers positive as well as negative feedback and the often-repeated
are expected), this was likely sufficient evidence for them to reassurance of full anonymity of the answers were used to
answer, independent of whether their domain knowledge was increase the reliability of the data. By making participation
profound enough to fully understand the workings. voluntary and using only the standard incentive for attending
$ Our interpretation of the results is that DFDs are the lab sessions, it is possible that we have attracted mainly
especially helpful in scenarios where a lack of domain students who show high motivation and are at the top of their
knowledge about the analysed application’s framework, class. This could have had distorting effects on the results and
libraries, etc. hinders the identification of features and could not be reasonably mitigated.
system components. The DFDs’ ability to shed light on External validity: The conclusions drawn in this paper might
properties shaded by a curtain of domain knowledge seems not entirely map to other scenarios or populations. The tasks
to be one of their core virtues. used as examples of security analysis activities might differ
from real-world use cases and thus influence the shown effects. on users’ comprehension. For example, Cruz-Lemus et al. [31],
Further, the experiment focused on microservice applications [32], Ricca et al. [33], and Staron et al. [34], [35] found
written in Java. We chose Java applications because it is that stereotypes (which are similar to annotations in DFDs in
the most used programming language for open-source mi- our experiment) increased users’ efficiency and effectiveness
croservice applications. The analysis of systems that follow a in code comprehension. Some publications found alternative
different architectural style or are written in another program- model representations to yield better comprehension among
ming language could show other outcomes. The number of participants in empirical experiments: Otero and Dolado [36]
participants (24) is relatively small. We chose robust statistical reported that OPEN Modelling Language (OML) models led
methods that are suitable for the sample size and discussed the to faster and easier comprehension than UML diagrams,
impact of the participants’ experience and the choice of tasks. while Reinhartz-Berger and Dori [37] reported Object-Process
The participants’ expertise in security analysis is rather low. Methodology (OPM) models to be better suited than UML
Thus, the effects described in this paper might not be observed diagrams for modelling dynamic aspects of applications.
for other users, e.g., with more experience. However, the use Bernsmed et al. [38] presented insights into the use of DFDs
of DFDs is not confined to security experts, hence rendering in agile teams by triangulating four studies on the adoption of
the participants a suited population for the experiment. Finally, DFDs. In the studies, software engineers were confused about
experiments with practitioners instead of students could lead the structure, granularity, and what to include in the models,
to different results, however, it is a common practice and has because no formal specification of DFDs exists. The partic-
been shown to produce valid results as well (see Section III-D). ipants in our experiment also showed some difficulties that
Construct validity: We measured the participants’ perfor- could be resolved by a clear definition and well-established
mance in terms of correctness and time, which are common documentation of DFDs. Regarding DFDs’ structure, Faily et
and objective metrics for such experiments. They relate to al. [39] argued that they should not be enriched with additional
the practical use-case of the investigated effects directly. The groups of model items, since their simplicity and accessibility
analysis correctness is crucial in security analysis to ensure ac- for human users might suffer. Instead, they proposed to use
curate security evaluations and, consequently, secure systems. them together with other system representations. In contrast,
The time serves as a measure of productivity and efficiency. Sion et al. argued in a position paper [40] that using DFDs in
Other constructs were disregarded but could be suited as well. their basic form is insufficient for threat modelling. Based on
Content validity: The tasks concerned the key security our findings, we argue that adding annotations to DFDs does
mechanisms implemented in the analysed applications. These not impede their accessibility and that security-enriched DFDs
or similar tasks would be part of a real-world security analysis. are well suited to support security analysis activities.
However, other tasks might also be important in this context. In conclusion, no publications were found that empirically
Conclusion validity: The responses to the tasks were given investigate the impact of DFDs (or other model represen-
in free-text fields. Although we did not identify such ambi- tations) on the security analysis (or related activities) of
guities in quantifying the responses, it is possible that some microservice applications.
answers were phrased in a way that was interpreted incorrectly.
A more restrictive way of collecting the answers could have VIII. C ONCLUSION
increased the conclusion’s validity. This paper presents the results of an empirical experiment con-
ducted to investigate the impact of DFDs on software security
VII. R ELATED W ORK analysis tasks. DFDs are widely used for security analysis
Although DFDs are used for different aspects of security and their varied adoption indicates a high confidence in their
analysis, no related work could be found that investigates their usefulness. To the best of our knowledge, the presented results
direct impact on the correctness of the analysis. Publications are the first to investigate these assumptions and can confirm
for other model types exist. For example, a considerable body a positive impact of DFDs in the given context. We found,
of empirical research on Unified Modeling Language (UML) that participants performed significantly better concerning the
diagrams has been published [26]. A number of experiments analysis correctness of security analysis tasks when they were
have been conducted to investigate whether users’ comprehen- provided a DFD of the analysed application. Additionally,
sion of the modelled systems increases with UML diagrams. traceability information that links model items to artefacts
Gravino et al. [27], [28] observed a positive impact of the in source code significantly improved their ability to provide
models, while experiments by Scanniello et al. [29] did not correct evidence for their answers. Consequently, this paper
show such an improvement (the authors attribute this to the serves as a basis for future research on specific applicabilities
type of UML diagrams, which had little connection to the and properties of DFDs. Further, it can provide guidance in
code since they had been created in the initial requirements decisions on the adoption of model-based practices.
elicitation phase of the development process). In an exper-
iment by Arisholm et al. [30], code changes performed by ACKNOWLEDGEMENT
participants with access to UML documentations showed sig- This work was partly funded by the European Union’s Horizon
nificantly improved functional correctness. Other researchers 2020 programme under grant agreement No. 952647 (Assure-
investigated the impact of specific properties of UML diagrams MOSS).
R EFERENCES [19] B. Kitchenham, S. Pfleeger, L. Pickard, P. Jones, D. Hoaglin,
K. El Emam, J. Rosenberg, Preliminary guidelines for empirical research
[1] L. Sion, K. Yskout, D. Van Landuyt, W. Joosen, Solution-aware data in software engineering, IEEE Transactions on Software Engineering
flow diagrams for security threat modeling, in: Proceedings of the 33rd 28 (8) (2002) 721–734. doi:10.1109/TSE.2002.1027796.
Annual ACM Symposium on Applied Computing, SAC ’18, Association [20] N. Juristo, A. Moreno, Basics of Software Engineering Experimentation,
for Computing Machinery, New York, NY, USA, 2018, p. 1425–1432. 2001. doi:10.1007/978-1-4757-3304-4.
doi:10.1145/3167132.3167285. [21] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, A. Wesslén,
[2] S. Hernan, S. Lambert, T. Ostwald, A. Shostack, Threat modeling- Experimentation in Software Engineering, Springer, Germany, 2012.
uncover security design flaws using the stride approach, MSDN Maga- doi:10.1007/978-3-642-29044-2.
zine (2006) 68–75. [22] S. Schneider, N. E. Diaz Ferreyra, P.-J. Queval, G. Simhandl, U. Zdun,
[3] Microsoft Corporation, Microsoft threat modeling tool 2016 (2016). R. Scandariato, Replication package, 2024.
URL https://www.microsoft.com/en-us/download/details.aspx?id=49168 URL https://github.com/tuhh-softsec/SANER2024 empirical
[4] P. Torr, Demystifying the threat modeling process, IEEE Security & experiment DFDs
Privacy 3 (5) (2005) 66–70. doi:10.1109/MSP.2005.119. [23] I. Salman, A. T. Misirli, N. Juristo, Are students representatives of
[5] M. Abi-Antoun, D. Wang, P. Torr, Checking threat modeling data flow professionals in software engineering experiments?, in: Proceedings of
diagrams for implementation conformance and security, in: Proceed- the 37th International Conference on Software Engineering - Volume 1,
ings of the 22nd IEEE/ACM International Conference on Automated ICSE ’15, IEEE Press, 2015, p. 666–676.
Software Engineering, ASE ’07, Association for Computing Machinery, [24] M. Svahnberg, A. Aurum, C. Wohlin, Using students as subjects -
New York, NY, USA, 2007, p. 393–396. doi:10.1145/1321631. an empirical evaluation, in: Proceedings of the Second ACM-IEEE
1321692. International Symposium on Empirical Software Engineering and Mea-
[6] M. Abi-Antoun, J. M. Barnes, Analyzing security architectures, in: surement, ESEM ’08, Association for Computing Machinery, New York,
Proceedings of the IEEE/ACM International Conference on Automated NY, USA, 2008, p. 288–290. doi:10.1145/1414004.1414055.
Software Engineering, ASE ’10, Association for Computing Machinery, [25] D. Falessi, N. Juristo, C. Wohlin, B. Turhan, J. Münch, A. Jedlitschka,
New York, NY, USA, 2010, p. 3–12. doi:10.1145/1858996. M. Oivo, Empirical software engineering experts on the use of students
1859001. and professionals in experiments, Empirical Softw. Engg. 23 (1) (2018)
[7] B. Berger, K. Sohr, R. Koschke, Automatically Extracting Threats 452–489. doi:10.1007/s10664-017-9523-3.
from Extended Data Flow Diagrams, in: Engineering Secure Soft- [26] D. Budgen, A. J. Burn, O. P. Brereton, B. A. Kitchenham, R. Pretorius,
ware and Systems, Vol. 9639, 2016, pp. 56–71. doi:10.1007/ Empirical evidence about the uml: a systematic literature review, Soft-
978-3-319-30806-7\_4. ware: Practice and Experience 41 (4) (2011) 363–392. doi:https:
[8] C. Cao, S. Schneider, N. Diaz Ferreyra, S. Verweer, A. Panichella, //doi.org/10.1002/spe.1009.
R. Scandariato, CATMA: Conformance Analysis Tool For Microser- [27] C. Gravino, G. Scanniello, G. Tortora, Source-code comprehension tasks
vice Applications, in: 2024 IEEE/ACM 46th International Conference supported by uml design models: Results from a controlled experiment
on Software Engineering: Companion Proceedings (ICSE-Companion), and a differentiated replication, Journal of Visual Languages & Com-
2024. puting 28 (2015) 23–38. doi:https://doi.org/10.1016/j.
[9] K. Tuma, R. Scandariato, M. Balliu, Flaws in Flows: Unveiling Design jvlc.2014.12.004.
Flaws via Information Flow Analysis, in: 2019 IEEE International [28] C. Gravino, G. Tortora, G. Scanniello, An empirical investigation on
Conference on Software Architecture (ICSA), 2019, pp. 191–200. doi: the relation between analysis models and source code comprehension,
10.1109/ICSA.2019.00028. in: Proceedings of the 2010 ACM Symposium on Applied Computing,
[10] R. Chen, S. Li, Z. E. Li, From monolith to microservices: A dataflow- SAC ’10, Association for Computing Machinery, New York, NY, USA,
driven approach, in: 2017 24th Asia-Pacific Software Engineering 2010, p. 2365–2366. doi:10.1145/1774088.1774576.
Conference (APSEC), 2017, pp. 466–475. doi:10.1109/APSEC. [29] G. Scanniello, C. Gravino, M. Risi, G. Tortora, G. Dodero, Documenting
2017.53. design-pattern instances: A family of experiments on source-code com-
[11] T. D. Stojanovic, S. D. Lazarevic, M. Milic, I. Antovic, Identifying prehensibility, ACM Trans. Softw. Eng. Methodol. 24 (3) (may 2015).
microservices using structured system analysis, in: 2020 24th Inter- doi:10.1145/2699696.
national Conference on Information Technology (IT), 2020, pp. 1–4. [30] E. Arisholm, L. Briand, S. Hove, Y. Labiche, The impact of uml
doi:10.1109/IT48810.2020.9070652. documentation on software maintenance: an experimental evaluation,
[12] S. Li, H. Zhang, Z. Jia, Z. Li, C. Zhang, J. Li, Q. Gao, J. Ge, Z. Shan, A IEEE Transactions on Software Engineering 32 (6) (2006) 365–381.
dataflow-driven approach to identifying microservices from monolithic doi:10.1109/TSE.2006.59.
applications, Journal of Systems and Software 157 (2019) 110380. doi: [31] J. A. Cruz-Lemus, M. Genero, D. Caivano, S. Abrahão, E. Insfrán,
10.1016/j.jss.2019.07.008. J. A. Carsı́, Assessing the influence of stereotypes on the comprehension
[13] N. Dragoni, S. Giallorenzo, A. Lluch-Lafuente, M. Mazzara, F. Montesi, of uml sequence diagrams: A family of experiments, Information and
R. Mustafin, L. Safina, Microservices: yesterday, today, and tomorrow, Software Technology 53 (12) (2011) 1391–1403. doi:10.1016/j.
Springer International Publishing, 2016, Ch. 12, pp. 195–216. doi: infsof.2011.07.002.
10.1007/978-3-319-67425-4\_12. [32] M. Genero, J. A. Cruz-Lemus, D. Caivano, S. Abrahão, E. Insfran,
[14] J. Lewis, M. Fowler, Microservices: a definition of this new architectural J. A. Carsı́, Assessing the influence of stereotypes on the comprehension
term, MartinFowler.com 25 (14-26) (2014) 12. of uml sequence diagrams: A controlled experiment, in: K. Czarnecki,
[15] S. Schneider, R. Scandariato, Automatic extraction of security-rich I. Ober, J.-M. Bruel, A. Uhl, M. Völter (Eds.), Model Driven Engi-
dataflow diagrams for microservice applications written in java, Journal neering Languages and Systems, Springer Berlin Heidelberg, Berlin,
of Systems and Software 202 (2023) 111722. doi:10.1016/j.jss. Heidelberg, 2008, pp. 280–294.
2023.111722. [33] F. Ricca, M. Di Penta, M. Torchiano, P. Tonella, M. Ceccato, How
[16] T. DeMarco, Structure Analysis and System Specification, Springer developers’ experience and ability influence web application comprehen-
Berlin Heidelberg, 1979. doi:10.1007/978-3-642-48354-7\ sion tasks supported by uml stereotypes: A series of four experiments,
_9. IEEE Transactions on Software Engineering 36 (1) (2010) 96–118.
[17] K. Tuma, R. Scandariato, M. Widman, C. Sandberg, Towards security doi:10.1109/TSE.2009.69.
threats that matter, in: S. K. Katsikas, F. Cuppens, N. Cuppens, C. Lam- [34] M. Staron, L. Kuzniarz, C. Wohlin, Empirical assessment of using
brinoudakis, C. Kalloniatis, J. Mylopoulos, A. Antón, S. Gritzalis (Eds.), stereotypes to improve comprehension of uml models: A set of experi-
Computer Security, Springer International Publishing, Cham, 2018, pp. ments, Journal of Systems and Software 79 (5) (2006) 727–742, quality
47–62. doi:10.1007/978-3-319-72817-9_4. Software. doi:10.1016/j.jss.2005.09.014.
[18] S. Schneider, T. Özen, M. Chen, R. Scandariato, microsecend: A dataset [35] M. Staron, L. Kuzniarz, C. Thurn, An empirical assessment of using
of security-enriched dataflow diagrams for microservice applications, stereotypes to improve reading techniques in software inspections,
in: 2023 IEEE/ACM 20th International Conference on Mining Software SIGSOFT Softw. Eng. Notes 30 (4) (2005) 1–7. doi:10.1145/
Repositories (MSR), 2023, pp. 125–129. doi:10.1109/MSR59073. 1082983.1083308.
2023.00030. [36] M. C. Otero, J. J. Dolado, An empirical comparison of the dynamic
modeling in oml and uml, Journal of Systems and Software 77 (2) (2005)
91–102. doi:10.1016/j.jss.2004.11.022.
[37] I. Reinhartz-berger, D. Dori, Opm vs. uml—experimenting with com-
prehension and construction of web application models, Empirical
Software Engineering 10 (2005) 57–80. doi:10.1023/B:EMSE.
0000048323.40484.e0.
[38] K. Bernsmed, D. Cruzes, M. Jaatun, M. Iovan, Adopting threat mod-
elling in agile software development projects, Journal of Systems
and Software 183 (2021) 111090. doi:10.1016/j.jss.2021.
111090.
[39] S. Faily, R. Scandariato, A. Shostack, L. Sion, D. Ki-Aries, Contextu-
alisation of data flow diagrams for security analysis, in: H. Eades III,
O. Gadyatskaya (Eds.), Graphical Models for Security, Springer Inter-
national Publishing, Cham, 2020, pp. 186–197.
[40] L. Sion, K. Yskout, D. Van Landuyt, A. van den Berghe, W. Joosen,
Security threat modeling: Are data flow diagrams enough?, in: Proceed-
ings of the IEEE/ACM 42nd International Conference on Software Engi-
neering Workshops, ICSEW’20, Association for Computing Machinery,
New York, NY, USA, 2020, p. 254–257. doi:10.1145/3387940.
3392221.