Porte & MsManus 2019 Doing Replication R in AL
Porte & MsManus 2019 Doing Replication R in AL
APPLIED LINGUISTICS
The Second Language Acquisition Research series presents and explores issues bearing
directly on theory construction and/or research methods in the study of second
language acquisition. Its titles (both authored and edited volumes) provide thor-
ough and timely overviews of high-interest topics, and include key discussions of
existing research findings and their implications. A special emphasis of the series is
reflected in the volumes dealing with specific data collection methods or instru-
ments. Each of these volumes addresses the kinds of research questions for which
the method/instrument is best suited, offers extended description of its use, and
outlines the problems associated with its use. The volumes in this series will be
invaluable to students and scholars alike, and perfect for use in courses on research
methodology and in individual research.
Of related interest:
Second Language Acquisition
An Introductory Course, Fourth Edition
Susan M. Gass with Jennifer Behney and Luke Plonsky
Typeset in Bembo
by Swales & Willis Ltd, Exeter, Devon, UK
List of Illustrations vi
Acknowledgement vii
9 Epilogue 176
Index 178
ILLUSTRATIONS
Figures
1.1 The research cycle 2
2.1 Google Scholar screenshot: bilingualism 18
2.2 Sample routes to selection of your study 19
3.1 Reading a paper: awareness-raising 29
3.2 Practice awareness-raising: the abstract 42
7.1 Parallel coordinate plot showing individual changes from
pre-test to post-test 131
8.1 Example poster. To view this in colour and in closer detail
please visit the eResource at www.routledge.com/9781138657359 170
Tables
4.1 Descriptive statistics: pre-test 53
4.2 Descriptive statistics: post-test 53
7.1 Descriptive statistics for percentage group mean accuracy scores
(mean, 95% CIs [LL, UL], SDs) at pre-test, post-test, and
delayed post-test 136
7.2 Effect size comparisons (Cohen’s d with CIs for d) with
treatment groups from Bitchener and Knoch (2010), and
effect size changes with effects adjusted for baseline differences 137
ACKNOWLEDGEMENT
The authors would like to thank Luke Plonsky for the feedback and advice
obtained on the original drafts of this book.
1
INTRODUCTION
Why Replication Research Matters
There is much in Sagan’s words which can serve as a stimulus to all research-
ers, novice and seasoned alike. He reminds us that science is more than the sum of
its parts. That all the facts we might have learned at school and beyond form the
basic building blocks of our accumulated knowledge about a subject. However,
it is then one’s approach to those facts that defines the indispensable “scientific”
way of looking at things. This is characterized by a questioning, critical approach
to what we are told or what we read: “skeptical” is clearly a cornerstone of
Sagan’s attitude to scientific knowledge.
His words are particularly pertinent, however, for those interested in embark-
ing on replication research. He calls for attention to potential error in what is
claimed in that knowledge, recalling our “human fallibility”. An implication here
is that we are imperfect beings, and it is therefore natural for us to be wrong at
times. Rather than dwell on that very human characteristic, Sagan focuses on the
attitude we must adopt to such a “failing”. We must be able to ask questions of
what we hear and of what we read.
This book is about showing you how to go about honing this questioning
attitude to what we read, and then acting upon any doubts, skepticisms, or just
plain inquisitiveness that reading may have aroused. While science research has
2 Introduction
had its share of “charlatans”, we begin by taking all the research we read at
face value. Our principal aim in replicating research, therefore, will not be to
debunk dubious claims nor sniff out potential falsehood, but rather to return to
a study that interests us, “repeating it in a particular way to establish its stability
in nature and eliminate the possible influence of artifacts or chance findings”
(Porte, 2012, p. 4).2
On our way toward that aim, which makes up the core of this book, we
address many conceptual and procedural elements of replication. First, however,
in this introductory chapter, we respond to some fundamental questions we have
heard and anticipate on this topic.
Identify
research
area
Publish Design
research research
The
results study
Research
Cycle
Analyze
Carry out
research
research
results
We want to focus your attention at the point in Figure 1.1 where you see the
dotted arrow. This is the critical juncture between the publication of what you
have done, or what you have read, and its producing the next piece of work.
This is traditionally where you will enter the cycle, of course. You might be
encouraged by your interests, or your instructor, to produce further research at
this point, using the previous literature as an appropriate springboard or stimulus
for something original or innovative. Indeed, “original” and/or “innovative”
is often the sine qua non for doctoral theses in many universities, and for many
academic journals for the kind of papers they tend to publish. In other words,
this next step in the cycle – and your conventional entry to it – is encouraged
to be by your initiating something that follows up previous research by extend-
ing it into new areas and original contexts. Typically, then, you might use the
previous methodology in the cycle to produce new data, from a new context with
other participants.
Note how your principal aim here is not to question, nor to reconsider in any
critical way the previous study’s procedures and outcomes in the light of your
new data. Of course, a previous study’s procedures might be an important point
in contextualizing your findings, and so you would want to reflect on those
when your findings pattern similarly and/or differently with previous work. An
important distinction between a research/extension study and a replication study,
therefore, is that the extension study’s principal aim is not to critically question,
revisit, and/or reconsider a previous study’s methodology. Thus, in an extension
study, any subsequent comparison between your study and a previous one would
be encouraged, but is incidental – and not your objective.
This book will have you enter the cycle at the same (dotted) point. Your
intended aim, however, will shift: your contribution will come precisely from
focusing on that previous step in the cycle – the publication. The stimulus for the
next step and for what to research now comes from one earlier study, in particular,
rather than a whole set of studies.
Now we need to shift our gaze back onto that previous study rather than focus
just on the new one being carried out.
We now assume that you are interested in what actually went on in that previ-
ous study.
This is the stimulus for any replication study you may want to undertake.
What attracted you to the study in the first place? We will be looking more
closely at such stimuli in later chapters but, for now, we can suggest that you may
have returned to it because it is a study everyone cites as important, or particularly
significant for your area of research. Or maybe it is a study with results which
somehow don’t seem to “fit in” with other, similar, ones you have read. It may
be that you question the generalizability (i.e., external validity) of the study’s
findings to other learners, contexts, or target structures. Or perhaps you just feel
the results warrant closer inspection. Either way, it has piqued your interest to
the point where you want to review what went on, to reconsider what emerged
from it in a new light, and somehow take another look at it.
4 Introduction
findings more precisely validated, its reliability focused on, its generalization tested,
perhaps even delimited. Furthermore, having the benefit of a number of research-
ers working on the same project independently or as part of a research team can
also add to the amount of detailed understanding that can be achieved. We will
take up the idea of collaborative replication research work in the final chapter.
So replication matters because – whatever the outcome – a contribution to
a better understanding of the target study is made. As Sagan implied, part of
our task as (applied) scientists is to ask questions; in so doing, we can expect
to discover error. By identifying error and having its rectification built in to
our understanding of a phenomenon, we help our field progress and make our
research more credible both for our fellow researchers and for practitioners in the
classroom – as well as the general public.
But error won’t be found if we don’t look, and in AL research at least, there
is evidence we are not looking. As far back as 1970 one social science researcher
was already claiming that “neglecting replication is scientific irresponsibility”
(Smith, 1970, p. 971).4 Such a lack of replication over the long term can even see
the academic base of the discipline brought into question. For many years, this
lack of adequate evidence and relative absence of self-correction in our research
was not considered of great importance. There now seems to be considerably
more concern.
Fast-forward to the last decade and we witness burgeoning debate on the
importance of replication research.5
So, are we observing here merely the periodic reappearance of the replication
debate in social sciences we mentioned above? Various aspects of the current
demand for replications would seem to indicate that – this time – things are more
serious. The difference now appears to be that while the perceived importance of
such research remains high, many of these browser hits reveal articles and blogs
which carry a tone of foreboding, with many choosing to warn of “trouble at the
lab”, “serious problems”, a “growing crisis”, a “failure” in social science research,
or predicting a “deepening crisis” and “a bleak outcome” for such work. What
has happened in the meantime to warrant such pessimism?
» Activity
a. What criticisms are made about the way research is currently being car-
ried out and disseminated?
b. What does each writer see as the way forward for such research?
(continued)
6 Introduction
(continued)
This last reading activity will have revealed one of the reasons why replica-
tion suddenly came to the attention of the general public once again. A quick
search for the term “reproducibility crisis” will bring up large numbers of recent
articles and discussion. The discussion appears to have centered on how much
confidence we can have in the scientific research we read – and therefore to
what extent apply it to learning situations – if so much appears not to have been
adequately replicated or, in some cases, has proved impossible to reproduce.
This book provides instruction on how you can carry out a replication. Here
is not the place to debate the relative merits of aiming for reproducibility versus
replication. In theory, however, when we speak of “reproducibility” we are
referring to the extent to which an outcome can be confirmed using the same
approach, participants, method, and analysis. Reproducibility also encompasses
research and reporting practices associated with “open science” and that can
facilitate efforts to replicate and to compare the results of replication and ini-
tial studies. “Replication”, by contrast, is usually understood here to involve
obtaining the same outcome with variations on the original approach. However,
there is debate even about the definition of “reproducible”, and the confusion
has led to a lack of clarity between what “reproducing” and “replicating” a
study entails. As we will discuss in Chapter 4, “reproducibility” in its strictest
Introduction 7
sense is exactly reproducing a previous study to verify its findings or add more
knowledge about its generalizability, for example. We refer to this as “exact”
replication and suggest such an endeavor is impossible in the field of AL (and
many others!). Our use of “replication” embraces a series of modified repeti-
tions of the original experiment along a continuum and promises equally useful
contributions to the field.
The above reading activity could just as easily imply that many who depend
on our research for its possible pedagogical implications and applications may
be rightly concerned about the presence of undetected error or the lack of
confirmatory evidence provided. Ironically, of course, one of the first lessons
you learn as an experimental researcher and from any research we choose to
do is that – despite all our built-in safeguards and the precision with which we
prepare and execute our study, and then analyse our data – what we do will
inevitably be subject to potentially significant limitations, bias, and error. It is all
part of researching.
Now, if we admit this from the start, we also need to appreciate the conse-
quences: our research – since it contains unavoidable error – can never be the final
say on something. There will always be something else that needs to be clarified,
tweaked, or further investigated as a result of our work. Maybe our choice of par-
ticipants was restricted by the learning context where we found them. Maybe the
randomness we wanted in selection just was not available there, or at that time.
Maybe there was some confusion in the instructions given the participants, or a
more powerful statistical procedure may have revealed more detailed information
on the data. Maybe the presence of a relationship or the effect of an instructional
treatment is clear but we are unsure of its strength.
In short, it is inevitable that some acknowledged – or as yet undiscovered –
limitation will raise further questions that encourage us to go back to the study
and find out something more about it.
Now, all this means we are encouraged by such a research process to move
forward and look backward to achieve our goal. It also follows that discovery
through research is not always in the one direction everybody else appears to be
heading, but ironically can also be behind us! If we all head or experiment in the
same (frontward) direction, it becomes all too easy to look for (and report) posi-
tive results only that confirm that we are all heading the right way! Replicating
previous research fulfills this need to look back at and review what has led up to
our present state of knowledge.
This book seeks to answer a number of questions on the practical aspects of
doing replication research in AL. In particular
what kind of replication approach is most useful given the nature of the
target study;
how to carry out the study to maximize its replicative potential;
how to write up the study to highlight its comparative core; and
where to get the work published to maximize its impact on the field.
It starts from the premise that replication research is essential to the con-
duct of good science and the advancement of knowledge. Albeit somewhat
belatedly, there is now an obvious groundswell of interest in carrying out rep-
lication research in AL and the consequential essential contribution to the body
of knowledge. Searching “applied linguistics and replication” in your browser
now yields a plethora of hits. Even the mere publication of this volume and that
which preceded it (Porte, 2012) – the first two of their kind in the field and per-
haps in any social science – would seem to signal a new era as far as replication
research is concerned. Further endorsement has come from the recent publi-
cation of detailed replication reporting guidelines by the prestigious American
Psychological Association6 together with dedicated strands in leading AL jour-
nals, including Language Teaching and Studies in Second Language Acquisition, further
supported by international conference roundtables and workshops.
With an eminently practical approach, this book will answer the need for more
such work by showing you how to conduct meaningful replication studies and
integrate them into your scholarly habits. In an ever changing and increasingly
diverse research environment, it will also answer a perceived need in the field for
authoritative, practical guidance on carrying out replication research in AL.
Envisage our purpose as having much more to do with handling the inher-
ent limitations placed on any research we do, and where we do it. Much of
that research is carried out in educational settings. Typically, the contextual
advantages – and safeguards – afforded the researcher working in other fields
are not just available. Randomization of participants, for example, is unlikely to
be offered in the typical intact class setup found in schools. Treatment condi-
tions whereby participants are presented with distinct teaching methodologies
or material are unlikely to be welcomed by local authorities (or parents!). Thus
often the researcher in these settings cannot determine, or take account of, the
many peripheral variables which might have affected the outcome. And so we are
typically presented with a study in which the investigation starts after the fact has
occurred without interference from the researcher – providing us with tantaliz-
ingly incomplete results, often ripe for replication.
In such circumstances, and faced with the need for further insights into what
has transpired in a study, we encourage you in this book to undertake replication
studies as a means of improving the interpretability of research, of “filling in the
gaps”, if you will. When a number of such replications are carried out on a study
Introduction 9
of interest the result is at the very least a more detailed knowledge of that study
and, potentially, a more comprehensive understanding of the generalizability or
otherwise of its findings.
It will be in Chapter 2 where we take up further the reasons why replicating
research can often be beneficial for all those involved in researching, together
with applying that knowledge to AL, and presenting it as evidence for future
practice and policy. Suffice to suggest here that discovery, generalization, delimi-
tation, in common with acceptance of the inevitability of bias and error are all
concomitant, and some would say desirable concomitants, of scientific research.
Without such imperfect and partial knowledge, we would not need to ask Sagan’s
“skeptical questions”.
You should be aware at this point, however, that we do not undertake a
replication study because we are assuming there has been error, or poor execu-
tion of the study, or even because we suspect something deceitful has gone on.
Replicating a study is not your embarking on a criminal investigation! That said,
it would be fair to say that several replication research endeavors are reacted to
like this by those whose studies are replicated. We will take up this point about
collaboration in the research effort in the final chapter of the book.
Chapter 2 goes on to demonstrate how to begin a search for a suitable study
to replicate. This is done by encouraging you to “ask the right questions” of
your reading, thereby developing a critical awareness of where a particular area
of interest, and by extension a specific study, might reward a revisit. The selec-
tion routes offered move from the more general consideration of main areas of
study, to sub-areas, through possible topics suggested by titles of papers through
to specific research studies of interest. At each stage you are shown how to seek
out target studies by making use of tools, including Google Scholar search, state-
of-the-art reviews, and customized calls for replication studies in journals, and
specific suggestions for follow-up research found in “limitations” descriptions
often found at the end of research papers.
Chapter 3 then situates the search at the level of a specific study now identi-
fied as being of interest for possible replication. You are shown how to read a
paper to raise your awareness by stimulating “a stream of consciousness”. This
encourages you to formulate your thoughts in questions that you might reason-
ably ask of the author. The chapter takes you through each section of a typical
research paper in the same way, encouraging you to address aspects in detail such
as participant characteristics, sample size, length and nature of the intervention,
specific task variables, and analysis procedures.
Having homed in on a possible study and the reasons why you might wish
to see it replicated, Chapters 4 and 5 take you through the various replication
types you might now want to use. In the first of these, you will look at internal
replication, by means of the routine checking of the outcomes presented in
the study from the research questions or hypotheses themselves through to the
10 Introduction
1. Bitchener, J., & Knoch, U. (2010). Raising the linguistic accuracy level
of advanced L2 writers with written corrective feedback. Journal of Second
Language Writing, 19, 207–217.
2. Carter, M., Ferzli, M., & Wiebe E. (2004). Teaching genre to English first
language adults: A study of the laboratory report. Research in the Teaching of
English, 38.4, 395–419.
Introduction 11
Notes
1 Interview with Charlie Rose on May 27, 1996.
2 Porte, G.K. (Ed.) (2012). Replication Research in Applied Linguistics. Cambridge: Cambridge
University Press.
3 Neuliep, J.W. (1991). Replication Research in the Social Sciences. Thousand Oaks:
Sage Publications.
4 Smith, Jr., N.C. (1970). Replication studies: A neglected aspect of psychological research.
American Psychologist, 25, 970–975.
5 Freese, J., & Petersen, D. (2017). Replication in social science. Annual Review of Sociology,
43, 147–165; Ishiyama, J. (2015). Replication, research transparency, and journal pub-
lications: Individualism, community models, and the future of replication studies. PS:
Political Science & Politics, 47.1, 78–83.
6 Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M.
(2018). Journal article reporting standards for quantitative research in psychology: The
APA publications and communications board task force report. American Psychologist, 73,
1, 3–25 (see p. 17).
2
FINDING A STUDY TO REPLICATE
Background Research
(continued)
14 Finding a Study to Replicate
(continued)
Our starting point in our search for a useful study which will merit the kind of
attention replication brings might be the current principles or assumptions about
AL we have come across in our reading. While we may well have seen these
repeated in much of the literature we read and while they may appear as givens,
our experience of the area might lead us to question the assumption and ask how
far things are exactly as described.
Read some current assumptions about L2 learning and think about how
far your experience leads you to agree or disagree with them. What
historical research have you read about which might have led to the
formulation of these principles?
•• Teachers and students should use the target language rather than the L1
in the classroom.
•• The more meaningful exposure, the more you learn.
•• The sooner you can acquire the grammatical system of a language, the
sooner you can use the language creatively.
•• Learners who use learning strategies effectively are more successful.
•• Students learn best by having their errors corrected immediately by their teachers.
•• Motivation affects the time spent learning a language.
•• Too much correction or criticism can inhibit your learning.
•• The more the language you are learning is like one you already know, the
more quickly you will learn it.
Your experience, intuition, and/or your reading of the literature may well
have led you to question the veracity or inclusiveness of some of these statements.
The next step in our search could now be to focus on an area of interest suggested
by these (or other) assumptions of interest, and then search out more specific
papers that might merit replication.
As we now search out a suitable study for replication, remember what was dis-
cussed in the Introduction. While in the short term replicating a study will prove
invaluable practice for novice researchers who are finding their way around
research methodology, doing so must also be seen as an important step in advanc-
ing your own research agenda (and career!). Therefore, select your target studies
for replication with an eye to helping move forward both your own research
practice and interests as well as giving service to current knowledge in the field.
Opting for a randomly chosen study merely to check whether the original author
made a mistake is probably not a wise move.
What follows are four possible “routes” to finding your target study: rereading
any experimental research which you came across in your course reading, using
16 Finding a Study to Replicate
Course Reading
You will doubtless already have read a large number of research papers, some of
which will have sparked your interest in some way – perhaps there was an unu-
sual outcome, maybe you wondered what the outcome would have been like
with more participants or from different backgrounds, or possibly you thought
the instructions given to the control group could have been clearer, and so on.
You might even have wondered whether – given the fact that it was carried out
some time ago – the same outcomes might prevail today. As we shall see below,
these are all legitimate initial stimuli for you to carry out a replication study.
Similarly, the authors themselves may well have suggested how their research
might be carried on or revisited in the “Limitations” section often found toward
the end of the published paper.
At this point, however – once you have settled on those studies that aroused
this initial curiosity – you should begin to analyze their potential for replica-
tion in more practical terms, while bearing in mind the advice in the section
below, Final Considerations for Replication (see also Chapter 6 on the feasi-
bility of replication).
Below we will look at the feasibility of carrying out the replication in more
detail when we critique an individual study. However, for now, and even at
this early stage of selection and reading through the abstract of a potential tar-
get study, we can respond to the labor, cost, and time requirements that might
be involved in carrying out the replication. Here, for example, a quick reflec-
tion on the selection of participants and a summary of procedures might lead
us to question how far in Study 3 our replicating the selection and procedures
would be more onerous than the other studies. The replication may be justi-
fied or a useful contribution to knowledge, but the resources required to get
a research and teaching team together for a long period as well as the recruit-
ment from such a large sample base may prove problematic for us. We might
also feel that the novel and longitudinal nature of the investigation (i.e., new
teaching/learning method over two years) is not something we could commit
to comfortably.
Likewise, the decision between a replication of Study 2 and 1 might hinge on
the fact that participant selection might need to be based on what was available in
the current “L2 classes” of interest. Moreover, video recording of the interviews
could be costly and would need to be voluntary, which might mean some drop-
outs from the original group and, again, there is extra time to be factored in with
the individual interviewing itself.
APPLIED LINGUISTICS
Possible topic
routes
suggested by
titles/keywords L1 ATTRITION AGE OF L2 ACQUISITION GENDER-SPECIFIC VOCABULARY ‘INNER-CIRCLE’VS.‘OUTER-CIRCLE
VARIETIES‘
Choose ONE of the following sample AL fields of study and use the term
to carry out a Google Scholar search to establish one sub-field which
interests you. Then use the search outcome to narrow down your search
to two or three possible target studies.
Read pages 312 through 315 from the state-of-the art review of research
on young L2 learners in Asia by Butler (2014).1 Underline as in the exam-
ple below what the author suggests are possible limitations and/or
potential modifications to a subsequent study.
Chow, McBride-Chang, & Burgess (2005)2 conducted a 9-month study
of Hong Kong kindergarteners in order to examine the relationships among
three phonological processing skills in Chinese (i.e., phonological awareness
(PA), rapid automatized naming, and verbal short-term memory) and word
reading in Chinese (L1) and English (L2/FL). After controlling for visual skills
that were considered to be an important predictor for Chinese reading, the
authors found that the three phonological processing skills were moderately
Finding a Study to Replicate 21
associated with word reading in both Chinese and English, and that the asso-
ciation remained stable over time. In addition, PA in Chinese (measured by a
syllable deletion test) was a relatively strong predictor for word reading both
in Chinese and English, suggesting that PA is important not only for learning
to read in alphabetic languages but also for reading in Chinese (a morphosyl-
labic/logographic language). However, the authors acknowledged that only
syllable awareness was tested in this study. In Chinese, a syllable constitutes
a morpheme that carries the semantic information of a word. Reading was
also restricted to only word-level reading in their study.
These papers have the considerable advantage that they can not only be used
as a potential source of a replication study, but also as a model for setting out
the need for a replication in your eventual writing up of your work prior to
submission to a journal (See Chapter 6). You will find an initial section in each
paper which explains where the original study (and any other replications) fits
into current knowledge in the field. The next section tells us the main details of
that original study to be replicated in terms of how it approached the problem
and framed the research question(s). Enough information is provided here to
give you an idea of what went on, why, and what the outcomes were. If you are
then interested in proceeding, you will need to read that study in detail (see Final
Considerations for Replication). References are then provided to where it can
be accessed in the references list. Authors then go on to tell the reader how they
suggest the replication might best be framed, pointing you in the direction of
22 Finding a Study to Replicate
what kind of approach(es) might bear useful (and publishable!) replication fruit,
and what elements of the original study might be usefully varied.
Published replication studies will be a useful source of potential recommendations.
Authors will typically have reviewed the background literature to their own replica-
tion, pointing out the gaps in the literature and eventually focusing on their questions
arising from the original study. Along the way they will often have provided similar
indications of the need for replication in other studies. Equally, in discussing and
concluding a paper, you may well read the author suggestions about how additional
replications might further refine what has been revealed so far. In the replication
studies we will refer to in the text, we will meet the typical “Review of the litera-
ture” section and, on the way to discussing the target study, the authors would then
endeavor to highlight unresolved aspects of closely related work as well. Likewise, at
the end of a replication, you would expect to see the typical “Limitations” section,
where you might even find the authors suggesting further replications with subtle
variation changes which they feel might contribute even more useful data about the
target study replicated. Chapters 6 and 7 discuss these aspects in considerable detail
when we take you through the steps of executing and writing up a replication study.
Think about why a paper such as this might continue to be of enough inter-
est to be regularly cited. Is the criticism of the paper all in one direction only?
What conclusions might you draw from this “popularity” as regards the
possible usefulness of revisiting it?
•• F. The references at the end of the original paper are now out of date.
How important is this observation with respect to the need for replication?
Are there any circumstances when this observation might signal a possible
need for replication?
•• G. The original study is cited as one of the most significant exam-
ples in practice of a particular theory.
How far might the continuing relevance of, or interest in, the theory direct
our thinking as regards replication?
•• H. The original study identified limitations.
How might these limitations set us thinking about replication needs?
•• I. The journal in which the paper was first published is a prestigious one.
Why might the reputation of the journal affect our decision? Think about
the impact and potential audience involved.
•• J. The participants chosen/assigned in the original study were very
similar to those I work with.
How similar would they need to be to justify a replication on this basis? Is
this reason enough to proceed to a replication?
•• K. The statistical analysis used on the original data can now be improved upon.
How might a different available analysis or procedure signal the potential
usefulness of a replication?
•• L. A small number of participants in the original study.
Think about how the number of participants might affect the outcome of a
study and the generalizability of the conclusions.
•• M. The original study targeted students learning ESL (English as a
second language).
How might replication in other language learning contexts or L2 prove interesting?
•• N. The study reports that results were not statistically significant.
How might the apparently unsuccessful outcome be interesting with respect
to further replication attempts?
•• O. It would be interesting to use the methodology in my local lan-
guage learning context to see what happens.
Would this be replicating the study in the terms we have described above in section 2.1?
•• P. I wonder whether an intervening variable, such as participants’ educa-
tional background, might have affected the outcome of the original study.
(continued)
24 Finding a Study to Replicate
(continued)
Which specific variables do you think might have affected the result, and
how might a replication attempt to discover this?
•• Q. I wonder if adding a further source of data would provide addi-
tional interest from a replication study?
What other data sources do you think might be added to the original study
and which might conceivably provide us with useful data?
•• R. Effect size data is not presented or is not convincing.
How might a replication which provides such a statistic add to our knowledge?
we need to know exactly how it was carried out, from participants/data selection,
through instruments, procedures to results, and analysis (see also Chapter 6).
It is a fair assumption that the further back you delve in the published archives,
the more difficult it will be to obtain this kind of detailed data and procedures
from the original author(s). There are severe space restrictions on most printed
journals these days and authors may simply not have the space to present all the
detailed data you need: for example, perhaps we can read about the number of
participants but their learning context/history may need more detail. If the infor-
mation is not available as supplementary data on the publisher’s website, in online
repositories such as IRIS (www.iris-database.org), or on research project websites
such as LANGSNAP (http://langsnap.soton.ac.uk), you will need to be able to
contact the original authors and request it.
Be on the lookout, too, for studies where an unexpected or unusual out-
come is reported. Such outcomes from well-designed and rigorous studies are
interesting in the sense that things appear not to have come out the way the authors
were expecting or the way the literature suggested they might; therefore, it is likely
further examination of the question is warranted and to be welcomed and would
lead us into critiquing some aspects of the study with a view to its future replica-
tion (see section in Chapter 3, Routes to Replication: Reading with Targeted
Questions). From such unexpected outcomes (perhaps a lesser effect than antici-
pated or the influence of an unforeseen variable), there might well be a significant
discovery made which is a game-changer. We can only begin to get to a more con-
fident conclusion, however, by undertaking suitable replication studies of the effect
or intervention which might also average out any biases or errors in the original.
Of course, we might be equally attracted to studies where we feel exaggerated
claims for a cause–effect or a particular intervention are made. Much experimen-
tal research we read about shows a positive outcome, null hypotheses rejected, or
interventions shown to result in gains in learning. There is evidence that the “file
drawer problem” observed in social science research overall, which is a tendency
to publish positive results rather than negative or non-confirmatory outcomes,
is just as prevalent in AL research output. Research that presents negative or
unexpected outcomes is often seen to be of less interest and, therefore, may not
even reach the published page (see Chapter 8). It is worth noting that moves in
journals such as Language Learning to initiate (as from 2018) a Registered Reports
section may alleviate this:
Notes
1 Butler, Y.G. (2015). English language education among young learners in East Asia: A
review of current research (2004–2014). Language Teaching, 48.03, 303–342.
2 Chow, B.W.-Y., McBride-Chang, C., & Burgess, S. Phonological processing skills and
early reading abilities in Hong Kong Chinese kindergarteners learning to read English as
a second language. Journal of Educational Psychology, 97.1, Feb. 2005, 81–87.
3
PLANNING YOUR REPLICATION
RESEARCH PROJECT
Here is a “mind map” starting with four basic aspects of an experimental research study. Brainstorm features of each and
then form questions such as those in the examples to establish possible routes for replication.
GENDER
(To what objective?)
Participants
INDIVIDUAL INTERVIEW
(What further information might be obtained?)
SELECTION PROCEDURES
(Volunteers vs Intact group?)
Data-
gathering
Data STATISTICAL vs PRACTICAL SIGNIFICANCE
(How might reporting an effect size help?)
analysis
TIME INTERVALS
(Might a delayed post-test provide more precise data?)
paper, by being often cited as one of the first to attempt formal online
instruction of the same, may well reward closer attention and possible rep-
lication. In the present case, the authors have concluded their paper with
an interesting challenge. They wonder whether the treatment would be
as successful with “English L2 students, the primary focus of English for
Specific Purposes”.
The quasi-experimental study investigated the effect of genre-based
instruction on the writing of university students learning disciplinary-specific
writing. One aspect that makes it an initially attractive candidate for rep-
lication is that it comes with enough detail regarding method to permit
replication. Two groups were set up, one receiving a handout on preparing
lab reports and specially designed instruction online software (LabWrite) and
the other (control) group receiving only the handout. Sample lab reports
30 Planning Your Replication Research Project
We now take you through a closer, critical address of this paper, this time explic-
itly to isolate and question a select number of features and variables that make
up the total research situation in the target study, and that potentially affect the
outcomes of the research and which, therefore, might warrant consideration in a
potential replication study.
Even good research designs cannot account for all of the many threats to
internal validity and just one of these can have impacted outcomes sufficiently
to warrant adjustment in a subsequent replication. On the other hand, there is
the risk when carrying out a replication study that you may be repeating a study
that is so poorly designed or executed anyway that it would not merit the effort.
Again, your close reading of the text will help you to choose wisely!
All this means that, as we ask these questions with an eye to the usefulness of
a potential replication study, we are also indulging in research critique. The more
questions we can see answered in the study as it is described, the greater appar-
ent soundness that study would seem to have with a view to our replicating it.
Rather more common, however, would be for us to find unanswered questions
arising from our reading which require more details to be obtained – perhaps by
contacting the original authors or accessing a web repository or project website.
Nevertheless, if you find yourself questioning most of the procedures, methods,
and analytical process, this may be a sign of a study which currently lacks suf-
ficient detail to be considered further as a possible replication.
The final practice activity in this chapter (pp. 41–46) will see you apply these
targeted questions to a complete paper. This concluding stage in selection should
help focus our thinking down to what a replication of the paper might look like
and what this might reveal as a result of modifying the factor or variable in question.
intervention treatment to have worked with the experimental group. It might, for
example, be interesting to see how far previous experience, in the shape of stu-
dents’ knowledge of the genre prior to instruction, might influence the outcome.
You might then annotate that section of the text with the targeted question(s)
arising (e.g., “I wonder if a pre-test of previous genre knowledge would provide some more
useful information on the true extent of any treatment success?”). A further question for
a replication might be whether the LabWrite treatment would be as successful
with participants using a similar genre but of another sciences major (i.e., other
than biology). The authors make no claims for the efficacy of the treatment
with L2 students, and it would clearly be interesting to see how useful it would
turn out to be with this audience. In such circumstances, we might assume the
LabWrite treatment might need simplifying or adjusting to the L2 proficiency
level of the students to meet the needs of such students in some way. An adapted
form of LabWrite tested in a similar way could be an interesting replication.
require closer control and/or monitoring as they become more susceptible to the
kind of history factors mentioned above. Conversely, you might address the fact
that a treatment or intervention needs time to obtain definitive results – successful
or otherwise. In this sense we would be interested in how far the research design
included suitable post-test measures or events (see page 35).
The study tells us only that the treatment continued in the spring semester
without specifying the amount of hours or frequency of classes received by either
group. A replication would probably want to specify and perhaps test this time
period as a new variable. Elsewhere we are told that the control group was “intro-
duced to writing lab reports on the first day of the lab class” but there appears to
be little or no supervised/taught revision beyond this day. The treatment group,
we understand, received this “integrated into” the course, which may indicate
rather more constant, and guided, reference to the materials. This difference in
access and use of the materials offered may also have affected outcomes.
In the case which concerns us, we read that care was clearly taken to make sure
the same professors taught both groups, “Control and treatment students were
taught by the same professor in the lecture sections and the same 2 instructors
in the labs (each instructor having 2 sections), with identical course syllabi, labs,
and assigned lab reports”, and the setting would seem to have been familiar – the
normal classroom or lab participants were used to. The authors themselves, how-
ever, make the point in their Discussion/Limitations section that “the in-class
instructors were ‘insiders’ of the labs” but insist that this would not have affected
results as it was the website which provided the instruction throughout. A rep-
lication of the study might want to see how far similar results are obtained with
non-specialist instructors (particularly in an ESP (English for specific purposes)
or EAP (English for academic purposes) situation) even if these are assigned an
apparently secondary role which is mainly “to encourage the use of the website
through assignments”.
we might suggest the treatment group received a novel, cohesive, and far more
continuous kind of instruction (“integrated into the laboratory activities of the
Spring semester course . . . LabWrite Post-Lab also gives students access to a
step-by-step guide for writing each section of the report.”). Using an L2 cohort,
a further test of the success of the treatment might be made by having the treat-
ment and control groups access any guidance through the same medium.
3.2.1.10 Measurement
Outcomes are likely to be affected by the way individual or group responses are
scored or assessed. We need to read about how responses were then weighted
as scores to fully understand their significance on the results. This can help us
decide how far, say, a Likert scale response over-limited the range of responses
36 Planning Your Replication Research Project
failed, while it actually was successful) do happen, and it is worth replicating even
negative results, but that certainly does not mean they are of any less scientific
worth than positive outcomes. Indeed, it is possible the original researcher made
some mistake in the original work, and you might then be able to “correct” that
mistake and go on to make entirely new discoveries.
b) While attention to statistical significance is an initial point of reference, you
will also read studies which describe outcomes in terms of statistical significance
alone – without actually describing the strength or magnitude of the result. One
of the problems of “Null Hypothesis Significance Testing” mentioned above is
that it encourages dichotomous thinking: either a result is statistically significant,
or it isn’t. But this is only half the story: a more informative and better contribu-
tion to progress of knowledge is made if we then discover how big the effect is
or how strong the relationship is.
Initial point of reference apart, your focus in most instances needs to be on
this effect size. Looking at p values alone will not be enough to suggest a suitable
direction for possible replication. More insidiously, as an outcome in a replication
study, it will lead us to misinterpret whether an original study was replicated. By
focusing on the effect size, we can begin more correctly to think about the degree
of replicability obtained.
An effect size measurement (e.g., Cohen’s d), together with a confidence
interval or margin of error, is particularly useful when we compare replications of
a study as, for example, we would be able to summarize a group of experiments
or replications that used the same independent and dependent variable and then
directly compare the effects across the studies. These results can then serve as a
starting point for additional replication or follow up studies that further describe
the psychological processes that underlie an effect and/or tell us more about the
limits of those conditions. Small effect sizes might be down to weaknesses in
the way a study was set up or data collected or analysed: a replication study may
address these apparent flaws. Should an effect size measurement not have been
reported, there are a number of web-based calculators which can be used given
the minimum descriptive statistics (e.g., www.uccs.edu/~lbecker/; www.polyu.
edu.hk/mm/effectsizefaqs/calculator/calculator.html; or www.psychomet
rica.de/effect_size.html). A subsequent replication can be set up to further test
the strength or weakness of that effect. Knowing the expected size of an effect
is important when planning a replication aiming to further enhance the strength
of the effect or relationship in the original study. An effect size calculated from
a large sample, for example, is likely to be considerably more accurate than that
from a small sample. An effect size’s confidence intervals can also provide useful
information about the reliability or robustness of an effect size (see Cumming &
Calin-Jageman, 2017).5
A principal advantage of comparing effect sizes across original experiments
and their replications is that the different size estimates from each study can then
Planning Your Replication Research Project 39
be combined to give an overall best estimate of the effect size. A number of such
experiments and resulting effect sizes can then be synthesized into a single effect
size estimate in a “meta-analysis”. Even more useful would be for us to examine,
in a series of replications and original studies, any differences between those with
large and small effect sizes and attempt to explain why there might have been
such a difference reported. This process is known as “moderator analysis” (see
Oswald & Plonsky, 2010, Plonsky & Oswald, 2015).6
In the present case, we are presented (on p. 406) with a treatment group which
outperformed the control group (“significantly more effective”) at a probability
level of p<.0001). While inter-rater reliability is calculated and is acceptably high,
this first hypothesis outcome has no reported effect size. Consequently, we only
see part of the story here: an indication of effect from the treatment, but no
information on the size of that effect. A subsequent replication should seek to
remedy this for the reasons stated above. A number of such replications would
have the advantage of allowing us to summarize a series of experiments all with
the same independent variable (the LabWrite treatment) and directly compare
effects across these studies, regardless of the numbers of participants involved –
invaluable knowledge when we are testing the effectiveness of a potentially inno-
vative intervention such as this.
c) Science advances by our asking questions and querying others’ explanations.
We assess the explanations given by examining the evidence put before us, com-
paring it, identifying weak reasoning, highlighting claims that seem to go against
the evidence, and/or suggesting alternative explanations for similar outcomes.
A crucial ability in reading any study – but particularly one which we are
sounding out for possible replication – is that of critically addressing the outcomes
as concluded by the authors. One of the key aims of research is to search out
and discover, but not to prove. Therefore, we can expect the author of the study
to “examine, interpret and qualify the results – as well as draw inferences from
them” (APA Publication Manual, p. 26)7 in a particular way – but by definition,
there will be others.
Furthermore, while we may agree with the results themselves, we may want
to disagree with what those results mean. If a treatment is claimed to work, are
further replications needed to confirm the extent of that finding? For example,
a researcher may suggest results were not affected by an intervening variable
identified post facto but you may feel there is evidence they were. A suitably
designed replication study might be useful to sort the discrepancy out. The
researcher might also indicate in a “Limitations” section what was perceived
to have affected results, and through these thoughts we can map out a use-
ful replication study. We might also feel the kind of generalization (see pages
40–41) made here would be useful for the field, but cannot be justified given
such aspects of procedure as the participant selection, group size, or pre- and
post-test measures. Again, if we felt the contribution of the study merited it, a
40 Planning Your Replication Research Project
time devoted to learning and, of course, other L2s. So many variables are involved
that the needful path of generalizing from a study will inevitably require a slower
pace and detailed approach to replication – and a number of replications at that.
As we saw in the previous chapter, with replications, we are best to talk of degrees
of replication across studies; in such a slow accretion of results each replication is
designed to provide a little more knowledge about the extent of effectiveness –
and the generalizability thereof – found in the original work. As Plonsky (2012,
p. 117) points out “For this reason, we must accept and even embrace a more
incremental and . . . patience-testing pace towards cumulative knowledge”.8
In the Carter et al. paper we have already mentioned the possibilities of repli-
cating the study with L2 participants without previous genre knowledge, but the
apparent usefulness of LabWrite (suitably adapted) shown here with students in
biology labs needs to be further tested with other science subject classes.
Focus, first, on the “Abstract” and carry out the same initial awareness-raising
reading strategy you saw in 3.2. Routes to Replication: Reading for Awareness-
Raising (p.28) above using the space around the text for your annotations. Once
you have completed your own reactions, compare these with our suggestions
in Figure 3.2 overleaf. You are not looking for things that have been omitted or
should be in the abstract itself , but rather things you intend to go on and find or
think about when reading the body of the paper.
Focus now on the “Introduction” (Aims), “The study” (Aims), and the
following Research questions/Hypotheses and carry out the same initial
awareness-raising reading strategy you saw in 3.2 Routes to Replication:
Reading for Awareness-Raising (p.28) using the space around the text
for your annotations. Once you have done so, take a look below at some
of our own thoughts as we were reading these sections.
(continued)
“high level of accuracy” = ? Was a reliable pre-test (and statistic?) given to establish this? If not, can we add one?
Abstract
How was this defined? Could it help other
levels I wonder ...? This article presents the findings of a study that investigated (1) the extent to which written corrective feedback (CF) can help
advanced L2 learners, who already demonstrate a high level of accuracy in two functional uses of the English article system (the use
of ‘a’ for first mention and ‘the’ for subsequent or anaphoric mentions), further increase that level of accuracy; and (2) the extent to
What subjects were they studying? Might
which there may be a differential effect for different types of feedback on any observed improvement. Sixty-three advanced L2
it have influenced outcomes?
learners at a university in the USA formed a control group and three treatment groups: (1) those who received written meta-
Similar outcomes in high school context?? linguistic explanation; (2) indirect circling of errors; and (3) written meta-linguistic feedback and oral form-focused instruction. On
three occasions (pre-test, immediate post-test, delayed post-test) the participants were asked to describe what was happening in a
picture of a different social setting. Significant differences were found in the level of accuracy on (1) the immediate post-test piece
What other types might be tried? E.g., of writing between the control group and all three treatment groups; and (2) on the delayed post-test piece between the control and
circling AND marginal indications of error indirect groups and the two direct treatment groups. ?
nature?
Defined as? Effect size? Check out if selection procedures can be improved on?
How long after the treatment? Would further post-tests establish more effectiveness?
(continued)
Now focus on the section entitled “The study: Context and participants” and
“Target structures” and carry out the same initial awareness-raising reading
strategy you saw in 3.2. Routes to Replication: Reading for Awareness-Raising
(p.28) above using the space around the text for your annotations. Then
use the targeted question sections Participant Characteristics and Sample Size
above to help guide your responses. Once you have completed your own
reactions, compare your answers with our suggestions below.
•• We read that the participants come from the ESL department of a large
US university (211). Would EFL (English as a foreign language) students
yield similar results? Other ESL participants in the US could be a useful
target in a replication, as well as ESL, but in another country?
(continued)
44 Planning Your Replication Research Project
(continued)
•• The participants are claimed to have had “a high degree of motivation”
(211). How reliably was this measured? Might this be improved upon
in a replication? Would the treatment work equally with less-motivated
participants?
•• In “Target structures” we read about the choice of functions as being
because: “L2 writers across English language proficiency levels (including
those at an advanced level of proficiency who demonstrate a reasonably
high level of accuracy in the use of the targeted form) continue to make
errors in the use of the English article system . . .”. A case for replication
with lower-level students? They arguably have a greater need for correc-
tion and may benefit more from the “treatment”.
•• Furthermore, “L2 writers across English language proficiency levels” had
us wondering whether all L2 writers demonstrated this, or whether it
might depend also on the participant.
•• Does the literature suggest other errors appear across proficiency levels
that might be worth testing in any replication?
•• Groups 1 and 3 both received “direct written CF”. Even with a “simple
explanation”, this kind of feedback will require deeper processing than that
offered the other groups. Is this form not better suited to the more advanced
student, then? Would the feedback need to be further simplified in any repli-
cation with lower-level students . . . or maybe presented in the L1?
•• A replication could try out other ways of presenting the written feedback
such as marginalized indications of numbers of errors (but no indication
of where in the line) and/or rough indications above the offending area
of where the error may lie.
•• The CF obtained is all written. Might a replication usefully compare oral
feedback in at least one group? How might this work?
Planning Your Replication Research Project 45
Now focus on the section entitled “The study: Procedure” and “Analysis”
and carry out the same initial awareness-raising reading strategy you saw
in 3.2 Routes to Replication: Reading for Awareness-Raising (p.28) above
using the space around the text for your annotations. Use also the targeted
question sections Specific Task Variables, Length of Treatment and Instructions
and Cues above to help guide your responses. Once you have completed
your own reactions, compare your answers with our suggestions below.
Now focus on the section entitled “The study: Results” and carry out the
same initial awareness-raising reading strategy you saw in 3.2 Routes
to Replication: Reading for Awareness-Raising (p.28) above using the
(continued)
46 Planning Your Replication Research Project
(continued)
space around the text for your annotations. Use also the targeted ques-
tion section Statistical Significance and Effect Size above to help guide
your responses. Once you have completed your own reactions, compare
your answers with our suggestions below.
Finally, focus on the section entitled “The study: Discussion and conclusions”
and carry out the same initial awareness-raising reading strategy you saw
in 3.2 Routes to Replication: Reading for Awareness-Raising (p.28) above
using the space around the text for your annotations. Use also the targeted
question section Does the Generalizability of the Study Need to be Further
Tested? above to help guide your responses. Once you have completed your
own reactions, compare your answers with our suggestions below.
Notes
1 Carter, M., Ferzli, M., &Wiebe, E. (2004).Teaching genre to English first language adults:
A study of the laboratory report. Research in the Teaching of English, 38.4, 395–419.
2 Plonsky, L., & Derrick, D.J. (2016). A meta-analysis of reliability coefficients in second
language research. Modern Language Journal, 100, 538–553.
3 Bitchener, J., & Knoch, U. (2010). The contribution of written corrective feedback to
language development: A ten month investigation. Applied Linguistics, 31.2, 193–214.
4 Norris, J.M. (2015). Statistical significance testing in second language research: Basic
problems and suggestions for reform. Language Learning, 65 (Supp. 1), 97–126.
Plonsky, L. (2015). Statistical power, p values, descriptive statistics, and effect sizes: A
“back-to-basics” approach to advancing quantitative methods in L2 research. In
L. Plonsky (Ed.), Advancing Quantitative Methods in Second Language Research (pp. 23–45).
New York: Routledge.
5 Cumming, G., & Calin-Jageman, R. (2017). Introduction to the New Statistics: Estimation,
Open Science and Beyond. New York: Routledge.
6 Oswald, F.L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices
and challenges. Annual Review of Applied Linguistics, 30, 85–110.
Plonsky, L., & Oswald, F.L. (2015). Meta-analyzing second language research. In
L. Plonsky (Ed.), Advancing Quantitative Methods in Second Language Research (pp. 106–128).
New York: Routledge.
7 American Psychological Association (2010). Publication Manual of the American Psychological
Association (6th Edition). Washington, DC: American Psychological Association.
8 Plonsky, L. (2012). Replication, meta-analysis, and generalizability. In G.K. Porte
(Ed.), Replication Research in Applied Linguistics (pp. 116–132). Cambridge: Cambridge
University Press.
4
WHAT KIND OF REPLICATION
SHOULD YOU DO?
From the Inside, Looking Out: Initial Critique
and Internal Replication
carrying out our research, most of them unforeseen, unwanted, and uninten-
tional. While the recent furor in the press indicates that there can sometimes be a
worryingly thin line in published empirical research between genuine error and
a desire to deceive, we might see internal replication as a first line of scrutiny
(or defense) to both find – and correct – errors. Without such safeguards, we
can simply never be confident in the outcomes presented; with any of the many
approaches to replication described here and elsewhere taken up as the norm, we
might begin to see practices change.
Unlike procedures in the hard sciences, where later scrutiny typically comes
from the assumption that external replication will further confirm or disconfirm
findings (see Introduction), much of the focus in the social sciences has been on
aspects such as the correctness and appropriateness of the statistical methods used,
the selection procedures, and/or the parameters or operational definitions used by
the author. As a result, however, we will often be perusing a number of studies
dealing with similar aspects of an area of AL but using very different assumptions,
criteria, and statistical procedures to present their outcomes. Even more reason,
then, why each study needs to be judged carefully on its own merits at the same
time as we think about its contribution to the wider field of knowledge in that area.
And there we have what may prove an initial stumbling block: can we get our
hands on all the data needed to reconstruct the study – in particular, the method
and results (including the raw data)? As we have seen in the previous chapter, this
may well mean looking in online repositories (such as IRIS, see below), the sup-
plementary online materials published by journals, on research project webpages,
or soliciting help from the original authors.
Routine checking sees you first reading through the paper to see whether
results can be reproduced using what is available for examination in the article.
Once again, we need to adopt our close-reading-and-commenting technique
presented in the previous chapter to satisfy ourselves about any problems present
in the data available – aspects such as participant selection, possible errors in data
coding and/or transcription, or in the kind of statistical approach employed in the
light of the assumptions required for such tests.
Ideally, we would like to see an author improve “reproducibility” in his or
her work by presenting enough information and data about the study for the
reader to understand and evaluate the work, prior to the more detailed discussion
of what has been observed in the Results and Conclusions section of the paper.
While there are inevitable restrictions on space in journals, we should still expect
to see enough of the results to satisfactorily respond to the hypotheses or research
questions initially put forward. The appraisal we make here needs to pay special
attention to any tables or figures in which outcomes are presented.
Ask yourself whether you see any anomalies or potential confusion in the
graphical presentation of outcomes which need to be further clarified?
Does the information coincide with what you have read in the text? Is there any
information which appears to be missing and needs to be obtained to make an
adequate appraisal? Are there any unaccounted-for outliers? Do the quantitative
values presented seem reasonable?
Take a good look at the columns and rows in the tables of results. Most likely,
these will have been labelled with the variables, group designations, and scores
that apply to the paper. If we are to analyze outcomes ourselves, we will need to
be sure these correspond to terms that have been sufficiently well explained or
defined in the text itself. Thus, for example, are any “scores” assigned raw data
or “per x words”?; are independent and dependent variable labels satisfactorily
operationalized?; do you spot any apparent anomalies in any results and are these
taken up in the discussion section? And, while most statistical data is carried out
through special software these days, it might be as well to check the math as far
as possible!
The statistical operations carried out on the data will usually be presented,
and we will want to check that enough is available to permit adequate checking
to be carried out. Descriptive data will often be a first port of call: we would,
for example, look to see the basics: how many participants were involved, how
were they distributed into groups, and of how many, what scores were obtained
by whom, and what measure of central tendency (i.e., mean, median, or mode)
What Kind of Replication Should You Do? 51
and dispersion (SD, standard error, confidence intervals) had been used (out-
comes can be greatly altered by the choice!). The mean is by far that most chosen
in studies manipulating interval/continuous scored data; however, if our routine
checking is to be prior to any further replication of the paper (see below), we
need to remember the measure is highly sensitive to extreme scores (outliers).
In a relatively small group, one or two such scores could shift measures such as
the mean and the SD, and it would be interesting to find out from the original
researcher what was done with such cases. The SD might be equally worthy of
attention as part of our routine checking: larger than normal dispersal of scores
in a small group will attract our attention – particularly if we are considering the
virtues of a subsequent replication involving more participants.
Much has improved in recent years following the calls for more replication
studies in our field, and publishing guidelines from a number of professional
organizations2 and journals themselves now clarify the requirements for data
availability and online data storage, designed to help supplement what is pre-
sented on the space-limited written page. Having said this, at the time of writing,
many journals recommend only the uploading of data collection materials to
databases such as IRIS (www.iris-database.org). There remains no obligation to
do so, and while this remains the case you may well still need to depend on the
willingness of the target study author(s) to fill in the gaps. That, in turn, will mean
first approaching the journal’s website to see where such data is provided. If the
journal’s main concerns do not lie in posterior scrutiny of, and debate on, their
papers – and for many this is not a priority – there will be a disincentive for that
journal to publish corrections, a disincentive for outsiders such as you to search
for and suggest corrections, and – perhaps more insidiously – a disincentive for
researchers to be careful when they carry out and write up their work.
Typically, therefore, you will need to prepare yourself for initial lack of preci-
sion and/or more of the space-induced lack of detail mentioned above than you
might hope for. As you read, you will need to be able to “spot” where some-
thing may need further clarification as a result of routine checking. You might
also need to be prepared to change your original choice of target study if faced
with the impossibility of obtaining the kind of “missing” data, methodology, or
procedures needed to carry out your prospective replication.
Read this extract from a study on which you have decided to carry out a
routine check. At certain points (marked ^) we suggest that some impor-
tant data/information may be lacking. This may need to be sourced
online or from the author. Decide what data this might be and write in
the columns on either side of the text.
(continued)
52 What Kind of Replication Should You Do?
(continued)
Research Questions
Does practice in dictation in class improve participants’ knowledge of L2 French
structures, vocabulary, listening, and reading comprehension?
Participants
Forty-four students of L2 French studying in their first year of French conver-
sation were put into two groups ^, the experimental and the control group.
Groups had the same numbers of participants each and were matched on
a number of parameters ^. Participants consisted of male and females ^
varying in age with a mean of 18.6 years old ^. At the same time as the
experiment was being carried out, participants were undertaking their other
university courses, but none in L2 French ^.
Procedure
The conversation class took place three times a week ^ over a period of 12
weeks. The classes consisted of instruction in a number of elements of French
pronunciation ^, delivered in the classroom using both researchers ^. Both
groups followed this same syllabus for the period and used the same material ^.
The researchers taught the two groups in turns throughout the experimental
period ^. The difference between the two groups came from the fact that the
experimental group received 60 practice dictations in their classes spread through-
out the experimental period ^. These dictations were presented in supplementary
material from the books used which were only available to the researchers and
increased in difficulty as time went on ^. The group heard the dictations from
audio media at the front of the classroom ^ and followed the instructions pre-
sented from the teachers’ version of the text book ^.
All the dictations were collected at the end of each class and scored by
one of the RAs (research assistants) of the researchers who had been trained
in this scoring before the experiment took place ^ and missing and misspelt
words were indicated in red ink. Texts were then returned to the participants
and they were allowed to review the marking and ask questions before pro-
ceeding to the rest of that class. At no time were participants told the main
objectives of the experiment.
Testing Data
Pre- and post-test data was obtained by using the Cercle Français official
test which covers all four language learning skills (testing materials, timing
requirements, and other criteria can be obtained from the authors) ^.
At the end of the experimental period the participants were all also admin-
istered three dictations. The latter were specially designed by the researchers
What Kind of Replication Should You Do? 53
and adapted from L2 French readers with which the participants were not
familiar. The dictations were then graded. ^
Results
Table 4.1 shows the descriptive data for the two groups and skills subsets
in the pre-test. While some discrepancies between the two groups can be
observed (the experimental group seems better in grammar and vocabulary
than the control and the control better in listening and reading), in general
it was felt that the two groups were sufficiently similar to proceed with the
experiment as statistical data showed none of the score differences (mean
or totals) shown as being significant, ^ and therefore it was felt these results
supported the matched selection process mentioned above.
Table 4.2 shows the descriptive data for the two groups and skills subsets
in the post-test. There are similar differences between the groups, with the
experimental group better in grammar (but not vocabulary) and the control
group better in listening and reading, but none of the results were seen to
be statistically significant ^.
Mean SD Mean SD
Listening 55.66 19.15 57.96 19.43
Grammar 71.53 20.95 70.04 22.86
Vocabulary 64.13 22.88 67.69 23.98
Reading 56.84 23.09 60.78 24.44
Dictation 1 55.16 18.07 56.78 18.08
Dictation 2 20.98 ^ 7.75 20.43 ^ 7.44
Dictation 3 60.34 19.31 57.72 22.31
Total dictation 45.49 15.04 44.97 15.94
TOTAL 62.04 21.51 64.11 22.67
(continued)
54 What Kind of Replication Should You Do?
(continued)
It is clear, therefore, that the extra dictation practice with the experimen-
tal group did not improve test performance in any significant way.
Conclusions
The use and practice of dictation was not shown to improve the language
skills of the experimental group participants. A number of possible explana-
tions can be put forward for this. First, it may be that the instruction ^ had
no effect on the participants’ abilities or they did not have enough time to
exercise their skills. Second, it may be that the failure to show improvement
could be due to the fact that participants did not understand a number
of the words in the dictations. Post-instruction interviews ^ with randomly
selected participants from the experimental group revealed that little extra
French language reading was being done outside the classroom and this,
too, was not seen as conducive to improvement.
Descriptive statistics tell us only part of the picture: group characteristics will
vary across many research contexts and conclusions drawn from descriptive data
may only be informative, but not conclusive. We would need to carry out many
replications in these contexts and with different participants to come close to
checking the validity of the outcomes. Inferential statistical procedures are
more often called upon as they provide more insight into what has happened and
potentially allow us to generalize to other contexts.
There is not space enough here to discuss all the current inferential statistical
procedures you may come across in your reading of papers, and whose assump-
tions you will need to check. Rather, we present below a general discussion of
the kind of things your routine checking phase should cover, with particular
regard to the immediate job at hand – checking what has been reported as far as
we are able to – and establishing where replication of the study might help clarify
outcomes or move things forward.3
We would initially want to make sure the various statistical operations carried
out on the data are made clear and, particularly, the assumptions such as normality
of distribution or selection procedures. Any subsequent replications of the paper will
need to take apparent omissions into account (how far, for example, can we accept
the wish to generalize from these results if these assumptions have not been met?).
As we mentioned in the previous chapter, our initial attention can be drawn
to the statistical significance level. There are a number of specific discussions
already mentioned which are worth reading about the perceived importance of
significance level testing,4 but for our present “replication objectives”, a number
of general points should be borne in mind.
What Kind of Replication Should You Do? 55
First, as we digest the outcomes, remember that the significance level chosen
by the researcher is unlikely to have been selected through any reasoned argument
based on previous, similar research. Depending on the cut-off point previously
adopted (α level), we could find ourselves presented with results which rest too
heavily on the rejection of the null hypothesis. And then, depending on the replica-
tion interest generated by the study, there might be a case to be argued at the very
least for strengthening the original significance level assigned to the data outcomes.
Next, you will need to bear in mind that the significance statistic you read
will relate closely to the sample size. While large numbers of participants might
be a positive sign as regards the external validity of the sample, one can obtain a
statistically significant outcome if your sample is large enough with even a small
effect. We have also noted above the tendency for journals to publish papers
wherein null hypotheses are rejected, but failure to do so is arguably even more
interesting from the point of view of the emergent replicator. Along these lines,
the number of statistical tests should also be considered in light of statistically
significant findings. It is not uncommon, for example, to find literally dozens of
statistical tests run for a single study. This is not a prudent approach as doing so
greatly increases the chance of Type I errors.
We have warned in the previous chapter, however, about reading too much
into, or beyond, the significance level presented. The outcome does not
present us with the chances of the results being replicated – however “sta-
tistically significant” they may be. As Nassaji (2012) points out, part of the
misunderstanding can come from a misinterpretation of what significance test-
ing can show us – and what it cannot. Significance testing “only” tells us about
the probability of the outcome, assuming the null hypothesis is correct. It does
not tell us “the probability that a certain mean difference . . . can be repeated”
(Nassaji, 2012, p. 101, our emphasis). Nassaji also cites work by Cumming
(2008)5 in which the latter – through a simulated set of repetitions of experi-
ments – found that the p value varied so wildly that it is clear this value provides
no proven indicator of generalizability, let alone replicability.
As a way of strengthening your own internal replication and, by extension,
the external replication that will be carried out, the bulk of your attention should
be on the effect sizes which express the strength or magnitude of an effect or
relationship (see below). The effect sizes indices most common in L2 research
are Cohen’s d, r (the correlation coefficient), r2, R2, eta2, and the odds ratio (for
categorical data). Although the reporting of effect sizes has increased dramatically
in the last 15 or so years (Plonsky, 2014),6 it is still commonplace for them to be
omitted. In many such cases, you can calculate the effect size yourself. The for-
mulas are quite straightforward and involve only the most basic arithmetic based
on descriptive statistics such as means, SDs, and sums of squares (in the case of
ANOVA). An understanding of the effect sizes will assist in arriving at a clear and
comprehensive understanding of the initial study in question, will help make sense
of your own results, and will also provide a basis for determining an appropriate
56 What Kind of Replication Should You Do?
sample size for the replication (via a priori power analysis, see Larson-Hall,
2016 for more information). In Null Hypothesis Significance Testing (commonly
NHST) (see Cumming & Calin-Jageman, 2017),7 sufficient statistical power is
required to help ensure “real differences are found and discoveries are not lost”
(Larson-Hall, 2016, p. 156). Statistical power can be calculated relatively eas-
ily with online calculators (e.g., www.danielsoper.com/statcalc/) which usually
require you to provide information about (a) the statistical test you will use (e.g.,
multiple regression, ANOVA), (b) the alpha level or statistical significance level
you will use (e.g., 0.05), (c) the desired effect size, and/or (d) the desired statisti-
cal power level (usually greater than or equal to 0.80). See Larson-Hall (2016) for
more information.
Finally, we return to a routine check on the assumptions for the statistical
procedures used, and will use some of the more popular procedures to be found
in AL to illustrate the points.
variables, for example, the researcher will be looking for some kind of positive or
negative relationship: it might be that there is a (positive) relationship observed
between being able to speak and write a foreign language such that the better you
are at speaking, the better you are also at writing that language. It would not be
correct to go on to conclude that the two variables have some causal link, that
better L2 speaking somehow brings about better L2 writing – although subse-
quent research may indeed try to confirm such a possibility.
Again, as part of this routine check, we would want to check that the assump-
tions applied to this procedure were accounted for satisfactorily. We would
expect there to have been examination for normal distribution and homo-
scedasticity (i.e., having equal statistical variances) of data: more often than not
this is assumed rather than established although most statistical programs will esti-
mate it satisfactorily. Ideally, there will have been some examination of the data
visually to establish that the data is related linearly – the data relationship once
plotted should approximate to a straight line. Furthermore, the data or scores
from one participant should not impact those of another – some apparent effort
to have randomized data-gathering and group assignment (i.e., to control
or experimental) will be reported.
Replications might equally derive from our checking of the SD presented,
since the bunching of high or low scores might indicate how far a parametric
correlation procedure was warranted. Similarly, we would want to review the
logic behind the correlation tested: if, for example, we were presented with an
experiment correlating L2 listening comprehension with age, we would want to
make sure that the participants’ age range was indeed enough to permit a wide
enough set of data to begin with. (A restricted range in one or both variables
being correlated can attenuate or reduce the observed correlation between the
two variables.) If not, a replication might want to widen this further.
We would expect the researcher to identify the specific test used, not least
because of the different assumptions which may apply to each. Non-parametric
tests such as the Wilcoxon or Mann–Whitney U are available to the researcher
who finds the normal distribution of data is not confirmed. The second assump-
tion, of equal or similar variances (SD²) or the homogeneity of variance, is
one which is often assumed rather than checked in much of what you might
read: in our routine checking, it might just be a matter of logic, too. If, for
example, a paper is comparing the error marking of two groups of L2 English
teachers, native as against non-native, we might feel the former were less likely
to present the kind of variance which the latter group might demonstrate. Once
again, however, a replication – if it is to be carried out or recommended – can
use additional estimating procedures to account for such inequality. Third,
we would want to see the data measured as interval or score measurements.
Any conversion of data from nominal non-continuous to continuous score
data (such as frequencies to percentages scores or raw frequencies to rates) in
an effort to accommodate the t-test would need to be satisfactorily explained
or justified.
The same goes for conversions of continuously measured variables to cate-
gorical ones. This transformation, which you might see employed to enable a
comparison of groups, is ill-advised for the loss of variance it entails. Imagine
a study in which the researcher was interested in examining the relationship
between motivation and proficiency and, to that end, gave a group of L2
learners both a motivation measure and a proficiency test. Some research-
ers might be tempted to divide the motivation scores at the median to form
two groups whose proficiency scores would then be compared on a t-test.
This approach is appealing in that it would provide an estimate of the differ-
ence in proficiency between low and high motivation learners. However, we
strongly advise against this practice on several accounts. First, as mentioned
above, this approach results in a loss of variance. An intervally measured
variable has been converted into a dichotomous one and, thus, a lot of infor-
mation about participants’ motivation has been discarded. Second, unless
there are theoretical grounds to believe that motivation exists as a “low” or
“high” phenomenon, the grouping is not sound. And third, the more appropriate –
and, actually, simpler – approach here would be to run a correlation between
these two variables.
The main assumptions for one-way and factorial ANOVA – which tests a
comparison of three or more group means rather than only two groups – are the
same as those for the t-test. However, since we are now reviewing the data – and
specifically the mean scores and the dispersion around those means (i.e., SDs) –
from more than two groups (unlike the t-test) and levels (in factorial ANOVA),
the researcher will need to go further into the outcomes to establish exactly
where the differences between the groups lie.
What Kind of Replication Should You Do? 59
First, as with any comparison, we need to know that we are comparing like
with like. If not, can we combine them, average them, or even begin to com-
pare them? Thus, if we are looking at two studies, both of which test whether a
particular approach to L2 vocabulary learning is more successful with intermedi-
ate than advanced learners, but wherein different effect sizes were reported, we
would need to see whether constructs such as “L2 vocabulary”, “Intermediate
(level)”, and “Advanced (level)” were operationalized similarly. Then, we would
need to see how the treatments were carried out in terms of procedures: for
example, if the treatments were the same or similar but the time intervals between
the pre- and posttests were significantly different, any calculated average between
the effect sizes would have little meaning.
The effect sizes you read will have a numerical outcome which can be read-
ily interpreted based on meta-analysis and field-specific benchmarks. As we
will see below, however, you also need to be careful when undertaking such a
rigidly-based interpretation in such a person-variable field as AL. In many cases,
meta-analysis in a particular field of study (e.g., pronunciation instruction) can
be the most helpful.
Thus, Cohen’s d, for example (see below for more details and the r family), will often
be interpreted (and accepted) as “0.2 and below = small effect size; 0.5 = medium effect size;
0.8 and above = large effect size.”. In fact, Cohen himself warned against too inflexible
an interpretation across studies of these statistics, and we would therefore need to be
wary when looking to replicate the study in question using the effect size reported as a
possible reason. An effect size of 0.60 might be interpreted as as “small” in the context
of one study, while in another context 0.60 may well be considered to be “large”.
Similarly, the same adjustment of interpretation might be needed if a small effect can be
seen to be meaningful.
Therefore, treat these assimilated descriptors with caution as we routinely
check papers: if necessary for a potential replication which we want to undertake,
consider consulting the author about his or her interpretation of an effect size
beyond the number reported. And remember, we would be wrong to assume a
“large” effect reported is, ipso facto, more important than a “small” one. We would
need to consider also the practical significance of the outcome as well (see below).
As we mentioned above, comparisons need to be used with care as the basis
for future replications. The diverse features of research studies we have read about
in our first chapters can make a smaller effect size from one measure appear more
important than a greater effect size based on another type of measure (see below).
Therefore, the moral of the study is not to be over-influenced by the reported
magnitude of the effect size statistic: much depends also on your own critique
(and that of others) as regards the strength of the research methodology itself!
What Kind of Replication Should You Do? 61
C. Consider the possibilities of the effect size statistic remaining at this level over a longer
period or amount of time.
D. Consider the possibilities of the effect size statistic changing along with the contexts in
which the research takes place.
As Plonsky and Oswald suggest, research setting may also explain variability
in outcomes, and may help us envisage another way in which a replication may
help clarify the result from the target study. These authors specifically single
out two possible context changes which might affect outcomes (the typically
tighter controlled “laboratory” vs classroom; ESL vs EFL), but you will doubtless
think of others during your reading of the target study (see below, Approximate
Replications). A further context variable mentioned here by the authors concerns
the true nature of the control group (treatment/no treatment) when this is
used to calculate the effect size. In most cases you will be told the control group
received no treatment, but elsewhere local stipulations or timetable needs might
dictate that this group received a “traditional or alternative” intervention. Again,
the potential replicator may want to tease out the true nature of any experimental
effect by comparing treatment/no-treatment control group outcomes.
These authors also remind us that effect size outcomes need to be weighed
against specific methodological aspects of the study in question. Aspects
such as reporting of reliability and validity measures for the dependent varia-
bles “. . . pretesting and delayed posttesting, random assignment to experimental
groups” might conceivably affect the strength or weakness of the statistic reported
and, as such, be a useful focus for replication.
62 What Kind of Replication Should You Do?
way as for the t-test – will be of greater interest to our cause in that the outcome
will give us a sense of the size of the effect of interest. Eta squared (η2) is used in
ANOVA, where each categorical effect has its own eta squared statistic, so we are
given a specific measure of the effect of the variable in question.
As we mentioned above, the effect size is particularly useful as we check the
data in a paper and consider how a replication might shed further light on the
outcomes: it quantifies the effectiveness of a particular intervention, relative to
some comparison. It moves us along from a basic question such as “Does the
intervention work or not?” to the far more useful, “How well does it work in
a range of contexts?”. In a language learning scenario, for example, we might
undertake a replication to try to increase the effect reported in the original study,
and thereby try to show that even a relatively inexpensive change might bet-
ter academic success through an increase of only 0.1 in the effect size. In that
context, and particularly if a lot of students were positively affected, this could
be a significant outcome through a replication of the original. Moreover, by
placing the emphasis on the most important aspect of an intervention – the size
of the effect – rather than its statistical significance (which conflates effect size
and sample size), it promotes a more scientific approach to the accumulation of
knowledge. Thus, as you pass through this routine checking approach, it is as
well to keep in mind that a statistically significant outcome is often inconsequen-
tial. Statistical significance is affected by the sample size as well as the effect size.
Thus, when checking a study with a view to possible replication, it is as well to
remember that – moving onward – practical significance may well need to be
shored up by working on the effect observed.
In practice, you will more likely see partial eta squared (ηp2) quoted in papers.
This may be down to the fact that the classical eta squared has a disadvantage
that as you add more variables to the calculation the proportion expressed by any
one variable tends to decrease, making it more difficult to compare the effect of
a single variable across different studies or replications. For smaller samples and
one-way ANOVAs you may see omega squared (ω²) being used. In general,
omega is considered a more accurate measure of the effect, where ω2 = .01 is
considered a small effect and ω2 = .06 and .14 are considered medium and large
effects respectively.
One further statistic you might want to look out for – together with the effect
statistic – and one which will help us interpret the outcome even more and feed
into any suggestion for replications is the confidence interval. As Plonsky and Oswald
suggested above, an effect size statistic from a large sample is probably going to
be more accurate than that from a small sample. The addition of a confidence
interval statistic adjusts for this margin of error and offers the same information as
in a test of significance: thus, a “95% confidence interval” is equivalent to taking
a “5% significance level” but using the former keeps the focus on the effect size
rather than the more problematic p value. Pooling the information from an effect
size and the associated confidence interval will additionally help you to evaluate
64 What Kind of Replication Should You Do?
the relationships within data more effectively than the use of p values, regardless
of statistical significance (see Cumming & Calin-Jageman, 2017).
If replications of a key study are undertaken – on an individual basis or as part
of a research program by a group of researchers – and a resulting set of effect sizes
obtained, the different sizes can then be collected to produce a better estimate
of the average size of the effect being considered. While replicating a study can
obviously help in providing more “evidence” for the strength of an effect, the
subsequent averaging of effects from a number of replications will inevitably
include studies with “large” as well as “small” effects. The “average” effect size
then presented will not tell the whole story of what might have accounted for
the differences in each study’s effect size in the first place, and our undertaking
of a routine check of the individual studies can provide a useful contribution.
Our routine check might enable us to come up with an interesting observation
in the case of a study with few participants such as an intact class, for example,
where there might be a small effect registered in a statistically insignificant result.
Such an outcome may not be of immediate interest because of the original sta-
tistical (in)significance outcome, but the results of a number of replications of
such experiments combined are likely to be statistically significant. In terms of
replicational strength, this cumulative evidence will also be of greater scientific
and practical weight as it will have most likely been consequent upon a number
of different contexts and perhaps from a number of everyday settings.
To this same end of potential replication, it is also worth noting whether an
effect size is reported in the case of the reporting of non-significant outcomes,
too. Even in non-significant contexts effect sizes with confidence intervals can
indicate to what extent the non-significant findings could be due to inadequate
sample size.
Look at the “Results” section from the paper by Carter et al. (see Introduction,
p. 10).
» Activity 11
Then, first, think about why the statistical procedure might have been con-
sidered the most appropriate and what alternatives there might have been.
Second, decide whether the alpha and significance levels were acceptable
and whether they took their lead from any previous work.
Then, consider whether the effect size statistic was the best choice pos-
sible and decide whether it has then been satisfactorily considered in the
discussion of results.
Finally, and based on your considerations, think about how you might
use the evidence presented through the effect size statistic to justify replicat-
ing the study.
replicability and without setting up another sample and actually redoing the
study. As we see below, all three approaches to internal replication currently in
vogue have their disadvantages, too.
All three typical approaches to internal replication would see the researcher
using the original data and through subsets in separate samples which are then
reunited in distinct ways. In itself, this can also help the researcher, therefore, to
do the kind of self-critique of his or her method and conclusions we have been
looking at above. All the procedures below involve the researcher in repeated
sampling of subsets of data from the overall pool to present replicate samples. The
statistic needed is then calculated for each replicate and the SD is calculated across
all the samples to arrive at the standard error of the estimate.
Cross-Validation
This process looks to resample the data and investigates the predictive power of
the specific statistical procedure, by comparing the outcomes of one subset against
those of another. To do this, the total sample is divided randomly into at least two
parts or splits (a training sample and a cross-validation test sample) and identical
analyses are carried out on each one. Then, one after another, each part is analyzed
and fitted to the test set. This gives you a large number of estimates of prediction
accuracy which can be combined into an overall measure. The objective is to
discover whether the result is replicable or just a question of random variabilities.
As you might expect, however, the outcome can vary a fair amount depend-
ing on what data or observations are in the training split and what are in the
validation split. Ideally, we would want to be sure that both training and test
samples are sufficiently large and diverse to be representative of the whole data.
However, if the sample size is small, logically each analysis is performed with a
smaller number of observations and care needs to be taken in interpretation.
Jackknife
In this procedure we once again resample different subsets of the original data.
However, this then becomes a step beyond cross-validation as the same test is
repeated by omitting samples, one at a time. Recalculations are then carried out
on the remaining data. In this way, an attempt is made to partition out the effect
on the remaining sample of a particular case from the data, and we can increase
the confidence in the potential replicability of outcomes by testing the variability
of the remaining samples. The jackknife estimate is the average of values obtained
across all the samples (see above). This is a particularly useful replication proce-
dure when the data dispersion in the distribution is seen to be somewhat wide
and/or extreme scores are evident.
Once again, however, the procedure has been called into doubt with small
samples: sample size itself imposes a limit on the number of subsets that can be
What Kind of Replication Should You Do? 67
Bootstrapping
In comparison with the above approaches, this method is often thought to be
a more thorough approach to internal resampling. It consists of copying the
data sample a large number of times into a large mega-file of data. Thus, many
samples are drawn from the “mega” sample, and the results are calculated for
each one and finally averaged, standard errors are calculated, and confidence
intervals also averaged. The theory behind this is that we make use of the whole
sample to represent the population and many samples of the same size from the
original sample.
One of the advantages of this technique over the above is that this procedure
does not delete individual data or create unpredictable splits but rather makes
use of all the data at once and presents different combinations of the sample.
Furthermore, in jackknife and cross-validation the number of observations/data
in the sub-sample selected is smaller than the original sample while in boot-
strapping each reanalysis involves the same number of observation/data as the
original sample.
Our conclusion must emphasize not only the usefulness of internal replication
calculations being made, adding as they do, an important element of confidence-
building in one’s data, but also their limitations as being in any way the “last
word” on a study’s replicability. While internal replication takes an important
step beyond “simple” NHST, external replication – to which we turn now –
remains the more reliable way of finding out how far new data, new participants,
or new research contexts reflect on the stability of the original study’s results.
Notes
1 Sokal, A, & Briemont, J. (1998). Fashionable Nonsense: Postmodern Intellectuals’ Abuse of
Science. New York: Picador. (cf., The Economist, October 19, 2013: 28).
2 The tenure and promotion guidelines issued by the AAAL (American Association
for Applied Linguistics) now refer explicitly to replication, suggesting that evaluation
committees consider “. . . that high quality replication studies, which are critical in
many domains of scientific inquiry within AL, be valued on par with non-replication-
oriented studies” https://www.aaal.org/.
3 Readers will want to consult the many statistics books now available to those working
in the field of second language acquisition and AL for more detailed advice on these
procedures and, in particular, consider what the use of such procedures has assumed in
the methodology used and, therefore, the implications for potential replication studies
(see, for example, Porte. G.K. (2010). Appraising Research in Second Language Learning.
Amsterdam: John Benjamins Publishing; Larson-Hall, J. (2015). A Guide to Doing
Statistics in Second Language Research Using SPSS (2nd edn). New York: Routledge.
68 What Kind of Replication Should You Do?
approach you will add value by focusing precisely on that previous step in the cycle.
The stimulus for the next step (and the focus of its findings) comes from one ear-
lier study, rather than a whole set of studies. The aim is an eminently comparative
one focusing on what the replication study says about the original.
This crucial difference between an extension/follow-up study and a replication is
an important one – but is not always clear. There should be an obvious and immedi-
ate difference of focus and interest evident not only in the very title of the research
paper (where we would reasonably expect a term such as “a replication of . . .” to
appear), but also in the way the abstract or introductory paragraphs describe what
has gone on. You will need to be careful in your reading, as some studies labelled as
“replications” are actually follow-up studies which take their lead from previous work
but, crucially, do not aim to provide the essential comparisons between the two studies.
» Activity 12
Read carefully each of the following abstracts from papers and decide
for yourself, and justify, whether a replication or extension/follow-up
study appears to have been envisaged.
tool for language learning with traditional teaching. This research tool
used was a specially designed app which adapted the original app for L1
French students learning L2 English. The original study failed to consider
both technology and non-technological language learning tools when
gauging student usage and perceived benefits. The results therefore
inherently favored technology-enhanced tools and support, when non-
technological equivalents may have been just as highly used or effective.
4. Author (1998) studied the way students modified their L2 English inter-
action in class by means of a number of qualitatively different language
learning tasks. This study uses the same methods of data collection and
data analysis of the original study, while testing the outcomes using
a different L2 – in this case French as well as an enhanced version of
the analysis procedure. The results of the replication study partially con-
firm Author (1993) results, but also indicate that a number of other
factors may have affected the outcomes. As a result, the original study’s
conclusions regarding the extent to which laboratory findings can be
transferred to the classroom need to be further investigated.
5. Author (2010) described an extensive reading-into-writing methodol-
ogy that significantly improved the L2 composition skills of a culturally
homogenous L1 Chinese group of EFL college students in China. This
present research takes up the same treatment to see if it is as effective with
more diverse groups of ESL university students in the United States. An ini-
tial pilot study suggested the instrument needed to be further enhanced,
and this version was then used over a longer period of time (one year
against 6 months in the original). The results indicated gains for the treat-
ment group. Effect size measures underlined the significance of these
outcomes. Our conclusion is that a further development of the original
methodology is very effective with a more diverse group of learners.
Now we have arrived at this stage of familiarity with the study in question,
you are likely to be in an ideal position not only to verify what has gone on
through your routine checking and critique, but also to conjecture how the
results might have come about. It is therefore only a short step from this point
to the consideration of how you might usefully build on these results in a way in
which you can make a contribution to the field by proceeding to some kind of
replication of the study.
As we have seen, there is often confusion about what a replication study
does, and what an extension or follow-up study does. Historically, moreover,
the literature has referred to degrees of replication according to what is replicated
and how, and then defines – in rather too many ways sometimes – how each
degree of replication might be termed (see Polio, 2012).1 This confusion is not
new: in a review of papers in the Social Sciences Citation Index, Bahr, Caplow,
and Chadwick (1983)2 concluded that studies can be referred to as “replications”
while differing widely in terms of participants, time, place, or methods of data-
gathering and analysis.
We do not want to add to the confusion here with more definitions! Rather,
we intend to present you with three existing definitions and define a practical
use for each within a systematic, cumulative approach to replication work. The
intention is to set out firm replication research series which are interdependent,
and in which you can participate – first through close, followed by approximate,
and perhaps then supplemented by conceptual replications.
Having said this, let us begin by discarding for our present purposes the “clas-
sic” sense of replication. “Exact”, “literal”, or “direct” replications – essentially
doing the same study again – and wherein the manipulation, measurement of
variables, and procedures are all kept the same, are just not possible in social
sciences research. Indeed, in the strictest sense, neither are they possible in the
so-called “hard” sciences – the size of a test tube may vary, for example, as the
quality and texture of a chemical may differ from one lab to another. A replicated
study can never really be the same, be it repeated by the same researcher in the
same context, or others. Time intervals play their part, as do participant variables,
changed contexts, and a host of other variables. As Rosenthal observed3 (1991)
“[In behavioral research] . . . replications are only possible in a relative sense”.
The best we can strive for is to monitor and control for the conditions that might
otherwise affect outcomes.
Thus, if exactness is accepted as an unreachable objective, we will need to
make judgments as to just how close we need to get to permit those sound com-
parisons to be made between the original and replicated studies.
close and personal” with your chosen target study you are now acquainted both
with the methods and procedures used in that study and with that author’s modus
operandi – specifically the large number of decisions he or she would have had to
make along the way.
Armed with this information, you should now be in a position to understand,
or conjecture, how the results you were presented with might have come about.
All this should mean you are also a little more aware of how you will be able to
further the knowledge gained from those outcomes. Ideally, you would even have
made those notes in the margins during your critique, identifying new aspects or
questions the original author did not think about or take into account in his or her
execution of the research, and which you felt might have affected the outcomes.
One of the principal reasons for conducting a replication is to increase the
confirmatory power of the original study. Replications proceed by degrees: with
every pertinent modification in subsequent replications the confirmatory power
of the study may increase and, potentially, be generalized to further or wider
applications. However, as with replication in the pure sciences, we must proceed
cautiously and at a deliberate pace, be systematic and careful in the application
and interpretation of each modification. Just as this cautiousness can eventually
lead us to an increase in confirmatory power, changes in several types of variables
simultaneously (in one study, for example) take us further away from the origi-
nal study and means, for example, that failure to replicate the original outcome
becomes much more difficult to interpret or pin down to a specific variable.
All this argues for an initial, cumulative, pre-planned series of replication
attempts of a target study – a replication scheme if you will – whereby only one
major variable is modified each time (it can be added or removed, of course!) and
all others are kept as constant as possible to be better able to single out the kind
of influences each has on the dependent variable.
Such a systematic series of progressive changes we will call a set of close replica-
tions such as those you see outlined in the next section. Each would ideally be
carried out by a number of independent “labs” or researchers who would present
their data before moving on the second, and so on. To do so efficiently requires
the integration of these teams into a systematic program.
Such a progression also implies the need for some kind of external executive
control or planning regarding what is replicated, how, and when. While this process
is more measured and deliberate in its approach to research data-gathering than you
might be used to, it better enables us methodically to build up sufficient compara-
tive evidence about a study’s validity or generalizability and involve large groups
of like-minded researchers working in different contexts toward a common aim.
Applied Linguistics, 33.1, 66–82. Then reread the three sections marked “Research
questions”, “Participants”, and “Procedure”, paying particular attention to the
description of the variables and the reasons why the study was carried out.
Below is a series of imagined plans for close replications of the study, each
taking up a different key aspect of the original (e.g., participants, time, and task
condition). For each sample series, you are presented with:
varies along with age (14–18). Would ESL high school students show similar
recall as in the original study?
ORIGINAL STUDY: “203 French-speaking ESL students enrolled
in . . . university. 18–53. Mean age 24.2.” Wide age range. Results not
broken down in terms of ages.
REPLICATION: Attempt to replicate research question 1 with a compa-
rable number and status of ESL high school students focusing on a narrower
age range.
OUTCOME: Similar effect as in the original found here in ESL high
school samples. However, F-tests showed similarly high recall for the second
(delayed) recall – in contrast to the original study. Narrower age range across
high school years sampled may indicate younger people are somehow better
able to recall???
********
VARIABLE MODIFICATION: DEGREE STATUS (ESL students
reading Literature/vs??)
JUSTIFICATION: Author a, 2001, Author b, 2006, and Author c, 2007
demonstrated that ESL students studying ESL Business Studies and English
Language majors in the US and Canada had statistically different recall for L2
English pseudo-words through writing. Would outcomes from the original
study be replicated across participants from differing degree areas?
ORIGINAL STUDY: Unspecified degree status. Check with authors
to see if large numbers of participants indicated various degree affiliations
involved.
REPLICATION: Attempt to replicate research questions 1 and 2 with a
comparable number and status of ESL university students focusing on differ-
ent degrees being read.
OUTCOME: ESL Business Studies degree students and English Language
degree students show similar recall to that found in the original across at ini-
tial time sample but the former degree students reveal considerably greater
losses at the second sample. Consider replications with other degrees?
********
VARIABLE MODIFICATION: L1 CHINESE vs L1 FRENCH
JUSTIFICATION: Studies of native Chinese ESL students (Author a,
1998, Author b et al., 2015) showed that these intermediate and advanced-
level ESL university students (18–22) acquire new words better through
writing and retain a high level of recall across a number of weeks.
76 From the Outside, Looking in
» Activity 13a
» Activity 13b
Choose ONE section from the complete paper above (F. Pichette, L. de
Serres, and M. Lafontaine (2012). “Sentence reading and writing for
second language vocabulary acquisition.” Applied Linguistics, 33.1,
66–82) from Participants, Items, Tasks, Time allotted for completing the
tasks, Procedure, Scoring and analysis, or Limitations. Think about TWO
further possible series involving a minimal variable modification within
that section. Then make similar notes to those above describing how a
subsequent close replication with that variable change might shed new
light on the effects observed in the original study.
78 From the Outside, Looking in
Our encouraging you to start with close replication serves two purposes: first,
it acknowledges that no replications in the field of social science can be truly
“exact” and that this means any attempt to revisit a previous study’s findings has
perforce to accept some, ideally minimal, modification. Second, such a minimal
change allows us to make a relatively “safe” comparison between our outcomes
and those of the original and thereby feed in to the body of knowledge arising
from the target study and, now, its subsequent replications.
» Activity 14
i) isolate the variables which appear to have been modified from the origi-
nal/target study and those that have been held constant; and then
From the Outside, Looking in 79
ii) hypothesize about what consequences there might be for the interpre-
tation of outcomes of the replication as a result of this combination of
variable modifications.
[and] there remains the ever-present question of the extent to which recall
measures actually measure acquisition.”
APPROXIMATE REPLICATION: Attempt to replicate research ques-
tion 3 with EFL participants and with four samples taken over the original
week period (“surprise”) and an additional two (“announced beforehand”)
over a further one-month period.
Perhaps recall is seen to improve more for the EFL participants with the
modifications to the time sampling.
OUTCOME: Inconclusive outcomes. Some of the participants showed
less pronounced fall in recall while others stayed at much the same level.
Recommendations to attempt to replicate study across a number of other
EFL populations.
As you see in the hypothetical study above, the results of a replication may
well point to the need for its own replication before moving on in the series.
While the outcome may have us wondering about the usefulness of a further
replication with another variable – such as the “altered writing condition” vari-
able modification above – we first need to gather enough data from enough
replications of the two variables at hand to begin to form a picture of what the
interaction might be telling us about the original study and its generalizability.
Hopefully, you will also notice the piecemeal nature of the research process
through replication. As we mentioned in the Introduction, your objective in
this approach to research is not to move from one context or variable or instru-
ment or procedure to another and thereby accumulate as much data as possible
using different forms of data collection or from a number of disparate contexts
and participants. This would be more characteristic of what we do in follow-up
studies or extensions of the target study. Replication rather rewards the patient,
methodical, cumulative approach to data-gathering from individual researchers
or research teams working on the same study.
» Activity 15
You are going to design some approximate replications. Look back at the
close replication series on pp. 73–77, and your own answers in Activities
13a and 13b. Think carefully first about what might justify the study of two
variables you now wish to pursue in each replication of the original study.
Discuss with someone what you might expect as a possible outcome of a
replication involving both variables. Then devise a set of THREE approxi-
mate replications in the way you saw above and present them in note form
as you saw on these pages, together with some hypothetical outcomes.
From the Outside, Looking in 81
» Activity 16
(continued)
Will selecting treatment and control groups on the basis of time spent in
the US educational system present similar or distinct effects on accuracy?
ORIGINAL STUDY: No reference to variable – confirmed by the authors.
OUTCOME: Groups receiving both direct and indirect feedback and iden-
tified with high instrumental motivation indices outperformed the control
in the immediate post-test and over the ten-week period. Very similar,
but not significantly better, results to the original study on both research
questions. Indirect feedback group tended to maintain the initial superior
longitudinal effect on accuracy over ten-week testing period although –
as in the original study – not as strong as the direct feedback group.
From the Outside, Looking in 83
» Activity 17
Research three or four “historical” studies from your own area of interest
and which continue to be frequently cited in the literature as remaining
significant for that area. For each study
Our aim is to examine the same underlying theory as the original. In this way, the
use of replication extends beyond confirmation of the original findings toward
theory and model building.
However, and unlike in the so-called “hard” or pure sciences, our operation-
alization of AL constructs will need to be adapted to different learning contexts,
or methodologies, or time periods, and so on. Often, what is defined in one
context in the original study may not square with our own. For example, the
construct “listening comprehension” can be operationally defined in various
ways and apply in various contexts. We need, therefore, to see how far what has
been revealed about listening comprehension in one context applies equally in
another. Conceptual replication encourages us to test multiple manifestations of
the construct and thereby try to arrive at a consensus about the generalizability
of the outcomes.
If the theory, or hypothesis, or treatment is seen to be supported across several
different constructs, or methodologies, or analysis methods, we can begin to have
confidence in that theory, beyond the effects observed in the one, original, study.
As with our updating through replication, we will be employing different pro-
cedures from the original study. These might be in our definition of the principal
construct at hand, the relationship between constructs, the way data is gathered,
and so on. Imagine, for example, that Researcher X (1998) carried out a close
replication of Researchers Y and Z’s (1993) study into the learning strategies L2
English learners used when going about learning new vocabulary. Both studies
used retrospective data collection methods albeit Researcher X chose to modify
the original using similar proficiency level L2 German students.
Both papers reported similar findings in terms of the main metacognitive strat-
egies reported but there were large differences reported regarding the individual
strategies used for recording and memorizing new vocabulary. This inconsist-
ency surprised both Researcher X and a number of others in the field given the
findings of similar studies carried out since Y and Z’s seminal work. However,
Researcher X’s experimental procedures diverged somewhat from Y and Z
in the individual retrospective interviews. The original paper had participants
identifying individual strategies used from a checklist while in X’s replication
participants were asked to recall as many strategies as they considered important.
X decided that outcomes might have been affected by the fact that the original
procedure restricted responses to a predesigned list of options contrasting with
the more open nature of X’s own data collection. Moreover, other researchers
had criticized X’s method of teaching the control group about learning strate-
gies before the study itself. They argued that the resulting priming effects on this
group meant that participants were already aware of the kind of strategies to look
out for in their subsequent reporting of events.
As a result of these concerns, Researchers A and B decided to replicate Y and
Z’s study conceptually by developing and testing a more comprehensive model of
procedures for identifying learning strategies, streamlining the original procedure
86 From the Outside, Looking in
» Activity 18
Below are three abstracts from studies which are described as, or
understood to be, “conceptual replications” of previous work. Read
the extracts carefully, underlining where the author described the
changes made to the original study in his or her replication. Explain
what characteristic(s) described makes this a conceptual replication. If
you think you need more information, read the full article before you
make your decision.
Paper 1
THE EFFECTS OF INPUT ENHANCEMENT ON GRAMMAR LEARNING AND
COMPREHENSION: A MODIFIED REPLICATION OF LEE (2007) WITH EYE-
MOVEMENT DATA
www.cambridge.org/core/journals/studies-in-second-language-acquisition/
article/effects-of-input-enhancement-on-grammar-learning-and-comprehension/
FA73F01ADB6A7B4148AD25D697F401D7.
PAULA WINKE (Studies in Second Language Acquisition, 35.2, 323–352).
In his 2007 study “Effects of Textual Enhancement and Topic Familiarity
on Korean EFL Students’ Reading Comprehension and Learning of Passive
Form,” Lee demonstrated that learners were better able to correct written
sentences that contained incorrect English passive forms after exposure
to texts flooded with enhanced (versus nonenhanced)4 passive forms. But
with enhanced forms, learners did worse on comprehension tests, which
arguably demonstrated a trade-off: More attention to forms resulted in
less to meaning. In this study, a conceptual replication of Lee’s using eye-
movement data, I assessed how English passive construction enhancement
affects English language learners’ (a) learning of the form (via pre- and
posttest gains on passive construction tests) and (b) text comprehension.
In contrast to Lee’s results, I found enhancement did not significantly
From the Outside, Looking in 87
Paper 2
DISCOURSE PROCESSING EFFORT AND PERCEPTIONS OF
COMPREHENSIBILITY IN NONNATIVE DISCOURSE: THE EFFECT OF
ORDERING AND INTERPRETIVE CUES REVISITED
www.cambridge.org/core/journals/studies-in-second-language-
acquisition/article/discourse-processing-effort-and-percep
tions-of-comprehensibility-in-nonnative-discourse/1AA1DA0EA88AC70
60312852465DCC5A5.
ANDREA TYLER AND JOHN BRO (Studies in Second Language
Acquisition, 15.4, 505–522).
The study reported here extends Tyler and Bro’s (1992) investigation of
the sources of native speakers’ perceptions of incoherence in English text
produced by nonnative speakers. Using paper-and-pencil tasks, the origi-
nal study examined two competing hypotheses: (a) The primary source of
interference was the order in which the ideas were presented versus (b) the
primary source of interference was mismatches in discourse structuring cues.
They found no effect for order of ideas but a strong effect of discourse struc-
turing cues. In the present study, 80 subjects were tested on the same texts
as those used in Tyler and Bro (1992) but using microcomputers. Subjects
rated the text for comprehensibility and answered three questions concern-
ing the propositional content. The computer format represented a more
sensitive measure of subjects’ reactions to the text because it did not allow
looking back and because it provided information concerning differences
in reading time for each manipulation. Once again, the results of the com-
prehensibility ratings showed a strong effect for miscues and no significant
effect for order of ideas. Results of the true/false questions indicated that
presence of miscues affected subjects’ comprehension of the propositional
content but that order of ideas had no discernible effect. Finally, reading
time results also showed a strong effect for miscues and a mixed effect for
order of ideas, suggesting that order of ideas does make a minor contribu-
tion to comprehensibility.
(continued)
88 From the Outside, Looking in
(continued)
Paper 3
ON THE SECOND LANGUAGE ACQUISITION OF SPANISH REFLEXIVE
PASSIVES AND REFLEXIVE IMPERSONALS BY FRENCH- AND ENGLISH-
SPEAKING ADULTS
http://journals.sagepub.com/doi/abs/10.1191/0267658306sr260oa.
ANNE TREMBLAY (Second Language Research, 22.1, 30–63).
This study, a partial replication of Bruhn de Garavito (1999a; 1999b),
investigates the second language (L2) acquisition of Spanish reflexive pas-
sives and reflexive impersonals by French- and English-speaking adults at an
advanced level of proficiency. The L2 acquisition of Spanish reflexive passives
and reflexive impersonals by native French and English speakers instantiates a
potential learnability problem, because (1) the constructions are superficially
very similar (se V DP) but display distinct idiosyncratic morphological and
syntactic behaviour; (2) neither exists in English, and the reflexive impersonal
does not exist in French; and (3) differences between the two are typically
not subject to explicit instruction. Participants – 13 English, 16 French and
27 Spanish speakers (controls) – completed a 64-item grammaticality-
judgement task. Results show that L2 learners could in general differentiate
grammatical from ungrammatical items, but they performed significantly
differently from the control group on most sentence types. A look at the par-
ticipants’ accuracy rates indicates that few L2 learners performed accurately
on most sentence types. Grammatical and ungrammatical test items involv-
ing [+animate] DPs preceded or not by the object-marking preposition a
were particularly problematic, as L2 learners judged them both as grammati-
cal. These results confirm that the L2 acquisition of Spanish reflexive passives
and reflexive impersonals by French- and English-speaking adults instantiates
a learnability problem, not yet overcome at an advanced level of proficiency.
» Activity 19
You are now presented with three abstracts from recent studies which
we will imagine have been singled out as in need of conceptual replica-
tion. Look up the studies themselves if you need more details about the
background, procedures and methodology.
Paper A
AFFECT TRUMPS AGE: A PERSON-IN-CONTEXT RELATIONAL VIEW OF
AGE AND MOTIVATION IN SLA
http://journals.sagepub.com/doi/abs/10.1177/0267658315624476?j
ournalCode=slrb.
SIMONE E. PFENNINGER AND DAVID SINGLETON (Second Language
Research, 32.3, 311–345).
Recent findings indicate that age of onset is not a strong determinant of
instructed foreign language (FL) learners’ achievement and that age is intri-
cately connected with social and psychological factors shaping the learner’s
overall FL experience. The present study, accordingly, takes a participant-
active approach by examining and comparing second language (L2) data,
motivation questionnaire data, and language experience essays collected
from a cohort of 200 Swiss learners of English as a foreign language (EFL)
at the beginning and end of secondary school. These were used to ana-
lyze (1) whether in the long run early instructed FL learners in Switzerland
outperform late instructed FL learners, and if so the extent to which motiva-
tion can explain this phenomenon, (2) the development of FL motivation
and attitudes as students ascend the educational ladder, (3) the degree to
which school-level variables affect age-related differences, and (4) learners’
beliefs about the age factor. We set out to combine large-scale quantitative
methods (multilevel analyses) with individual-level qualitative data. While the
results reveal clear differences with respect to rate of acquisition in favor of
the late starters, whose motivation is more strongly goal- and future-focused
at the first measurement, there is no main effect for starting age at the end of
mandatory school time. Qualitative analyses of language experience essays
offer insights into early and late starters’ L2 learning experience over the
course of secondary school, capturing the multi-faceted complexity of the
role played by starting age.
Paper B
THE INFLUENCE OF FOREIGN SCRIPTS ON THE ACQUISITION OF A
SECOND LANGUAGE PHONOLOGICAL CONTRAST
http://journals.sagepub.com/doi/abs/10.1177/0267658315601882.
LIONEL MATHIEU (Second Language Research, 32. 2, 145–170).
Recent studies in the acquisition of a second language (L2) phonology
have revealed that orthography can influence the way in which L2 learners
(continued)
90 From the Outside, Looking in
(continued)
Paper C
INPUT PROCESSING AT FIRST EXPOSURE TO A SIGN LANGUAGE
http://journals.sagepub.com/doi/abs/10.1177/0267658315576822.
GERARDO ORTEGA AND GARY MORGAN (Second Language Research,
31.4, 443–463).
There is growing interest in learners’ cognitive capacities to process a
second language (L2) at first exposure to the target language. Evidence sug-
gests that L2 learners are capable of processing novel words by exploiting
phonological information from their first language (L1). Hearing adult learn-
ers of a sign language, however, cannot fall back on their L1 to process novel
signs because the modality differences between speech (aural–oral) and sign
(visual-manual) do not allow for direct cross-linguistic influence. Sign lan-
guage learners might use alternative strategies to process input expressed in
the manual channel. Learners may rely on iconicity, the direct relationship
From the Outside, Looking in 91
between a sign and its referent. Evidence up to now has shown that iconicity
facilitates learning in non-signers, but it is unclear whether it also facilitates
sign production. In order to fill this gap, the present study investigated how
iconicity influenced articulation of the phonological components of signs.
In Study 1, hearing non-signers viewed a set of iconic and arbitrary signs
along with their English translations and repeated the signs as accurately as
possible immediately after. The results show that participants imitated iconic
signs significantly less accurately than arbitrary signs. In Study 2, a second
group of hearing non-signers imitated the same set of signs but without the
accompanying English translations. The same lower accuracy for iconic signs
was observed. We argue that learners rely on iconicity to process manual
input because it brings familiarity to the target (sign) language. However,
this reliance comes at a cost as it leads to a more superficial processing of the
signs’ full phonetic form. The present findings add to our understanding of
learners’ cognitive capacities at first exposure to a signed L2, and raises new
theoretical questions in the field of second language acquisition.
effectiveness of reading and writing sentences for the incidental acquisition of new
vocabulary in a second language”. To do so, they present three research questions
(p. 70) with a number of constructs – some of which we have put in italics:
Question #1
For intermediate and advanced L2 students, does sentence writing lead
to higher vocabulary gains relative to sentence reading?
Question #2
For intermediate and advanced L2 students, does recall vary according to
the concreteness of target words?
Question #3
Does the impact of task and concreteness change over time?
Your close reading and critical address (see Chapter 2) of the text should have
encouraged you to look out for the operationalization of these constructs within
the study. At that point such attention to how constructs were defined in the
study was a question of whether we found them acceptable as a practical, test-
able, observable, and/or measurable quality within that study. Now we need to
think about how far that operationalization might be modified to permit useful
conceptual replicas to be undertaken.
So, for example, in the above study, reading further ahead we discover that
“recall” was defined as follows (p. 71):
The recall task chosen was cued recall, which requires the participants to
provide the L2 word via a clue offered by the experimenter. The measured
knowledge is thus of a productive, not receptive, nature. Cued recall is rec-
ognized as sensitive to word forms, since the person tested does not have to
recognize the L2 form, but retrieve it from memory and produce it correctly.
Since the experiment included abstract words, the clues were L1 French
definitions, since the use of illustrations would be difficult, if not impossible.
Later on we read that this operational definition was mooted to have had its
drawbacks (p. 78–79):
. . . . many students had probably guessed that some sort of recall test would be
given as a follow-up task, given the fact that they are frequently solicited for
participating in studies during their degree program. And there remains the ever-
present question of the extent to which recall measures actually reflect acquisition.
» Activity 20
Look at pp. 207–213 in the key paper by Bitchener & Knoch.6 Decide
what constructs have been used and how these have been operation-
alized. If you were going to carry out a conceptual replication of this
study to further test the underlying theory or hypotheses, how might
you operationalize some of these constructs differently?
the same conclusion, can we say we have “conceptually not replicated” our
original? Obviously not, as that would not be the basis of comparison in terms
of procedures and/or analysis, etc to satisfy such a conclusion. Moreover, as we
will discuss later, there is an apparent publication bias toward successful out-
comes in replication, and a bias will quickly be presented toward our reading
only of such work, rather than “failed” replications.
Having said all this, we end this section encouraging you to undertake well-
planned and well-argued conceptual replications. They can be a more high-risk
undertaking than close or approximate replications in the sense that failure to
replicate will leave us with little or anything to say about the original. However,
anything that results in our better understanding of the extent or limits of a
hypothesis or theory has to be welcomed. And there are payoffs: when we are
presented with a group of conceptual replications of a particular study which tend
to reinforce the original hypothesis or clearly build upon its underlying theory,
we inevitably need to sit up and listen, for something is clearly emerging as a
result of this work.
We have suggested in this chapter that replication in AL exists on a continuum
from the more closely allied to the original study through to the more distant.
Our most “distant” (conceptual) replication is, nevertheless, still seen to be of
potential value as a contribution to a debate about a study. It takes us beyond the
confines of the study in question and potentially presents evidence for the success
of what has been tested across many more contextual variables.
The result of a well-designed, cumulative program of replication research is
ideally going to result in research which has been tested or validated to the best
our circumstances will allow. It will help us understand how far the observed
outcomes are generalizable, and also indicate how far these effects are robust to
variations in contexts and/or conceptual modification.
Notes
1 Polio, C. (2012). Replication in published applied linguistics research: A historical
perspective. In G.K. Porte (Ed.), Replication Research in Applied Linguistics (pp. 47–91).
Cambridge: Cambridge University Press.
2 Bahr, H., Caplow, T., & Chadwick, B. (1983). Middletown III: Problems of replication,
longitudinal measurement, and triangulation. Annual Review of Sociology, 9, 243–264.
3 Rosenthal, R. (1991). Replication in behavioral research. In J.W. Neuliep (Ed.),
Replication Research in the Social Sciences (pp. 1–30). Newbury Park, CA: Sage.
4 Textual enhancement is a form of modifying visually those parts of a printed text which
include a targeted syntactic structure for the purpose of instruction. The aim is to bring
the learner’s attention, while s/he is focusing on the meaning of a stretch of discourse, to
the targeted structures and to how they are used. It is hoped that textual enhancement
will promote the learner’s noticing of the form and will help them acquire or compre-
hend them.
5 See Introduction (pp. 10–11) for full reference.
6 See Introduction (pp. 10–11) for full reference.
6
EXECUTING AND WRITING UP YOUR
REPLICATION STUDY
Research Questions and Methodology
6.1 Introduction
Executing a replication study and writing it up for journal publication is our focus
in both this chapter and Chapter 7. To achieve this goal, we will pay particular
attention to critical reflections on study design, methodological procedures, and
routine checking of results – as discussed in Chapters 2–5. Because describing
accurately what went on is such a critical part of the research process, we dedi-
cate large portions of Chapters 6 and 7 to the writing up of replication research,
including suggested writing models with examples from published replications in
high quality journals.
Just as we understand that replication can differ from other types of empiri-
cal research, the writing up of a replication study also includes unique features
that we need to be aware of. For example, since our replication is likely to differ
from the original study, we need to make explicit what aspect(s) of the original
study have been changed, why we have changed them, and the way(s) in which our
changes have been executed. As we will see, detailing the procedures and ration-
ale for our replication becomes necessary not only for interpreting our findings
in light of those from the original study, but also in terms of contributing to the
larger academic discourse (see Appelbaum et al., 2018).1 Researchers carrying out
subsequent replications that build on your own will need to clearly understand
how you approached replication, and why.
Understanding journal expectations for replication research is therefore
important. Let us begin with surveying journal requirements for the writing up of
replication research. We will look specifically at one journal in second language
acquisition, and another from an unrelated social science field.
96 Research Questions and Methodology
• What is the impact of scant reporting of the original study’s data sample?
{ How could you execute a replication if you know little about the
data sample?
will need to take into account both the completeness and transparency of the
original study’s methodological and procedural reporting, as well as our own
expertise, and access to resources and participants. This reflection then leads to
the execution and writing up of our replication study’s research questions and
methodology.
• How did you address the concerns you had in the abstract?
• Did a closer reading of the research study answer your original concerns?
• To what extent did the discussion and conclusion raise the same questions?
features of the original study: what is the sample, what was the context of the data
collection, and what materials were used to collected data?
• Do I have access to a data collection site that runs an academic writing course
for international university students?
{ Possibilities may include Intensive English Programs, Pre-Sessional
courses, and English for Academic Purposes courses.
• Does that course include at least 63 ESL learners, aged 18–20 years old, who
are mostly from East and South Asian countries?
• Does that course last at least ten weeks?
• Can I obtain – or satisfactorily recreate – the data collection materials used in
B&K?
• Teacher participation is needed – so can I get agreements from on-site
teachers?
You will want to ask yourself questions similar to these before proceeding any
further because our answers may well influence what happens next.
For example, if we don’t have access to a data sample similar to that described
in the original study, we might need to rethink our variable modification in order
to execute our close replication. Imagine, for example, we don’t have access to
18–20-year olds, or maybe the students are not mostly from East and South Asian
countries. Because we are at the planning stages of our close replication, we can
still implement change. It is important to remember, however, that our variable
modification has to be motivated by our critique of the study. For example,
assuming we decided to modify age because we didn’t have access to 18–20-year
olds, what might our rationale be (see Chapter 5)?
A further consideration might be whether minor variable modifications
are involved. If so, could you still proceed? For example, imagine you meet
all the design features as presented above, but your participants are in an
Intensive English Program in Scotland, whereas the original study was con-
ducted in the US. In such a case, you would want to make a note of why
you consider such a difference to be a minor variable modification. Then,
you would want to ensure that you return to this matter in your discussion of
the results because it is possible that this context difference has contributed in
some way to your findings.
Before proceeding, let us take stock. We have asked important questions to
determine whether it is feasible for us to execute this close replication. These
questions addressed the nature of the data sample (i.e., the study participants),
the context of the data collection, and the materials used to collect the data. Our
answers to these questions may determine whether or not we are able to execute
a close replication of the original study. For our purposes, the results of our pre-
liminary questioning were positive.
100 Research Questions and Methodology
B&K summarize their study’s aims and research questions on page 211.
What are the study’s main features?
Points to consider:
1. Does advanced learner accuracy in the use of two functions of the English
article system improve over a 10-week period as a result of written CF?
2. Does advanced learner accuracy in the use of two functions of the
English article system vary according to the type of written CF provided?
(Bitchener & Knoch, 2010, p. 211)
Let us examine each separately in the same critically attentive way we addressed
the reading of the abstract and other sections of the paper in Chapter 3.
First, question 1 narrows the focus to improvement of “accuracy”, although
we do not yet know how “accuracy” is defined (but see the “Analysis” on p. 213).
It is also as yet unclear from the research questions what “use” refers to, but we
assume that it refers to writing because the sample population is “L2 writers”,
as described in “Aims” (p. 211). We therefore assume that B&K’s dependent
variable is written accuracy (but this is specified neither in “Aims” nor in the
research questions). As regards “the use of two functions of the English article
system”, Activity 24 told us that B&K investigated “first and subsequent or
anaphoric mentions”, but we should note this information is missing from the
research questions. The last component of research question 1 stated improve-
ment “over a 10-week period as a result of written CF”, which indicates that
written accuracy improvement was examined over a 10-week period follow-
ing the provision of written CF. The “Aims” section tells us that written CF
was only provided once, between pre-test and post-test in a pre-test-post-test-
delayed post-test design.
Second, research question 2 is structured in the same way as research question
1 up until the last part, which examined the extent to which written accuracy
varied “according to the type of written CF provided”. Research question 2
indicates, therefore, that B&K additionally examined the role of different types
of written CF on written accuracy. Although B&K do not specify how many
types of written CF were provided, the “Treatment” (p. 212) and “Procedure”
sections (p. 213) describe three types of written CF.
Our analysis of B&K’s research questions could be summarized as follows:
L2 proficiency level, in contrast to written CF, which was different for each
group), inclusion of proficiency level could be seen as optional. In which case,
we would have to make sure proficiency level is defined and explained in our
“Aims” section.
Now, we are at a point where we have critiqued B&K’s research ques-
tions to understand the focus of their investigation. Given that our variable
modification is L2 proficiency, and that our close replication is following
every other aspect of B&K’s study, we could take their exact research ques-
tions and replace “advanced learner” with “upper-intermediate learner”.
That said, our critique of B&K’s research questions indicated we could
add more information that would improve the clarity and precision of our
research questions. In the next section, we will write up our research ques-
tions for publication.
Examine how Eckerth (2009, pp. 113–114) set out his replication study’s
research questions.
Consider:
1a. To what extent does providing written CF improve the accuracy of first and
anaphoric mentions in written L2 English immediately after instruction (at
post-test) and eight weeks later (at delayed post-test)?
2a. To what extent do different types of written CF (direct, indirect, direct +
oral review) improve the accuracy of first and anaphoric mentions in written
L2 English?
You will notice that we additionally added “to what extent” rather than
using “does”, which encourages us to ask questions about degrees of improve-
ment instead of binary questions of “improvement” versus “no improvement”
(see Cumming & Calin-Jageman, 2017).8 You may also have noticed that we
dropped L2 proficiency from our research questions, for the reasons discussed
earlier: L2 proficiency is not an experimental manipulation in either B&K or in
our close replication. Such a decision is optional, however. If your close replica-
tion did experimentally manipulate L2 proficiency level (e.g., low, intermediate,
and high in the same study), it would be essential to include this information in
your formulation of the research questions.
For now, we have a set of research questions that, despite some differences
in structure, reflect the same information as in B&K, but with more precision.
Having defined our research questions, the next task is to write them up and link
them to the original study. The extract below is from Eckerth (2009). His write-
up begins with a summary of the original study’s research questions and aims. It
appears under the subtitle “The original research study” (p. 113).
Foster’s original study was set up to see ‘“what the student in the classroom
does” with the negotiation of meaning (Foster 1998: 5). For this purpose,
lower intermediate ESL learners in an actual classroom were observed while
they were working in small groups and in pairs on different language learn-
ing tasks. The study sought to investigate to what extent the learners would
104 Research Questions and Methodology
Using the previous citations from Eckerth (2009) describing the original
study and the replication, answer these questions:
Aims
This close replication followed very closely the procedures and research
design as described in B&K (2010), except for a difference in L2 pro-
ficiency. Modification of L2 proficiency level was the only difference
between the original study and this close replication. B&K’s participants
were classified as “advanced learners”, but this replication recruited upper-
intermediate level learners. Although the original study did not use an
independent measure of proficiency to assess L2 proficiency level, we
assessed upper-intermediate proficiency using the ETS assessment of inter-
mediate proficiency in TOEFL writing (17–23) and pre-test performance
(For more information, see Analysis). This replication is otherwise very
closely comparable to the original study’s research design.
As in the original study, this replication examines the extent to which
(a) written corrective feedback (CF) can improve the accuracy of upper-
intermediate learners’ written L2 English, and (b) the impact of different
types of CF on written L2 English. The same two target features were
examined: first and subsequent or anaphoric mentions. Written data was
collected over ten weeks: pre-test in week 1, post-test in week 2, delayed
post-test in week 10. CF was provided once only, between the pre-test
and the post-test.
We addressed very similar research questions as B&K (2010):
1. To what extent does providing written CF improve the accuracy of
first and anaphoric mentions in written L2 English immediately after
instruction (at post-test) and eight weeks later (at delayed post-test)?
106 Research Questions and Methodology
6.4.3 Methodology
A close replication requires that we follow the original study’s methodology and
procedures as closely as possible. As discussed in Chapter 2, sometimes authors
are unable to fully describe their research design in the paper itself because of
space limitations. That said, we need to ensure that our design and procedures
are reported as fully as possible to facilitate both full evaluation of our findings
and replication. Today, many journals publish supplementary materials along-
side published papers where authors can include additional information helpful
for understanding the research more fully (e.g., extra analyses, sample materi-
als, coding protocols). Data collection materials can also be uploaded to IRIS
(www.iris-database.org), as discussed previously, which is an open-access repos-
itory of data collection materials (see Marsden, Mackey, & Plonsky, 2016).9
PARTICIPANTS
B&K describe their data sample on page 211, which includes infor-
mation about the data collection context and characteristics of their
participants. Answer the following questions:
B&K describe their data sample and the research context in the section titled
“Context and participants” (p. 211). The main attributes of their research context
are as follows:
B&K’s description informs us that data was collected in the US in an ESL pro-
gram, with international students enrolled in a course that prepared them for
academic writing. Additional information about both that university’s English
language entrance requirements and a description for that specific academic writ-
ing course would help us clarify the extent to which our research site matched
the original study. For our purposes, however, we should ensure that our research
context matches B&K’s as per their description.
Moving on to the data sample itself, we know from the above citation that
participants were “63 advanced L2 writers” (p. 211) and they were enrolled
in a course that prepared them for the “academic writing requirements of the
university” (p. 211). Information about age and background was also provided:
“Most of the participants were from a range of East and South Asian coun-
tries and were in the 18–20 year old age bracket” (Bitchener & Knoch, 2010,
p. 211). In summary, we know that B&K’s participants were 63 advanced L2
writers of English, all enrolled in an academic writing course for international
students at a university in the US. We also know that “most” of the participants
were from a variety of “East and South Asian countries” and most were 18–20
years old. In general, we may well conclude that the data sample is only partially
reported. For example, how should we interpret “most” in “most of the partici-
pants were from a range of East and South Asian counties”, what types of topics
are covered in the course the students are taking? Additional useful information
about recruitment would have helped our design as well. For example, how
were participants recruited? This type of information might also give us some
information about possible attrition and exclusion rates. Were any participants
108 Research Questions and Methodology
TARGET STRUCTURES
executing our replication. The target structures will remain the same. We now
look at these as described in B&K, as follows:
this study investigated the effect of targeting two functional uses of the
English article system: the referential indefinite article ‘‘a’’ for referring
to something for the first time (first mention) and the referential definite
article ‘‘the’’ for referring to something already mentioned (subsequent or
anaphoric mentions).
(Bitchener & Knoch, 2010, pp. 211–212)
Examples of the target structure are included in the description of the treatments,
as shown: “A man and a woman were sitting opposite me. The man was British
but I think the woman was Australian” (Bitchener & Knoch, 2010, pp. 211-212).
For our purposes, in line with the original study, the target structures will be uses
of “a” for referring to something for the first time (the first mention) and then
use of “the” for any subsequent mentions of the same thing (the subsequent or
anaphoric mention).
TREATMENTS
We will also provide the same treatments as the original study, which describes
three experimental treatments (see Bitchener & Knoch, 2010, p. 212). The treat-
ment description indicates four different groups: three treatment groups, each
with a different treatment, and one control group that received no treatment.
The treatments consisted of written CF on pre-test texts. As a result, each par-
ticipant appeared to receive individualized (and possibly different amounts of)
feedback according to their specific treatment condition. Written feedback was
provided on the texts participants completed for the pre-test. Feedback was pro-
vided once only for each participant.
First, in group one, participants received direct written CF plus meta-linguistic
explanations:
Example
A man and a woman were sitting opposite me. The man was British but I
think the woman was Australian.
(Bitchener & Knoch, 2010, p. 212)
Group three received direct written CF in the form of (1) the same writ-
ten meta-linguistic explanation as group one and (2) an oral form-focused
review of the written meta-linguistic explanation. The latter took the form
of a 15 minute full class discussion of the written meta-linguistic explana-
tion that the writers wanted to have clarified.
(Bitchener & Knoch, 2010, p. 212)
In executing our close replication of B&K, we will provide the same treatments
as described in the original study: three treatment groups and one control group.
Each treatment group will receive a different type of written CF. Group one will
receive direct written CF plus a written meta-linguistic explanation. Group two
will received indirect written CF only (circling of errors). Group three will receive
written CF plus a written meta-linguistic explanation as well as an oral form-
focused review of the written meta-linguistic explanations lasting 15 minutes.
INSTRUMENTS
PROCEDURE
B&K describe their procedure for data collection in four steps. We will adhere to
these procedures as closely as possible. Participants and then teachers separately
“were provided with information sheets about the study and were given the
opportunity to ask questions before signing a participant consent form” (p. 213).
112 Research Questions and Methodology
These information sessions were scheduled five days before the pre-test. The
procedure is then described as follows:
We will now work through the above procedure so as to fully understand the
different steps involved in data collection.
First, the pre-test was administered on day one. Although a pre-test was nei-
ther clearly identified in the description of the data collection instruments nor
here in the procedure, we will assume that the pre-test involved a 30-minute
handwritten description of an image, and that participants had no access to dic-
tionaries, grammar books, or other writing aids. In executing our replication, we
will use the ‘beach’ image as our pre-test (note that the original study did not
specify which image was used for the pre-test).
Second, three days after the pre-test, participants in a treatment group (groups
one, two, and three) received back their texts with written CF that conformed
to their particular treatment, as previously described. B&K explained that par-
ticipants were given “several minutes to consider the feedback” (p. 213). At this
point, the immediate post-test was completed.
Third, participants received back their immediate post-test “1 week after it
had been written” (p. 213), indicating that all groups (treatment and control
Research Questions and Methodology 113
• How does the author connect the replication and the original study?
{ Do you notice any expressions or use of subtitles that help connect
the original study and the replication?
• How does the replication report differences with the original study?
{ What methodological differences are described, and how are they
justified?
• How does the replication report similarities with the original study?
{ What particular word choices help the reader understand between-
study similarities?
• Does the replication study add any new tasks?
{ If so, how are these described and how are additions justified?
114 Research Questions and Methodology
Participants
While B&K’s (2010) participants were identified as advanced ESL writ-
ers, this replication study’s participants were upper-intermediate ESL
writers because L2 proficiency was our intentional variable modification.
Although L2 proficiency level was not independently measured in the
original study, this replication used TOEFL writing scores as an indi-
cator of L2 English proficiency. All participants reported TOEFL writing
scores between 17 and 23. We also compared our pre-test scores with those
Research Questions and Methodology 115
reported in the original study, and pre-test scores in this replication were
descriptively lower than in the original study.
Since the other components of the methodology section are exactly the same as
in the original study (i.e., target structures, treatments, instruments, and proce-
dures), we will cover them together. We should add, however, that as with all
other write-ups of replication research, we should ensure that we consistently
note similarities and differences when they occur. For example, we previously
noted that we were adding detail to the procedures in terms of which tasks were
used at the pre-test, post-test, and delayed post-test (which were not specified in
the original study). Also, since we were not able to use the original study’s data
collection instruments, we created our own based on descriptions in the original
study. These are important points to note for subsequent comparison with the
original study and for future replication.
In short, we must ensure that our write-up contains two features. First, when
something changes between the original and the replication, we must clearly state
what changed. Second, when something is the same as in the original study, we
must clearly state that it was the same. In the following paragraphs, we present an
example write-up of our replication’s target structures, treatments, instruments,
and procedures.
Target structures
This replication study’s two target structures were exactly the same as
in B&K (2010): use of “a” and “the” to index the first and subsequent
mentions, as shown in the following example. “A man and a woman
were sitting opposite me. The man was British but I think the woman
was Australian” (Bitchener & Knoch, 2010, p. 212). The bolding (from
original) on “a” and “the” is used to illustrate that the first mention
of man is referred to with “a” because they are unknown in the dis-
course, whereas their subsequent mention uses “the” because they are
now known subjects in the discourse. The use of “a” (for first mention)
and “the” (for subsequent mention) were selected in the original study
because previous research has repeatedly shown these structures to be
difficult to acquire, even at the advanced levels of proficiency (Butler,
2002; Ferris 2002, 2006).
Treatments
Our treatment design exactly followed B&K (2010), who provided three
types of CF treatment. Written CF was handwritten and provided on the
pre-test text. We note that the original study did not specify whether texts
were handwritten or typed.
First, a “direct CF group” received direct CF that included a brief meta-
linguistic explanation of “a” and “the” when used for first and subsequent
mention (as described for “target features”). Other uses of “a” and “the”,
correct or incorrect, were ignored without correction. For each error, an
asterisk was marked above the error with the same color marking used for all
asterisks, which referred to the following explanation and example (no other
explanations or examples were provided), exactly as in the original study:
Example
A man and a woman were sitting opposite me. The man was British but I
think the woman was Australian.
(Bitchener & Knoch, 2010, p. 212)
Finally, a control group received no treatment and completed only the pre-
test, post-test, and delayed post-tests.
Instruments
As in the original study, three images were used to elicit written descrip-
tions of what was happening in the image (see Appendix and IRIS for
images used). Each image was of a social gathering: One image was at the
beach, one image was at a picnic, and the last image was at a family celebra-
tion. We used the same contexts (beach, picnic, family celebration) as B&K
(2010) to elicit written descriptions. Written descriptions were handwrit-
ten on paper, and participants had 30 minutes to write their descriptions.
No dictionaries, grammar books, or other writing aids were available.
Although we were not able to use the original study’s test instruments,11 we
closely followed their description as reported in the original study: “Each of
the three pieces of writing required a description of what was happening in
a picture of a social gathering (a beach, a picnic, and a family celebration).
Thirty minutes was given for the writing of each description” (Bitchener
& Knoch, 2010, p. 213).
Procedure
1. Five days before the pre-test, we met with all participants and then
with all teachers to discuss the study and give opportunities for
questions to be asked about the study. All participants signed a con-
sent form.
2. We administered the pre-test on day one of week one. Each par-
ticipant received one color copy of the beach image and were
requested to write a description of it on the provided paper using
the provided pen. Participants received 30 minutes to complete
their writing.
118 Research Questions and Methodology
As in the original study, CF was only provided on the pre-test text. Also,
interviews with teachers confirmed no teaching of this study’s target
structures during the course of data collection (“a” and “the” in first and
subsequent mentions). Participants were assigned to a treatment group via
randomization, but we note that the original study did not describe how
participants were assigned to groups.
Notes
1 Appelbaum, M., Cooper, H., Kline, R.B., Mayo-Wilson, E., Nezu, A.M., & Rao, S. M.
(2018). Journal article reporting standards for quantitative research in psychology: The
APA Publications and Communications Board task force report. American Psychologist,
73.1, 3–25.
2 Eckerth, J. (2009). Negotiated interaction in the L2 classroom. Language Teaching, 42.1,
109–130.
3 Foster, P. (1998). A classroom perspective on the negotiation of meaning. Applied
Linguistics, 19.1, 1–23.
4 McManus, K., & Marsden, E. (2018). Online and offline effects of L1 practice in L2
grammar learning: A partial replication. Studies in Second Language Acquisition, 40.2,
459–475.
5 McManus, K., & Marsden, E. (2017). L1 explicit information can improve L2 online and
offline performance. Studies in Second Language Acquisition, 39.3, 459–492. doi:10.1017/
S027226311600022X.
6 As a reminder, a close replication involves modification of only one major variable at a
time in order to keep all other study design features as constant as possible, and there-
fore facilitate comparison between the original study and the replication (for review,
see Chapter 5).
7 For a general introduction to research questions in second language research, see
Mackey, A., & Gass, S.M. (2016). Second Language Research: Methodology and Design.
Routledge: New York.
8 Cumming, G., & Calin-Jageman, R. (2017). Introduction to the New Statistics: Estimation,
Open Science and Beyond. Routledge: New York.
9 Marsden, E., Mackey,A., & Plonsky, L. (2016).The IRIS Repository:Advancing research
practice and methodology. In A. Mackey & E. Marsden (Eds.), Advancing Methodology
Research Questions and Methodology 119
and Practice: The IRIS Repository of Instruments for Research into Second Languages
(pp. 1–21). New York: Routledge.
10 Mackey, A., & Gass, S.M. (2016). Second Language Research: Methodology and Design. New
York: Routledge.
11 We checked the article’s Appendix and the journal’s webpage, as well as checking on
IRIS. We then wrote to the authors, but they no longer had copies of the data collec-
tion materials.
7
EXECUTING AND WRITING UP YOUR
REPLICATION STUDY
Analysis, Results, Discussion, and Conclusion
7.1 Introduction
In Chapter 6, we critiqued the research design of an original study (research
questions and methodology), followed by recommendations for executing and
writing up a replication study using published replication studies as models (see
also Appelbaum et al. 2018). This chapter extends the execution and writing up
of a replication study to the analysis, results, discussion, and conclusion. Many
of our critiquing and reporting strategies are similar to those used in Chapter 6
because our aim in writing up our replication study will again be to systematically
highlight similarities and differences between the original and replication studies,
including justifications for any differences.
Our focus so far has been methodological and we have not thought very
much about how a variable modification might require a new or different set
of analyses. A related challenge is how to deal with partially described analytical
procedures in the original study. It is for these reasons, therefore, that our cri-
tique of an original study’s research methodology is important for replication: in
order to make informed decisions about if and how to implement alterations and
additions, we need to critique and understand the original study’s analytical pro-
cedures. For example, as discussed in earlier chapters, we might want to address
concerns about data measurement or statistical procedures.
the group sample sizes are not of equal variances, or particular codings
and/or variable types are not compatible with the type of statistic used.
These would be good grounds for considering alternatives.
7.1.1 Analysis
A study’s analytical and statistical procedures need to be detailed enough for us to
understand and evaluate a study’s findings. The reader should be clearly aware of
how the data was coded and analysed, including any justifications, as well as the
guidelines for interpreting findings. In this section, we critique B&K’s analysis,
as reported in the section titled “Analysis” (p. 213). The purpose of our critique
is to understand how the original study’s data was analysed in order to replicate
these procedures as closely as possible. However, we must also be aware of the
impact of particular analytical decisions on outcomes, and so we will also consider
alternative procedures, if appropriate, which might increase the transparency and
robustness of the analysis. When it comes to writing up our replication study’s
analysis, we must also ensure that we fully document any changes followed by
justifications for them.
122 Analysis, Results, Discussion, and Conclusion
• How are analytical similarities reported between the original study and
the replication study?
{ Do you notice any phrases and/or word choices that are useful to
highlight between-study similarities?
Analysis, Results, Discussion, and Conclusion 125
• Do the authors add any new analyses, or present their analyses in a dif-
ferent way?
• What aspects of the data analysis are the same between the studies?
• What aspects of the data analysis are different between the studies?
{ Why did he choose to use the original study’s coding in the end?
{ If yes, what was the impact of that result on the study’s findings?
• If I implement two analyses, how will I interpret the results if they contrast?
order to ensure comparability, the replication study used the same coding
procedures as the original investigation. Thus, all data was counted for
c-units, defined as “utterances, for example, words, phrases, and sentences,
grammatical and ungrammatical, which provide referential or pragmatic
meaning to NS–NNS interaction” (Foster 1998: 8, referring to Pica et al.
1989 and Brock 1986).
(Eckerth. 2009, pp. 116–117)
We see that Eckerth notes there might be better ways to analyse spoken lan-
guage but implementing such a change could reduce comparability with the
original study. As such, the potentially different analytical procedure is presented
(which can be later addressed in the discussion), but the original procedure is
implemented “to ensure comparability”.
Alternatively, both analytical procedures could be implemented. For example,
it would be permissible to include the analysis as described in the original (“a”
and “the” merged) as well as a different analysis (“a” and “the” analysed sepa-
rately). McManus and Marsden’s (2018) replication calculated effect sizes both
similarly and differently to the original study, and then presented both sets. The
additional/different calculations included “within-group ES [effect sizes] cor-
rected for the dependence (correlation) between the means” (p. 465–466), which
were presented in the journal’s online supplementary materials.
In our example of analysing “a” and “the” together and then separately, we
could opt to include both sets of analyses in the main body of the paper, in the
online supplementary materials, or as an appendix. It would be important to
compare the two sets of results, however, to verify the extent to which the dif-
ferent analytical procedures led to a different/similar patterning of results. For
example, do the two analyses converge or do they suggest different patterns of
findings? In the latter case, this could be an important discussion point later on in
the replication study write-up.
Although the discussion provides an important space for comparing results
between studies, we can also use standardized effect sizes (e.g., Cohen’s d) to
make between-study comparisons, as follows: “Between-group ES are provided
for each of McManus and Marsden’s (2017) groups using the mean and standard
deviation of the relevant group from McManus and Marsden (2017) as the ‘com-
parison/control’ group” (McManus & Marsden, 2018, p. 466). This comparison
method used the means and SDs from the original study to draw between-group
comparisons with the replication study (see McManus & Marsden, 2018, Table 4).
In our case, between-group comparisons would be helpful in determining the
effectiveness of the different treatments between studies, especially since our
intentional variable modification is L2 proficiency. This would allow us, for
example, to directly compare the effectiveness of the same instructional treat-
ment among advanced (original) and upper-intermediate (replication) L2 writers.
Furthermore, although B&K did not calculate effect sizes (a limitation discussed
128 Analysis, Results, Discussion, and Conclusion
further below), they provided means and SDs, which we can use to calculate
effect sizes for our replication.
Now, having summarized some of the different ways in which we can high-
light similarities, differences, and additions between the original study and our
replication, let us proceed to writing up our replication study’s analysis. Before
doing so, let us take stock of the major points (and critiques) of B&K’s analysis:
•• “A” and “the” were identified, but it is unclear how “omissions” were
examined.
•• Results for “a” and “the” were merged into the same analysis, but separate
analysis might be appropriate.
•• Accuracy was calculated as suppliance in obligatory contexts, but target-like
use is an alternative accuracy calculation.
•• Inter-rater reliability was high for identification and categorisation of errors.
•• Descriptive statistics (means and SDs of group accuracy percentages) were
calculated at each test point (pre-test, post-test, delayed post-test).
•• One-way ANOVA used to test pre-test differences.
•• Repeated-measures ANOVA used to compare change over time, but unclear
whether assumptions were tested.
•• Only treatment groups were included in the repeated-measures ANOVA.
•• One-way ANOVA used for posthoc tests, but with uncorrected alpha level
leading to potentially false-positive results (Type I error, see Chapter 3).
•• Alpha level not stated in the analysis section (but included later in the results).
•• CIs not included.
•• Effect sizes not calculated.
Using the above summary, and the models from Eckerth (2009) and McManus
and Marsden (2018), we provide below an example write-up of our replication
study’s analysis.
Analysis
Consistent with B&K (2010), for each text at each test point (pre-test, post-
test, delayed post-test), we identified all instances of “a” and “the” used to
refer to first and subsequent mentions. We coded all uses as “correct” or
“incorrect”, scored as 1 or 0, respectively. For correct and incorrect usage,
we calculated for all groups at each test point the means, SDs, and 95% CIs.
Accuracy of “a” and “the” use was calculated as suppliance in obliga-
tory contexts (%SOC) in percent (Ellis & Barkhuizen, 2005).3 Although
%SOC can lead to inflated accuracy rates because inappropriate uses are not
accounted for, in contrast to target-like use (see Pica, 1983), this replication
study measured accuracy using %SOC to ensure comparability with the
original study. In contrast to the original study, however, we present two
Analysis, Results, Discussion, and Conclusion 129
analyses of accuracy. First, following B&K (2010) we analyse “a” and “the”
together. Second, because “a” and “the” are understood to be functionally
distinct, we present separate analyses for “a” and “the” to examine accuracy
similarities and differences over time.
As in the original study, we calculated inter-rater reliability as percentage
agreement. Two raters coded all of the data separately, and their codings
were compared, which revealed 100% agreement on the identification of
errors (accurate vs inaccurate use) and 97% agreement on the categoriza-
tion of errors (first mention error vs subsequent mention error).
As all data sets were normally distributed (according to visual check-
ing of normality and Q–Q plots, and Shapiro–Wilks tests), we present the
result of parametric tests (ANOVAs).
First, we tested for parity (one-way ANOVA) at pre-test between all groups,
which indicated no differences between the groups (CIs passed though zero,
indicating unreliable change, Cohen’s d effect sizes were marginal, d = .09,
p > .05). Second, a 4 x 3 repeated-measures ANOVA was conducted, with
Group as the between-subjects factor (Direct CF, Indirect CF, Direct CF +
Oral Review, Control) and test point as the within-subjects factor (pre-test,
post-test, delayed post-test). We set the alpha level at .05. If, according to a
repeated-measures ANOVA, a statistically significant effect was found, pairwise
comparisons with Bonferroni correction were used for the posthoc tests.
For interpreting magnitudes of change, we present Cohen’s d effect
sizes and 95% CIs for d for all between- and within-subjects paired com-
parisons (and not only statistically significant results). Within-subject effect
sizes were calculated using the mean and SD of the pre-test as a baseline
(and the post-test for effect sizes at delayed post-test). CIs that did not pass
through zero were considered reliable indicators of change (Field, 2013).4
Between-group effect sizes are additionally provided for each of B&K’s
(2010) groups using the mean and SD of the relevant group from B&K
(2010) as the comparison/control group. We primarily draw on Plonsky
and Oswald’s (2014)5 Cohen’s d field-specific benchmarks for interpreting
our d values (within-subjects: 0.60 (small), 1.00 (medium), 1.40 (large);
between-subjects: 0.40 (small), 0.70 (medium), 1.00 (large)).
7.1.2 Results
Up to this point, we have closely examined the original study’s research ques-
tions, methodology, and analysis. Each time we have critiqued the original study
in order to evaluate its study’s design, followed by considerations for executing
and then writing up our replication. In a sense, much of the hard work is now
complete. In this section, our focus is on presenting our results, following the
analytical procedures previously presented. As with the other components of our
replication, we aim to closely follow the original study.
130 Analysis, Results, Discussion, and Conclusion
B&K’s results (pp. 213–214) are presented in data tables and a graph.
Statistical tests are reported both as running text and in table form.
Before we proceed, examine the layout and structure of B&K’s results.
Consider the follow questions:
a series of “data accountable graphics”, which visualize data about central ten-
dency, dispersion, and outliers. For repeated-measures studies (like B&K),
Larson-Hall recommends parallel coordinate plots which can provide informa-
tion about the learning trajectories at both the group and individual levels, as
shown in Figure 7.1.
The parallel coordinate plots in Figure 7.1 provide information about each
individual’s learning trajectory (e.g., do all participants improve, do all partici-
pants trend in the same direction, do participant scores cluster together or is
there a large amount of dispersion?). In visualizing our data, we should consider
using data accountable graphics that do not plot group means only. Although
our choice of data visualization would be different from the original study, both
graphics minimally contain the same information. The main difference is that our
parallel coordinate plots would additionally visualize individual performance as
well as group performance.
Next in B&K’s results is the presentation of the statistical tests (ANOVAs).
As previously mentioned, B&K included in their statistical analyses the three
treatment groups only, excluding the control group (see p. 214). This inclusion
method could be followed in order to ensure comparability with the original
study, but it seems like a more appropriate method would be to include all groups
to examine the effectiveness of the different treatments (a) in comparison with
the other treatments and (b) in comparison with no treatment.
100.00 100.00
80.00 80.00
60.00 60.00
POST-TEST
PRE-TEST
40.00 40.00
20.00 20.00
.00 .00
FIGURE 7.1 arallel coordinate plot showing individual changes from pre-test to
P
post-test.
132 Analysis, Results, Discussion, and Conclusion
B&K first present the results of a one-way ANOVA, used to compare pre-
test scores, to examine pre-test accuracy differences between the groups. The
results of the ANOVA indicated no statistically different scores between the
treatment groups. That said, inspection of the descriptive statistics and the line
graph suggest some between-group differences.7 As we saw above, calculating
95% CIs and effect sizes would have been helpful in better interpreting this result.
Having determined that the treatment groups’ pre-test scores were not statisti-
cally significantly different, B&K’s presentation continued to the results of the
repeated- measures ANOVA in table 2 (p. 214).
B&K state that “there was no significant interaction between Time and writ-
ten CF type”, although the result of that test indicated a very strict interpretation:
p = .051, with an alpha level of .05. It seems to us that this borderline effect
should not be dismissed quite so quickly, and the use of effect sizes and inspec-
tion of CIs would indicate the extent to which the interaction between Time and
CF was not a meaningful result. Furthermore, the main effects of CF type and
Time were both statistically significant, which indicated “significant differences
between the three groups and significant differences over the three testing times
in terms of accuracy” (p. 214).
As we know, a series of one-way ANOVAs were conducted as posthocs
(rather than use of pairwise contrast and/or planned comparisons, see Larson-
Hall, 2016), in which an adjustment appears not have been made to the alpha
level (e.g., Bonferroni correction) to account for the multiple tests being carried
out on the same data set. A Bonferroni correction to the alpha, for example, can
reduce the chances of obtaining false-positive results (Type I error). This case
provides a strong example in favour of calculating effect sizes, but they were not
included in the analysis, as follows:
One-way ANOVAs revealed that the differences between the three groups
were significant at the time of the immediate post-test (F [3, 52] = 6.69;
p = .001) and the delayed post-test (F [2, 52] = 4.67; p = .006). Tukey’s
post hoc pair-wise comparison (with an alpha level of .05) was performed
to isolate the significant differences among the three groups. These indi-
cated that at the time of the immediate post-test, participants in the three
treatment groups significantly outperformed those in the control group, but
that the three treatment groups did not differ from each other. However,
at the time of the delayed post-test, the participants who received indirect
feedback could not sustain this improvement and therefore did not differ
significantly from those in the control group.
(Bitchener & Knoch, 2010, p. 214)
Furthermore, the results of the posthoc tests were not presented in the analy-
sis, and so the reader is unable to examine the nature of the differences or the
results of the statistical tests. We have also noted that posthoc tests appear to have
Analysis, Results, Discussion, and Conclusion 133
included the control group, but this group was absent from all other statisti-
cal tests. For example, B&K state all treatment groups’ scores were statistically
significantly higher than control group scores at post-test, but differences at the
delayed post-test were not statistically significant between the control group and
the Indirect CF group. Because the results of the posthocs were not included, we
also do not know the extent to which differences between the treatment groups
emerged at the delayed post-test. Effect sizes can be extremely helpful in this
regard, calculated using the original study’s descriptive results (means and SD).
To this end, calculating between-group effect sizes (e.g., Cohen’s d, Hedges’
g) is strongly encouraged when statistical tests are not reported. This process
helps the potential replicator get a clear idea of the findings and their interpreta-
tion. Effect size comparisons indicated three contrasting results with the original
study’s claims, as follows:
In sum, B&K’s reliance on group means and strict p value based interpreta-
tions have arguably presented only a partial understanding of performance.
When writing up our analysis we will want to take into account the follow-
ing, and aim to present our results in a way that both advances the original
study’s presentation and addresses limitations (many of which are presented
in our Analysis section):
Critically read though Eckerth’s (2009) reporting of the results (pp. 118–120).
Pay particular attention to reporting strategies and the structuring of the results.
Think about:
• What is the purpose of Eckerth’s first paragraph in the results section (p. 118)?
• To what extent do the original study and Eckerth’s replication follow the
same presentation structure?
{ How do we know this?
• What is the purpose of Eckerth’s use of subtitles in the results section?
• How does Eckerth summarize similarities and differences between the
original study and the replication study?
Marsden, 2018). Before proceeding to the writing up of our results, let us first
confirm the analyses we will present (as a reminder, see also section 7.1.1.):
•• Table presentation of group means, SDs, and 95% CIs at pre-test, post-test,
and delayed post-test.
•• Data visualization using parallel coordinate plots.
•• Include all groups in statistical analyses.
•• Present results of all statistical tests (omnibus and posthocs).
•• For all tests, provide p values, 95% CIs for the test statistic, effect sizes, and
95% CIs for effect sizes.
•• Interpret our statistical tests using effect sizes, and following interpretations
presented in “Analysis”.
•• Calculate between-group effect sizes using the mean and SD of the relevant
group from B&K as the comparison/control group.
As the above list suggests, we are largely following the same results presenta-
tion structure as the original study. Indeed, Eckerth (2009) also followed the
structure of the original study in presenting his findings, as follows:
In line with the structure of Foster’s paper, the data will be presented
according to the research questions posed earlier: language production
(Tables 2 and 3 in the e-supplement, section 4.1), comprehensible input
(Tables 4 and 5 in the e-supplement, section 4.2) and modified output
(Tables 6 and 7 in the e-supplement, section 4.3).
(Eckerth, 2009, p. 118)
Following the same presentation of results structure as the original study helps
between-study comparison. This approach might not always be possible, and it
may well depend on the nature of the replication and the extent of variable modi-
fication. Following both our previous critique and this model from Eckerth, we
could describe our reporting of results as follows:
In line with the structure of the original study, our results will be presented
in a very similar manner to B&K: descriptive statistics (Table 7.1), data vis-
ualization (Figure 7.1), statistical tests, and effect size comparisons between
the original study’s groups and this replication study’s groups (Table 7.2).
We are thus preparing our reader for a very similar presentation order/structure.
However, as we know, that will not be sufficient because we will want to indicate
when the original study and replication study’s results are similar and when they
are different. Table 7.1 is an example of how we could present the descriptive
statistics from both the original and the replication, following the same general
table structure in the original study:
136 Analysis, Results, Discussion, and Conclusion
TABLE 7.1 Descriptive statistics for percentage group mean accuracy scores (mean,
95% CIs [LL, UL], SDs) at pre-test, post-test, and delayed post-test
M SD M SD M SD
[95%CIs] [95%CIs] [95%CIs]
Direct CF
B&K 12
Replication 20
Indirect CF
B&K 27
Replication 20
Direct CF + Oral
Review
B&K 12
Replication 20
Control
B&K 12
Replication 20
In Table 7.1, following the original study, we have used a similar layout and
have presented the same descriptive statistics (means, SDs), but have added 95%
CIs. Our table also presents our results alongside those from the original study to
draw attention to any differences and similarities. Depending on how your results
turned out, it would be helpful to also draw readers’ attention to any important
similarities/differences between the original study and the replication in the text.
For example, Eckerth’s (2009) comparison of language production data in the
original and in the replication was noted as follows: “In relation to the over-
all amount of language production, the extent of output modification is rather
limited, as it is also in Foster’s data” (Eckerth, 2009, p. 120). If our descriptive sta-
tistics showed similarities to the original study (e.g., similar accuracy proportions,
high accuracy levels), we should also indicate that information using a similar
strategy to Eckerth. Our purpose remains to highlight similarities and differences
between our results and those of the original study.
Our visualization of data trends may indicate differences with the original
study, however. This seems plausible because we are going to use parallel coor-
dinate plots, which show individual performance. If our graphic presented a
different patterning of results to the original study, we would want to draw our
reader’s attention to the nature of the difference, which may be because we used
a different type of graphic.
Continually making readers aware of the results in the original study as compared
with the replication study is a vital component of your comparative presentation.
Analysis, Results, Discussion, and Conclusion 137
In both Eckerth (2009) and McManus and Marsden (2018), readers were
informed how the results were different, which is essential to the replication.
Highlighting a difference followed by a description of the nature of the dif-
ference “Such a result is in conflict with Foster’s scores, which show the same
ratio for the groups, but the reverse for the dyads” (Eckerth, 2009, p. 118); “In
contrast, McManus and Marsden’s (2017) L21+L1 group had RTs that were
significantly slower in mismatched compared to matched trials at both Post and
Delayed (medium Effect Size)” (McManus & Marsden, 2018, p. 467). Not only
is a difference in result important to note, but it is equally important to indicate
whether the difference is perhaps due to different analyses and/or procedures.
For example, we noted that the original study’s statistics excluded the control
group from the omnibus tests but included them in the posthoc tests. Assuming
we analysed all groups together, we would want to note this difference, as it may
lead to a different type of result in the omnibus tests.
Furthermore, since the original study did not present the posthoc statistics in
the results, we would want to note that point. We would also want to clarify that
our interpretation is based on effect size calculations (Cohen’s d with 95% CIs),
whereas the original study’s interpretations were based on p values.
Our last point in this section refers to summarizing findings between stud-
ies. We noted that we will calculate between-group effect sizes using the mean
and SD of the relevant group from B&K as the comparison/control group, as
conducted in McManus and Marsden (2018), which is a helpful way to sum-
marize differences between studies using a standardized effect size, as shown in
Table 7.2
Table 7.2 summarizes the differences and similarities between the original
study and the replication study because (1) it can summarize differences between
groups at each test phase, and (2) it can summarize changes by adjusting for base-
line differences (i.e., pre-test differences). This latter point is important because
it is a standardized means of assessing improvement between studies. Although
our Table 7.2 only compares differences between the same treatment groups
TABLE 7.2 Effect size comparisons (Cohen’s d with CIs for d) with treatment groups
from Bitchener and Knoch (2010), and effect size changes with effects
adjusted for baseline differences
vs vs vs vs
Replication group Control Direct CF Indirect CF Direct CF + Oral
Pre-test
Post-test
Delayed post-test
Pre-post d change
Pre-delayed d change
138 Analysis, Results, Discussion, and Conclusion
in each study (i.e., Control vs Control, for illustrative purposes), you would
want to extend it to compare differences between all groups (e.g., Direct CF
vs Control). This approach additionally accounts for measurement differences
by using Cohen’s d, a standardized effect size (see Cumming & Calin-Jageman,
2017). Eckerth (2009) also provides a summary table of raw quantitative results
(see Eckerth, 2009, table 2, p. 120), but this would only be a helpful summary if
the groups were matched (i.e., no pre-test differences) and the data was measured
in the same way.
Alternatively, and perhaps more traditionally, providing a narrative summary
of between-study trends can indicate the important similarities and differences
between studies. A narrative summary also allows for the analytical/procedural
differences to be highlighted, as follows:
As previously noted, the authors call for future research examining “the rela-
tive effectiveness of all types of indirect and direct feedback when given to L2
writers of different proficiency levels” (Bitchener & Knoch, 2010, p. 216). In
interpreting/evaluating the findings, however, B&K offer almost no mention
of methodological factors that may limit the generalizations of their findings,
as we discussed in this chapter and Chapter 6, including, for example, unequal
numbers of participants in each of the groups, and little information about how
“advanced” is to be understood and interpreted.
In the next section, we discuss the writing up of a replication study’s discus-
sion and conclusions. While many of the attributes of writing up the discussion
and conclusions of original research apply to a replication study, there are a num-
ber of important attributes that are different.
• To what extent and in what ways does the original study feature in the
replication study’s discussion and conclusions?
For the discussion, re-familiarizing the reader with the original study is an
important feature of the replication study. Here, you will draw out the main
design features of the original study. In short, your discussion will begin with a
type of “executive summary”, introduced in Eckerth (2009) as follows:
Eckerth summarizes the original study’s main findings in “plain” English (i.e.,
without statistical details) in order to prepare the ground for the imminent sum-
mary of the replication study’s main findings. In our case, we would want to
summarize B&K’s main findings: essentially that all types of Written CF were
argued to improve L2 written accuracy immediately after instruction, but that
between-treatment effects appeared eight weeks later at delayed post-test. More
specifically, B&K argued that no differences between the indirect CF treatment
and the control group at delayed post-test indicated that, compared with direct
CF, indirect CF appeared less effective for improving L2 written accuracy among
advanced-level learners.
Briefly restating what the replication set out to accomplish can be a helpful
follow-on paragraph, in which the questions driving this replication are stated, as
well as a brief rationale for these particular questions. For example, in our case,
our variable modification was L2 proficiency, and a brief statement about the rel-
evance for this particular modification would be well received. In McManus and
Marsden (2018), the general research problem is restated, followed by a unified
set of aims that both the original study and the replication investigated. This can
be a good way of aligning both studies’ aims, as follows:
Whereas the benefits of L2 EI and L2 practice are well researched to date, the
current study addressed the role of L1 practice in L2 learning. Classroom-
based evidence has suggested benefits of L1 EI (González, 2008; Spada
et al., 2005), but has not examined L1 practice in L2 learning. Although
that research had different designs to McManus and Marsden (2017) and
this replication, our findings broadly align with it, extending it to show L1
EI plus practice benefited L2 offline and online performance more than L1
practice alone. In short, L1 practice without L1 EI provided few learning
Analysis, Results, Discussion, and Conclusion 143
Neither the original nor the replication study could confirm the over-
riding effect of task type (required vs. optional information exchange) on
the amount of language production and meaning negotiation established by
former studies (e.g. Pica & Doughty 1985; Doughty & Pica 1986). Foster
explains her results as a consequence of the learners’ adaptation of the tasks
[. . .] With regard to dyadic task completion, such task adaptation strategies
are unequivocally confirmed by the findings of the replication study.
(Eckerth, 2009, p. 121, our emphasis)
•• Did perhaps the new data collection site result in a different patterning of
results?
•• Would a different data collection site be a possible explanation for the dif-
ferent results?
•• What are the implications of these differences/similarities for understanding
the phenomenon under investigation?
•• Could certain research design and analysis choices in the original study,
which were ironed out in the replication, be a reason for the contrasting
findings?
Notes
1 Pica, T. (1983). Methods of morpheme quantification: Their effect on the interpretation
of second language data. Studies in Second Language Acquisition, 6.1, 69–78.
Analysis, Results, Discussion, and Conclusion 145
2 Larson-Hall, J. (2016). A Guide to Doing Statistics in Second Language Research using SPSS
and R. Routledge: New York.
3 Ellis, R., & Barkhuizen, G.P. (2005). Analysing Learner Language. Oxford: Oxford
University Press.
4 Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. London: Sage.
5 Plonsky, L., & Oswald, F.L. (2014). How big is “big”? Interpreting effect sizes in L2
research. Language Learning, 64.4, 878–912.
6 Larson-Hall, J. (2017). Moving beyond the bar plot and the line graph to create informa-
tive and attractive graphics 1. Modern Language Journal, 101.1, 244–270.
7 On this point, see below for Cohen’s d effect size comparisons showing differences between
the groups, contrary to B&K’s claims that the groups were not different at pre-test.
8
DISSEMINATING YOUR RESEARCH
Research is clearly of little value if it does not get out to those who should be
interested in reading it. This academic community has traditionally been reached
through paper publications such as books and academic journals, and is increas-
ingly using electronic communication, blogs, conference papers, and poster
sessions to become aware of new research.
It follows that those working in replication research need to have similar
outlets made available to help them disseminate their work. Limiting such pos-
sibilities only serves to maintain the community oblivious to potentially crucial
additional information both regarding the original studies and their replications.
The importance of replication research is also being further encouraged through
dedicated research grants, such as the US-based National Science Foundation
which “expects that these activities will aid in verification of prior findings,
disambiguate among alternative hypotheses and serve to build a community of
practice that engages in thoughtful reproducibility and replicability efforts.”1
Together with the recent calls for more replication research in AL noted in the
Introduction, we have witnessed the consequent demands for further – or more
effective – outlets for dissemination and/or the incorporation of such work into
existing publications. The argument has even been made that there is a place for
a journal specifically dedicated to – in the present case – AL replications. This,
however, runs the risk of taking the important act of replication out of the main
stream of AL research dissemination and potentially marginalizing it – when our
intention is that it gains a much wider readership and greater funding.
8.1 Journals
For the moment – and mainly in response to the recent increased prominence given
to such work – a number of high-ranking AL journals devote space or specific
Disseminating Your Research 147
» Activity 39
Language Learning
Applied Linguistics
(continued)
148 Disseminating Your Research
(continued)
TESOL Quarterly
Language Teaching
System
RELC Journal
CALICO Journal
» Activity 40
i) why you think the underlined advice has been included; and
ii)
discuss whether you feel any specific guidance below might be use-
fully added/applied to the advice you discovered above in AL journals.
****
Overview of procedure
1. . . . First, the article is submitted to the replication section of the
journal. The journal sends out the replication study to the original
author(s) asking for feedback within 60 days. At the end of 60 days,
either with or without feedback from the original author, the journal
(continued)
150 Disseminating Your Research
(continued)
****
1. Name and author(s) of the study that has been selected for replication.
The study can have been published in any reputable journal, and is not
confined to Human Factors
2. Why the study is worthy of replication. The reasons should be one or
more of the following:
a) The study forms the basis of an important theory, model, interven-
tion or other significant finding in the HFES literature
b) The study is controversial in some way
c) The study is highly cited or often viewed (please provide numbers)
3. Who will do the replication, and whether the researchers will be from a
single lab or multiple labs (multiple labs encouraged)
4. Whether the original author will be included in the research group
(encouraged)
Disseminating Your Research 151
Once you have targeted the journal, you might want to focus more closely
on the kind of papers published (i.e., have they published/do they publish
replication studies?) as well as the participants and/or methodological procedures
of the papers that have been published over the last few years in that journal. For
example, if you carried out your replication using quantitative statistical proce-
dures and you notice that the journal mainly publishes papers with qualitative
approaches to analysis, you might find this journal – initially at least - unwilling
to consider your research for publication.
If you dig deeper into these papers, you will likely find indicators of the kind
of methodological approach favored. The recently inaugurated Journal of
Second Language Pronunciation, for example, encourages:
» Activity 41
Investigate the websites of these journals and compare (i) the main
focus specified of the publication, (ii) the editorial board (specialisms),
(iii) any available lists of reviewers (often listed in year-end issues),
(iv) the kind of papers published (stated and as revealed in issues),
(v) the kind of methodological approach favored in research (if stated),
and (vi) any information on length of papers (with references), and
online repositories.
Disseminating Your Research 153
Language Learning
Applied Linguistics
TESOL Quarterly
Language Teaching
System
RELC Journal
CALICO Journal
Impact indices of articles/journals you may cite or, indeed, the target study
itself are also important considerations. It is true that the impact factor remains
to date the yardstick used by many governmental and non-governmental institu-
tions to assess quality in research. Care should be taken, however, in assigning
value to only one index. The AAAL promotion and tenure guidelines mentioned
above also remind us that:
There are many reasons . . . for which impact factor alone cannot adequately
determine the value of a given journal: the time between manuscript sub-
mission and publication in the field of applied linguistics often exceeds
the two years used in the impact calculations, reducing the impact factor
of applied linguistics journals. There is also a growing realization that not
all citations are necessarily positive (and thus cannot be the determinant of
quality), nor does the determination of the impact factor always take into
consideration the practice of self-citation.
» Activity 42
Below are a number of impact indices and factors which are typically
quoted on journal websites. Try to find out more information about each
one, including its strength and weaknesses, and how it is calculated.
Then investigate which of these is considered of greater importance or
significance in your local publishing context.
(continued)
154 Disseminating Your Research
(continued)
Immediacy index
Eigenfactor score
Altmetric
CiteScore (Elsevier)
H-Index
G-Index
The general topic of the original paper is one that continues to generate
much debate.
The original paper is one that continues to generate much debate.
The original paper continues to be cited in publications.
As a potential author for the journal, you face the initial challenge of convinc-
ing both the editor and the reviewing panel that your chosen study is sufficiently
important to have been replicated in the first place. You will need to be very
clear about the perceived need, therefore.
A study that is not only still cited after some time in the literature, but is also
a continued subject of debate is demonstrating its perceived ongoing relevance.
We can assume such a study has some sustained significance for the field at least
and, specifically, for that journal’s readership. Your accompanying argument for
publication will doubtless want to emphasize how your replication endeavors to
further illuminate that significance and contribute to that debate.
At the same time, and given this continued interest in the paper, the edi-
tor could reasonably expect the study already to be “familiar” to the readers.
However, your “new” approach to it will be expected to throw new light on
the outcomes. You will need – in your defense of this quality – to cite this
recent literature and indicate what aspects of that original study are the motive
of discussion and how this has led you to decide on replicating it, and its per-
ceived contribution.
The original paper’s findings are not consistent with previous or subsequent
work in the area.
Once again, much depends here on the study in question still being cited or, at
least, remaining on the radar of those working in this field. If an (historically) key
piece of work stands out because its findings did not fit in with the general trend
of those in similar studies, you might reasonably argue for the need to revisit it,
either to shed light on what might have brought about the atypical outcome or
simply to provide more evidence.
Similarly, we might want to note here whether other replications of the same
study have yielded results which in some way also point to the need for our own
“take” on the replication – particularly if these have presented inconsistent results.
The original study is cited as one of the most significant examples in practice
of a particular theory.
Theory feeds into practice and practice can feed back into theory. Both the-
ory and practice, therefore, must be interdependent if we are to advance our
156 Disseminating Your Research
We identified earlier certain aspects specific to the original study which argue
for the need to replicate. These might now usefully be highlighted in this accom-
panying justification to the journal editor/reviewers. Limitations noted in the
original study often reveal where the original authors feel future research efforts
relating to the current study need to be directed; replications can contribute
to the knowledge base by tweaking those limitations in the original study and
uncovering remaining pieces of the puzzle.
Our baseline here again is that you have chosen a study which remains of
interest to the readership. The additional contribution to emphasize is that your
replication has picked up on an already-observed constraint on the outcomes in
the original and sought to correct this. In this way, the replication is seen to form
part of a continuing cycle of necessary input. For example, perhaps the original
researcher was unable to apply random selection in the learning context chosen
and wonders whether this might have affected the results. Your taking up of this
aspect directly identified by the original author as of interest immediately high-
lights the importance of the contribution.
The statistical analysis used on the original data can now be improved upon.
Effect size data are not presented or are not convincing.
In your accompanying document to the journal you might also want to justify
the importance of your contribution by claiming, for example, how your analy-
ses have provided greater insight into the veracity – or strength – of the claimed
effect. Improving the statistical power of a study of ongoing interest which has
Disseminating Your Research 157
had no further support beyond that original can also be argued to be more urgent
than another which has already seen such replications. As we have seen through-
out this book, you would have tried to ensure sufficient statistical power was
assigned to your replication to enable you to back up the claims you now make to
justify the importance of your contribution. Similarly, you might want to argue
that your replication builds on the original – adding more power – since your
sample size has been significantly increased (some statisticians recommend 2.5
times the original sample size).
A good case can also be made for the contribution your replication presents
by arguing that new light is thrown on previous results through the method of
analysis used. While your replication will want to reflect methods and procedures
as close as possible to the original study, the interval between the original pub-
lication and your own replication may also have seen the appearance of other,
more sophisticated, statistical procedures which could potentially further inform
this data.
Clearly, the result of a replication may just as easily not support the findings
of the original study, and your task would now be to convince those reading it
that these are just as worthy of dissemination. Although confirmatory replica-
tions bring useful extra evidence which should be of interest to the readership, an
outcome which does not fit in with previous results may be particularly deserv-
ing of further attention. Indeed, much of the recent publicity we read in the
Introduction surrounding the lack of replication studies in the social sciences in
general was down to the fact that both few replications were being carried out
and because those that failed to produce the same results as in the original were
not deemed of such interest to journals.
In this scenario, you might find yourself needing to justify the “failure” in
other (positive) terms. There are several possible reasons for such an outcome
and these need not reflect negatively on the original study. For example, your
non-confirmatory outcomes might just indicate a simple case of regression to the
mean: results tend to even out over time and if a variable measured presents an
extreme value the first time it is noted, it will likely tend closer to the average the
next time around. Then again, you might also have received interesting insights
from the original authors (see below) which suggest the existence of false nega-
tives amongst the results – perhaps failing to find evidence of effects that with
hindsight they feel might have been real and which have shown themselves to be
so in your replication.
A strong argument to be made from all this is that a failure to verify an origi-
nal study – or some aspect of it – is just as important a contribution to science.
What we are attempting to do is participate in what we referred to earlier as the
self-correcting route of science. As more and more replications of a study are
presented, it figures that our knowledge (and perhaps our outcomes) change.
158 Disseminating Your Research
» Activity 43
Imagine you have a replication research study to publish with the fol-
lowing titles.
Search out suitable journals which you think may be best targeted for the
initial submission, and then target ONE according to the criteria you read
about in “Selecting a Suitable Journal” above. Finally, justify your choice by
summarizing why you think the journal in question is the most suitable in
each case.
Language Teaching invites the original study’s author to provide a short response
to the replication at the end of the published replication. Such post-facto obser-
vations are an essential part of the continuous nature of the scientific endeavor
of course: if the original authors notice a difference in a failed replication that
they believe might have been significant, the onus is on them to mention it and
perhaps call for further replications.
Such controversy, and the importance of involving the original author in your
work, should not be understood as our inviting you to hold back from undertak-
ing replications! While that author has the right to be informed about something
which derives directly from his or her own work, they are not the “guardians”
of the outcomes or object of interest here, and they too should embrace the
attempt to advance our knowledge of this area through what you are doing. As
we suggested in several chapters, AL research has been guilty of accumulating
rather than constructing knowledge in many areas, and retaining “key” findings
or basing theory upon single unreplicated studies can eventually be damaging if
this feeds through to the teaching and textbook end of our business.
literature reviews or when others quote our own work. The rather singular
situation in replication research comes about because we are often obliged or
encouraged to contact the original author to obtain more details about the
study’s methodology or outcomes than are immediately available in the publi-
cation itself – often due merely to the lack of space available to journals. At a
practical level, therefore, it makes sense to involve the original author as soon as
one decides to set up the replication.
Conflict may arise when the outcomes of a replication fail to confirm the
original findings in some way or are used to question – implicitly or explicitly –
the ability or integrity of either party in the replication. Reputations matter in
any profession, and both those beginning their careers and established “names”
can be negatively affected by outcomes – and by the way these are expressed in
the paper. Much of this questioning arises when a replication fails to confirm
previous results and a spotlight then seems to shift on to that original author.
It is worth remembering that journals – when they do publish replications –
traditionally prefer those that reveal a new effect or disconfirm rather than
confirm the original findings (Neuliep and Crandell, 1990).3 To a large extent,
then, journal policy might well be exacerbating the situation here (although
see Language Learning’s previously discussed implementation of Registered
Reports, https://onlinelibrary.wiley.com/page/journal/14679922/homepage/
registered_reports.htm.
Where researchers are simply not used to having their work revisited and,
in the best sense of the word, “questioned”, there will be a natural tendency to
react. And if a reaction of some kind is guaranteed, the relationship replicator–
author is potentially tense from the off, as the latter feels he or she must be on the
defensive. At this point the original researcher can easily feel hurt or even bullied
if the resulting replication fails to confirm their original findings.
One of the first attempts to do this came from the “Many-Labs” project (www.
manylabs.org). The initial phase of the project involved over 30 sites and 12
countries and over 6,000 participants in attempts to replicate key findings in 13
studies in psychology.
A similar, albeit more delimited, project in AL would be a welcome addition
to the literature. Setting up larger research groups who are to work on close repli-
cations of one study, for example, will require careful planning and organization,
not least in terms of initial recruitment of sites and members.
» Activity 44
• Key decisions:
How will you decide on a target study which will interest different
groups in different sites?
How will you decide which sites to approach for collaboration? National/
international?
What will be your target profile for the participating researchers? Some
people are good at getting things done, while others are natural communi-
cators and networkers – maybe a mixture of both is helpful?
• Attracting members: how and where to publicize the research group for
maximum interest in the replication objective? (email shots – where?;
online presence – where?; in person – where?; use your supervisor – how?)
• Funding: Department? Faculty? University? National? Professional AL
associations?
• How will feedback/reporting take place? What should be written up?
Skype?
• Scheduling work: reasonable and timely cut-off points for . . . data-
gathering, analysis, and reporting back.
Disseminating Your Research 163
» Activity 45
i) Read the whole paper through to get an idea of the study itself.
ii) Now read section 5.1 and select/highlight what you initially think should
be included in a conference presentation of this study’s methodology.
Justify your choice.
(continued)
164 Disseminating Your Research
(continued)
iii) Finally read sections 5.2 through 8 and extract/highlight what you initially
think should be included to be summarized in a conference presentation
of this study’s results, discussion, and conclusion. Justify your choice.
There will likely be a need for greater frequency of textual clues to help the
audience follow the argument. A reader can skip ahead or pause to read a printed
section again; an oral presentation does not allow for this. So take your audience
through the paper. Be aware of the importance of linking words and phrases
such as “nevertheless”, “although”, “therefore”, “it then follows . . .” as well as
“text-posting” with listing words such as “first”, “second”, “then”, and “finally”
after initially establishing the structure of the paper. Insert frequent “reminders”
through the presentation such as “the second research question led on from the
first in that . . .”, or effective ways of moving into a new section or idea such
as “following on from this we now move on to . . .”, or “I now move on after
presenting these figures to discuss the importance of those pre-test results once
these post-tests were computed . . .”.
Similarly, edit for – or rid the paper of – jargon or unnecessary verbiage such
as adverbs, empty adjectives, or unnecessary preamble. Consider changing from
the typically (printed) passive (and detached) voice to the active (and personal).
Finally, remember the central place and unique importance of comparison (and
comparison language therefore!) in your replication presentation and in any vis-
ual material offered (see Chapter 7). We would expect to be reminded at regular
intervals (but particularly in the discussion and concluding sections) of how your
work grew alongside – and takes its significance from – the original study.
» Activity 46
Look again at the replication study from the previous activity, focus-
ing on section 6. Imagine you want to deliver this discussion section
to a conference audience who is already somewhat familiar with the
literature on L2 writing but not specifically “pre-task planning”. Use
your selected extracts from the previous activity (i.e., what you thought
should be included in a conference presentation of this study’s out-
comes), together with some of the advice given in the previous section
“Writing the Text”, to write out part of your conference paper. Some
initial examples are given.
“By examining the effect of pre-task planning sub-processes on the
written language production of L1 writers of English, who had presumably
achieved the hypothesized threshold of proficiency in English, this study
used measures of fluency, grammatical complexity, and lexical complexity
identical to those in the original study . . .”
“Unlike the study, which found no impact of pre-task planning on
grammatical and lexical complexity and a minimal impact of organization
(continued)
166 Disseminating Your Research
(continued)
1. For times when you are not by your poster and therefore unable to answer
questions, the poster should provide detail about the justification and
nature of your replication to permit basic understanding of what you
have done.
2. It should provide a stimulus for discussion when you are around to
engage conversation. An important consideration is that the poster should
not be crammed with detailed text. Remember: a poster is neither a journal
article nor a conference paper. Balancing text, visuals, and space is perhaps
the most challenging aspect of creating a poster.
• General layout:
{ What tools and software packages are recommended for the crea-
tion of a poster presentation?
{ What presentation guidelines are suggested for interacting at a
conference?
• Sample posters:
Now, let us design a poster for our replication of B&K (2010), guided by our
critique of the original study in Chapters 5 and 6.
You previously looked over some resources and guidelines for creating
a poster, as well some example posters. Similarly to journal articles and
conference papers, posters tend to be separated into different sections,
but with an important difference: a poster aims to visualize information,
with much less text than both journal articles and conference papers.
For our purposes (and as a guide), we are going to plan for our poster
to include five sections: Introduction/Background, Research Questions,
(continued)
168 Disseminating Your Research
(continued)
• Background:
• Research questions:
• Methodology:
• Results:
Now that you have some idea of the poster’s content (at least sketched
out), it will be helpful to choose a template. The template will help you
select how much text we can reasonably include. Before selecting our
template, however, we need to know what size mounting board we can
expect the conference to provide. Our poster will be displayed on this
mounting board. Not all conferences will use the same size mounting
boards, so be aware!
For our purposes, let’s assume we’re going to present our poster at the
AAAL annual conference (www.aaal.org).
Those guidelines might typically specify that the mounting boards for
posters will be “four feet by eight feet in size”, and so we must ensure that
our poster does not exceed these dimensions. To fit comfortably in that
Disseminating Your research 169
space, we will aim to select a template with the dimensions 36” tall by 48”
wide (A0 size), making the orientation landscape.
Now, let’s select a template. The following website provides some exam-
ples to get you started, but check your university library as well:
www.posterpresentations.com/html/free_poster_templates.html.
You’ll notice that some posters have different size specifications, so just
be sure to select a template that is going to fit comfortably on the mounting
board. Once you have selected your template, we are ready to begin.
If you have selected a poster template from the links above (which you don’t have
to, but it might be easier if you do), you will also likely be provided with additional
guidelines on the left and right margins (e.g., tips for making sure images display
correctly). These margins do not print, so you don’t need to worry about deleting
them. Figure 8.1 is an example poster of our replication, filled out with content (an
enlarged version can be viewed at www.routledge.com/9781138657359). We are
going to work through how to get something like Figure 8.1.
Before we start filling our poster template with content, we will fol-
low some guidelines from the previously-mentioned Penn State University
Libraries link:
Aim for a total of 300–500 words on your poster. You won’t simply be
pasting large blocks of text from your paper or your abstract onto your
poster; you need to boil it down to the essence, with explanation and visu-
als as needed. Use a font size slightly smaller than your name for the section
headings on your poster. The rest of the text should be approximately
28-point to 36-point font.
(Penn State University Libraries, http://guides.libraries.psu.edu/c.
php?g=435651&p=2970252)
Title
Let’s start at the top. The title is arguably the most important part of your poster.
A catchy title, with a clear and uncluttered poster layout, is going to attract
people to your poster, and that is what we want! In journal articles, and to some
extent in conference presentations, titles can be long and descriptive, but our
poster title needs to be short and snappy to grab the attention of people passing
by. We want to avoid a title that is anything greater than one line.
In Figure 8.1 our title is relatively short. The title includes the crucial word
“replication”: “Proficiency mediates the effectiveness of corrective feedback: A
close replication”. This title contains four main components because, for us,
these are the main components that define our study, and we want to convey
these to our audience:
FIGURE 8.1 Example poster. To view this in colour and in closer detail please visit the eResource at www.
routledge.com/9781138657359.
Disseminating Your Research 171
A helpful way to design your poster’s title could be to boil down the study
to three or four key words and then figure out how you can get these to work
together in a title.
Next are the authors’ names and affiliations. That should be relatively straight-
forward. Small images of our university logos on either side of authors’ names
provide a useful visual identity of affiliation. You will sometimes see that some
posters include email addresses in this top part too, but, in our case, we have a
bottom corner with that information. Decisions like those will come down to
personal preference.
We are now going to work from left to right in filling our poster with con-
tent. Our first decision will be to rename the section titles and maybe resize them.
How you use the space is up to you, and it is going to be influenced by the nature
of your replication study and the messages you want to communicate.
•• The same target features were examined, including what these were.
•• The research design: ten weeks, pre-test, post-test, and delayed post-test.
•• Restate research aims: (1) effects of written CF on L2 writing, and (2) differ-
ences between different types of CF.
Research Questions
We decided to include a separate section titled “Research questions” to make
them stand out. This is a personal preference and they could be included under
“Aims” if you are looking to save space. We used the same research questions as
presented in Chapter 6.
Methodology
Our “Methodology” section is a full column, but it could be shorter depending
on the nature of the replication study. For example, an intervention study, as
in B&K (2010), requires description of the target structures and the treatments,
which can make the methodology section a little longer when compared to stud-
ies that are not interventions.
• Participants:
• Target structures:
• Treatments:
• Analysis:
You will notice that our “Instruments” includes reproductions of the three
images used to elicit written descriptions, with a very brief statement that these
were used to elicit written descriptions.
Last in this section is our “Analysis”, which as you will remember had to be
very detailed for our write up in Chapter 7. In the poster, however, as we need
to cover the essential points, we will use a series of bullet points. As with our
other sections, we state that we followed the same analytical procedures as in the
original study. We then provide the following information:
Results
Our results section contains perhaps the least amount of text compared with the
other sections, with preference given to the more visually effective use of figures
and tables. We draw on the following aspects from our Chapter 7 discussion of
results: (1) data-accountable graphs showing individual scores over time per indi-
vidual in each group, (2) descriptive results presented in a table, and (3) effect size
comparisons between the original study and the replication to show similarities
and differences.
Notes
1 www.nsf.gov (Publication NSF 18-053).
2 Kahneman, D. (2014). A new etiquette for replication. Social Psychology, 45(4), 310-311.
3 Neuliep, J., & Crandall, R. (1990). Editorial bias against replication research. Journal of
Social Behavior and Personality, 5(4), 85-90.
9
EPILOGUE
Note
1 Adapted from https://undsci.berkeley.edu/article/_0_0/howscienceworks_10.
INDEX
confidence intervals (CIs) 38, 123; discussion and conclusions 10, 46–47, 127,
bootstrapping 67; poster presentations 138–144; poster presentations 168, 170,
174; results 130, 132, 134; routine 174; writing up 141–144
checking 51, 62, 63–64; writing up dissemination 10, 69, 146–175;
128, 129, 135, 136, 137 collaboration 158–160; conference
confirmation 13, 149, 157 papers 163–166; journals 146–158;
confirmatory power 73 poster presentations 166–175;
conflict between authors 159, 160–161 replication research ethics 160–161;
constructs, operationalization of 60, 85, research teams 161–162
91–93, 122–123, 174
context 27, 85 Eckerth, J. 96, 102, 113; analysis 125–127;
control agents 33–34 discussion and conclusions 140, 143;
control groups 29–30, 32–36, 39, 45–46, executive summaries 142; writing up
131; discussion and conclusions 139; 103–105, 114, 134–138
effect size 61; poster presentations educational settings 8
173–174; procedures 112–113; results effect size 38–39, 45–47; discussion
132–133, 134, 137; writing up 117, and conclusions 143–144; “multi-
118, 137–138 lab” replication projects 161; poster
corrections 51 presentations 174; results 132, 133, 134;
correlation 56–57 routine checking 55–56, 59–65; writing
course reading 15, 16–17 up 127, 129, 135, 137, 138
critical reviews 16, 20–21 equal variance 58
Cronbach’s alpha 36 error 7, 14, 37, 44–45, 48–49, 71; data
cross-validation 66, 67 coding 50; discovery of 5; inevitability
cues 35, 45 of 9; inter-rater reliability 123; jacknife
Cumming, G. 55 procedure 67; margin of error 38, 62;
customized calls for replication 16, 21–22 standard error 51, 66, 67; Type I errors
55, 128, 132
data: access to 24–25, 50, 51, 111; eta squared 63
accumulation 13, 14; bootstrapping 67; ethics 160–161
coding 50, 121, 122; manipulation 14; etiquette 10, 160
routine checking 52–53; sharing evidence 7, 12, 176–177; approximate
49, 111; visualization 130–131, 134, replication 78; close replication 73;
135, 136 conceptual replication 94; effect size 64
data analysis 10, 14, 27, 120, 121–129, execution 10, 65, 95–118; feasibility
157; poster presentations 173, 174; of replication 98–99; methodology
writing up 121, 125–129; see also 106–118; research questions 100–106
statistical methods executive summaries 142
data collection/data-gathering 27, 31, extension studies 3, 14, 69, 70–71, 72, 80
37, 57; approximate replication 80; external replication 10, 48, 49, 65, 67,
close replication 73; conceptual 69–94
replication 83–84, 85–86; discussion
and conclusions 143; feasibility of failures 4, 26, 94, 157, 161
replication 99; instruments 35; fallibility 1
IRIS 106; procedures 111–113; false negatives 37–38, 157
writing up 115 feasibility 98–99
databases 25, 51 Ferzli, M. 28–35, 39–40, 41
de Serres, L. 73–77 “file drawer problem” 25
dependent variables 61, 124 fixed factors 124
descriptive data 123, 130, 132; routine follow-up/extension studies 3, 14, 69,
checking 50–51, 53–54; writing up 70–71, 72, 80
128, 135–136 Foster, P. 96, 103, 126–127, 135–136,
disconfirmation 13, 149, 161; see also non- 142, 143
confirmatory outcomes future research 144
180 Index