4 Corpus Linguistics Outcomes and Applications in
4 Corpus Linguistics Outcomes and Applications in
4.1 Introduction
Prominent scholars highlight the key role of word use in the development of corpus
linguistics as the study of linguistic phenomena by means of extensive collections
of machine-readable texts, i.e., by means of corpora. The steady evolution of corpus
linguistics has been primarily motivated by the linguists’ need to understand how
words are actually used in natural languages, which most common words tend to be
used in certain contexts, what is common and what is uncommon for certain language
variations (including specialisms), thus leading to the first outcomes provided by
corpus-based approaches, i.e., words lists and synonymous terms. Hence, mainstream
literature pinpoints the emergence of corpus-based investigations as early as 1755,
when, as endorsed by Biber et al. (2006: 22), Johnson used a corpus of texts to gather
authentic uses of words that he then included as examples in his dictionary – a first
step made towards the understanding of the patterns of use associated with a word.
A less popular branch of linguistics, though widely explored in the 1940s
and 1950s, statistical linguistics, has also contributed to the development of what
nowadays corpus linguistics has become through innovative mathematical theories
of information. Yet, due to the lack of computer-assisted processing tools back at that
time, it proved to lack productivity and effectiveness.
Another branch of linguistics closely related to the current state-of the-art of
corpus linguistics was what ‘older linguists, of the heyday in the 1950s’, such as Harris,
Fries or Hill and other American structuralists regarded as ‘descriptive linguistics’
(Leech, 1992: 106), i.e., the scholars’ aim at describing the corpus under investigation.
Accordingly, endorsing the flexible typology of descriptive linguistics towards theory
construction, Leech (ibidem) highlights the less abstract nature of its outcomes,
particular to one language, where linguistic phenomena are more easily to localise,
observe and analyse. In this context, we grow aware of one main characteristic of
corpus linguistics, namely that most corpus-based analyses are applied to data
inquiries specific to individual languages.
In the 1980s, linguists registered a rebirth of corpus linguistics, which by then
had already indicated a close connection with quantitative linguistics, a specialised
research branch that promotes the need for quantitative methods in language
study, which, according to various linguists, are also frequently used in most other
disciplines as they can provide reliable outcomes when it comes to the description of
language in terms of frequency and infrequency rates.
38 Corpus Linguistics Outcomes and Applications in the Digital Era
The central aim of the present paper is not only to highlight the considerably value-
added and reliable outcomes provided by computer-assisted corpus-based analysis,
but also to pinpoint the highly effective dedicated software and computer-assisted
tools applied in corpora design and compilation.
We should first mention that, paradoxically, the concept of corpus was initially
used to designate various non-linguistic collections, such as the Corpus Juris Civils
introduced by the emperor Justinian in the sixth century and regarded by numerous
scholars as a compilation of early Roman law and legal principles, which also
illustrated particular cases and provided clarifications of new laws and future
legislation to be put in effect (see Jan Svartvik, 1992). However, linguistics corpora
are regarded as collections of texts ‘assumed to be representative of a given language,
dialect, or other subset of a language, to be used for linguistic analysis’, as defined
by Francis (1992: 17). Thus, in language study, corpora have been primarily used for
linguistic analysis, a feature that according to Francis (ibid:19), differentiates them
from other types of corpora or large text collections, such as anthologies, for example,
the Oxford Book of English Verse, whose purpose is literary.
Other examples of corpora types are lexicographical corpora, compiled in the
process of making dictionaries, for example, the Oxford English Dictionary, edited
in the 19th century by Murray or the Merriam-Webster Dictionary edited in the 20th;
dialectological corpora compiled for the purpose of designing dialect atlases, for
example, in the Middle English Dialect Atlas, issued in 1981 by Benskin, or grammatical
corpora, among which Quirk’s Survey of English Usage published in the 20th century
is the most well known.
Starting with the early 1960s, modern concepts of corpora generally indicate the
use of large collections of texts available in machine-readable forms, of which the most
representative one is the Brown University Standard Corpus of Present-Day American
English compiled by Henry Kucera and W. Nelson Francis and which remains a sample
of present-day English for use with digital computers. In the next two decades, it was
followed by the creation of the British National Corpus (BNC), started in 1991.
Given the broad and multi-faceted research directions that emerged in the field of
linguistics, as well as in previously mentioned close related domains, where corpora
investigations are applied in order to achieve novel quantitative and qualitative
outcomes, McEnery and Hardie (2012: 12) put forward a series of criteria applied to
distinguish between the types of corpus-based investigations:
–– mode of communication
–– corpus-based versus corpus-driven communication
–– data collection regime
–– the use of annotated versus unannotated corpora
–– total accountability versus data selection
–– multilingual versus monolingual corpora
Corpora Design and Compilation in the Digital Era 41
(BNC) that includes field (the subject-matter of written texts), tenor (spoken texts),
and mode (books, periodicals).
A central issue in corpora design and particularly in the design of specialised
corpora is how to establish the most appropriate variables and how the dynamics
of the sub-corpora should function, considering the fact that most of the times such
specialised corpora reveal a multi-dimensional character. Nevertheless, as argued by
Lüdeling and Kytö (2008), in most situations it is obviously not possible to collect
all the texts that constitute the research object into a corpus. In this respect, the
authors advocate that the selection of a subset of the texts aimed for analysis stands
as a reliable method. Moreover, the scholars recommend that establishing of strict
sampling techniques may lead to reliable and operative corpus design. We would tend
to emphasise that the contents of a corpus designed for research purposes need to
be carefully considered, though some will argue that the design of the corpus will
depend more on what is freely available in an easily converted format than on other
criteria, thus pinpointing towards the benefits provided by the digital era.
Within this context, a series of other very important aspects that need to be
considered in the corpora design can be mentioned, i.e., software limitations,
copyright, and text availability.
It is a truism that dedicated software provides an overwhelming capacity to
store databases under different shapes (written, spoken, images, graphics, etc.);
various linguists have pointed out that readily available software packages, such as
WordSmith Tools (Scott, 2004), can deal with a corpus of tens of millions of tokens
in size. Lüdeling and Kytö (2008: 157) also mention the existence of larger corpora
that often work with software, which demands that the tokens are converted into
digits, which would facilitate the software to process more quickly. Be that as it may,
specialists from linguistics and IT alike advocate that complex operations on corpora
that may involve hundreds of millions of words can take some time to complete.
Furthermore, another aspect that needs to be taken into consideration is that the
storage and further use of the electronic version of published texts remains a crime in
many of the EU Member States and worldwide as well, if no copyright permissions are
legalised. Thus, due to the fact that such permissions are often difficult to gain, the
availability of some corpora may be restricted.
Another aspect highlighted also by Lüdeling and Kytö (ibid: 158) refers to the
availability of texts. In this respect, the authors mention the design limitations
imposed by the corpora designers who wish to include in their corpora ‘spoken texts
from an era before the invention of tape recorders’. The accuracy of the data as a
representation of the actual speech is always questionable. Even though written texts
are easier to obtain, there are some situations when extensive written texts are difficult
to be incorporated into a corpus, unless very large storage facilities are available for
scanning and keying.
Application: A Model for Computer-Assisted Corpus Design and Analysis 43
also provide further visualisations which can be exported to Excel sheets as well
under the shape of tables or graphics.
Figure 4.2: Print Screen: document portrait generation according to the assigned codes.
4.5 Conclusions
References
Aston. G. & Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with Sara.
Edinburgh: Edinburgh University Press.
References 47
Benskin, M. (1981). The Middle English Dialect Atlas. In Benskin, M. & Samuels L. (Eds.). So Meny
People Longages and Tonges: philological essays in Scots and mediaeval English presented to
Angus McIntosh. Edinburgh. xxvi-xli.
Biber, D., Conrad S., & Reppen, R.. (2006). Corpus Linguistics: investigating language structure and
use. Cambridge: Cambridge University Press.
Francis. W. N. (1992). Language Corpora B.C. In Svartvik J. (Ed.). Directions in Corpus Linguistics:
Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991. Berlin: Walter de Gruyter,
17-35.
Kennedy, G. (1998). An Introduction to Corpus Linguistics. London: Longman.
Leech, G. (1992). Corpora and theories of linguistic performance. In Svartvik J. (Ed.). Directions in
Corpus Linguistics: Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991. Berlin:
Walter de Gruyter, 105-123.
Lüdeling A. & Kytö, M. (2008). Corpus Linguistics an International Handbook. Vol.1. Berlin: Walter de
Gruyter.
McEnery, T. & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge:
Cambridge University Press.
Scott, M. (2007). Oxford WordSmith Tools Version 4.0. Oxford: Oxford University Press.
Svartvik, J. (1992). Corpus linguistics comes of age. In Svartvik J. (Ed.). Directions in Corpus
Linguistics: Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991. Berlin: Walter de
Gruyter, 7-17.
Online resources
Maxqda for Windows. Reference Manual. Berlin: VERBI Software. Consult. Sozialforschung.
Available: http://www.maxqda.com [Accessed 2015, July, 30].