Versa NT Writing Test Validation
Versa NT Writing Test Validation
TM
Table of Contents
Table of Contents .......................................................................................................................... 1
1. Introduction ............................................................................................................................... 4
2. Test Description ....................................................................................................................... 4
2.1 Workplace Emphasis .............................................................................................................................................................. 4
2.2 Test Design ............................................................................................................................................................................... 4
2.3 Test Administration ................................................................................................................................................................ 5
2.4 Test Format .............................................................................................................................................................................. 5
Part A: Typing ............................................................................................................................................................................ 5
Part B: Sentence Completion ................................................................................................................................................. 6
Part C: Dictation ....................................................................................................................................................................... 7
Part D: Passage Reconstruction ............................................................................................................................................. 8
Part E: Email Writing ................................................................................................................................................................ 9
2.5 Number of Items ..................................................................................................................................................................... 9
7. Validation ................................................................................................................................. 19
7.1 Validity Study Design ............................................................................................................................................................ 19
7.1.1 Validation Sample .......................................................................................................................................................... 19
7.2 Structural Validity .................................................................................................................................................................. 20
7.2.1 Descriptive Statistics .................................................................................................................................................... 20
7.2.2 Standard Error of Measurement ............................................................................................................................... 21
7.2.3 Test Reliability ............................................................................................................................................................... 21
7.2.4 Dimensionality: Correlations among Subscores .................................................................................................... 21
7.2.5 Machine Accuracy ......................................................................................................................................................... 22
7.2.6 Differentiation among Known Populations ............................................................................................................. 23
Page 2 of 30
8. Conclusion ............................................................................................................................... 28
9. About the Company ............................................................................................................... 28
10. References ............................................................................................................................. 29
Page 3 of 30
1. Introduction
Pearsons Versant Writing Test, powered by Ordinate technology, is a computer-based assessment
instrument which is designed to measure how well a person can handle workplace English in written
form. The Versant Writing Test is intended for adults 18 years of age and older and takes about 40
minutes to complete. Because the Versant Writing Test is delivered automatically by the Versant
testing system, the test can be taken at any time, from any location on a computer. A human examiner is
not required. The computerized scoring allows for immediate and objective results that are reliable and
correspond well with traditional measures of English language proficiency.
The Versant Writing Test measures facility in written English in the workplace context. Facility is defined
as how well a person can understand spoken or written English and respond in writing appropriately on everyday
and workplace topics at a functional pace. Versant Writing Test scores provide reliable information that
can be applied to placement, qualification and certification decisions by academic institutions, businesses
and government agencies. The test is also appropriate for monitoring progress as well as measuring
instructional outcomes. (The Versant English Test1 is also available if it is necessary to evaluate spoken
English.)
2. Test Description
2.1 Workplace Emphasis
The Versant Writing Test is designed to measure the candidates ability to understand and use English in
workplace contexts. The test does not target language use in one specific industry (e.g., banking,
accounting, travel, health care) or job category (e.g., shop clerks, accountant, tour guide, nurse) because
assessing the candidates English ability in such specific domains requires both English ability and content
knowledge, such as subject matter knowledge or job-specific terminology. Rather, the Versant Writing
Test is intended to assess how well and how efficiently the candidate can process written English on
general topics such as scheduling, commuting, and training that are commonly found in the workplace
regardless of industry or job category.
For more information about the Versant English Test, please refer to Versant English Test: Test Description and Validation Summary.
Page 4 of 30
The Overall score is a weighted average of the five subscores. Together, these scores describe the
candidates facility in written English in everyday and workplace contexts. As supplemental information,
Typing Speed and Typing Accuracy are also reported on the score report.
The Versant testing system automatically analyzes the candidates responses and posts scores to a
secure website usually within minutes of completing the test. Test administrators and score users can
view and print out test results from ScoreKeeper, a password-protected section of Pearsons website
(www.VersantTest.com).
Page 5 of 30
algorithm factors in the number of polysyllabic words across sentence samples (McLaughlin, 1969). All
passages have a readability score between 10 and 12, which is at a high school level and can be easily
typed by most educated English speakers with adequate typing skills.
Examples:
Whenever you have a fantastic idea, you should always write it down. If you don't,
it is quite possible that you will forget about it. Many creative people have a pen and
paper close at hand at all times. That way, whenever an interesting thought comes to
them, they are prepared to write it down. Later on, when they have time, they sit
down and read through their list of ideas.
You can benefit from this practice, too. Keeping a notebook full of thoughts is a
great way of understanding yourself better, because it tells you how you think. It
allows you to return to an interesting idea when you have the opportunity to do so.
You might find that you've created something that can change the world forever.
This task has several functions. First, since typing is a familiar task to most candidates, it is a comfortable
introduction to the interactive mode of the written test as a whole. Second, it allows candidates to
familiarize themselves with the keyboard. Third, it measures the candidates typing speed and accuracy.
The Versant Writing Test assumes a basic competence in typing for every candidate. Since it is
important to disambiguate candidates typing skills from their written English proficiency, it is
recommended that test administrators review each candidates typing score. If typing speed is below 12
Words per Minute, and/or accuracy is below 90%, then it is likely that this candidates written English
proficiency was not properly measured due to poor typing skills. The test administrator should take this
into account when interpreting test scores.
Part B: Sentence Completion
In this task, candidates read a sentence that has a word missing, and they supply an appropriate word to
complete the sentence. Occasionally, two adjacent sentences are presented but still only one word is
missing. Candidates are given 25 seconds for each item. During this time, candidates must read and
understand the sentence, retrieve a lexical item to complete the sentence, and type the word in the text
box provided. Sentences range in length from 4 to 30 words. Across all items in this task, candidates
are exposed to sentences with words missing from various parts of speech (e.g., noun, verb, adjective,
adverb) and from different positions in sentences: sentence-initial, sentence-medial, sentence-final.
Examples:
1. I'm sorry but your bill is long past __________.
2. He arrives __________ and is often the first one here.
3. I asked a coworker to take over my __________ because I wasnt feeling well.
It is sometimes thought that fill-in-the-gap tasks (also called cloze tasks) are more authentic when longer
passages or paragraphs are presented to the candidate, as this enables context-inference strategies.
However, research has shown that candidates rarely need to look beyond the immediate sentence in
order to infer the correct word to fill the gap (Sigott, 2004). This is the case even when test designers
specifically design items to ensure that candidates go beyond sentence-level information (Storey, 1997).
Readers commonly rely on sentence-level comprehension strategies partly because the sentence
Page 6 of 30
surrounding the gap provides clues about the missing words part of speech and morphology and partly
because sentences are the most common units for transmission of written communication and usually
contain sufficient context for meaning.
Above and beyond knowledge of grammar and semantics, the task requires knowledge of word use and
collocation as they occur in natural language. For example, in the sentence: The police set up a road
____ to prevent the robbers from escaping, some grammatical and semantically correct words that
might fit include obstacle, blockage or impediment. However, these would seem inappropriate
word choices to a native reader, whose familiarity with word sequences in English would lead them to
expect a word such as block or blockade.
In many Sentence Completion items there is more than one possible correct answer choice. However,
all items have been piloted with native speakers and learners of English and have been carefully reviewed
with reference to content, collocation and syntax. The precise nature of each item and possible answer
choices are quantified in the scoring models.
The sentence completion task draws on interpretation, inference, lexical selection and morphological
encoding, and as such reflects the candidates mastery of vocabulary in use.
Part C: Dictation
In the Dictation task, each item consists of one sentence. When candidates hear a sentence, they must
type the sentence exactly as they hear it. Candidates have 25 seconds to type each sentence. The
sentences are presented in approximate order of increasing difficulty. Sentences range in length from 3
words to 14 words. The items present a range of grammatical and syntactic structures, including
imperatives, wh-questions, contractions, plurals, possessives, various tenses, and particles. The audio
item prompts are spoken with a natural pace and rhythm by various native speaker voices that are
distinct from the examiner voice.
Examples:
1. Theres hardly any paper left.
2. Success is impossible without teamwork.
Corporations and companies are staying current with the latest technologies.
Dictation requires the candidate to perform time-constrained processing of the meanings of words in
sentence context. The task is conceived as a test of expectancy grammar (Oller, 1971). An expectancy
grammar is a system that governs the use of a language for someone who has knowledge of that
language. Proficient listeners tend to understand and remember the content of a message, and not the
exact words used; they retain the message rather than the words that carry the message. Therefore,
when writing down what they have heard, candidates need to use their knowledge of the language either
to retain the word string in short term memory or to reconstruct the sentence that they have
forgotten. Those with good knowledge of English words, phrase structures, and other common
syntactic forms can keep their attention focused on meaning, and fill in the words or morphemes that
they did not attend to directly in order to reconstruct the text accurately (Buck, 2001:78).
The task is a good test of comprehension, language processing, and writing ability. As the sentences
increase in length and complexity, the task becomes increasingly difficult for candidates who are not
Page 7 of 30
familiar with English words and sentence structures. Analysis of errors made during dictation reveals
that the errors relate not only to interpretation of the acoustic signal and phonemic identification, but
also to communicative and productive skills such as syntax and morphology (Oakeshott-Taylor, 1977).
The Dictation task describes one part of the grammar construct that is assessed in the Versant Writing
Test.
Part D: Passage Reconstruction
Passage Reconstruction is similar to a task known as free-recall, or immediate-recall: Candidates are
required to read a text, put it aside, and then write what they can remember from the text. In this task,
a short passage is presented for 30 seconds, after which the passage disappears and the candidate has 90
seconds to reconstruct the content of the passage in writing. Passages range in length from 30 to 75
words. The items sample a range of sentence lengths, syntactic variation and complexity.
Two discourse genres are presented in this task: narrative and email. Narrative texts are short stories
about common situations involving characters, actions, events, reasons, consequences, or results. Email
texts are adapted from authentic electronic communication and may be conversational messages to
colleagues or more formal messages to customers.
In order to accurately reconstruct a passage, the candidate must read the passage presented, understand
the concepts and details, and hold them in short term memory in order to reconstruct the passage.
Individual candidates may naturally employ different strategies when performing the task.
Reconstruction may be somewhat verbatim in some cases, especially for shorter passages answered by
advanced candidates. For longer texts, reconstruction may be accomplished by paraphrasing and
drawing on the candidates own choice of words. Regardless of strategy, the end result is evaluated
based on the candidates ability to reproduce the key points and details of the source passage using
grammatical and appropriate writing. The task requires the kinds of skills and core language
competencies that are necessary for activities such as responding to requests in writing, replying to
emails, documenting events or decisions, summarizing documents, or writing the minutes of meetings.
Examples:
(Narrative) Corey is a taxi driver. It is his dream job because he loves driving cars.
He started the job ten years ago and has been saving up money since then. Soon, he
will use this money to start his own taxi company.
(E-Mail) Thank you so much for being so understanding about our delay of
shipment. It has been quite difficult to get materials from our suppliers due to the
recent weather conditions. It is an unusual circumstance. In any case, we should be
able to ship the products to you tomorrow. In the meantime, if you have any
questions, please feel free to contact me.
The Passage Reconstruction task is held to be a purer measure of reading comprehension than, for
example, multiple choice reading comprehension questions, because test questions do not intervene
between the reader and the passage. It is thought that when the passage is reconstructed in the
candidates mother tongue then the main ability assessed is reading comprehension, but when the
passage is reconstructed in the target language (in this case, English), then it is more an integrated test of
both reading and writing (Alderson, 2000:230). Since the task is a measure of comprehension and
accurate production of sentence-level and paragraph-level writing at functional, workplace speeds,
performance is reflected in the Reading Comprehension and Grammar subscores.
2013 Pearson Education, Inc. or its affiliate(s).
Page 8 of 30
Page 9 of 30
items from the item pool taking into consideration, among other things, the items level of difficulty and
its form and content in relation to other selected items. Table 1 shows the number of items presented
in each section.
Table 1. Number of items presented per section.
Task
Presented
A. Typing
B. Sentence Completion
20
C. Dictation
16
D. Passage Reconstruction
E. Email Writing
Total
43
3. Test Construct
3.1 Facility in Written English
For any language test, it is essential to define the test construct, or the skills and knowledge reflected in
the test scores (Bachman, 1990; Bachman & Palmer, 1996). The Versant Writing Test is designed to
measure a candidate's facility in written English in the workplace context, which is how well the person can
understand spoken or written English and respond in writing appropriately on everyday and workplace topics at a
functional pace.
The constructs that can be observed in the candidates performances in the Versant Writing Test are
knowledge of the language, such as grammar and vocabulary, and knowledge of writing conventions,
such as organization and tone. Underlying these observable performances are psycholinguistic skills
such as automaticity and anticipation. As candidates operate with texts and select words for constructing
sentences, those who are able to draw on many hours of relevant experience with grammatical
sequences of appropriate words will perform at the most efficient speeds.
The first concept embodied in the definition of facility is how well a candidate understands spoken or
written English. Both modalities of encoding (listening and reading) are covered in the test. Dictation
exposes candidates to spoken English and the remaining sections present written English that candidates
must read and comprehend within given time limits.
Listening dictation requires segmenting the acoustic stream into discrete lexical items and receptively
processing spoken language forms including morphology, phrase structure and syntax in real-time. The
task simulates use of the same skills that are necessary for many real-life written tasks, such as
professional transcribing, listening to a customer over the telephone and inputting information into an
electronic form, and general listening and note-taking. Buck (2001) asserts that dictation is not so much
an assessment of listening skills, as it is sometimes perceived, but rather an assessment of general
language ability, requiring both receptive and productive knowledge. This is because it involves both
comprehension and (re)production of accurate language.
Reading requires fluent word recognition and problem-solving comprehension abilities (Carver, 1991).
Interestingly, the initial and most simple step in the reading process, word recognition, is what
Page 10 of 30
differentiates native readers from even highly proficient second-language readers (Segalowitz et. al.,
1991). Native readers have massively over-learned words by encountering them in thousands of
contexts, which means that they can access meanings automatically and also anticipate frequentlyoccurring surrounding words.
Proficient language users consume fewer cognitive resources when processing spoken English or
analyzing English text visually, and therefore have capacity available for other higher-level comprehension
processes. Comprehension is conceived as parsing sentences, making inferences, resolving ambiguities,
and integrating new information with existing knowledge (Gough et. al., 1992). Alderson (2000:43)
suggests that these comprehension skills involve vocabulary, discourse and syntactic knowledge, and are
therefore general linguistic skills which may pertain to listening and writing as much as they do to
reading.
By utilizing integrated listening/reading and written response tasks, the Versant Writing Test taps core
linguistic skills and measures the ability to understand, transform and rework texts. After initial
identification of a word, either as acoustic signal or textual form, candidates who are proficient in the
language move on to higher-level prediction and monitoring processes including anticipation.
Anticipation enables faster and more accurate decoding of language input, and also underlies a
candidates ability to select appropriate words when producing text. The key skill of anticipation is
assessed in the Sentence Completion and Passage Reconstruction tasks of the Versant Writing Test as
candidates are asked to anticipate missing words and reconstruct textual messages.
The second concept in the definition of facility in written English is how well the candidate can respond
appropriately in writing. The composition tasks in the Versant Writing Test are designed to assess not
only proficiency in the core linguistic skills of grammatical and lexical range and accuracy, as described
above, but also the other essential elements of good writing such as organization, effective expression of
ideas, and voice. These are not solely language skills but are more associated with effective writing and
critical thinking, and must be learned. Assuming these skills have been mastered in the writers first
language (L1), they may be transferable and applied in the writers L2, if their core linguistic skills in L2
are sufficiently advanced. Skill in organization may be demonstrated by: presenting information in a
logical sequence of ideas; highlighting salient points with discourse markers; signposting when
introducing new ideas; giving main ideas before supporting them with details. When responding to an
email, skill in voice and tone may be demonstrated by: properly addressing the recipient; using
conventional expressions of politeness; showing understanding of the recipients point of view by
rearticulating their opinion or request; and fully responding to each of the recipients concerns.
Because the most widely used form of written communication is email, the Versant Writing Test
directly assesses the ability to compose informative emails with accuracy and correct word choice, while
also adhering to the modern conventions regarding style, rhetoric, and degree of formality for business
settings.
The last concept in the definition of facility in written English is the candidates ability to perform the
requested tasks at a functional pace. The rate at which a candidate can process spoken language, read
fluently, and appropriately respond in writing plays a critical role in whether or not that individual can
successfully communicate in a fast-paced work environment. A strict time limit imposed on each item
ensures that proficient language users are advantaged and allows for discriminating candidates with
different levels of automaticity.
The scoring of the Versant Writing Test is grounded in research in applied linguistics. A taxonomy of
the components of language knowledge which are relevant to writing are presented in a model by Grabe
2013 Pearson Education, Inc. or its affiliate(s).
Page 11 of 30
and Kaplan (1996). Their model divides language knowledge into three types: linguistic knowledge,
discourse knowledge, and sociolinguistic knowledge. These are broadly in line with the Versant Writing
Test subscores of Grammar and Vocabulary (linguistic knowledge), Organization (discourse knowledge),
and Voice & Tone (sociolinguistic knowledge).
Table 2. Taxonomy of Language Knowledge (adapted and simplified from Grabe and Kaplan, 1996:220-221).
1. Linguistic
Knowledge
2. Discourse
Knowledge
a.
b.
c.
d.
e.
3. Sociolinguistic
Knowledge
Aligned with the taxonomy presented in Table 2, linguistic knowledge maps onto a linguistic aspect of
performance in the scoring of the test; whereas discourse and sociolinguistic knowledge relate to a
rhetoric aspect. Comprehension is not mapped explicitly onto the taxonomy because it addresses
language knowledge as opposed to the specific information conveyed by the language. However,
comprehension is recognized as an important factor for facility in written English, and is, therefore,
identified as a unique aspect of the candidates performance in the scoring.
In sum, there are many processing elements required to participate in a written exchange of
communication: a person has to recognize spoken words or words written in an email or text received,
understand the message, formulate a relevant response, and then compose stylistically appropriate
sentences. Accordingly, the constructs that can be observed in the candidates performances in the
Versant Writing Test are knowledge of the language, such as grammar and vocabulary, comprehension
of the information conveyed through the language, and knowledge of writing conventions, such as
organization and tone. Underlying these observable performances are psycholinguistic skills such as
automaticity and anticipation. As candidates operate with texts and select words for constructing
sentences, those who are able to draw on many hours of relevant experience with grammatical
sequences of appropriate words will perform at the most efficient speeds.
Page 12 of 30
Scoring related to Discourse and Sociolinguistic Knowledge, however, requires context, awareness of
audience, and functional purpose for communication.
Except for the Email Writing task, all items present context-independent material in English. Contextindependent material is used in the test items for three reasons. First, context-independent items
exercise and measure the most basic meanings of words, phrases, and clauses on which contextdependent meanings are based (Perry, 2001). Second, when language usage is relatively contextindependent, task performance depends less on factors such as world knowledge and cognitive style and
more on the candidates facility with the language itself. Thus, the test performance relates most closely
to language abilities and is not confounded with other candidate characteristics. Third, contextindependent tasks maximize response density; that is, within the time allotted for the test, the candidate
has more time to demonstrate performance in writing the language because less time is spent presenting
contexts that situate a language sample or set up a task demand. The Dictation, Sentence Completion
and Passage Reconstruction tasks present context-independent material while the Email Writing task
presents a situation with schema that candidates must attune to, for example, the purpose of the writing
and the relationship between themselves and the intended recipient of the email. In this way, Email
Writing allows for the assessment of the grammar and mechanics of writing, as well as knowledge of the
email genre and the rhetorical and cultural norms for organizing information in emails.
Page 13 of 30
writers were provided a list of potential topics/activities/situations with regard to the business domain,
such as:
Announcements
Business trips
Complaints
Customer service
Fax/Telephone/E-Mail
Inventory
Scheduling
Marketing/Sales
Item writers were specifically requested to write items so that items would not favor candidates with
work experience or require any work experience to answer correctly. The items are intended to be
within the realm of familiarity of both a typical, educated, native English speaker and an educated adult
who has never lived in an English-speaking country.
Draft items were then reviewed internally by a team of test developers, all with advanced degrees in
language-related fields, to ensure that they conformed to item specifications and English usage in
different English-speaking regions and contained appropriate content. Then, draft items were sent to
external experts on three continents. The pool of expert reviewers included several individuals with
PhDs in applied linguists and subject matter experts who worked as training and recruitment managers
for large corporations. Expert review was conducted to ensure 1) compliance with the vocabulary
specification, and 2) conformity with current colloquial English usage in different countries. Reviewers
checked that items would be appropriate for candidates trained to standards other than American
English.
All items, including anticipated responses for Sentence Completion, were checked for compliance with
the vocabulary specification. Most vocabulary items that were not present in the lexicon were changed
to other lexical items that were in the corpus and word list. Some off-list words were kept and added
to a supplementary vocabulary list, as deemed necessary and appropriate. The changes proposed by the
different reviewers were then reconciled and the original items were edited accordingly.
For an item to be retained in the test, it had to be understood and responded to appropriately by at
least 90% of a reference sample of educated native speakers of English.
Page 14 of 30
5. Score Reporting
5.1 Scoring and Weighting
Of the 43 items in an administration of the Versant Writing Test, up to 41 responses are used in the
automatic scoring. The first item responses in the Sentence Completion and Dictation sections are
considered practice items and are not incorporated into the final score.
The Versant Writing Test score report is comprised of an Overall score and five diagnostic subscores
(Grammar, Vocabulary, Organization, Voice & Tone, and Reading Comprehension).
Overall: The Overall score of the test represents the ability to understand English input and
write accurate, appropriate texts at a functional pace for everyday and workplace purposes.
Scores are based on a weighted combination of the five diagnostic subscores. All scores are
reported in the range from 20 to 80.
Grammar: Grammar reflects how well the candidate understands, anticipates and produces a
variety of sentence structures in written English. The score is based on the ability to use
accurate and appropriate words and phrases in meaningful sentences.
Vocabulary: Vocabulary reflects how well the candidate understands and produces a wide
range of words in written English from everyday and workplace situations. The score is based
on accuracy and appropriateness of word use for topic, purpose, and audience.
Organization: Organization reflects how well the candidate presents ideas and information in
written English in a clear and logical sequence. The score is based on the ability to guide readers
through written text and highlight significant points using discourse markers.
Voice & Tone: Voice and Tone reflects how well the candidate establishes an appropriate
relationship with the reader by adopting an appropriate style and level of formality. The score is
based on the writer's ability to address the reader's concern and have an overall positive effect.
Reading Comprehension: Reading reflects how well the candidate understands written
English texts on everyday and workplace topics. The score is based on the ability to operate at
functional speeds to extract meaning, infer the message, and respond appropriately.
Figure 1 illustrates which sections of the test contribute to each of the five subscores. Each vertical
rectangle represents the response utterance from a candidate. The items that are not included in the
automatic scoring are shown in green.
Page 15 of 30
Grammar
Typing
Vocabulary
Sentence Completion
Dictation
Organization
Passage
Reconstruction
Reading
Comprehension
Email Writing
40 minutes
Figure 1. Relation of subscores to item types.
Table 3 shows how the five subscores are weighted to achieve an Overall score.
Table 3. Subscore weighting in relation to Versant Writing Test Overall score.
Subscore
Weighting
Grammar
30 %
Vocabulary
30 %
Organization
10 %
10 %
Reading Comprehension
20 %
Overall Score
100 %
The subscores are based on several aspects of the candidates performance: a linguistic aspect (the range
and accuracy of word use), a content aspect (the comprehensiveness of the information given), and a
rhetoric aspect (the organization and presentation of information).
The linguistic aspect is informed by the Grammar and Vocabulary subscores. Combined, these two
dimensions account for 60% of the overall score because knowledge of a wide range of words and the
accuracy of their use are the pre-requisites of successful written communication. If a candidate is unable
to produce coherent sentences that convey the intended meaning in English, then the other dimensions
of content and rhetoric may be of limited value. Conversely, if a candidate is strong in the mechanical
skills of written language, then s/he has a foundation upon which to learn higher order comprehension
and rhetorical skills.
The content aspect, or comprehensiveness of the information given in a candidates response, is
associated with the Reading Comprehension subscore. This accounts for 20% of the Overall score. It is
not only a measure of how well the candidate is able to understand textual input, but also how well the
Page 16 of 30
candidate then demonstrates understanding by responding to it. Thus, this is not a measure of pure
comprehension in the cognitive sense, but rather of comprehension and usage.
Finally, the rhetoric aspect is informed by the Organization and Voice & Tone subscores. This aspect
also accounts for 20% of the Overall score. Producing accurate lexical and structural content is
important, but effective communication depends on producing clear, succinct writing which allows for
ease of reading and gives a positive impression to the reader.
In the Versant Writing Test scoring logic, the linguistic, content, and rhetoric aspects are weighted 6020-20 respectively to reflect their importance for successful written communication.
Page 17 of 30
Table 4. Description of participants in the field testing whose responses were used to
develop automated scoring models (n=1,768).
Native
73
Non-Native
31% : 63%
Unknown = 6%
20 - 73
mean = 35.6
44% : 49%
Unknown = 7%
19 - 67
mean = 28.0
Number of
Participants
Male : Female
Age Range
Languages
1695
Page 18 of 30
Selected item responses to Passage Reconstruction and Email Writing from a subset of candidates were
presented to twenty-one educated native English speakers to be judged for content accuracy and
vocabulary usage. Before the native speakers began rating responses, they were trained to evaluate
responses according to analytical and holistic rating criteria. All raters held a masters degree in either
linguistics or TESOL.
The raters logged in to a web-based rating system and evaluated the written responses to Passage
Reconstruction and Email Writing items for such traits as vocabulary, grammar, organization, voice and
tone, email conventions, and task completion. Rating stopped when each item had been judged by at
least two raters.
7. Validation
7.1 Validity Study Design
A series of validity analyses were conducted to examine three aspects of the Versant Writing Test
scores:
Structural Validity
1. Reliability: whether or not the Versant Writing Test is structurally reliable and assigns scores
consistently,
2. Dimensionality: whether or not the five different subscores of the Versant Writing Test are
sufficiently distinct,
3. Accuracy: whether or not the automatically scored Versant Writing Test scores are
comparable to the scores that human listeners and raters would assign,
4. Differentiation among known populations: whether or not Versant Writing Test scores reflect
expected differences and similarities among known populations (e.g., natives vs. English
learners).
Concurrent Validity
Relation to scores of tests or frameworks with related constructs: how closely do Versant Writing Test
scores predict the reliable information in scores of a well-established English test for a workplace
context (e.g., TOEIC); and how do Versant Writing Test scores correspond to the six levels of the
Common European Framework of Reference (CEFR).
7.1.1 Validation Sample
A total of 124 subjects were recruited for a series of validation analyses. These validation subjects were
recruited separately from the field test candidates. Care was taken to ensure that the training dataset
and validation dataset did not overlap for independent validation analyses. This means that the written
performance samples provided by the validation candidates were excluded from the datasets used for
training the scoring models.
Validation subjects were recruited from a variety of countries, first language backgrounds, and
proficiency levels and were representative of the candidate population using the Versant Writing Test.
A total of five native speakers were included in the validation dataset. Table 5 below summarizes the
demographic information of the validation participants.
Page 19 of 30
Table 5. Description of Participants Used to Validate the Scoring Models and Estimate Test Reliability (n=124).
Number of
Participants
44% : 56%
Age Range
19 - 66
mean = 30.4
Languages
Measure
Statistic
Mean
51.74
Standard Error
1.37
Median
51.55
Standard Deviation
15.27
Sample Variance
233.07
Kurtosis
-0.44
Skewness
0.06
Page 20 of 30
Split-half Reliability
for Human Scores
Split-half Reliability
for Machine Scores
Overall
0.93
0.98
Grammar
0.97
0.98
Vocabulary
0.89
0.91
Organization
0.77
0.87
0.79
0.90
Reading Comprehension
0.92
0.93
Score
The reliability for the Organization and Voice & Tone subscores is lower than the reliability of the other
subscores because these subscores are estimated solely from Email Writing, of which only two items
are presented in the test. However, the agreement between two raters for these subscores was
sufficiently high: inter-rater reliability for Organization was 0.90 and inter-rater reliability for Voice &
Tone was 0.93 at the item level (corrected for under-estimation).
7.2.4 Dimensionality: Correlations among Subscores
Ideally, each subscore on a test provides unique information about a specific dimension of the
candidates ability. For language tests, the expectation is that there will be a certain level of covariance
between subscores given the nature of language learning. This is due to the fact that when language
3
The possible reliability coefficient range is 0 to 1. The closer the coefficient is to 1.0, the greater the reliability is.
Page 21 of 30
learning takes place, the candidates skills tend to improve across multiple dimensions. However, if all
the subscores were to correlate perfectly with one another, then the subscores might not be measuring
different aspects of facility with the language.
Table 8 presents the correlations among the Versant Writing Test subscores and the Overall score for
the same validation sample of 124 candidates, which includes five native English speakers.
Table 8. Inter-correlation between Subscores on the Versant Writing Test (n=124).
Grammar
Vocabulary
Organization
Voice & Tone
Reading
Comp.
Grammar
0.81
0.77
0.79
Vocab
Organization
Voice &
Tone
0.81
0.83
0.98
0.91
0.88
0.87
0.89
Reading
Comp.
Overall
0.96
0.96
0.89
0.91
0.96
As expected, test subscores correlate with each other to some extent by virtue of presumed general
covariance within the candidate population between different component elements of written language
skills. The Organization and Voice & Tone subscores correlate highly with one another since they are
both representing the rhetoric aspect of written language from the same set of items. However, the
correlations between the remaining subscores are below unity (i.e., below 1.0), which indicates that the
different scores measure different aspects of the test construct.
7.2.5 Machine Accuracy
An analysis for internal quality of the test involved comparing scores from the Versant Writing Test,
which uses automated language processing technologies, versus careful human judgments from expert
raters.
Table 9 presents Pearson Product-Moment correlations between machine scores and human scores,
when both methods are applied to the same performances on the same Versant Writing Test responses.
The candidate sample is the same set of 124 validation candidates that was used in the reliability and
subscore analyses. The human scores in Table 9 were calculated from a single human judgment, which
means that the correlation coefficients are conservative (higher coefficients can be obtained with
multiple human ratings).
Table 9. Correlation Coefficients between Human and Machine Scoring of
Versant Writing Test Responses (n = 124).
Score Type
Correlation
Overall
0.98
Grammar
0.99
Vocabulary
0.98
Organization
0.90
0.91
Reading Comprehension
0.96
Page 22 of 30
The correlations presented in Table 9 suggest that scoring a Versant Writing Test by machine will yield
scores that closely correspond with human ratings. Among the subscores, the human-machine relation
is closer for the linguistic (Grammar and Vocabulary) and content (Reading Comprehension) aspects of
written language than for the rhetoric aspect (Organization and Voice & Tone), but the relation is close
for all five subscores. At the Overall score level, Versant Writing Test machine-generated scores are
virtually indistinguishable from scoring that is done by multiple independent human judgments.
7.2.6 Differentiation among Known Populations
The next validity analysis examined whether or not the Versant Writing Test scores reflect expected
differences between native English speakers and English language learners. Overall scores from a total of
400 tests completed by the native speakers and 1709 tests completed by the learners representing a
range of native languages were compared. Figure 2 presents cumulative distributions of Overall scores
for the native and non-native speakers. Note that the range of scores displayed in this figure is from 10
through 90, whereas Versant Writing Test scores are reported on a scale from 20 to 80. Scores
outside the 20 to 80 range are deemed to have saturated the intended measurement range of the test
and are therefore reported as 20 or 80.
The results show that native speakers of English consistently obtain high scores on the Versant Writing
Test. Fewer than 5% of the native sample scored below 70, which was mainly due to performance in
Email Writing (i.e. rhetorical written skills rather than language skills). Learners of English as a second
or foreign language, on the other hand, are distributed over a wide range of scores. Note also that only
10% of the non-natives scored above 70. In sum, the Overall scores show effective separation between
native and non-native candidates.
Page 23 of 30
Page 24 of 30
Table 10. Pearson Correlation Coefficients for Versant Writing Test and TOEIC (n=55).
TOEIC Reading
TOEIC Listening
TOEIC Total
Versant Writing Test
TOEIC
Reading
0.84
0.96
0.70
TOEIC
Listening
TOEIC
Total
0.96
0.68
0.72
Though the sample size is small, this matrix shows an expected pattern of relationships among the
subscores of the tests, bearing in mind that they all relate to English language ability but assess different
dimensions of that ability.
The Versant Writing Test and TOEIC Total correlated moderately at r=0.72, as shown in Figure 3,
indicating that there is general English ability as a covariance, but that these tests measure different
aspects of language performance. The Versant Writing Test correlated higher with TOEIC Reading
(r=0.70) than with TOEIC Listening (r=0.68), which is expected because more content is presented
through reading than listening in the Versant Writing Test.
Figure 3. Scatterplot showing the relationship between the Versant Writing Test and TOEIC (n=55).
Page 25 of 30
proficiency levels of the CEFR. A secondary goal of the study was to empirically demonstrate that two
item types found on the Versant Writing Test, Passage Reconstruction and Email Writing, can be reliably
evaluated by English language testing experts.
Method
A set of analytic descriptors containing six levels was developed from the CEFR scales, corresponding to
CEFR levels A1, A2, B1, B2, C1, and C2. Six English language testing experts were recruited as expert
judges. They were instructed to utilize the CEFR descriptors to grade holistically, and choose the CEFR
level that best fit each response. A response set of written samples was created using the following
procedure: 240 candidates who took a field test version of the Versant Writing Test were selected via
stratified random sampling. This sampling technique was used to assure that the response set contained
written samples from a wide variety of language backgrounds and equally distributed proficiency levels,
approximately 40 per CEFR level. The candidates came from China, Costa Rica, France, Germany, India,
Iran, Japan, Korea, Mexico, the Netherlands, Russia, Spain, Taiwan, Thailand, and the United States.
Eleven of the candidates were excluded from analysis either before or after the rating process due to
incomplete data (most or all responses were blank), leaving 229 individual candidates in the response
set. Each candidate contributed a total of five written responses from two tasks: three Passage
Reconstruction responses and two Email Writing responses. The response set therefore consisted of
1145 written samples: 687 Passage Reconstruction responses and 458 Email Writing responses.
Results
Raters demonstrated a high level of consistency with one another in their assigned scores (r=0.98). This
high level of inter-rater reliability demonstrates that candidates can be consistently classified into CEFR
levels based on performances elicited by these tasks. The CEFR ratings from the six raters and the
Versant Writing Test scores for each candidate were entered into a Rasch model to produce an ability
estimate for each candidate on a common logit scale. Initial CEFR boundaries were then estimated from
Rasch ability estimates, as shown in Table 11.
Table 11. CEFR Score Boundaries as Logits from a Rasch Model.
Facetstep
CEFR Level
1
2
3
4
5
6
A1
A2
B1
B2
C1
C2
Expectation Measure at
CEFR Boundary (Logits)
-4.43
-2.45
-0.68
0.88
2.39
4.22
Candidates Versant Writing Test scores were then lined up next to their CEFR-based ability estimates
to establish the score boundaries. When comparing the aggregated expert judgments with the Versant
Writing Test scores to establish a CEFR Level, 68% of candidates are correctly classified and 99% of
candidates are classified correctly or one step away. Table 12 below provides the final mapping between
the two scales.
Table 12. Mapping of CEFR Levels with Versant Writing Test Scores.
Page 26 of 30
CEFR Level
A1
20-29
A2
30-43
B1
44-53
B2
54-66
C1
67-76
C2
77-80
Figure 4 plots the relation between each candidates Versant Writing Test score (shown on the x-axis)
and their CEFR ability estimate in logits as estimated from the judgments of the six panelists (shown on
the y-axis). The figure also shows the original Rasch-based CEFR boundaries (horizontal dotted lines)
and the slightly adjusted boundaries (vertical dotted lines).
Figure 4. Scatterplot showing Rasch-based CEFR ability estimates as derived from human
judgments and Versant Writing Test scores.
The Pearson correlation coefficients for Versant Writing Test scores and CEFR estimates is 0.95,
revealing that the Versant Writing Test instrument yields test scores which are highly consistent with
judges evaluation of written performance using the CEFR scales.
The raters CEFR ratings were based on two tasks (Email Writing and Passage Reconstruction) which
elicit linguistic, content and rhetorical skills. However, it is important to note that the Versant Writing
Test Overall score is derived not only from performance on these two tasks, but also on Sentence
2013 Pearson Education, Inc. or its affiliate(s).
Page 27 of 30
Completion and Dictation which assess linguistic skills more reliably. Therefore, some error in CEFR
classification is to be expected when individuals have substantially different linguistic skills than content
and rhetorical skills.
8. Conclusion
This report has provided details of the test development process and validity evidence for the Versant
Writing Test. The information is provided for test users to make an informed interpretive judgment as
to whether test scores would be valid for their purposes. The test development process is documented
and adheres to sound theoretical principles and test development ethics from the field of applied
linguistics and language testing:
the items were written to specifications and were subjected to a rigorous procedure of
qualitative review and psychometric analysis before being deployed to the item pool;
the content was selected from both pedagogic and authentic material;
the test has a well-defined construct that is represented in the cognitive demands of the tasks;
the scores, item weights and scoring logic are explained;
the items were widely field tested on a representative sample of candidates.
This report provides empirical evidence demonstrating that Versant Writing Test scores are structurally
reliable indications of candidate ability in written English and are suitable for high-stakes decision-making.
Page 28 of 30
10. References
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L.F. & Palmer, A.S. (1996). Language testing in practice. Oxford: Oxford University Press.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.
Council of Europe (2001). Common European Framework of Reference for Languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Godfrey, J.J. & Holliman, E. (1997). Switchboard-1 Release 2. LDC Catalog No.: LCD97S62.
http://www.ldc.upenn.edu.
Gough, P.B., Ehri, L.C., and Treiman, R. (1992). Reading acquisition. Hillsdale, NJ: Erlbaum.
Grabe, W., and Kaplan, R.C. (1996). Theory and practice of writing. New York: Longman.
Liao, C-w., Qu, Y., and Morgan, R. (2010). The Relationship of Test Scores Measured by the TOEIC
Listening and Reading Test and TOEIC Speaking and Writing Tests (TC-10-13). Retrieved from
Educational Testing Service website: http://www.ets.org/research/policy_research_reports/tc10-13
Luoma. (2003). Assessing speaking. Cambridge: Cambridge University Press.
McLaughlin, G.H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639646.
Oakeshott-Taylor, J. (1977). Information redundancy, and listening comprehension. In R. Dirven
(ed.), Hrverstndnis im Fremdsprachenunterrict. Listening comprehension in foreign language
teaching. Kronberg/Ts.: Scriptor.
Oller, J. W. (1971). Dictation as a device for testing foreign language proficiency. English Language
Teaching, 25(3),254-259.
Perry, J. (2001). Reference and reflexivity. Stanford, CA: CSLI Publications.
Segalowitz, N., Poulsen, C., and Komoda, M. (1991). Lower level components or reading skill in
higher level bilinguals: Implications for reading instruction. In J.H. Hulstijn and J.F. Matter (eds.),
Reading in two languages, AILA Review, Vol. 8,. Amsterdam: Free University Press, 15-30.
Sigott, G. (2004). Towards identifying the C-test construct. New York: Peter Lang.
Storey, P. (1997). Examining the test-taking process: a cognitive perspective on the discourse cloze
test. Language Testing, 14(2), 214-231.
Page 29 of 30
Version 0313b
Page 30 of 30