Test of Written Language-3 (TOWL-3)
Test of Written Language-3 (TOWL-3)
The Buros reviewer expressed concern that while census data align with the sample: “Census Bureau data enabling a comparison of
proportionality of age groups by each of these variables is not provided. The norming sample is marginally adequate for the
purposes it supports. A larger sample, with data showing proportionality of the subgroups to population statistics, would be helpful”
(Hansen, Bucy, & Swerdlik, 1998, p. 1072).
Test Description/Overview:
The TOWL-3 consists of 8 subtests, all of which involve the student engaging in written work. The Buros reviewers write: “The Test
of Written Language--Third Edition (TOWL-3) is represented as a comprehensive test of written language. As the third edition of a
test first published in 1978, the TOWL-3 is substantially improved from earlier versions. Improvements include reduced length of
time for administering and scoring, a tighter and more logically defensible conceptual model of writing, improved norms, improved
reliability and validity, and a bias study” (Hansen, Bucy, & Swerdlik, 1998, p. 1070).
Theory:
The authors define written language: “The term written language refers to the comprehension and expression of thought through the
use of characters, letters, or words that are etched, traced, or formed on the surface of some material. Written language is commonly
considered to be one of the two principle manifestations of language, the other being spoken language” (Hammill & Larsen, 1996, p.
1). The authors have chosen to focus on written language, rather than both reading and writing, although they do discuss the
relationship between reading and writing, and the requisites of writing. They discuss the components of writing: Conventional,
Linguistic, and Cognitive. These appear to be well-known concepts of writing, although few, if any, citations are provided. The
authors also state that in order to effectively evaluate writing, more than one aspect of written language should be assessed. They use
two formats, contrived and spontaneous writing, in order to capture these aspects.
“The underlying model of writing upon which the TOWL-3 is based incorporates three writing components, conventional, linguistic,
and cognitive and two writing sample formats, contrived and spontaneous, to assess different sets of skills essential to effective and
efficient writing. Contrived Writing, in which predetermined stimuli are given to the student and he or she selects or provides the
correct responses, is used to assess Vocabulary, Spelling, Style (which includes punctuation and grammar), Logical Sentences, and
Sentence Combining. Spontaneous Writing uses student-generated writing samples to assess Contextual Conventions, Contextual
Language, and Story Construction. Three composite scores are produced: Spontaneous Writing, Contrived Writing, and Overall
Writing (a combination of the other two composite scores). This separation of writing into two discrete components with an overall
score enables the educator to ascertain how well the individual student understands the conventions and forms of written language
as well as how he or she actually writes. This key feature of the TOWL-3 is, to my knowledge, unique among written language
assessments and should be considered a definite strength” (Hansen, Bucy, & Swerdlik, 1998, p. 1070).
Purpose of Test: The purpose of this test is to identify written language difficulties in students. The authors list four purposes:
Hayward, Stewart, Phillips, Norris, & Lovell 3
Areas Tested: The test consists of 8 subtests: vocabulary, spelling, style (punctuation and grammar), logical sentences, sentence
combining, contextual conventions, contextual language, and story construction. The first five subtests follow a contrived writing
format while the remaining three follow a spontaneous writing format (student produces a story that is evaluated for vocabulary, plot,
and punctuation.)
Areas Tested:
Spelling Other (Please Specify)
Writing Letter Formation Capitalization Punctuation Conventional Structures Word Choice Details
Grammar Narratives Other (Please Specify) Sentence formulation
Who can Administer: The authors state that examiners should have formal training in assessment, be proficient in the English
language, and be familiar with the instrument. Interestingly, the authors identify college graduates as potential scorers, as long as
they are orientated to the tasks.
Administration Time: The test has no time limit, although the authors state that the entire test may take up to 80 minutes. Testing
may occur over more than one session. Administration time is estimated at 65 minutes with an additional 15 minutes for scoring,
however, the Buros reviewers estimate that complex scoring will take longer, at least initially.
The TOWL-3 can be administered individually or to a group of students without modifications needing to be made. The manual
states basals and ceilings for age intervals with examples provided. Detailed instructions are found in Chapter 3 of the manual.
Practice items as well the examiner’s instructions to the student are detailed in the examiner’s manual.
Comment: My initial impression of Chapter 3, “Specific Subtest Administration”, was that it was very detailed, and although clearly
specific for each subtest, fairly easy to follow. Most of the tasks begin at a more complex level than we are interested in for TELL
purposes. For example, in Vocabulary, the student is instructed to make up a sentence using the word, “see” (item 1), but quickly the
Hayward, Stewart, Phillips, Norris, & Lovell 4
words become quite complex. “Avoid”, “donate”, and “faithful” are items 7, 8, and 9 (see Hammill & Larsen, 1996, pp. 14-15).
“The examiner's manual contains sample stories and instructions on how to score them. These can be used for examiners to practice.
Nine rules for administering the test are provided. If adhered to, these rules provide at least a minimal level of guidance to ensure
valid interpretation of scores. An entire chapter of the examiner's manual is dedicated to instruction on how to score each of the
eight subtests. Although the TOWL-3 is intended to be untimed, a convenient chart in the manual provides timing guidelines for
administering and scoring the test. Special instructions are provided for group administration” (Hansen, Bucy, & Swerdlik, 1998,
pp. 1070-1071). Twelve sample stories are provided which allow the examiner to check for reliability of scoring against the models.
The test begins with the spontaneous story in response to a detailed black-and-white picture prompt, followed by the contrived
writing. Fifteen minutes are given for story writing, before the subtests are administered in set order, beginning with Vocabulary.
Test Interpretation:
The TOWL-3 manual outlines interpretation in Chapter 4. This chapter describes the test scores and their interpretation as well as the
quotients and interpretation thereof. Each subtest is briefly described in terms of what skill is being assessed. The difference between
quotients is discussed in terms of statistical and clinical usefulness.
Testing the limits of a student’s ability is described. Leading questions include: “Tell me about this item. Why did you give this
answer?” And “Let me explain how this answer could be better. Can you tell me how to make your next answer better?” In a section
on Local Norms (|Hammill & Larsen, 1996, p. 40), the authors offer a procedure for developing local norms if the norming sample is
believed to be significantly different from the local setting.
TOWL-3 Chapter 5 introduces “Resources for Further Assessment and Testing” (|Hammill & Larsen, 1996, p. 43). The section
begins with a discussion of instructional purpose in each subtest area, followed by a section on remediation. The authors begin with
the reminder that the TOWL-3 results do not point to specific interventions but rather offer a profile of students’ skills. General
guidance for addressing skill development and remediation are provided with reference to the literature and to available program
approaches. Comment: This test is 12 years old, therefore, the research is technically not current.
“Scoring instructions are clear and easy to follow, using the tables in the appendix. Nevertheless, the number of steps involved in
the process could be daunting to many classroom teachers. Given the complex nature of this test and the administration and scoring
procedures, it should not be used by the classroom teacher without training and practice. It would be more appropriate for an
Hayward, Stewart, Phillips, Norris, & Lovell 5
assessment professional, evaluator, or school psychologist to be responsible for the administration and scoring. At the very least,
such a person should conduct training and set high performance standards for those who would administer and score the test. The
cumbersome nature of manual scoring also limits the appeal of this test” (Hansen, Bucy, & Swerdlik, 1998, p. 1071).
As with any measure of written language, a substantial degree of subjectivity is inherent in the scoring.
Standardization: Age equivalent scores Grade equivalent scores Percentiles Standard scores Stanines
Other (Please Specify) Eight subtest scores and the Composite Quotients: Overall Written Language, Contrived Writing
(vocabulary, spelling, style, logical sentences, and sentence combining), and Spontaneous Writing (thematic maturity, contextual
vocabulary, syntactic maturity, contextual spelling and contextual style) are determined. SEMs are provided. They range from 1 point
for all subtests except Contextual Conventions, which has an SEM of 2 points. Composites SEMS are 3 for Contrived Writing, 5 for
Spontaneous Writing and 3 for Overall.
Though age and grade equivalents are provided, the authors provide the usual cautions regarding these scores. However, as
administration and legislative requirements often mandate these types of scores, the authors reluctantly provide them.
“A limitation of this measure is that a scale score can be computed from raw scores of zero. This is most clearly a problem at
younger ages when, for example, a 7-year-old scoring zero in Sentence Combining earns a Subtest Standard Score of 9. Guidance is
needed from the authors when scoring and interpreting protocols of students who obtain subtest raw scores of zero” (Hansen, Bucy,
& Swerdlik, 1998, p. 1073).
Reliability
The Buros reviewers state: “Four types of reliability coefficients are provided: coefficient alpha, alternate forms (immediate
administration), test-retest (time-interval not specified), and interscorer. Although most reliability statistics are within industry
standards, the Contextual Conventions subtest provides a notable exception. The average coefficient alpha coefficient for this subtest
is a mere .70, with a range from .60 to .77. Alternate form reliability is .71 and time sampling (test-retest) yields .75. This subtest
'measures the ability to spell words properly and to apply the rules governing punctuation of sentences and capitalization of words in
a spontaneously written composition' (manual, p. 38). Further research should be conducted by the authors to investigate possible
sources of error in this subtest and correct the problem. One such source was mentioned above in the subjectivity of the scoring
Hayward, Stewart, Phillips, Norris, & Lovell 6
Internal consistency of items: Using the entire normative sample data, coefficient alphas were calculated at 11 age intervals.
Guilford’s formula was used for calculating coefficients for the composites. A Z transformation technique averaged the coefficients.
All subtests, with the exception of Contextual Conventions were .8 or above, and all composite alphas exceeded .9, which are within
the industry standards. Caution is advised when scoring Contextual Conventions. The authors provide alphas for select subgroups
within the sample, thus demonstrating that the TOLD-3 is reliable across all subgroups.
Test-retest: Test-retest was studied on two groups of students after a two week interval, using both forms of the test. The first group
consisted of the youngest, and the second group the oldest group given the TOWL-3. The youngest group, consisting of only 27
second graders, and the older group consisting of only 28 grade 12 students, seems somewhat small. Subtest coefficient estimates are
in the .70s and .80s with composite coefficients in the .80s to .90s. Although .8 is considered marginally reliable, the authors state
that the test has strong reliability. The Buros reviewers caution that .9 is considered a more acceptable standard, which TOLD-3 does
not meet in most cases.
Inter-rater: Two Pro-Ed staffers independently scored 38 randomly chosen completed protocols from the normative sample. The
correlations between the two raters showed high coefficients (.83 to .97). The Buros reviewers caution “Staff training was not
described, therefore it is impossible to determine if this level of interscorer reliability can be expected among typical test users”
(Hansen, Bucy, & Swerdlik, 1998, p. 1073).
Other (Please Specify): Less detailed information is provided regarding alternate forms reliability. For example, the Buros reviewer
notes that “… specific procedures such as time between administrations, order and format of administration (individual or group),
and sample size are absent” (Hansen, Bucy, & Swerdlik, 1998, p. 1073), and Hansen cautions that the “data provided by the authors
contain observable differences between alternate forms, favoring Form B (p. 1071). Overall, correlation coefficients were found to be
.8 or greater, however once again, Contextual Conventions did not meet this standard with a correlation coefficient of .71. In terms of
composites, all coefficients were .8 or higher with the exception of age 17, Spontaneous Writing, which was .76.
Validity:
TOWL-3, Chapter 8, describes the rationale and presents the research on which the subtests and the formats were based. Qualitative
content evidence is thus provided.
Hayward, Stewart, Phillips, Norris, & Lovell 7
The Buros reviewers state: “Three types of validity data: content, criterion related, and construct, are provided. Evidence for content
validity is present in the form of a strong, clear, logical rationale for each subtest. Classical item analysis is used as a means of
screening items. An item inclusion criterion of .3 plus statistical significance was used for point-biserial coefficients. Nevertheless,
median discrimination indices for age groups 7 and 8 are unacceptably low for all five contrived subtests. This finding renders the
TOWL-3 relatively useless for primary grade students” (Hansen, Bucy, & Swerdlik, 1998, p. 1071). Comment: Many of the items
seem much too difficult for primary students and the items become more difficult early on in the subtests.
Content: The qualitative evidence for the content is well argued and makes sense, however, the item pool that would be appropriate
to the early years is severely limited. Evidence was provided for item bias, and the authors determined the test to have little to no
item bias for gender and race.
Classical Item Analysis: Using the entire normative sample, item discrimination coefficients and percentages of difficulty were
calculated. The reported values met the standard.
Criterion Prediction Validity: Seventy-six elementary students were administered the Comprehensive Scales of Student Abilities
(Hammill & Hresko, 1994) using the writing scales only. Comment: Note that the CSSA is co-authored by the first author of this test,
and also that the authors did not choose to address all age levels covered by this test. This is important because as the Buros
reviewers said, this test is really not a good indicator of writing achievement at the elementary ages. Moderate correlations were
reported between the two measures of .34 to .68. The authors state that this is enough to evidence the criterion related validity of this
test (Hammill & Larsen, 1996, p. 72). Comment: How can they make this claim with such low correlations, and only having looked
at one age group and one test? This seems a bit suspicious. Surely there were more written language tests available for comparison?
Construct Identification Validity: In terms of age differentiation, the authors offer visual inspection of the means and standard
deviations of the subtest scores at the eleven age intervals. The authors claim this inspection “is sufficient to recognize that the
pattern is consistent with our hypothesis about the relationship of age to writing (i.e., the means become progressively larger between
ages 7 and 12, and they level off after age 13)” (Hammill & Larsen, 1996, p. 74). The authors provide correlation coefficients which
show strong correlations between ages 7 to 12, but not between ages 13 to 17, thereby providing evidence of their construct validity
claims.
Group Differentiation: The authors examine group differentiation by comparing the performance, in terms of standard scores, of
Hayward, Stewart, Phillips, Norris, & Lovell 8
students with learning disabilities and students with speech impairments with the average score for the normative sample. They found
the mean scores for these two clinical groups to be below the average score of the normative sample. This pattern was the same for
mean composite scores.
Relationship to Tests of Academic Achievement: Again, the authors used the CSSA (co-authored by Hammill). The seventy-six
students used for the Criterion Prediction Validity evidence were given the reading, math and general facts tests of the CSSA. A
moderate relationship was found with a median of .505. In terms of composite scores, a high relationship was found with a median of
.605.
Relationship to Tests of Intelligence: The authors used another of Hammill’s tests (The Comprehensive Test of Non-Verbal
Intelligence). Fifty-two high school students were tested and a comparison of their scores revealed statistically significant
coefficients at the .05 level of confidence (.5 to .6 with a median of .505). The authors report that they are confident in the results of
the validity testing. Comment: Now the authors have used high school students, yet they still are using a test that the primary author
developed. Why not use the WISC, like everyone else does? Also, why not extend testing across age groups?
Factor Analysis: Factor analysis revealed a single factor present in all subtests, which the authors deemed “Overall Writing
Quotient”. To ensure all groups in the normative sample performed similarly, they examined the performance of the subgroups on
subtests and found that “Overall Writing Quotient” is captured in the subtests across the normative population. The authors claim that
the test is indeed assessing general writing ability.
Differential Item Functioning: Using the Delta Scores approach, DIF item bias analysis was conducted. The results demonstrated
high coefficients of .90 or higher across all subgroups and subtests. Thus, little or no item bias was found.
Summary/Conclusions/Observations:
The Buros reviewers state: “The TOWL-3 is substantially improved from earlier versions. As a diagnostic and formative evaluation
tool, it is most useful in identifying student writers who are performing substantially below their peers. Strengths of this test include:
a strong conceptual model of writing, generally acceptable levels of subtest reliabilities (with the exception of Contextual
Conventions subtest), and generally good evidence of validity. Its weaknesses include: complex and time-consuming manual scoring
Hayward, Stewart, Phillips, Norris, & Lovell 9
and interpretation procedures, a need for user training, subjectivity in some of the scoring procedures (especially for Story
Composition), poor discrimination for ages 7-8, and marginally adequate norms” (Hansen, Bucy, & Swerdlik, 1998, p. 1072).
Although the composite scores are generally acceptable, subtest interpretation should be used with caution, especially in light of the
low reliability and validity of such subtests as Contextual Conventions.
Clinical/Diagnostic Usefulness:
Used for intended purpose, I think that this is a useful test. But I feel that it is important to heed the caution regarding the utility of
this test with the youngest group, ages 7- and 8-years-old. For the TELL, I suspect that this test has limited relevance. I still have
reservations about the validity of this test, however, considering how the testing was done.
References
Hammill, D. D. & Hresko, W. P. (1994). Comprehensive Scales of Student Abilities. Austin, TX: ProEd, Inc.
Hammill, D. D. & Larsen, S. C. (1996). Test of Written Language-3 (TOWL-3). Austin, TX: ProEd, Inc.
Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (1996). Comprehensive Test of Nonverbal Intelligence. Austin, TX: ProEd, Inc.
Hansen, J. B., Bucy, J. E., & Swerdlik, M. E. (1998). Review of the Test of Written Language – Third Edition. In J. C. Impara & B. S.
Plake (Eds.), The thirteenth mental measurements yearbook (pp. 1069-1074). Lincoln, NE: Buros Institute of Mental
Measurements.
US Bureau of the Census. (1994). Statistical Abstract of the United States: 1994. (114th ed.). Washington, DC: Author.
Hayward, D. V., Stewart, G. E., Phillips, L. M., Norris, S. P., & Lovell, M. A. (2008). Test review: Test of written language-3
(TOWL-3). Language, Phonological Awareness, and Reading Test Directory (pp. 1-9). Edmonton, AB: Canadian Centre for
Research on Literacy. Retrieved [insert date] from http://www.uofaweb.ualberta.ca/elementaryed/ccrl.cfm.