Corpus Use in Teaching Language Arts

Introduction to Corpus Linguistics

for Advanced Structure of American English

Find this document in .doc form or .ppt form.


Textbooks on English grammar attempt to describe the language as a system of rules that might explain how the child comes to know the language so quickly. This knowledge of language is referred to as grammatical competence. However, the use of language in everyday situations, known as grammatical performance, often affects competence since it provides the data that the child hears. Corpus linguistics aims to look at the actual use of language, written and spoken. The tasks you will do below are designed to make you familiar with this approach and to appreciate some of its possibilities for your own research and teaching.


Shifts in Inflectional Forms

Most aspects of language remain fairly constant over time, but irregular inflections tend to regularize unless the irregular form is common, as in drink/drank or sink/sank. For example, what form would you fill in to complete the following sentence?

I _____________ the lamp last night.

Using the Virtual Language Centre (VLC) Web Concordancer, compare the two forms of the past tense of light that we have just considered. To do so, click on "Simple search" under "English", In the VLC Web Concordancer, (English), type each form in the second box after "Search string:"; then go to "Select corpus:" and, on the pull-down menu, select first "Brown Corpus" and get your results and then “Sherlock Holmes stories” and get your results.

The Brown corpus represents English usage in the 1970’s; Sherlock Holmes, the 1880’s.

Which form occurs more frequently in the Brown Corpus? ___________________

Which occurs more frequently in the Sherlock Holmes stories? _________________ 


Powerful tea and cooked coffee?

Certain words commonly occur together: coffee is brewed, not cooked, lights are turned off, not closed. These collocational patterns are language-specific and, as such, are often mind-boggling to the language learner. They are learned by constant exposure. A corpus can provide this exposure quickly.

The following sentences are ESL student productions in which the underlined word is not a standard collocate of the following word(s). Using the VLC concordancer, find a better word for each of the underlined words below.

A powerful dollar overseas hurts European markets. (Search Business and Economy and Sort left.)

This was his single chance for success. (Search Brown and Sort left.)

I like being with people, knowing new people, etc. (Search Times for 'new people', and Sort left.)

Stand on line or in line?

Should ESL students be taught to 'stand on line' or 'stand in line', or does it matter? You can check the use of the prepositions at the Collins CoBuild website. In the "Type in your query" box, type stand +1line. The +1 will ignore a single word between stand and line. How many times does stand on line occur? _________ stand in line? __________ Do Americans use on or in more often? _________


Part of Speech Identification

Many grammar texts give form criteria by which to judge the part of speech of a word. For example, a word is classified as a noun if it can occur with a plural or possessive ending, or if it has a noun-making morpheme like –ness or –tion, while it is classified as a verb if it can occur with the tense and participial endings, or if it has a verb-making morpheme like –ize, or -ate. However, many words in English have the same form as nouns and verbs, e.g., house, button, garden, progress, permit, record. How is part of speech determined in such cases?

To answer this question, look at the usage of the word permit as it occurs in the Brown corpus. (Sort left.) The word permit can be a noun or a verb in English and, in its base form, there is no formal, morphological way of telling whether it is a noun or a verb.

Find the first three occurrences of permit as a verb. How do you know permit is being used as a verb here?




If you're not sure about assigning parts of speech, you might want to check out the online Part of Speech Tagger at the University of Colorado.


Testing textbook claims about POS

Klammer, Schulz and Della Volpe's Analyzing English Grammar says that completely, absolutely, totally, extremely, and excessively are adverbs even though they fit the qualifier frame

The handsome man seems ______ handsome.

These words do fit the adverb form test (they end in -ly) but they fail all the function tests (they can't modify verbs and they can't move within the sentence).. What’s going on? Looking at the usage of one of the -ly degree words in MICASE, the Michigan Corpus of Academic Spoken English, will clarify what's going on with these words.

If we classify adverbs as words that (1) modify verbs and (2) can be moved within a sentence, find an example in the data above of totally used as an adverb.

If we classify qualifiers as words that (1) modify adjectives or adverbs and (2) can fit the slot in the frame sentence The handsome man is ________________ handsome, find an example in the data above of totally used as a qualifier.


This exercise shows that the part of speech class of a word depends in part on

a. the morphological form of the word

b. the context in which the word occurs

c. the grammatical function of the word


Exemplifying Standard and non-standard forms 

Adverbs or Adjectives in intransitive sentences?

Many grammar texts claim that there is a usage issue relating to the use of adverbs vs. adjectives following intransitive verbs, as in doing well vs. doing good, with the latter considered informal.

  • Find all instances in MICASE of doing good and doing well. Which form is more common?
  • MICASE classifies its data by speech event. (The speech events are listed to the left of the data.) Is there any correlation between the informal nature of the speech event and doing good, as opposed to the formal nature of the speech event and doing well?

Speech Event: ADV – advising session; COL – colloquia; DIS – discussion section; LAB – lab sections LEL – large lecture; LES – small lecture; MTG – meetings; OFC – office hours; SEM – seminars; SGR – study groups; TUT – tutorials;

  • How about the age and/or status of the speakers? (Speaker characteristics are listed to the right of the data.)


Syntactic Constructions

The passive with by

The often politically motivated claim "Mistakes were made" has recently been dubbed the 'past exonerative'. How often does the passive occur without by? To see how often mistakes occurs with made, with and without the by, in Collins' 56 million word database, type the following in the Collins Cobuild site's query box --


-- and click "Show Concs"

The +2 allows for a maximum of two words to intervene between mistakes and made. (If you want results for mistake and mistakes, type mistake*.)

Searching for passives is tricky because the past participle used in the passive voice often has the same form as the past tense. For example: I made a mistake (past tense); A mistake was made (past participle).

The Collins Cobuild site distinguishes past tense forms, which it labels VBD, from past participle forms, which it labels VBN and verbs can be searched for as, for example, made/VBN. (Be aware that the VBN tag will give you all past participles, i.e., both has/have made and am,is,are,was,were,be made.) So you can try the search again as follows --


Of the 40 samples that you see, how many have a by phrase? _________

You can see how often the passive form of a specific verb occurs with a by phrase in the entire 56 million word database by asking for the T-score under "Collocation Sampler". What is the joint frequency of the following items?

mistake*+2made  and   by?  __________    (Type the first term; the by will appear in the table.)

mistake*/NOUN and made? __________

What percentage of the time does by appear with mistakes were made?  _____________


Subject and object selection

According to the Collins Cobuild data, what subjects and objects can the verb prove take? Animiate, inanimate, abstract, concrete, mass, count?                                                                                                                         

Verb complementation

Can you pretend something, i.e. a noun or a noun phrase? Type


into the Collins Cobuild box to find out. (Note the occasional errors in the POS tagging of pretend.) What somethings can one pretend? _________________________________________

Can pretend be followed by "-ing" forms, infinitives or anything else? Type pretend* in the box to find out.




Know Your Tools

A concordance, in its simplest form, is an alphabetical listing of the words in a text, given together with the contexts in which they appear. The most common form of concordance today is the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field (e.g., 80 characters). The example given below was produced by Conc 1.70 (Macintosh), from a plain ASCII text version of the first book of Dickens' A Tale of Two Cities. Note that the line numbers are as calculated by Conc.

Figure 3.1.1: Concordance of poor in Tale of Two Cities, Book 1

1320       taste it is that such  poor cattle always have in their mouths                                 

948               of sparing the  poor child the inheritance of any part of                                

778         small property of my  poor father, whom I never saw--so long 

1870        desolate, while your  poor heart pined away, weep for it     

947                 Miss, if the  poor lady had suffered so intensely    

1884              the love of my  poor mother hid his torture from me    

1615      stockings, and all his  poor tatters of clothes, had, in a long                                   

1577           faded away into a  poor weak stain.  So sunken and        

1001          on your way to the  poor wronged gentleman, and, with a    

1036         detachment from the  poor young lady, by laying a brawny hand                                   


A concordancer is a software tool that produces such a list.

A collocate list is a list of words that occur in the neighborhood of the keyword. For example, a search for the keyword so in the Hong Kong Web Concordancer with a request for words that occur at a distance of two words from the so, returns the following words as the top collocates of so that occur to its right:

Right collocates for 'so'

The 132

that 127

as 110

to 77

in 57

a 49

and 47

it 46

He 40

of 35


A part-of-speech tagger automatically tags each word in a text with its part of speech. Current taggers are about 97% accurate (as are human experts). The Collins CoBuild Concordancer allows you to search for part of speech strings rather than strings of words.

Searching, in the context of corpus work, means looking in the online text for a specific keyword, phrase, part-of-speech tag, etc.

Browsing means reading through the documents in the corpus. This is a useful activity only if the documents have been classified. For example, the MICASE corpus is categorized by speech event (lab, lecture, office hour, etc.) and by speaker (professor, undergraduate, grad student, native speaker, non-native speaker, male, female, etc.). This classification allows you to get a sense of the differences between one speech event or speaker type and another.

Sorting means listing words in alphabetical order. The Hong Kong Web Concordancer allows you to sort the collocates immediately to the right or to the left of the keyword.


Websites to get you started.

Corpus Linguistics at the Hong Kong Polytechnic University. Contains a tutorial on corpus linguistics as well as a corpus linguistics course outline with student assignments, etc.


The Internet Grammar of English is an online course in English grammar written primarily for university undergraduates. IGE does not assume any prior knowledge of grammar. It includes interactive exercises.

English Grammar on the Web is a resource designed to support ESL/EFL teachers, but it has valuable lists of links to other web resources on English grammar. Particularly helpful is its Lists of Grammar Lists.

The Hong Kong Web Concordancer is a concordance program that allows you to search several million words of English sampled from many sources.

Alternate URL for The Hong Kong Web Concordancer

The Collins CoBuild Concordancer allows you to search 56 million words of contemporary written and spoken documents. It also allows you to tag these documents for part of speech.

A Ten-step Introduction to Concordancing through the Collins Cobuild Corpus Concordance Sampler. Instructions for using Collins Cobuild effectively.

MICASE, the Michigan Corpus of Academic Spoken English allows you to search 1,848,364 words of English transcribed from lectures, conversations, service encounters, etc. recorded at the University of Michigan.

POS Tagger, A statistically-based Part-of-Speech Tagger from the University of Colorado; it returns your sentence with Penn Treebank part-of-speech tags assigned.

Bookmarks for Corpus Linguists is a list of web resources on using corpora and links to software and corpus collections.

The Web Concordances and Workbooks from the University of Dundee English Department. This site is devoted to the study of literature using literary computer concordancing, a form of analysing text. This document will attempt to help students understand what is meant by literary concordancing.

WordNet a lexical database for the English language.

Concordances and Corpora. A tutorial by Catherine Ball on the design of corpora, the use of concordances, and available concordancing software.

The Montclair Electronic Language Database (MELD). A collection of ESL student essays and background information on L1, native country, age, gender, and other languages spoken.