Corpus Resources

  1. On-line corpora with query engines
    1. IMS Corpus Workbench (CWB); These corpora are POS tagged with Penntree tags by Tree-Tagger. Tree-Tagger and the rest of the CWB concordancing tools are downloadable for UNIX/LINUX platforms; there is a Tree-Tagger for Windows also.
    2. Web Concordancer is a front end to demo the free concordancer WebApp and the Windows apps ConcApp and ConcGram. It offers a large variety of corpora (Brown, LOB, Sherlock Holmes, KJV Bible, Hitchhiker's Guide, Starr report, NY Times, student writing). Also does Chinese, French, and Japanese. You can also upload a .txt file for analysis. Well-documented with some exercises. ConcApp is free (for Windows) and very servicable: it requires single .txt files, but will make them from a list of many files--but it does save the big file in your directory. It will even do Concgrams, but not automatically.
    3. TAPoR is a new, multicentered Canadian site with much the same functionality as Web Concordancer. It has a sizeable collection of on-line texts, and you can add and store your own. "TAPoR Recipes" give some ideas what to do with the tools. English/French, mais oui.
    4. The venerable Web Concordances (of poems of Romantic poets) demonstrate the output of Concordance, and suggest some uses for literary analysis. Rob J. C. Watt's fairly full-featured text analysis tool (for Win) which is free for 30 days, and after that at market.
    5. British National Corpus
      • The BNC World edition (aka BNC 2) can (usually) be accessed for simple search (but with CQL power) with randomized hits limited to 50. To use the POS tagging in a query, you need to know the tag set (this is for Sampler, but is basically right).

        The full BNC, retagged and with CQP query syntax, can be accessed in the Leeds suite of corpora.

      • Word Frequencies in Written and Spoken English has many stats about the BNC including freq of different parts of speech, comparison of written and spoken subcorpora, etc.
      • "Phrases in English" Home Site by William Fletcher and Michael Stubbs to search BNC for phrasal strings (Ngrams) and strings of POS categories.  This is very handy.
      • BYU-BNC Interface for a recoded BNC by Mark Davies (BYU) has much of the functionality of PIE and in addition allows you to refer to subcorpora based on selected registers. You can compare two different subcorpora. Made to parallel his Corpus of Contemporary American English. Very powerful and nifty.
      • If you require even more selective power over BNC subcorpora and have access to the BNC, David Lee's BNC Web Indexer allows you to select according to 14 parameters--register (called "genre") and various demographic criteria. It will output a list of the BNC files to load into Wordsmith or another concordancer. It implements his article (Lee, 2001)--see Bib.
    6. MICASE--Michigan Corpus of Academic Spoken English lets you browse and search for any word in a highly stratified corpus and to download the lecture or speech or conversation transcripts that contain the word. Currently has 152 transcripts totalling 1.8 million tokens.
    7. Mark Davies, 100M token corpus of Time Magazine from 1923 to present
    8. Mark Davies, Corpus of Contemporary American English 410+ Million Words, 1990-2010. Five subcorpora of 70M words: Spoken, Fiction, Magazine, Newspaper, Academic, and all of these may be searched by subcorpora. No web-original documents. Entire corpus not available for download. Mark-up is CLAWS7 (same as BNC).
  2. Free corpora for download
    1. BASE— The British Academic Spoken English corpus— is made up of 160 lectures and 39 seminars in various disciplines totalling about 1.64M tokens recorded between 2000-2005. Transcripts are available from Warwick Centre for Applied Linguistics and from the Oxford Text Archive with and without pauses indicated. There is a search interface in the Sketch Engine.
    2. BAWE—British Academic Written English— is the counterpart to BASE and can be accessed (after a harmless free registration). The writing in the corpus is of British University students, and can be sorted by genre and discipline. The full corpus (6.7 M words) is available at the Oxford Text Archive.
    3. The Open American National Corpus (OANC) is the open source portion (14.6 M words) of the ANC, 2nd ed. (22 M words). It is tagged with PennTree tags and noun and verb chunked. Also has a useful too (ANCTool). The full 22 M version (much of the balance is New York Times for June 200x) is for fee from LDC.
  3. Subscription corpora and tools
    1. Newspapers online:General corpora such as BNC and ANC have a newspaper component, and many newspapers have online archives which are searchable, but the mother of all newspaper corpora is LexusNexus-Academic, which is available to all of us with a UW account. It allows a lot of tuning--which newspapers, which parts of the paper, presentation of results and so on. It responds somewhat to "literals" (" ") around the search term, but lemmatizes slightly, lumping both present-tense forms of verbs and singular/plural forms of nouns. Its 'best fit' algorithm will occasionally scramble the order of words, and certain words (be and forms of be, and) flood it with undesired hits. Will count the search term appearing in the same article (in different newspapers). Moral: eyeball the results of the search and adjust them.
    2. Sketch Engine (SkE)gives you preloaded corpora in several languages, WebBootCat web-corpus builder (available for free as BootCat), Corpus Builder to upload and install your own corpora, and BASE Plus interface to BASE. 30 days free, then €55.25/ann
    3. The 550M token Collins WordBank is available via a Sketch interface for 30 trial; after that a jillion pounds sterling/ann. It is a mix of sources from the Inner Circle English speaking countries, with most texts dating from 2000-2006.
    4. WMatrix "provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains." 30 dqays free, then @50£/ann
    5. The mother of all subscription corpora is the BNC, installed on your own machine or via the Xiara corpus engine (which is open source and can be used to compile and access your own corpora; GUI for MS Windows, command line for Linux).
  4. On-line corpora of web texts:
    1. William Fletcher's WebAsCorpus offers two options: English Web Corpus, allows searches of either a 518M token corpus of web writings collected in 2007 or one of 100M tokens in 2006. Also produces a table of ngrams for the word or words searched. (For other option, see WebConcordance below.)
    2. VISL's Corpuseye gives a number of English and other corpora, most of which are electronic in origin, like Wikipedia, europarl, Enron, and email (333M in all). It uses CQP as well as a simple query format and there is a good Guided Tour animation which also goes into CQP. Results can be sorted. Can also search by grammatical FUNCTION! Quite a complete analytic tool. HIGHLY RECOMMENDED.
    3. UKWAC-15—the first 20% of a 2G token web corpus of web texts in the .uk domain made with medium frequency seeds from BNC and tagged for CQP. UKWAK is a project of the WACKy Group. The whole 2G corpus is said to be available via the Sketch Engine.
    4. Leeds 110 M token Internet Corpora Nice interface to TreeTagged text. Learn CQP language and win a big prize. Also here: British News corpora, the Brown Corpus, and an Internet (Creative Commons) corpus which is light on professional news sources and heavy to blogs and fanzines.
  5. Corpora of transcribed spoken English (some audio)
    1. VOICE: Vienna Oxford International Corpus of English. International students speaking English as a lingua franca (no L1 speakers participating)
    2. Hong Kong Corpus of Spoken English. English as spoken in Hong Kong in specific professional contexts.
  6. Web as Corpus (corpora on the fly)
    1. Webcorp will collect cites from all over the Web (via Altavista or Google) and produces a concordance display which is sortable. It is takes a while to do this, and is subject to time-outs, but is a great way to tame the immense data flow from the Web. Searches should be very narrowly drawn, so that the searches and sortings can come to completion and so that other people can use this site. Search can be tuned to a particular domain or subject area.
    2. WebAsCorpus, as Web Concordancer draws directly on the Web via LiveSearch, which unfortunately does not support wildcards. Its search can be limited (under Options) to a particular country.
  7. Concordancers and Tools
    • Non-free
      • R.J.C. Watt's Concordance, Michael Barlow's MonoConc,, and Mike Scott's WordSmith Tools have similar features, and all cost a pretty penny, though time and/or data-limited versions are available. All are Windows programs. Most of their functionality is available on free tools.
    • FREE
      1. Xaira is the successor SARA for the BNC+XML. It can be used with other files. The ANC can be modified to be used with Xaira. It is easy to add simple XML markup to a text file--in fact, Xaira Tools will do it for you. Because it indexes the corpus, it is very fast to use, which WSTools is not with a large corpus like ANC. Quite powerful and did I mention free.
      2. Kfngram makes ngram indices of any text(s) you give it. Like WSTools' Cluster function, but free. Works on Windows.
      3. As noted above ConcApp is very servicable and free.
      4. Laurence Anthony's AntConc is light and cross-browser and well-documented. Asian as well as European languages. The most useful of the free concordancers. Good Help file.

This is one of four sites of (on-line) Resources for English language study maintained by George Dillon, University of Washington. The others are:

Phonetics Resources English Syntax Resources Semantics Resources


Last modified: November 2010