These are most of the no-fee one-million word Corpora of the ICE Project (see the link to the ICE Corpus Design), tagged with TreeTagger, and compiled for the Corpus Work Bench concordancer. Each corpus is half text transcribed from speech; I have reduced the markup of overlaps for the sake of readability, but retained the indications of omitted and unclear words (using curly brackets) and foreign words (using italics and [FW] tags). Most of these corpora use INDIG (for "indigenous") to tag words and stretches of local languages.
- ICE-India uses INDIG to mark words and stretches of Hindi and other indigenous languages (over 3800 of them). These words are displayed using italic face inside empty square braces and include accha, ahn, crore, haan, hai, ki, na, ya, & yaar.
- ICE-Philippines uses INDIG for local languages, mainly Tagalog (4650 in all), and FW for words from other languages (Spanish, Latin, French, German). Much code-mixing is evident in the Spoken portion with many instances of ano, ang, di ba, hindi ba, ho, kasi, po, walang, etc. Some TV transcripts (of, e.g. cooking shows) seem almost dual-language.
- ICE-Hong Kong uses INDIG, but much less than India and Philippines (534). It mainly occurs in discussions about Cantonese versus Mandarin pronunciations of words. Speakers exhibit hesitant speech, repeating the first syllable of English words. These 'stuttered' syllables are nonetheless tagged as words. Annotation often identifies Cantonese speech particles (for example lo, la, sei).
- ICE-Jamaica tags about 600 indigenous words, most of them from the Patois.
- ICE-East Africa uses the tag ea/ before East African words.
- ICE-Singapore has most of the well-known discourse particles (aiyah, lah, meh, ma, lor, hor, ya) tagged as interjections (UH). Some (ah, eh, what) are not specially tagged.
- ICE-Canada has over 300 INDIG tags which wrap French words.