Winter 2012
Home Administration Deliverables

So what's a word?

"Word", a slippery concept... Linguists prefer to use "lexeme" to denote whatever a word may be. "Lexeme - a word in the abstract sense, an individual distinct item of vocabulary, of which a number of actual forms may exist for use in different syntactic roles." Linguists use "word form" to avoid the ambiguity of word. For example, see, sees, seeing, saw and seen are word forms of the lexeme see. The Oxford Dictionary of English Grammar, 1998

Language is messy, organic and not a system of notation planned ahead of time.

"To understand punctuation, a historical perspective is essential. The modern system is the result of a process of change over many centuries, affecting both the shapes and uses of punctuation marks. Early classical texts were unpunctuated, with no spaces between words." David Crystal, The Cambridge Encyclopedia of the English Language, 1995

Origin of the blank space

In the 7th century Irish monks started using blank spaces, and introduced their script to France. By the 8th or 9th century spacing was being used fairly consistently across Europe. (Wikipedia: Word divider)

Blank spaces help the uneducated find words

What you consider to be "normal" form for a language was invented by some printer in London in the Fifteen Century who put spaces between words because he was trying to sell Bibles to folks who didn't have a classical education.

Inadvertent consequences of the blank space

  • Graphic Words: strings of letters demarcated with spaces that may contain only hyphens and/or apostrophes,
  • Compound Words: strings composed of other words, either appearing as a continuous string, in a hyphenated form or linked with spaces,
  • Merged Words: two words merged to reflect a reduction in spoken language (i.e., he’ll, ‘tis, gonna), and
  • Pseudo Words: normally independent words that have been linked together by a hyphen (i.e., "Charles MacArthur-Helen Hayes" contains the pseudo word "MacArthur-Helen")
  • Francis, W. N. & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston, MA: Houghton Mifflin.

Our language is changing right now!

An orthographic word is a series of alphanumerics between spaces. Are you as clever at finding words as a computer is? How many words in each of these?

Words are Fonts are Pictures

There are an interesting number of cases where we would have to accept that individual letters, and the way they are presented in typography or handwriting, do permit some degree of semantic or psychological interpretation, analogous to that which is found in sound symbolism, though the element of subjectivity makes it difficult to arrive at uncontroversial explanations. (David Crystal, The Cambridge Encyclopedia of the English Language 1995, p. 268)

This is a logo for a company that supplies computer training to corporations -- what does this logo say?     

This is a logo for a restaurant. What is the name of this restaurant, and why this treatment of the letter "a"?      

This is a logo for song by a rock band. What is the name of the band and why are the words of the name placed as they are?      

Fonts tell stories


Wikipedia: Glyph

Wikipedia: Computer font

Words are Fonts are Pictures are Words are Fonts ...

Distinction is sought through font variation: ConneXions, InformationWEEK and net, or by combining letters, numbers and punctuation: .exe, RElease 1.0, Soft*letter, T.H.E. Journal, I.T.1 Magazine. These latter risk malformation by any normalization process that breaks words apart based on punctuation. Sometimes font and spelling changes become one as in this advertisement: "GRAB THIS VNIQUE BVSINESS OPPORTVNITY". How far from an ordinary orthography is the substitution of v's for u's? Textual creativity is limited only by human imagination. Here is a short list:

  • Grant$ for women and girls, 1993/1994 can be retrieved with the query term grants, but not grant$ (OLUC an 31483793)
  • ;Login: can be retrieved by ignoring the leading semi colon and trailing colon (OLUC an 10959450)
  • *** must be retrieved with asterisk asterisk asterisk (OLUC an 29357394)
  • ? must be retrieved with question mark (OLUC an 28740285)
  • (!) yeah: cover and poems must be retrieved with exclamation mark (OLUC an 3459474)


Spelling It 'Dinsey,' Children on Web Got XXX

Federal prosecutors in Manhattan charged a Florida man yesterday with violating a new law that makes it illegal to use misleading Internet domain names to entice minors onto pornographic Web sites. Prosecutors said that as part of the scheme, the defendant, John Zuccarini, had registered 3,000 domain names that included misspellings or slight variations of popular names like Disneyland, Bob the Builder and Teen magazine. Mr. Zuccarini used more than a dozen variations of the name Britney Spears, the prosecutors said.

A child who accidentally mistyped a name into an Internet browser would be directed to a Web page controlled by Mr. Zuccarini and barraged with X-rated advertising, the authorities said. The child would also be "mousetrapped," they said; that is, unable to exit from the Web site.

Representative Mike Pence, a Republican from Indiana who wrote the domain names law, said by telephone that he saw the issue less as one of indecency than as one of fraud. "I found in sitting down with my kids to do their homework on the Internet," he said, "that you could type in the most innocuous phrases, and that you literally had to cover their eyes before you activated the Web site."

Once a person was directed to a pornographic Web site, Mr. Comey said, "the usual tools that we use to close a Web site would not work." Clicking on the X in the corner, or pressing the back button, he said, would "simply open more screens, bombarding the user with an endless stream of hard-core pornography."
Spelling It 'Dinsey,' Children on Web Got XXX. Benjamin Weiser. The New York Times, September 4, 2003

In Online Auctions, Misspelling in Ads Often Spells Cash

When Holly Marshall wanted to sell a pair of dangling earrings, a popular style these days, she listed them on eBay once, and got no takers. She tried a second time, and still no interest.
Was it the price? The fuzzy picture? Maybe the description: a beautiful pair of chandaleer earrings.
Such is the eBay underworld of misspellers, where the clueless — and sometimes just careless — sell labtop computers, throwing knifes, Art Deko vases, camras, comferters and saphires
In Online Auctions, Misspelling in Ads Often Spells Cash. Diana Jean Schemo. The New York Times, January 28, 2004

"Jew Jersey" has issued its list of this year's most hilarious or egregious newspaper errors and corrections. No. 1 is The Denver Daily News accidentally calling New Jersey "Jew Jersey."
"The News offends an entire state and a major religion and all it can muster is 39 words? Only a newspaper could get away with that," says.
But a Google search for "Jew Jersey" one day this week yielded 888 hits, and most of them weren't racial slurs, but mere typos. Most were not made by newspapers, but by the likes of Best Western Hotels, the BBC and the American Library Association. The "j" key, it appears, is perilously close to the "n" key. Watch those fingers, folks.
"What's Online" Dan Mitchell, December 17, 2005

Emoticons and Smiley Faces

Emoticons are facial made by a certain series of keystrokes.

"No longer are they simply the province of the generation that has no memory of record albums, $25 jeans or a world without Nicole Richie. These Starburst-sweet hieroglyphs, arguably as dignified as dotting one’s I’s with kitten faces, have conquered new landscape in the lives of adults, as more of our daily communication shifts from the spoken word to text. Applied appropriately, users say, emoticons can no longer be dismissed as juvenile, because they offer a degree of insurance for a variety of adult social interactions, and help avoid serious miscommunications.
“In a perfect world, we would have time to compose e-mails that made it clear through our language that we are being cheerful and friendly, but we’re doing these things hundreds of times a day under pressure,” said Will Schwalbe, an author of “Send: The Essential Guide to Email for Office and Home” (Knopf, 2007), written with David Shipley, the deputy editorial page editor at The New York Times.
Mr. Schwalbe said that he has seen a proliferation of emoticon use by adults in delicate and significant communications. “People who started using them ironically are now using them regularly,” he said. “It’s really in the last couple of years that the emoticon has come of age.”
"(-: Just Between You and Me ;-)" The NY Times, Sunday, July 29, 2007

Communicating with pictures

Katy, a 9-year-old Ravens (football team) fam from Bel Air, Md. has Angelman syndrome, a rare genetic disorder that affects the nervous system and causes frequent seizures. As a result, she communicates mainly through pictures and gestures.
Last September, Katy composed an essay about training camp by using pictures and symbols, one of many Ravens-related projects she has used to express herself. She carefully placed icons produced by a computer program on paper. Her mother said Katy took 30 minutes to an hour to create each sentence.
"Katy is going to camp with the Ravens as their little cheerleader" The NY Times, July 15, 2007

Distinguishing man from machine with pictures

On the Internet, nobody knows you’re a human — until you fill out a captcha. Captchas are the puzzles on many Web sites that present a string of distorted letters and numbers. These are supposed to be easy for people to read and retype, but hard for computer software to figure out. Most major Internet companies use captchas to keep the automated programs of spammers from infiltrating their sites. There is only one problem. As online mischief makers design better ways to circumvent or defeat captchas, Web companies are responding by making the puzzles more challenging to solve — even for people. They are twisting the letters, distorting the backgrounds, adding a confusing kaleidoscope of colors and generally making it difficult for humans.
Microsoft researchers have developed an alternative captcha that asks Internet users to view nine images of household pets and then select just the cats or the dogs. “For software, this is wildly hard,” said John Douceur, a Microsoft researcher. “Computers are tripped up by all the photos at different angles, with variable lighting conditions and backgrounds and the animals in different positions.”

"A Dog or a Cat? New Tests to Fool Automated Spammers" The NY Times, June 11, 2007

Writing for dance

Rudolf van Laban, a Hungarian-born choreographer and dance theorist, developed his system of notation in the 1920s. (Systems have existed since the 15th century, but Labanotation and Benesh notation, developed in Britain in the 1950s, are the two types most used today.) Like music notation it uses graphic symbols on a staff. But the extreme complexity and detail needed to represent timing, direction, impulse and dynamics make it the province of very few specialists. For this reason, perhaps, along with the expense and the time it takes to compile a score (anything from a few weeks to a year for a big ballet), few companies and choreographers employ notators with any consistency. But those who work in the field of notation are passionate about its importance.
“Dance is not an ephemeral art form,” said Sandra Aberkalns, the senior staff notator at the bureau. “Music is just as ephemeral in performance, but the performer can play that score and read it over and over again, discuss it, debate it. When all you have is video or photographs, what you have is primarily the dancer’s interpretation. Ideally you have those too, but what you get from a score are the choreographer’s intentions, and the nuance and depth that you can capture in the choreography are really phenomenal.”
Some choreographers are skeptical. “The notation is based on an agreed-upon form of moving, which I believe is misleading,” Mark Morris said after his “All Fours” was staged from a score at Ohio State University last year. “It’s nearly impossible to accurately communicate dynamics and phrasing, although I grudgingly admit that it was a far better tool than I had anticipated.”
"All the Right Moves" The NY Times, August 30, 2007

Short Text Messaging

In the first major competition of its kind, the Guardian awarded cash prizes to people who wrote the best poetry on their mobile phones, using the popular short text message service (SMS). People on their way to work, people on their way home, and people just out and about, banged out poems and shot them to the newspaper at an incredible rate.

Because the size of a phone's screen is limited and an SMS message can hold only 160 characters, contestants had rather interesting ways of expressing their thoughts. Check out Hetty Hughes' championship entry:

txtin iz messin,
mi headn'me englis,
try2rite essays,
they all come out txtis.
gran not plsed w
letters shes getn,
swears i wrote better
b4 comin2uni.
&she's african 

The newspaper winnowed the entries down to 100 and then handed them to professional poets who selected seven of the poems for cash prizes. The judges chose a poem written by Julia Bird as the "most creative use of SMS 'shorthand' in a poem:

a txt msg pom.

his is r bunsn brnr bl%,
his hair lyk fe filings
W/ac/dc going thru.
I sit by him in kemistry,
it splits my @oms
wen he :-)s @ me. 


Time out for modern love...

R We D8ting?

Sandra Barron, New York Times, July 24, 2005


Chat room acronyms:


L33t 5p34k (Translation: "Leetspeak")

Now called "Hacker" September 2010

During the early 1980s, hackers that didn't want their websites, newsgroups, etc, to be picked up in a simple keyword search began using numbers to replace certain letters (mostly vowels) such as A = 4 or E = 3.
At this point, l33t speak was only known to a select few and only used when necessary. However, in 1994, id Software began to add Internet connectivity to Doom and Doom II, leading to a revolution in PC gaming and also to the rise of l33t speak.
megatokyo brought l33t speak into mainstream with its infamous speak l33t? comic. These days l33t speak is very well known to the hardcore Internet community (especially gamers). An Explanation of l33t Speak


Google in 133t5p33k:


Urban Dictionary

Bottom of the lecture page:

This is a course about the intellectual foundations of informatics...things such as information systems, architecture and retrieval. The essence of this lecture is the variability of words - the fundamental unit of human expression. But, also, the fundamental unit of information retrieval systems. What are the implications for information retrieval if the fundamental unit is so arbitrary? What is the future of programming a computer to understand language if language is characterized by this sort of thing? (from the Urban dictionary):

Would this example illustrate how orthography frustrates information retrieval?

Spoken language (Siri = iPhone 4s)

"I never said she stole my money"

I.B.M. plans to announce Monday that it is in the final stages of completing a computer program to compete against human “Jeopardy!” contestants. If the program beats the humans, the field of artificial intelligence will have made a leap forward. “The big goal is to get computers to be able to converse in human terms,” said the team leader, David A. Ferrucci, an I.B.M. artificial intelligence researcher. “And we’re not there yet.” The team is aiming not at a true thinking machine but at a new class of software that can “understand” human questions and respond to them correctly. Such a program would have enormous economic implications. Despite more than four decades of experimentation in artificial intelligence, scientists have made only modest progress until now toward building machines that can understand language and interact with humans. I.B.M. will not reveal precisely how large the system’s internal database would be. The actual amount of information could be a significant fraction of the Web now indexed by Google, but artificial intelligence researchers said that having access to more information would not be the most significant key to improving the system’s performance. Eric Nyberg, a computer scientist at Carnegie Mellon University, is collaborating with I.B.M. on research to devise computing systems capable of answering questions that are not limited to specific topics. The real difficulty, Dr. Nyberg said, is not searching a database but getting the computer to understand what it should be searching for. The system must be able to deal with analogies, puns, double entendres and relationships like size and location, all at lightning speed. In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder. For example, given the statement, “Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide,” Watson beat its human competitors by quickly answering, “What is Lebanon?” Moments later, however, the program stumbled when it decided it had high confidence that a “sheet” was a fruit. The way to deal with such problems, Dr. Ferrucci said, is to improve the program’s ability to understand the way “Jeopardy!” clues are offered. The complexity of the challenge is underscored by the subtlety involved in capturing the exact meaning of a spoken sentence. For example, the sentence “I never said she stole my money” can have seven different meanings depending on which word is stressed. “We love those sentences,” Dr. Nyberg said. “Those are the ones we talk about when we’re sitting around having beers after work.” "Computer program to take on 'Jeopardy!'" NY Times, April 27, 2009

NELL = Never-Ending Language Learning

Few challenges in computing loom larger than unraveling semantics, understanding the meaning of language. One reason is that the meaning of words and phrases hinges not only on their context, but also on background knowledge that humans learn over years, day after day.

The Never-Ending Language Learning system, or NELL, has made an impressive showing so far. NELL scans hundreds of millions of Web pages for text patterns that it uses to learn facts, 390,000 to date, with an estimated accuracy of 87 percent. These facts are grouped into semantic categories — cities, companies, sports teams, actors, universities, plants and 274 others. The category facts are things like “San Francisco is a city” and “sunflower is a plant.”

NELL, he says, is just getting under way, and its growing knowledge base of facts and relations is intended as a foundation for improving machine intelligence. Dr. Mitchell offers an example of the kind of knowledge NELL cannot manage today, but may someday. Take two similar sentences, he said. “The girl caught the butterfly with the spots.” And, “The girl caught the butterfly with the net.” A human reader, he noted, inherently understands that girls hold nets, and girls are not usually spotted. So, in the first sentence, “spots” is associated with “butterfly,” and in the second, “net” with “girl.” “That’s obvious to a person, but it’s not obvious to a computer,” Dr. Mitchell said. “So much of human language is background knowledge, knowledge accumulated over time. That’s where NELL is headed, and the challenge is how to get that knowledge.” NY Times, October 4, 2010

Business Glossary: Define a Common Business Language Among Modeling Tools

In a large organization with complex analysis, modeling, and development initiatives spread across multiple projects, standardizing business semantics is key. Without a way to standardize the meanings and definitions of business concepts, each analysis, modeling, or development thread will naturally establish its own semantics. These disparate semantics can compound the already fragmented understanding of the relationship between IT assets and the business concepts they support.

For example, the business side of the house might clearly define the term Customer Tax Status. This enables each IT initiative that supports Customer Tax Status to use the defined meaning, which drives consistency of term name, definition, and related semantics across all the IT initiatives. By contrast, in the absence of such a structure, each IT initiative might naturally come to its own conclusion as to what Customer Tax Status means and how it should be defined. This can result in multiple structures, such as Customer Tax Code, Tax Status, Customer Code, all of which loosely imply the same semantics but differ in name and definition...

InfoSphere Business Glossary provides a means to specify business concepts and to manage the relationship among those concepts and the IT structures that support them. However, this content is only useful if it is easy to access. For example, without immediate and efficient access to glossary content, model users, including service analysts, component designers, and logical data modelers, might ignore the glossary and define their own terms. The glossary content should be available within the modeling tools, making the content impossible for the modeler to ignore. Still, there might be complications with model interchange and synchronization as relationships between model structures and glossary terms must be retained as models flow from tool to tool...

These new functions within the modeling platform fundamentally change the capability of an enterprise to define and control business semantics across various modeling domains. These techniques, properly applied, can greatly reduce the variation in business definitions across modeling efforts, across projects, and across line-of-business boundaries..."


To do: Find words to talk about death in 10,000 years!