dLIS 540 Information Systems, Architecture and Retrieval
Autumn 2007

Week two: Information architecture in the 21st century!

SGML, HTML, DHTML, XML, XHTML, JSON, etc.



This week's lecture begins with this timeless observation about human writing systems:






What's the big deal about separating content from presentation, huh?

For everyone who grew up in a world where content was NOT separated from presentation:



Consider the dynamics of living in a world of billions of web pages, millions of database records, etc., etc., and somebody has to 'manage' all that stuff.


  • "Create standard visual templates that can be automatically applied to new and existing content, creating one central place to change that look across a group of content on a site."
  • "Once your content is separate from the visual presentation of your site, it usually becomes much easier and quicker to edit and manipulate."


Have a look here: "Content management systems" - Wikipedia



But computers don't know the difference between content and presentation.

Computers don't even know the difference between letters and numbers.

Computers process all text - letters or numbers - as series of binary numerical codes - 1's and 0's. When a computer writes the letter 'A' on to your hard drive, it doesn't create an image of the letter 'A', but writes a series of 1's and 0's that represent the letter 'A' from a table of code. When your computer "reads" the letter 'A' from your hard drive, it really reads a series of 1's and 0's and then consults a font file for selecting the character shape of 'A' that it shows on the computer monitor.

Bob Bemer developed the American Standard Code for Information Interchange, ASCII. In 1960, there was no such standardization. IBM's equipment alone used nine different character sets. "They were starting to talk about families of computers, which need to communicate. I said, 'Hey, you can't even talk to each other, let alone the outside world,'" says Bemer, who worked at IBM from 1956 to 1962.

ASCII is a seven-bit code that consists of 128 decimal numbers ranging from zero through 127 assigned to letters, numbers, punctuation marks, and the most common special characters. The Extended ASCII Character Set also consists of 128 decimal numbers and ranges from 128 through 255 representing additional special, mathematical, graphic, and foreign characters.

 




UNICODE

During 1980s researchers at Xerox begin mapping every character to a 16-bit code. They developed a "unique, universal and uniform character encoding" - UNICODE.



  • Universal - encompasses all world languages
  • uniform - fixed-width codes
  • unique - bit sequences has only one interpretation

Unicode provides a consistent way of encoding multilingual text and helps the exchange text files internationally. The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The original goal was to use a single 16-bit encoding that provides code points for more than 65,000 characters. While 65,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, the Unicode standard and ISO/IEC 10646 now support three encoding forms that use a common repertoire of characters but allow for encoding as many as a million more characters. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world, as well as common notational systems.



 



Big Idea: Metadata

Why not create self-referential text?

i.e., Text about Text





"Self-referential" ... a drawn hand drawing a hand ... text that refers to other text ...


Name it "Mark Up"

What is so clever about the name "Mark Up"? Examine the following:

Nobody has used the name "Mark Up" so far. There's "Mary Up" and "Margaret Up", even "Luann Up", but no "Mark Up". Clever, no?



 

A Short History of Text Markup

The digital processing of text requires distinguishing the "content" text from flags or signs embedded in the text that signal how the content text should be processed.

  • 1967 - William Tunnicliffe distinguished the content of documents from their format at a meeting of the Canadian Government Printing Office.
  • 1969 - IBM researchers invent the Generalized Markup Language (GML).
  • 1978 - An American National Standards Institute working group was formed to provide a format for text interchange and a markup language for future processing. Introduced new concept of structural markup: titles were marked as <title> rather than <bold> and <center>. By marking a title as <title>, database searches could be limited to titles. This was the beginning of Standard General Markup Language, SGML, which represents the structure of a document.
  • 1980 - First draft of SGML
  • 1986 - SGML approved as ISO international standard 8879


SGML - Standard General Markup Language

SGML differs from other markup languages in that it does not simply indicate where a change of appearance occurs, or where a new element starts. SGML sets out to clearly identify the boundaries of every part of a document. To allow the computer to do as much of the work as possible, SGML requires users to provide a model of the document being produced. This model, called a Document Type Definition (DTD), describes each element of the document in a form that the computer can understand. The DTD shows how the various elements that make up a document relate to one another.



HTML - HyperText Markup Language

HTML is a document-layout and hyperlink-specification language. It defines the syntax and placement of special, embedded directions that aren't displayed by the browser, but tell it how to display the contents of the document, including text, images, and other support media.

"Yield to the browser. Let it format your document in whatever way it deems best. Recognize that the browser's job is to present your documents to the user in a consistent, usuable way. Your job, in turn, is to use HTML effectively to mark up your documents so that the browser can do its job effectively. Spend less time trying to achieve format-oriented goals. Instead, focus your efforts on creating the actual document content and adding the HTML tags to structure that content effectively." Chuck Musciano & Bill Kennedy. HTML: The Definitive Guide O'Reilly, 1997



Here's a little question: Who is really in control of HTML?

The Reader or the Writer?

Who is the real architect of information?

Required reading: "No Bad Webpages: Reader Empowerment and the Web" by T.A. Brooks

 

Mashing web pages


I.B.M. has posted a tutorial for its mash-up tool, QEDWiki, on YouTube.

Now mash-ups are poised to hit the mainstream, and to spread well beyond music. Yahoo, I.B.M., Microsoft and others are creating systems to let ordinary people who’ve never been near a Java class create useful computer applications by combining, or “mashing up,” different online information sources.
If the technology catches on, many of us may become part-time programmers, instead of waiting for the people in information technology to help.
Here’s just one example: An employee at a chain of hardware stores creates a mash-up that combines inventory data, storm forecasts and the telephone numbers of branch managers. Then, when snow is on the way, the application sends text messages to the managers’ cellphones, telling them how many shovels to order.
Devising that sort of mash-up, which handles multiple data sources to produce a customized solution, is typically the province of a professional. But the new systems are designed, their creators say, so people with modest technical skills can tailor applications to their needs — while writing little or no code.
"Do the Mash (Even if You Don’t Know All the Steps)" The NY Times, September 2, 2007

 

Don't believe Terry?


What I did to the Catalyst Portfolio tool?

What I will do the Catalyst ePost tool?




DHTML - Dynamic HTML

"Adding effective Dynamic HTML (DHTML) content to your pages requires an understanding of other technologies, specified by additional standards that exist outside the charter of the original HTML Working group...DHTML is an amalgam of specifications that stem from multiple standards efforts and proprietary technologies that are built into the two most popular DHTML-capable browsers, Netscape Navigator and Internet Explorer, beginning with Version 4 of each browser." Danny Goodman, Dynamic HTML: The Efinitive Reference O'Reilly, 1998

Technologies covered by Goodman: (1) Cascading stylesheets and (2) JavaScript.

[Note: This web page is an example of DHTML]



XML - Extensible Markup Language

XML is text-based markup that permits authors to invent their own tags, hence Semantic Markup

<?xml version="1.0" encoding="UTF-8" ?>
<pets>
	<dog>
		<name>Fido</name>
	</dog>
	<cat>
		<name>Fluffy</name>
	</cat>
</pets>


One consequence of permitting authors to invent their own tags is that XML coding must be strictly correct - no broken or missing tags.

Associated technologies are XSLT - Extensible Stylesheet Language Transformation and XML Schemas - schemas act as definitions for XML documents by declaring their structure. An XML schema validates and instance of an XML document. Validation is important because it permits you to be sure that the XML instance you have is correctly structured according to its defintion.



Jon Bosak is Sun's XML architect. He organized and led the working group that created XML and served for two years as chair of the W3C XML Coordination Group. He is a founding member of OASIS, the Organization for the Advancement of Structured Information Standards, and of its predecessor, SGML Open. At Sun he holds the position of Distinguished Engineer.




Required reading:   The Birth of XML: A Personal Recollection by Jon Bosak





XHTML - Extensible HyperText Markup Language

XHTML extends HTML by making it XML compliant. This permits standard XML tools to view, edit and validate them. "The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility." XHTML 1.0, W3C Recommendation, January 26, 2000



JSON - JavaScript Object Notation




Like XML, JSON is also used to share information among applications, but it is easy for people to read and machines to parse. "While JSON is often positioned "against" XML, it's not uncommon to see both JSON and XML used in the same application" (Wikipedia: JSON)


An example of a JSON object describing football players and their positions:


{ "players" : [
               { "firstName" : "Ryan", "lastName" : "Campbell", "position" : "S" },
               { "firstName" : "Chris", "lastName" : "Campbell", "position" : "QB" },
               { "firstName" : "Kevin", "lastName" : "Hale", "position" : "DT" }
           ]}

 

An orientation for the acronym abused

Feeling slightly nauseous with all these acronyms? Head swimming? What you're witnessing is the rapid development of many different information architectures to solve various problems. Some of these architectures are for presentation (e.g. HTML), some are for modifying presentation (e.g. JSON), some are for heavy-duty information dissemination (e.g. XML).


Some typical scenarios for different information architectures:


  • Showing information in a web browser: HTML is the preferred technology. XHTML is an attempt to make HTML as orderly as XML. DHTML recognizes that static HTML can be made interactive with JavaScript, etc. In Terry's opinion, one would never serve XML to a web browser because you can never guarantee which XML parser will be used in which web browser, hence you can never be sure of the visual presentation of your information. Terry can show you his many scars from trying to serve XML to web browsers, if you ask him nicely. To show information in an XML document in a web browser, you should first use an XSL stylesheet to transform it to HTML. HTML is preferred technology for information presentation.


  • Serve active content in a web browser: DHTML is the preferred technology. At a user event such as a image rollover, a call can be made to a web server for more information to update just a portion of a visible web page. JSON is the preferred technology for sending small, easy-to-consume packages of information to a web browser. You now understand the heart of the new Web 2.0 initiative called AJAX, Asynchronous JavaScript and XML.

  • Up-stream information providers: XML is the preferred technology to receive information from up-stream information providers. For example, each week your collaborator in Paraguay sends you the latest sales figures. The sales figures come onboard as an XML document. First thing you do is check the XML document as well-formed and valid as a 'sales report.' You then shred the XML document and update your databases, etc.

  • Down-stream information consumers: XMl is the preferred technology to disseminate information to down-stream information consumers. For example, each week you send sales orders out to your branch offices. You send them XML documents. Analogs of this: RSS feeds.

  • Long-term data storage: XML would be the preferred technology. It is composed of simple text files so it doesn't make any assumptions about technology platform or operating system.


 

To do: Martha Stewart calls you from her jail cell. She needs your help to decorate ... her data!