Web page semantics

I quote here from an article that I wrote for the Encyclopedia of Library and Information Science about the W3C.

"Residing behind the W3C’s vision of an all-encompassing, beneficent Web is the more ambition vision of the Web as a network of semantics. As described in a famous article in the Scientific American, the Semantic web would feature metadata that could be harvested mechanically and manipulated by inference tools to solve ad hoc everyday problems such as finding the office hours of the closest doctor, the cheapest available tickets for the theater, and so on. The subtitle of the Scientific American article summarizes this more august vision of a semantic web: “A new form of web content that is meaningful to computers will unleash a revolution of new possibilities.” At this time, however, web content that is meaningful to computers would fall within the web services paradigm, which is a machine-to-machine interaction over a network. Commonly web services exist within a cultural context of known identities, enforced security and shared definitions of concepts and language. A corporate intranet would be an example of such a cultural envelope. The cultural markers for success of web services within corporate intranets should be contrasted to the open Web of unknown identities, no security and doubtful conceptual and linguistic agreement."

Implementing the Semantic Web has been controversial. I have taught LIS 540 Information Systems, Architecture and Retrieval a number of times. "Does that stuff mean anything?" is a page devoted to the discussion of semantics and the web. (Some of you have even taken that course from me!) The cynic would be quick to point out that semantic problems ("What does Moby Dick really mean?") have not been solved in the print domain, and digitizing the print domain probably will do nothing to solve semantic problems. Might even make them worse, who knows.

We do know that Google distrusts metadata that it finds in web pages.

Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem. This problem that has not been addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit. The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page

So the open Web is characterized by untrustworthy, short-lived web pages that probably are an unlikely foundation for a world-wide platform for creating shared semantics.

There has been an emerging trend to locate semantics in documents that are placed in repositories by trustworthy agents and that will last long enough to justify the economics of time and effort in giving them semantics. [See the article "The Yahoo! Search Open Ecosystem" and its use of "data web" --- hmmm, data web might be the more evolved form of the rhetorical form semantic web.] An example would be a public repository created by the University of Washington of papers written by Husky professors, or articles submitted to peer-reviewed online scholarly journals.

Recently, I got very excited about Eprints DC XML, which is a new protocol for marking up scholarly articles in open repositories in England. It is an XML version of Dublin Core, which is much easier for machine harvesting. Unfortunately, Eprints DC XML is also a very labor intensive strategy, and probably will have limited application.

The W3C developed a protocol in the mid 90s specifically designed for web semantics: RDF (Resource Description Format), but there has been resistance to its use because of its complexity.

Since the initial experiments indicate that RDF data is hard to find, a more targeted search was conducted. During the first experiment, RDF data was found in only sixteen out of over half a million pages from the Open Directory. This number increased to 180 out of 2.9 million pages in the second run.
Overall, with the categories combined, this translates to 1018 out of 541,536 URLs containing RDF, 613 of them correct and 405 with incorrect RDF for the first run. In the second run, out of 2,952,010 pages, 1479 contained valid and 2940 contained invalid RDF.
The results of this survey suggest that RDF has not caught on with a large user community." Survey of RDF Data on the Web by Andreas Eberhart, August 15, 2002

So we conclude from the decade or so that the vision of a Semantic Web has been around, the presence of the RDF protocol to create web semantics, and the current lack of anything much concrete that the right vehicle to embed semantics cheaply and naturally as a person creates web pages has yet to appear.

Then RDFa appeared.

RDFa: Resource Description Format/Attributes


The motivation of RDFa is the observation that many web pages have structured data that appear as attibutes of HTML elements. For example, a picture appears in an HTML page as an <img> tag, but it can carry a lot more freight that appear as attributes:

<img src="photo1.jpg"
  rel="license" resource="http://creativecommons.org/licenses/by/2.0/"
  property="dc:creator" content="Mark Birbeck"
/>

Here we find an image tag with its expected attribute "src", but also RDFa attributes such as

ref="license", expresses the type of relationship between the image and the resource indicated, i.e., The resource tells us how the image is licensed.

resource="http://creativecommons.org/licenses/by/2.0/" is the URI that points to a certain resource.

property="dc:creator" indicates a type of property, here someone who is a "creator" (probably the person who took the picture). Note that the "dc" indicates that creator is being used in the sense specified by the Dublin Core namespace.

content="Mark Birbeck" and here we find the content that was referred to by the property tag. Aha! The person who took the picture was Mark Birbeck and he has placed this photograph on the Web under a certain type of creative commons license.

Golly, this is beginning to feel like the Semantic Web. Or, maybe the Data Web.

The RDFa Attributes


RDFa exploits a number of XHTML attributes that already exist and adds a few news ones. Here are a few examples:

rel - Expresses the relationship between two resources

content - A string expression

about - Identifies the resource being described

property - Expresses a relationship between the subject and some quality


What is interesting about the use of RDFa in web pages is that two readers are accommodated: (1) The human reader reads the styled web page in the browser, and (2) An application reads the HTML code and finds RDFa marked-up content. Neither reader gets in the way of the other.


Examples


Use your imagination with these examples and consider how some web searchbot could look for these attributes in web pages and harvest them for some index.

The most interesting examples of RDFa occur inline with other elements that are providing content to be styled for the human reader.

  <html
    Defines "dc" as a reference to the Dublin Core namespace
    xmlns:dc="http://purl.org/dc/elements/1.1/">
  <head>
    <title>Jo's Blog</title>
  </head>
  <body>
     Puts a property attribute inside a span tag inside a naturally occurring H1 element 
    <h1><span property="dc:creator">Jo</span>'s blog</h1>
    <p>
      Welcome to my blog.
    </p>
  </body>
</html>
     



<html
  xmlns="http://www.w3.org/1999/xhtml"
  // Declarations of other namespaces
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  >
  <head>
    <title>My home-page</title>
    // Dublin Core 'creator' element with the content 'Mark Birbeck'    
    <meta property="dc:creator" content="Mark Birbeck" />
    // Friend of a Friend 'workplaceHomepage' element with its href    
    <link rel="foaf:workplaceHomepage" href="http://www.formsPlayer.com/" />
  </head>
  <body>...</body>
</html>     
     
     



<html
  xmlns="http://www.w3.org/1999/xhtml"
  //Declare a biblio namespace  
  xmlns:biblio="http://example.org/"
  >
  <head>
    <title>Books by Marco Pierre White</title>
  </head>
  <body>
    I think
  //Points to an ISBN and categorizes it as a 'book'    
    <span about="urn:ISBN:0091808189" instanceof="biblio:book">
      White's book 'Canteen Cuisine'
    </span>
    is well worth getting since although it's quite advanced stuff, he
    makes it pretty easy to follow. You might also like his
  //Points to an ISBN and categorizes it as a 'book'     
    <span about="urn:ISBN:1596913614" instanceof="biblio:book">
      autobiography
    </span>.
  </body>
</html>     
          
     

Semantic Radar


As you might suspect, somebody has already written a Firefox plug-in, Semantic Radar, that adds extra functionality to the Firefox browser - it informs you of the semantic markup that may lurk in a web page that you're viewing.




Oh come on! We can do that!


In a recent blog posting (February 4, 2008) Bob DuCharme states:

An important principle has been the ability to make a web page's data readable by both eyeballs and automated processes. This is great, but there are two related issues that I feel need a higher profile: first, RDFa has great potential for storing non-eyeball information in web pages. Secondly, examples like the one above go after microformats on their own turf, where they're dug in pretty well. Being a more generalized, scalable solution, RDFa can do a lot more than microformats, and with many of those other applications having more commercial potential, I see them as the best growth area for the format.

Pricing is a good example. It's a huge area where people would be happy to give away data in the form of extra embedded metadata in their web pages, because it can drive new paying customers to the source of that data (for example, to sell more copies of the book with the ISBN 1930220111).

Obviously one of the places where I should have used RDFa markup was in our class textbook page. This is where there is structured information (i.e., our textbook information) that might be harvested automatically (i.e., pick out the author, pick out the book title).

Let's markup up our textbook page


For the basic architecture of my new textbook page I went to the YUI CSS grids and chose the "2 Column (33/66)" template.

I deleted the header and footer DIVs. I colored the font in the left column red and the center column blue.

Here is the marked up content for the center DIV.

<h3>Textbooks</h3>
<br>

<p>
<span property="dc:creator">Shafer, Dan</span> and 
<span property="dc:creator">Andrew, Rachel</span>. 
<span style="font-weight: bold" property="dc:title">
  HTML Utopia: Designing without tables using CSS
</span>  
<span property="dc:publisher">Sitepoint, 2006</span>
</p>

<br>
<p>
<span property="dc:creator">Powers, Shelly</span>. 
<span style="font-weight: bold" property="dc:title">Learning JavaScript</span>. 
<span property="dc:publisher">O'Reilly</span>, 2007
</p>
        

We have to declare our use of the "dc" - Dublin Core namespace so let's add the following to the <HTML> element of our page.

<html xmlns:dc="http://purl.org/dc/elements/1.1/">      
      

After these preliminaries, my page looked like this.



Harvesting semantics


I'll give you an XPath algorithm that harvests the semantics of this page. Can you place the semantics in a nice table in the left column?

[Note: Since I'm using XPath, this example will only work in Firefox!]

Here's the code for a button to put at the bottom of your page:

<input type="button" value="Set of data" onclick="findSetsOfData()"  />

And here's some JavaScript that finds sets of data:

function findSetsOfData()
{
//Find all the paragraphs
var paragraphs = document.evaluate("//p", 
        document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);

    //For each paragraph
    for ( var a = 0; a < paragraphs.snapshotLength; a++)
    {
        //Find all the childnodes    
        var theseChildren = paragraphs.snapshotItem(a).childNodes;
        
        for (var i = 0; i < theseChildren.length; i++)
        {
            //Locate <span> elements            
            if (theseChildren[i].nodeName == "SPAN")
            {               
                //Choose the ones with the 'property' attribute                
                if (theseChildren[i].hasAttribute("property"))
                {                   
                    //Get the attribute value                    
                    var attValue = theseChildren[i].getAttribute("property");
                    //Get the value of the <span> element                    
                    var value = theseChildren[i].innerHTML;
                    
                    alert("The attribute " + attValue + " is " + value);                                       
                }
                                
            }
        }
    
    }

}

Here's a working example:




If you can reach into the web page and find the RDFa semantics, then you can imagine building some application that would automatically harvest the semantics of the web page and post them in an obvious place.

Take a look at the following image. I modified my web page so that it harvested the semantics and then presented them in the left column of the page.




Modify the JavaScript above to harvest the RDFa in the webpage and show those metadata in the left column of the web page. When you're finished, you can add it to your deliverables page. Well done!