Surfing the Semantic Web

This article was first written and published on Krimson.be. It provides an introduction to the Semantic Web and Open Data. The Archipel project aims to exchange data between repositories in a transparent fashion. The project and Work Package 4 rely on semantic technologies like RDF and SPARQL to achieve this goal.

May you live in interesting times.

This ancient proverb isn't far off these days. In this last decade, the Web has become ubiquitous in our daily lives. It has become far more then just a digital library of HTML pages. It's a virtual space where people meet, communicate and interact. The unseen expansion and evolution of the Web has spawned vast quantities of data distributed over countless server clusters, domains, databases,... So far, we've managed to keep the web accessible, allowing surfers to tap into it and retrieve relevant information. The big challenge is to make all that information meaningful so we can do more then just query it with simple keywords.

Here's why.

Search engines like Google and Yahoo! Have done a sterling job in the past decade to hand us the tools to search the Web for relevant answers to our queries. Still, the Web can do better. Search engines only return us lists of relevant results based on a few keywords, factored with a host of parameters and algorithms. The real power of the Web stays dormant. What if not only search engines but all kinds of web applications like social networks, location based tools, calendars, banking applications,... could fully “understand” content? Then they could aggregate data and hand us results with a higher information density.

On the Semantic Web searching for movies playing in local theaters becomes a breeze. Not only will a semantics enabled search engine retrieve a list of results, it could also show when movies are scheduled, give you ratings and reviews retrieved from other online sources and display ticket prices. We wouldn't have to click on every link that takes us to each separate theaters' site to find out which one offers us the best deal.

So, how does it work?

The Semantic Web heavily relies on the concept of 'resources'. A resource can be anything from a person (his contact information) to a book (it's bibliographic record). Resources are identified on the Semantic Web through URI's. The data they represent are described as triples. These are statements of the form subject – predicate – object, i.e. “The New York Times is a newspaper” describes the disposition of The New York Times. The concept of a newspaper is described in a common machine-readable vocabulary – or ontology - allowing a machine to unambiguously understand what a newspaper is and know that The New York Times is a newspaper and http://www.nytimes.com refers to The New York Times. Resources can also be interlinked with each other, creating relationships which in turn can also be described.

For instance, the Friend-of-a-Friend ontology (FOAF) allows you to describe persons, their activities and their relations to other people and objects.

Watch the "Intro to the Semantic Web" video for a more verbose explanation.

How can we make data “meaningful”? The technologies to do this are already here. They enable us to enrich resources with semantic markup. Microformats allow you to mark up events, contact information, locations,... But the real workhorse for the Semantic Web is RDFa. It's a language which makes it easy to embed extra metadata in your HTML mark up. Machines can understand this metadata. This way, a web application can not only retrieve a web page, it can be fully aware and understand what's on the page. If a webpage contains an RDFa marked up address, a browser can pick that up and ask the user if he wants to open up a map or contact application. As a standard recommendation, RDFa is backed by the W3C.

If you want to know more about RDFa, check out this short introductory video.

We'll also need tools to leverage semantically marked up data. Search engines are already making a move towards incorporating Semantic Searching. But one of the cornerstones is a query language called SPARQL. Much like standard SQL allows you to query data in a database, SPARQL lets you query the Semantic Web. Since SPARQL approaches it as a repository with interlinked resources or linked data, it's also called the Giant Global Graph. A SPARQL query travels through that graph, following links between resources and returning results matching your conditions. So, a SPARQL query isn't limited to querying just a single webpage, it can send a query retrieving results from across different domains.

Let's clarify this with an example. DBPedia is a project to add semantic sugar to Wikipedia articles and providing tools to query the data with SPARQL. The endresult: an open repository of linked data which is accessible to anyone from anywhere. Now, how do I query this data? Let's take Bob DuCharme's example. He wanted to retrieve a lists of things Bart Simpson writes on the blackboard at the beginning of a collection of Simpsons episodes. Trying to retrieve that information through a regular search engine would prove to be arduous. Creating and running a SPARQL query against DBPedia makes it really easy. This is Bobs' query:


SELECT ?episode ?chalkboard_gag WHERE {
?episode skos:subject
.
?episode dbpedia2:blackboard ?chalkboard_gag
}

Now open up your browser, surf to http://dbpedia.org/snorql, copy and paste the query and run it, it will return you a set of relevant gags for Season 12 of The Simpsons. Great!

Now, try to imagine how we could do the same with other kinds of data.

May you come to the attention of those in authority.

Where does Drupal come into play?

Drupal powers roughly 1% of the Web. While that doesn't sound like much, it does represent a lot of data. Moreover, Drupal is used in a variety of ways going from small blogs to mainstream newspaper websites. Manually adding semantic information to this segment of the Web would require a titanic effort. Drupal has made it easy to manage content and now, it can make it even easier to automatically intersperse RDFa in your published content.

Dries Buytaert has stressed the position of Drupal as a potential major enabler for the Semantic Web. Drupal 7 will support publishing RDFa enriched content out of the box. When you build or upgrade your website to D7, your content will automatically become part of the Semantic Web. It's a part of the new Fields API. When you create new entity types, Drupal will add RDFa to them. It's one of the main reasons why it's so important to get Drupal 7 moving.

Check out this movie from DERI Galway on how Drupal and RDFa can work together.

May you find what you are looking for

Publishing Semantically enriched data is only half of the story. Even if we go the extra mile and add RDFa to our content, who is going to notice apart from a select few like big search engines? Wouldn't it be great if we could also take advantage of semantically published data? Mashing our own content with information coming from the outside creates added value and drives traffic to your site. A website of a movie theater which uses reviews, actor biographies, public transport time tables,... to enrich it's own content will have an edge over its' competition who doesn't.

Again, Drupal is a great platform to turn your project into a consumer through it's flexible approach of adding functionality through modules. A great example is the SPARQL Views plugin maintained by Lin Clark of DERI Galway . This Drupal module extends the omnipresent Views module with SPARQL support. The Views module allows developers to easily and efficiently create and theme lists of content. It's a heavily used module, but it only queries the local database. The SPARQL Views plugin extends the Views module and reuses it's interface and theming functions. Using display plugins for Views like Display Suite or Panels, we can integrate the results seamlessly in our own content.

The project is currently under heavy development and could use a lot of feedback. The advantage of this plugin is the flexibility and ease with which it allows you to query linked data. This enables less experienced developers to also connect with the Semantic Web.

Want to know more about SPARQL Views? Check out this video!

What's in it for me?

So, why hasn't it happened already? If the tools are readily available, why isn't there a massive drive to improve them and get the Semantic Web into mainstream?

RDFa implementation in most cases involves big projects who have large datasets, enough funding and the knowledge to go the extra mile. The SemWeb is still largely rooted within entities who invest heavily in R&D (academic institutions, large corporations,...) But what about the rest of the Web?

Although it's relatively easy to publish RDFa enriched content, writing applications which connect with the Semantic Web is not yet common practice. Support from popular tools like Drupal break an important technological barrier: they enable developers to easily reuse data on their own website. Yet, there are other hurdles which might prove harder to take.

Why would clients want to invest in connecting with the Semantic Web? What's in it for the client? The Semantic Web is inherently open. Which means that published data can be easily reused by other parties. Stakeholders might not be so willing to share data for a variety of reasons: legal rights, marketing, advertising, image building,... If your content gets reused on other websites and applications, why would a user go back to the primary source, one could argue? Doesn't that drive traffic away?Actually, sharing your content through RDFa levels the plain. Secondary sources who reuse your content tend to generate backlinks to the original source. Depending on the quality and the correctness of your content, your site gains authority. Good examples of authoritative projects are Wikipedia, IMDB and RottenTomatoes. The high level of quality of their content generates traffic. Since interlinking is one of the fundaments of the SemWeb, linking back to the original resource is actually encouraged.

Tim Berners Lee, inventor of the Web, commented on data sharing:

"The less inviting side of sharing is losing some control. Indeed, at each layer --- Net, Web, or Graph --- we have ceded some control for greater benefits.

People running Internet systems had to let their computer be used for forwarding other people's packets, and connecting new applications they had no control over. People making web sites sometimes tried to legally prevent others from linking into the site, as they wanted complete control of the user experience, and they would not link out as they did not want people to escape. Until after a few months they realized how the web works. And the re-use kicked in. And the payoff started blowing people's minds.

Letting your data connect to other people's data is a bit about letting go in that sense. It is still not about giving to people data which they don't have a right to. It is about letting it be connected to data from peer sites. It is about letting it be joined to data from other applications.

It is about getting excited about connections, rather than nervous."

Different stakeholders tend to focus on the content inherent to the project without being aware of the opportunities which connecting with other sources elsewhere on the web might bring. The Semantic Web adds a new layer of possibilities, but we need to think about how we can use it and how our projects can benefit from it. So, when we plan a project, we need to think outside the problem domain, prospecting other content sources and establishing viable business cases where reuse adds value to a clients' project.

Hard boiled

The fact that consuming the SemWeb isn't all that trivial has created a chicken-and-egg paradox. As long as semantic data aren't being used, implementing RDFa only adds little value. But with a small base of RDFa enabled content, there is only so much room for implementing a variety of consumers.

The Open Graph Protocol is a great example of how this chicken-and-egg problem can be tackled. It's a protocol that allows you to integrate web pages into the social graph, which is a representation of relationships in a social network. Facebook has integrated the Open Graph Protocol. When you implement OGP (which is a RDFa vocabulary) on your own web pages, these can be turned into Facebook Pages. This enables developers to use Facebooks' popular 'Like' button outside Facebook. The button can be applied on their own content. If a user clicks on the button, Facebook will connect your webpage with the users' profile pulling data from your page (images, a teaser text, title,...) and making it available on Facebook for everyone to share and reuse.

IMDB added OGP support to their dataset. Each detail page of a movie has a 'like' button. When clicked, the content (movie title, poster, small teaser) gets exposed in Facebook. A backlink on Facebook drives traffic back to IMDB. Since Drupal 7 already supports RDFa, it shouldn't be a far stretch to implement OGP in your own webprojects too. Already, Stéphane Corlosquet at MIND/MGH has developed an OGP module for Drupal 7 which can embed OGP metadata automatically in your website.

Conclusion

The Semantic Web is slowly coming of age. Major well known players like Facebook are stepping in. Given the increasing availability of tools, the bar for connecting web projects gets lowered. As a major platform for content management and delivery, Drupal can become a great driver for the Semantic Web. While the technologies are there, the big challenge will be to rethink our frame of reference when we are modeling content in a web project and learn to see new opportunities and business cases.

Share/Save

Deelnemende organisaties