Tuesday, June 07, 2011

Create structure with microdata

Already some time ago HTML was created to be able to provide a structured way of creating content for the web. HTML makes sense to your browser and enable you as a user to view the website in a way the developer of the site intended to (or at least that is the intention of the developer. Having a structured way to create documents is great and needed to enable all browsers to show the data in (almost) the same way. On the other hand companies who run search engines, for example Google Search and Microsoft Bing, deploy index robots who crawl all the websites on the web to index what is on it and store this is the datastores which they use to query.

Both, browsers and crawl robots need structured data. In the current version of HTML their is enough structure to display the content in a browser, however for a crawl robot it is in some cases not enough. For example, if you create a image gallery you will most likely state where it is taken, who is on it, and when it is taken. This will be in a simple way that you as a user understand it when it is rendered by your browser from the HTML code. However for a crawl robot and the algorithm used to search the index created by the search robots this can be quite hard to figure this out.

An other example, if you have a store which sells hardware supplies you will most likely a page which states where your shop is and when it is open. It would be ideal for a search engine if you clearly state in a structured way which data on this page is talking about your location, which part about the time you are open and some parts about what kind of a store you are. The good part about modern websites is that most of the websites are rendered based upon data which is in a structured data, for example a database of XML files. However when we display this data to the users on the web via HTML this structure is sadly abandoned.

Putting back this structure enables applications and search engines to benefits greatly. For this reason Microsoft, Google and Yahoo have created schema.org which can be used as a vocabulary for the microdata format. This format enables you to structure the data in such a way that search engines will be able to index your site better and by doing so provide it higher in the results when someone searches for it. Using the microdata format and the vocabulary of schema.org will provide you the best result while keeping the way your data is shown in a browser the same.

Currently the schema.org hierarchy is not completed and will most likely never be completed however it is providing a first draft on how you could structure your data. For example coming back on the picture with the text I stated at the beginning of this page. In html it should most likely look a bit like this:

<html>
<body>
<img src="some_image.png">
<br><b>Description : </b>Johan Louwers inspecting the place before takeoff
<br><b>Location : </b>Hilversum airfield
<br><b>Date : </b>07-MAY-2011
</body>
</html>

This is providing a very simple page showing a picture and telling you who is on this picture, where it is taken and when it is taken. For a human this is no problem to interpret. For a algorithm this can be a little harder. specially for this the microdata can be used with the information from schema.org to make sure you are consistent. To make sure a search index knows what is what and also that it is related to the picture you can also use the code below. The result in your browser will be exactly the same however the indexing and the resulting position within a search engine can be completely different.

<html>
<body>
<div itemscope itemtype="schema.org/ImageObject">
<img src="some_image.png">
<br><b>Description : </b><span itemprop="description"><span itemprop="name">Johan Louwers</span> inspecting the place before takeoff</span>
<br><b>Location : </b><span itemprop="contentLocation">Hilversum airfield</span>
<br><b>Date : </b><time itemprop="publishDate" datetime="2011-05-07">07-MAY-2011</time>
</div>
</body>
</html>

If you look closely at the code you can see that we have placed most of the HTML code within a div;

<div itemscope itemtype="schema.org/ImageObject">

<div>

As you can see we make use of itemscope to tell that this is a div for starting the scope of the item (within the div) and we state that the itemtype we are making use of is a image object as is defined by schema.org . to be precise we are making use of a itemtype defined at schema.org/ImageObject. If you check this web location you will find more detailed information of what the options are you can use to describe an ImageObject.

The description we have change to the following line:

<br><b>Description : </b><span itemprop="description"><span itemprop="name">Johan Louwers</span> inspecting the place before takeoff</span>

In this we nest to things, first we create a itemproperty for the description which is "johan Louwers inspecting the place before takeoff" which is correct, however the "Johan Louwers" part is a person so to inform the interpreter of our microcode that this part of the description is a person we place a itemprop(erty) for person within the itemprop(erty) for description.

The current location is defined as below by using a itemprop(arty) contentLocation which is telling the search engine where the picture is taken. The great part of this is that more and data is getting location aware. By adding a proper location format to your code you will add to this and when someone is searching for for example pictures with the following search string "Hilversum Airfield 2011" your image is more likely to come up with your picture.

<br><b>Location : </b><span itemprop="contentLocation">Hilversum airfield</span>

To make sure that the time when this picture is also indexed correctly we can add the correct time (without having to thing about what date format you want to display and could be used to search on) using the following option:

<br><b>Date : </b><time itemprop="publishDate" datetime="2011-05-07">07-MAY-2011</time>

A lot more options can be displayed or not be displayed and only used to inform the interpreter of your microcode. For example, I think the picture is quite family friendly and nothing inappropriate is shown on it. So I would like to inform the world this is familyfriendly which can be done with the itemprop isFamilyFriendly which is a boolean value, I do however not want to show this explicitly. By using the below code I only make it available in the source of my page however the user will not have anything displayed on his screen

</span isFamilyFriendly="true"/>

Now you can argue that this, and taking into consideration all the other options you can use in your microdata are not really useful for some home build image gallery where you show your holiday pictures. On the other hand, why would you not like to turn up when someone is looking for the pictures of that great vacation you told him about, is that not the soul reason you place them online. If you are not willing to enable all this kind of microdata that is perfectly fine. If you are running a business this is however a whole other point. As a business being found on the internet is currently the difference between survival or not and this will become more and more the case. If someone is looking for something, interested in buying a product, it is most likely this person will go online first and then come to your shop. If he is unable to find your shop online he is very unlikely to find your store and make a purchase. Meaning, if you want to stay ahead of the rest it is very important that you take a good look into microdata or you might find that the competition is winning.

1 comment:

Lawrence Woodman said...

Good article, but the following html is problematic and isn't valid microdata:
</span isFamilyFriendly="true"/>

You need to specify the itemprop, so you would be better off doing something like:
<meta itemprop="isFamilyFriendly" content="true"/>