Skip to content

Working with web search engines

A post about a Google (XML) Sitemaps Generator for WordPress was a good starting point for learning about this technology. Google has defined a sitemaps format to help their web crawlers index websites.

Google has unveiled a new Google Sitemaps program allowing webmasters and site owners to feed it pages they’d like to have included in Google’s web index. Participation is free. Inclusion isn’t guaranteed, but Google’s hoping the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited.

The technical details of the format is described at It is an XML format with six attributes. From the FAQ

As with all XML files, any data values (including URLs) must use entity escape codes for the following characters: ampersand (&), single quote (‘), double quote (“), less than (<), and greater than (>). You should also make sure that all URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.[link to rfx 3986 fixed]

This is one of several forms of website meta data that need to be understood and used in creating a successful website. The “robots.txt” file is explained at Thomas Brunt’s Outfront or

The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:

In addition there are external resources:

The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion. The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.

It should be noted that this is not content creation or web software development. It is about how to define behind the scenes website data that coordinates your website with other web services. From here you can start to look at using RSS feeds and similar technologies to improve your website.

This is a field undergoing a significant rate of change. It, like many in information technology, can be hacked at by amateurs but those amateurs are going to be missing an awful lot of the implications of their work until they become familiar with the underlying technological concepts and ideas.

Post a Comment

You must be logged in to post a comment.