SEO for Web Developers: Page Construction

Posted Wednesday 6th April, 2011

Following on from “SEO for Web Developers: Keywords and Links”, this next article in my SEO series focuses on page construction. Whilst I’ve previously stated that in-links (i.e. external incoming links) are the fundamental workhorse of good SEO, it is also important to make sure you are constructing your pages in a way that easily exposes your content, and that clearly links it to your identified keywords.

Further to that, it’s important to know what the search bots are looking for when they spider your pages. From URL structure through to semantic markup and page specific metadata, there are a multitude of features you can build-in from the start that will improve the search engines’ insight into your content.

A note on ranking factors

Every two years, SEOmoz conduct a survey across various SEO experts which they use to publish statistical findings in relation to search engine rankings. This data is exceptionally useful in judging what elements to concentrate your efforts on during development.

In 2009 link metrics were worth a whopping 43% of value when calculating rankings. However, in the latest figures, that value has dropped to only 22%, which brings it in line with similar values for domain-level link authority (i.e. your domain is a “trusted” domain for quality). This reduction in size hasn’t resulted in other factors growing in value, but rather new factors have been introduced. These additions are domain-level keyword usage (how the keywords are relevant across the site), domain-level brand metrics (highlighting the importance of “brands” as a whole), and page-level traffic/query metrics.

Understanding the bots

Understanding how the search engine bots evaluate your pages and content is key to learning how page construction affects SEO.

Page vs Site

I have previously written—in my article “Semantics and Structure“—of the importance of remembering that the web is simply a network of single pages. This network has only a vague understanding of the human concept of site; via the comprehension of URLs beneath a fixed domain—and even that doesn’t necessarily translate to a single site, despite being used for authority rankings. Large sections of search bot AI has been dedicated to discerning site structure from the links within your pages.

Whilst we can see from the metrics above that domain-level factors are now more prevalent in the calculations for ranking, it is still important to imagine each page of your site as an independent unit, and structure your content and code accordingly. This means that the page title should be situated in an h1 tag and not the site title. That is unless you’re developing the home page, in which case the site title probably does belong in an h1 element.

Site-wide architecture should be situated within non-pertinent markup (i.e. markup that doesn’t apply any semantic emphasis on the content) and placed accordingly in the source order. I’ve heard very peculiar things about content placed in paragraph tags being more pertinent than content that isn’t. In my experience this is a significant fallacy. For more information about “pertinent markup”, read the “On-page optimisation” section later in this article.

Block-level analysis

Search engines, for the most part, follow an algorithm of block-level analysis. This means they will break a page down into sections (e.g. masthead, navigation, footer, main content, secondary content etc.) as a signal towards ranking the content within.

It is a good idea to make sure that the content you want to rank for is situated within your main content area. This may sound obvious but in a brave new world of modular development, and pages made up of “modules” of content, it is surprisingly easy to confuse the search engines and suffer for it in the rankings.

Source order is important!

It’s also important that you are prioritising your main content in your source order. Search engines will only evaluate the link text of the first use of a URI in a page. If you’re repeating a link with more valuable and relevant link text in your content than you are in your navigation, you’ll need to make sure the navigation comes after the main content. This is often the exact opposite of good UI design which will attempt to place navigation in the most obvious place; across the top, or down the left.

Take the following code as an example:

<ul>
    <li><a href="/about/">About me</a></li>
    <li><a href="/blog/">Blog</a></li>
    <li><a href="/contact/">Contact</a></li>
</ul>

<h1>My site</h1>

<p>Welcome to my site. In here you can find
<a href="/about/">information about me and
my career</a> and <a href="/blog/">my personal
blog about web development and the
internet</a>.</p>

Here the navigation links will be the first things evaluated by the bots, but they’re probably not the most contextually relevant links to the content in question. In fact, the second set of links include good keywords and may add significant value. For this reason the better option would be this:

<h1>My site</h1>

<p>Welcome to my site. In here you can find
<a href="/about/">information about me and
my career</a> and <a href="/blog/">my personal
blog about web development and the
internet</a>.</p>

<ul>
    <li><a href="/about/">About me</a></li>
    <li><a href="/blog/">Blog</a></li>
    <li><a href="/contact/">Contact</a></li>
</ul>

Obviously you can change the visual position of this content with CSS. Whilst the more sophisticated search bots will render CSS in an attempt to detect invisible content (i.e. that which is hidden either through display or visibility rules, or that which is moved offscreen), they are less bothered about the reordering of content visually.

Creating content

It may sound entirely obvious but it’s exceptionally important to write good textual content for search engines. You will always rank higher if you include a good balance of text and links—with a high relevancy to each other—on your pages.

Duplication is bad

Try to avoid duplication of content. By this I mean avoiding the same content on two different URIs. If your content is duplicated then you will be diluting its value by placing it in two places, even if it’s actually the same page served from two different URIs. I’ll cover the method you should use to declare the One True Version™ of your pages later on.

It’s also important to make sure your pages don’t duplicate content that is elsewhere on the internet. Good examples of this sort of repetition would be travel brochure text or product descriptions which are highly likely to be used on a multitude of affiliate sites; especially if that content is included in some sort of feed. Wherever possible try and create your own content; it will always serve you better and will ultimately separate you from the crowd.

Dynamic vs. static content

Dynamic content is that which updates regularly and hardly ever stays the same from day to day. Good examples of this might be a list of blog posts on a blog index page (where the blog items update regularly), news item indexes, feeds from other sites (e.g. RSS, Twitter etc.), and regularly moderated lists of links.

Static content is that which hardly ever changes. Good examples might be “about me” text on your blog, the description on a product detail page, and the article text on a blog article page.

It is very important to find the right balance of both dynamic and static content on your pages. Some pages will suit more static content (e.g. the article page of a blog) whereas others will suit more dynamic content (e.g. the index of that blog). In either case, make sure you’re including some of both types of content. On the article include links to the top articles on the blog, or feeds from Delicious or Twitter; on the index page include some static “about this blog/author” text.

Design your URIs

Since links are the most important part of your SEO strategy, it’s hardly surprising that the design of the URI is a fundamental ranking factor. It’s important that you use just as much care with the URI as you do with the text of those links. Here are the important factors in good SEO-friendly URI design:

Keywords in your domain

If you can, try to get some keywords in your domain. Often the most successful domains will have one or more relevant keywords in their domain name, and the closer to the left the better. A good example of this might be something like travelsupermarket.com, which is currently number one in Google for “travel”.

Rather interestingly, _exact_ keyword match domains seem to perform marginally better than domains with _some_ keywords. By this I mean domains that are entirely formed of keywords in the order for which they are searched (e.g. cheapmajorcaholidays.com). Hyphenated exact match domains (e.g. cheap-majorca-holidays.com) seem to perform slightly worse than those without hyphens, and domains that contain all query terms but are not an exact match are marginally worse again.

Choose the right TLD

Rather notably the .com TLD (top-level domain) appears to perform better than any other TLD. That’s not to say that you can’t easily out rank a .com domain, but they do seem to gain an advantage in a like-for-like ranking test with pretty much every other TLD I tried.

Subdomains

Subdomains do not count as part of the domain. Cross subdomain links, as previously stated, are deemed internal and URIs containing a keyword subdomain appear to rank similarly to URIs with the same keyword as the first path element. Example:

http://games.mygamessite.com/

Ranks the same as:

http://mygamessite.com/games/

Keep paths shallow

Try to keep your URI path to a minimum. Each level you go down the hierarchy, the less value you give to keywords within it. This is another good reason to maintain a flat site architecture. Also, remember that the keywords on the left of each path segment (i.e. each section delimited with /) are the ones with the highest value at that level.

The maximum number of layers (or segments) you should use in a path is about 3. Any more than that and you’ve basically lost any SEO benefit from that section of the URI.

Be strict with your characters

There are a limited subset of characters that are permissible in URI syntax. However, that subset still allows a great variation in the style of your characters. In general it is best to keep your URIs strictly lowercase so as to cut down on the chance of creating or generating duplicates through case sensitivity. Also, try and internationalise the character set in your URIs; it’s always going to be easier to match a UTF-8 search term against a UTF-8 keyword in your URI.

Space separators

There are several characters that can be decoded as a space in URI syntax, but only two that will work successfully as part of your path: “%20”, and “-“. It is best to enforce the use of a hyphen as a word separator in your paths. Google understands this as a space, and it will ensure your URIs remain readable. Underscores (“_”) are not recognised as a word separator and are therefore of little use. Matt Cutts has previously discussed this in his “dashes vs. underscores” blog post.

Branches and leaves

Do not underestimate the value of human readability when designing your URIs. I have personally experienced, through extensive user research and user testing, that users afford a certain level of confidence to a readable URI. What’s more, some SEO consultants whom I have spoken with recommend applying slightly different rules to branch and leaf URIs.

A branch URI is a node that could potentially lead to more branch URIs or leaf URIs. The recommendation is that these URIs should end with a /.

Branches:

http://sportsnews.com/football/
http://sportsnews.com/football/bundesliga/

A leaf URI is one that is the final node in the path tree. The recommendation is that these URIs should not end with a /, and should include some kind of filetype extension (usually .html). I prefer leaving out the extension myself, since it feels a bit old school and almost enforces a file-type assumption on the resource in question, but I accept I may be clouded by a developer’s understanding of HTTP and REST.

Leaves:

http://sportsnews.com/football/international/teams/england.html
http://sportsnews.com/contact-us.html
http://sportsnews.com/terms-and-conditions

These recommendations amount to a more traditional OS directory-style vision, which makes them more familiar to the general non-techie user. It’s worth remembering that the URI is displayed prominently in the SERPs and as such inform the users’ confidence in clicking the item.

On-page optimisation

Now you’ve hosted the pages on good SEO friendly URIs, and you’ve built a good network of links to those pages, it’s about time we looked at improving the way the search bots spider and evaluate your content:

Page title

The page title should sum up the content in as few words as possible. On left-to-right reading pages the left most words are deemed more significant (I’ve no research on right-to-left languages, but one would assume the opposite). A good title will have a healthy sprinkling of keywords whilst remaining human readable:

<title>SEO for Web Developers - Nefarious Designs</title>

Note that, in my titles, I’ve chosen to include the site title following the page title. Firstly, this means the site title will be included in the page title on the SERP, but it also means that the site title is associated as a keyword or keywords.

Semantic content markup

The majority of semantic markup won’t give you a significant boost in rankings. Sorry folks, it’s true; despite the fact that we web devs love some good semantic markup, the search bots are less bothered. Let’s face it, the internet is still full of badly constructed web pages and the bots have to spider, analyse, and rank those too.

However there are some semantic elements (and some presentational elements) that will have a greater influence on denoting keywords in your content to the search engines. These are:

Meta elements

There are plenty of articles misrepresenting the value of meta elements in terms of SEO. For pretty much the last ten years, they have been mostly irrelevant for SEO. In fact, at the moment, this very site isn’t using them (more through laziness than intention—the site templates are old and I intend on migrating away from them sooner or later).

Keywords

The keywords meta element is all but abandoned by todays’ search engines. You will see little effect in rankings if you remove it altogether. I have noticed some smaller search engines using it, and it’s conceivable that other search engines may use links from these search engines as authority on your content. In short, you might see a small amount of benefit by proxy, but possibly not enough to warrant making the effort to keep the keywords list up to date and matching your content.

Description

The description meta element is used as the first port of call for the description of your page in the search engine. There is also a minor ranking effect from including keywords in this description but as ever, it’s important not to spam them here.

The link element is a sneaky little devil. Most web developers use it solely for linking stylesheets to their pages, but it has so many other uses that can help the search engines understand your site architecture better.

The link element allows you to specify another URI that is linked to your page, and also what relationship that URI has with the current one. To specify the relationship, you declare it in the rel attribute. The rel attribute has many possible values, but only a few that are relevant to SEO:

Canonical

At a minimum it’s important to understand the canonical link element. This allows you to declare the One True Version™ of any page on your site. This is probably easier to explain through example. Consider the following two URIs:

http://sportsnews.com/football/

http://sportsnews.com/football/?ads=false

Assuming these are actually the same page, the search engine will consider these to be unique URIs and could penalise you for duplication of your content. To avoid this, it’s important to declare one version as the canonical version. To do this you add the following to the page:

<link rel="canonical" href="http://sportsnews.com/football/">

This method is also true of something more subtle like “topic” style pages. Imagine the following URIs:

http://sportsnews.com/teams/tottenham-hotspur/peter-crouch

http://sportsnews.com/teams/england/peter-crouch

By adding a canonical link to the first item we can declare that, although the pages are exactly the same, the second is actually the same page on a slightly different URI.

Others

There are a whole host of available options to use in the rel attribute of the link element that may improve the search engines’ understanding of your pages and site structure. I really can be bothered listing them all out here, but you can find an interesting explanation of SEO relevant rel values here.

Search bots as knowledge-based systems

Modern search engine bots are incredibly complex programs. Considering the amount of information they gather on each pass of your pages, they should certainly be considered to be knowledge-based systems. As the web changes, so do the methods the bots adopt to navigate your pages in as close to a humanised way as possible.

To this end, both Googlebot and Bingbot are capable of understanding fairly complex navigation systems. What’s more, they are very good at evaluating navigation links within your pages to ascertain different routes to your content. Through this understanding they are able to apply value calculations to specific URIs.

Google announced at the end of 2009 that they were attempting to display site hierarchies as an alternative to the URL in the SERPs where they could surmise it from on-page breadcrumbs. For more information on this, see “New site hierarchies display in search” on the Googleblog.

To do this, Googlebot simply analyses any constructs that look like a breadcrumb within your pages. This can be either a list of links identified by “breadcrumb” as an ID or class, or a set of links separated by the “>” character. In my tests both of these methods have been successfully picked up by Googlebot and displayed as site hierarchies on the Google SERP.

There are other methods of marking up breadcrumbs using “rich snippets” which I will talk about later.

Learning new interaction paradigms

Googlebot has been designed to understand modern in-page navigation paradigms, such as tab, carousel, and accordion widgets. It does this by searching for the presence of known markup configurations and included scripts.

For this reason, it’s sometimes better to use commonly used scripts such as popular jQuery or jQuery UI widgets which are likely to have a large user base across the web. Sadly this often means that badly written front-end interaction scripts may perform better than more bespoke options.

You can also improve Googlebot’s understanding of your pages by adopting common element identifiers on your JavaScript enhanced elements. This means using classnames such as “open”, “closed”, “active”, or “selected” etc.

It is worth noting, however, that in my tests rankings seemed better when pages used prebuilt widgets (i.e. those already in its knowledge base) than with bespoke techniques using predictable classes or IDs.

Rich snippets

These days Google is capable of rendering a large number of “rich snippets”. These are generally pieces of code specifically designed for marking-up data of a particular type, e.g. microformats, microdata, and RDFa. Often these snippets simply involve a range of predictable element identifiers or tags.

When displaying search results, Google will try and handle these rich snippets in the best way it sees fit. For example, reviews will appear as a 5 star rating widget just below the title of the search item. Recipes, on the other hand, are awarded their very own search utility:

Google Recipe search for “Chicken Madras”

Currently, Google supports the following types of rich snippet:

For more information on implementing these snippets for your content, take a look at Google’s excellent rich snippet explanation page.

JavaScript

Googlebot currently runs a headless browser as it navigates pages. A headless browser is basically a browser with no user interface and, in this context, it means Googlebot constructs the correct DOM, and attempts to render all CSS and some JavaScript. It also means that many DOM events are fired as it follows links through your site, potentially exposing functionality it wouldn’t normally find.

The JavaScript related tests I have run to analyse the bots’ paths through my sites have largely been quite inconclusive; it appears that sometimes Googlebot simply evaluates embedded script in the page (i.e. code inside ). It also appears to find URLs manipulated into links on click events. However, it is definitely finding links in embedded script tags more regularly than those in external script files.

So far I haven’t managed to get Googlebot to spider content or links added during a DOMready or window.onload event. My tests are ongoing and I shall endeavour to document my findings on this blog once I have a clearer picture.

Summary

Apologies for the length of this article, it just kept growing! There was so much to cover that I really didn’t want to have to split it out into several sub-articles; especially considering the fact that it’s already part of a series.

Hopefully you’ve now got a head-start when it comes to building your pages for SEO. Tie this to the link building and keyword targeting you’ve been doing following my first article and you’ll be ranking well in the SERPs.

In the following article I’ll be looking at what you can do to help the search engine robots specifically by providing metadata for them, and how proper handling of HTTP can improve crawlability. I’ll also cover some basic tools that can aid your monitoring and improvement of SEO.

Included in: Architecture, Design, Development, HTML, Information Architecture, Reference, SEO, Tutorials, Web, Web Standards

Categories:

  1. Accessibility
  2. Agile
  3. Ajax
  4. Apache
  5. API
  6. Architecture
  7. Books
  8. Browsers
  9. CMS
  10. CouchDB
  11. CSS
  12. Design
  13. Development
  14. Django
  15. Email
  16. Events
  17. Gaming
  18. Grammar
  19. Hardware
  20. HTML
  21. HTTP
  22. Humour
  23. Idea
  24. Information Architecture
  25. JavaScript
  26. jQuery
  27. Lean
  28. Life
  29. Linux
  30. Literature
  31. Mac OS X
  32. Management
  33. Meme
  34. Microformats
  35. Monday
  36. MySQL
  37. Networking
  38. News
  39. Personal
  40. Photoshop
  41. PHP
  42. Process
  43. Python
  44. Reference
  45. REST
  46. Science
  47. SEO
  48. Server
  49. Site
  50. Sitepimp
  51. Social
  52. Spelling
  53. Syndication
  54. Testing
  55. The Future
  56. Thoughts
  57. Tools
  58. Tutorial
  59. Tutorials
  60. Typography
  61. UI
  62. UNIX
  63. Virtualisation
  64. Web
  65. Web Standards
  66. Widgets
  67. Wii
  68. Writing
  69. Xbox
  70. XHTML