| SEO tips: query strings, multiple languages, forms and other content management system issuesPosted on August 7, 2009 by Mikko OhtamaaFiled Under plone, technology This post is collection of search engine optimization tips for content management systems, especially for Plone. Do not index query stringsIt is often desirable to make sure that query string pages (http://yoursite/page?query_string_action=something) do not end up into the search indexes. Otherwise search bots might index pages like site’s own search engine results (yoursite/search?SearchableText=…) lowering the visibility of actual content pages. GoogleBot has regex support in robots.txt and can be configured to ignore any URL ? in it. See the example below. Query string indexing causes the crawler crawl things like
Also, “almost” human readable query strings look ugly in the address bar… Top level domains and languagesUsing top level domain name (.fi for Finland, .uk for United Kingdoms, and so on.) to make distinction between different languages and areas is optimal solution from the SEO point of view. Search engines use TLD information to reorder the search results based on where the search query is performed (there is difference between google.com and google.fi results). Plone doesn’t use any query strings for content pages. Making robots to ignore query strings is especially important if you are hosting multilingual site and you use top level domain name (TLD) to separate languages: if you don’t configure robots.txt to ignore ?set_language links only one of your top level domains (.com, .fi, .xxx) will get proper visibility in the search results. For example we had situation where our domain www.twinapex.fi did not get proper visibility because Google considered www.twinapex.com?set_language=fi as the primary content source (accessing Finnish content through English site and language switching links). Shared formsPlone has some forms (send to, login) which can appear on any content page. These must be disallowed or otherwise you might have a search result where the link goes to the form page instead of the actual content page. Hidden content and content excluded from the navigationAny content excluded from the sitemap navigation should be put under disallowed in robots.txt. E.g. if you check “exclude from navigation” for Plone folder remember to update robots.txt also. In our case, our internal image bank must not end up being indexed, though images themselves are visible on the site. Otherwise you get funny search result: if you search by person’s name the photo will be the first hit instead of biography. Sitemap protocolCrawlers use Sitemap protocol to help determining the content pages on your site (note: sitemap seems to be used for hinting only and it is not authoritative). Since version 3.1 Plone can automatically generate sitemap.xml.gz. You still need to register sitemap.xml.gz in Google webmaster tools manually. There exists a sitemap protocol extension for mobile sites. Webmaster toolsGoogle Webmaster tools enable you to monitor your site visibility in Google and do some search engine specific tasks like submitting sitemaps. I do not know what kind of similar functionality other search provides have. Please share your knowledge in the blog comments regarding this. HTML <head> metadataSearch engines mostly ignore <meta> tags besides title so there is no point of trying fine-tune them. Example robots.txtHere is our optimized robots.txt for www.twinapex.com: # Normal robots.txt body is purely substring match only # We exclude lots of general purpose forms which are available in various mount points of the site # and internal image bank which is hidden in the navigation tree in any case User-agent: * Disallow: set_language Disallow: login_form Disallow: sendto_form Disallow: /images # Googlebot allows regex in its syntax # Block all URLs including query strings (? pattern) - contentish objects expose query string only for actions or status reports which # might confuse search results. # This will also block ?set_language User-Agent: Googlebot Disallow: /*?* Disallow: /*folder_factories$ # Allow Adsense bot on entire site User-agent: Mediapartners-Google* Disallow: Allow: /* Useful resources
XHTML mobile profile transformer and cleaner for PythonPosted on July 23, 2009 by Mikko OhtamaaFiled Under mobile, plone, python, technology Mobile phones, and especially mobile site validators, are very picky about the validy of XHTML. It must not be any XHTML, but special mobile profile XHTML. Also, search engines like Google, will punish you in the mobile search results if your site fails to conform to mobile profile. This is especially troublesome if you display external content (RSS feeds, ATOM feeds) on your mobile site. Incoming HTML cannot be guaranteed to follow any specification. To solve this problem, we have created gomobile.xhtmlmp Python library which helps you to transform any HTML to content to valid XHTML MP. The library is piloted on plonecommunity.mobi site which uses aggregated content from varying sources. The library is based on lxml.html.Cleaner. The library is part of GoMobile project which aims to create world class Python mobile web development tools. Highlights
As an example we integrated gomobile.xhtmlmp to Feedfeeder Plone add-on product. Enjoy. Scripting Google analytics for multidomain sitePosted on July 15, 2009 by Mikko OhtamaaFiled Under plone, technology We are running few Plone sites which use top level domain (TLD) to identify the site language. Like many other CMSs out there, Plone has only one box to enter Google Analytics script snippet. It is often desirable to use different tracker for different domain and different language combinations, but Google itself doesn’t provide any fancy generator to create complex page tracking code. Page tracker code, though looks little difficult when spit out by Google Analytics, is just normal Javascript. You can make the condition to choose the appropriate page tracker id in Javascript itself using document.location property and this way you don’t need to mess with your page templates to create separate tracking Javascript snippet slots. Here is an example what you can toss in to Plone site setup -> site -> JavaScript for web statistics support: <script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
// Choose page tracker id according to domain
var domains = document.location.hostname.split(".");
var tld = domains[domains.length-1];
if(tld == "fi") {
// .fi
pageTracker = _gat._getTracker("UA-8819100-1");
} else {
// .com
pageTracker = _gat._getTracker("UA-8819100-4");
}
pageTracker._trackPageview();
} catch(err) {
}
</script>
This is used on www.twinapex.com and www.twinapex.fi sites. Use console.log(err) to output possible Javascript in catch {} errors using Firebug. How to optimize your mobile site visibilityPosted on July 14, 2009 by Mikko OhtamaaFiled Under Business, mobile SEOptimize has an interesting post containing lots of resources for mobile internet growth, mobile site search engine optimizations and mobile web design. Keep this under your pillow, mobile folks! |
