• About

    Twinapex Blog is the voice of mobile and Internet experts. We tell tales about our exciting life in the world where communication methods convergence and you can access whatever information you wish, wherever, on whichever device you want.

    If you find us interesting and talented and you are looking for developers, please contact us and we might just be able to help you.

    Creative Commons License
    This work is licensed under a Creative Commons Attribution 3.0 Unported License.

SEO tips: query strings, multiple languages, forms and other content management system issues



This post is collection of search engine optimization tips for content management systems, especially for Plone.

Do not index query strings

It is often desirable to make sure that query string pages (http://yoursite/page?query_string_action=something) do not end up into the search indexes. Otherwise search bots might index pages like site’s own search engine results  (yoursite/search?SearchableText=…) lowering the visibility of  actual content pages.

GoogleBot has regex support in robots.txt and can be configured to ignore any URL ? in it. See the example below.

Query string indexing causes the crawler crawl things like

  • Various search results (?SearchableText)
  • Keyword lists (?Subject)
  • Language switching code (?set_language)… making set_language appear as the document in the search results

Also, “almost” human readable query strings look ugly in the address bar…

Top level domains and languages

Using top level domain name (.fi for Finland, .uk for United Kingdoms, and so on.) to make distinction between different languages and areas is optimal solution from the SEO point of view. Search engines use TLD information to reorder the search results based on where  the search query is performed  (there is difference between google.com and google.fi results).

Plone doesn’t use any query strings for content pages. Making robots to ignore query strings is especially important if you are hosting multilingual site and you use top level domain name (TLD) to separate languages: if you don’t configure robots.txt to ignore ?set_language links only one of your top level domains (.com, .fi, .xxx) will get proper visibility in the search results. For example we had situation where our domain www.twinapex.fi did not get proper visibility because Google considered www.twinapex.com?set_language=fi as the primary content source (accessing Finnish content through English site and  language switching links).

Shared forms

Plone has some forms (send to, login) which can appear on any content page. These must be disallowed or otherwise you might have a search result where the link goes to the form page instead of the actual content page.

Hidden content and content excluded from the navigation

Any content excluded from the sitemap navigation  should be put under disallowed in robots.txt. E.g. if you check “exclude from navigation” for Plone folder remember to update robots.txt also.

In our case, our internal image bank must not end up being indexed, though images themselves are visible on the site. Otherwise you get funny search result: if you search by person’s name the photo will be the first hit instead of biography.

Sitemap protocol

Crawlers use Sitemap protocol to help determining the content pages on your site (note: sitemap seems to be used for hinting only and it is not authoritative).  Since version 3.1 Plone can automatically generate sitemap.xml.gz. You still need to register sitemap.xml.gz in Google webmaster tools manually.

There exists a sitemap protocol extension for mobile sites.

Webmaster tools

Google Webmaster tools enable you to monitor your site visibility in Google and do some search engine specific tasks like submitting sitemaps.

I do not know what kind of similar functionality other search provides have. Please share your knowledge in the blog comments regarding this.

HTML <head> metadata

Search engines mostly ignore <meta> tags besides title so there is no point of trying fine-tune them.

Example robots.txt

Here is our optimized robots.txt for www.twinapex.com:

# Normal robots.txt body is purely substring match only
# We exclude lots of general purpose forms which are available in various mount points of the site
# and internal image bank which is hidden in the navigation tree in any case
User-agent: *
Disallow: set_language
Disallow: login_form
Disallow: sendto_form
Disallow: /images

# Googlebot allows regex in its syntax
# Block all URLs including query strings (? pattern) - contentish objects expose query string only for actions or status reports which
# might confuse search results.
# This will also block ?set_language
User-Agent: Googlebot
Disallow: /*?*
Disallow: /*folder_factories$

# Allow Adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

Useful resources

XHTML mobile profile transformer and cleaner for Python



Mobile phones, and especially mobile site validators, are very picky about the validy of XHTML. It must not be any XHTML, but special mobile profile XHTML. Also, search engines like Google, will punish you in the mobile search results if your site fails to conform to mobile profile.

This is especially troublesome if you display external content (RSS feeds, ATOM feeds) on your mobile site. Incoming HTML cannot be guaranteed to follow any specification.

To solve this problem, we have created gomobile.xhtmlmp Python library which helps you to transform any HTML to content to valid XHTML MP. The library is piloted on plonecommunity.mobi site which  uses aggregated content from varying sources. The library is based on lxml.html.Cleaner. The library is part of GoMobile project which aims to create world class Python mobile web development tools.

Highlights

  • Turn any incoming HTML/XHTML to mobile profile compatible
  • Enforce ALT text on images – especially useful for external tracking images (feedburner tracker). ALT texts are required by XHTML MP.
  • Protect against Cross-Site Scripting Attacks (XSS) and other nastiness, as provided by lxml.xhtml.clean
  • Unicode compliant – eats funky characters

As an example we integrated gomobile.xhtmlmp  to Feedfeeder Plone add-on product.

Enjoy.

Plone goes mobile



FYI

Plone GoMobile project.

plonecommunity.mobi demo site.

More to come.

Zope Zeo vs. standalone setups



We do some Plone development here at Redi. As known, Plone is a powerful, but unfortunately quite a heavy CMS which is best suited for Intranets. Thus, we are always looking for speed increase.

Enter Zeo cluster – a feature that nowadays comes bundled with Zope and allows one database (practically Data.fs) to be used by multiple Zope instances, or more accurately Zeo clients. In standalone installation only one CPU / CPU core can be used for processing requests (as Zope / Python implementation is single-threaded AFAIK). So if there are any concurrent requests the database (ZODB, the Zope Object Database) usually has to wait for the request processing before it is asked for the data and only part of the processing power is used as requests are queued. Using Zeo server-client architecture however, each Zeo client can do the processing on their own CPU/core (thus efficiently using the whole CPU prosessing power available) and also minimize the hard disk idle time by asking for data in an ~asynchronous manner (in separate queues). Actually ZODB even serves the same object simultaneously to different client processes for performance reasons. This might raise database ConflictErrors, which are nothing to fear of, however, as noted some paragraphs below.

Similarly, you could also deploy Zeo clients on different computers in local network (or wherever you want), but that’s not the scope of this article. Having clients running on different machines is a similar case with the same performance basis, but there are connection lags, bandwith limits and such that decrease performance.

Theory vs. practice

Deploying a Zeo cluster instead of standalone Zope instance should theoretically increase the performance by factor of extra available CPUs / CPU cores. There might be some overheads from this setup though, so we tested it out using ApacheBenchmark – the benchmarking module that comes bundled with Apache nowadays. But first something about…

Setting up Zeo & converting from standalone mode

In the easiest scenario, setting Zeo up is rather easy: the unified installer supports Zeo-server setup out of the box (=there is a recipe for it). Just run the unified installer like:

$ ./install.sh zeo

Luckily, the unified installer uses buildout from Plone 3.1 onwards. Thus, converting your current buildout instances to Zeo cluster is nothing but change of buildout configuration. Where you would normally need ‘instance’ section in your buildout.cfg you will now need the following:

[zeoserver]
recipe = plone.recipe.zope2zeoserver
zope2-location = ${zope2:location}
zeo-address = 127.0.0.1:12000
#effective-user = __EFFECTIVE_USER__
[client1]
recipe = plone.recipe.zope2instance
zope2-location = ${zope2:location}
zeo-client = true
zeo-address = ${zeoserver:zeo-address}
# The line below sets only the initial password. It will not change an
# existing password.
user = admin:mysecretpassword
http-address = 12001
#effective-user = __EFFECTIVE_USER__
#debug-mode = on
#verbose-security = on

# If you want Zope to know about any additional eggs, list them here.
# This should include any development eggs you listed in develop-eggs above,
# e.g. eggs = ${buildout:eggs} ${plone:eggs} my.package
eggs =
    ${buildout:eggs}
    ${plone:eggs}

# If you want to register ZCML slugs for any packages, list them here.
# e.g. zcml = my.package my.other.package
zcml =

products =
    ${buildout:directory}/products
    ${productdistros:location}
    ${plone:products}

To add more clients (which is quite the point here), append as many times the extra client sections like this:

[client2]
recipe = plone.recipe.zope2instance
zope2-location = ${zope2:location}
zeo-client = true
zeo-address = ${zeoserver:zeo-address}
user = ${client1:user}
http-address = 12002
#effective-user = __EFFECTIVE_USER__
#debug-mode = on
#verbose-security = on
eggs = ${client1:eggs}
zcml = ${client1:zcml}
products = ${client1:products}

That minimizes the need for retyping user names, password etc. These examples were taken from Plone unified installer buildout.cfg with ports changed.

Starting, stopping & restarting

Now, to start your Zeo-powered Plon clients you could type:

bin/zeoserver start
bin/client1 start
bin/client2 start
...same for all the clients...

However, the unified installer has a recipe which automatically generates nice and simple shell scripts to control your cluster. In the end of your buildout.cfg, add:

[unifiedinstaller]
recipe = plone.recipe.unifiedinstaller
user = ${client1:user}
primary-port = ${client1:http-address}

That should generate the scripts. In fact, it propably does also something else, something which I’m not aware of. However, I didn’t bump into any problems, yet :) Anyway, to start the whole cluster (server & clients), type:

bin/startcluster.sh

And that does it (it start server and the clients). Shut it down via:

bin/shutdowncluster.sh

And restart:

bin/restartcluster.sh

ConflictErrors – not that errerous

As noted before, in Zeo mode the ZODB might serve the same objects to two more clients at the same time. If one client manipulates the object before others (ie. edits values and saves changes) the other requests will propably fail. This raises ConflicError which looks like this:

ConflictError: database conflict error (oid 0x0f39, class HelpSys.HelpSys.ProductHelp)

In this case ZODB tries to reprocess the failed requests. This should be common database approach and thus a feature, not a bug (although Zope might want to tell that in error message!). For more accurate explanation see Plone discussion.

Parsing it together with web server

The Zeo components (server and clients) talk to each other via standard Internet protocols (TCP or UDP, not sure). In the default setup, the Zeo server listens to port 8100 and Zeo clients to 8080, 8081, etc. Thus, to access the separate clients as ‘one site’ we need to serve the requests to multiple clients. This can be achieved with load balancers. Apache has at least one: mod_proxy_balancer which should do exactly what we need. Apache isn’t the best choice for achieving high requests per second values, but it will do for our tests (compare to more lightweight but also more limited lighttpd). Just remember that there are other alternatives/methods available, like using squid as load balancer.

Our configuration is as follows (inside VirtualHost-directive):

  <Proxy balancer://lb>
    BalancerMember http://127.0.0.1:12001/
    BalancerMember http://127.0.0.1:12002/
    BalancerMember http://127.0.0.1:12003/
    BalancerMember http://127.0.0.1:12004/
  </Proxy>

  <Location /balancer-manager>
    SetHandler balancer-manager
    Order Deny,Allow
    Allow from all
  </Location>

  ProxyPass /balancer-manager !
  ProxyPass             / balancer://lb/http://localhost/VirtualHostBase/http/www.mydomain.com:80/plonesite/VirtualHostRoot/
  ProxyPassReverse      / balancer://lb/http://localhost/VirtualHostBase/http/www.mydomain.com:80/plonesite/VirtualHostRoot/

This setup also allows us to use the balancer-manager (accessible at /balancer-manager) that comes with mod_proxy_balancer. It’s useful for checking if the configuration is working and balancer is dividing the requests equally. In my setup the balancer is using the default Request Counting -algorithm which divides the requests numerically equally between the instances, but you might want to also try Weighted Traffic Counting, which should be for actual use. In our test only the frontpage is accessed however, so each request’s data transfer is equal and the weighted traffic counting isn’t of use.

The test

The server machine

  • Ubuntu 8.04 virtual server
  • Intel Xeon 2.0Ghz (4 cores)
  • 2 GB of RAM
  • Hard disk drive (7200rpm?)

The setup

  • Standalone Plone instance
  • Plone via Zeo server with 4 clients (as many clients as cores in processor)
  • Plone via Zeo server with 6 clients (for curiosity)

The tests where run locally in development environment to minimize the network lag (was 0-1ms).

The test commands

ApacheBenchmark commands:

$ ab -n N -c C myurl

where N was either 1000 or 9000 (requests) and C 1, 10, 100 or 1000 (concurrent requests).

The results

You can download the more in-depth test sheet Plone Standalone vs. Zeo installation (PDF).

To put it simple: theory and practise meet well – Zeo server is a lot more powerful with concurrent requests. On non-concurrent requests the results are about the same.

Having as many Zeo clients as CPUs / CPU cores can boost the performance up to number of extra CPUs/cores. For example, in our quad-core server with Zeo setup we gained nearly 4 times the requests per second of standalone installation (~370% to be accurate). Increasing Zeo clients to 6 didn’t help any as there’s no processing power left from 4 heavily stressed client processes. Also to be noted is that the waiting times for clients nearly tripled (median jumped from 126 to 305 ms) when raising concurrency from 1 to 10. This isn’t bad though – those are still low figures compared to standalone’s median of 1215 ms! Only when raising concurrency to 100 we began to see some 3,6 seconds waiting times (6 seconds for standalone). Increasing concurrency didn’t bring down the requests/second rates much (less than 5%) as expected.

Overall, the results were expected, but now we have evidence of it: under concurrent request load Zeo server is a good option to multiply the performance of your site. With very low traffic sites which rarely get more than 1 request at time this doesn’t matter.

One bad word about the resource requirements though: The used RAM increase for 6 client Zeo setup (standard Plone 3.1.2 + 12 additional Products) was whopping 621 MB (1132 MB -> 1753 MB). That means about 100 MB per Zeo client as the Zeo server memory intake was only about 12-15 MB. Thus, only use as many Zeo clients as absolutely necessary or you might find your beloved server machine under very serious Zope flu!