Google Webmasters Help FAQ

Weblog for the Google Webmasters Forum

Error handling and robots.txt (Apache)

Posted by sebastian on March 9th, 2007

Operating a Web site with an incomplete setup can result in search engine invisibility, because misconfigurations can prevent search engines from crawling. Hosting services usually don’t install error handling or robots.txt for new accounts, so the Webmaster has to implement it. Here is a bullet-proof setup for Apache Web servers.

If you don’t have a robots.txt file yet, create one as plain text file and upload it to your Web root folder today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Now check the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it in its file list due to the dot which marks hidden files under UNIX, add “-a” to “external mask” in the FTP-settings and reconnect.

If you find complete/absolute URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a 302 redirect to the given URL, which probably responds with “200-Ok”, and the actual HTTP error code gets lost in cyberspace. Be aware that most header checkers don’t show the 200 response code. SE crawlers getting a redirect (302/301) response will capture the redirect location and request this URL later on. You can’t redirect a spider, you can only feed it with URLs.

If OTOH you want to redirect for example not-found errors to an external banner farm or popup hell, you must use a complete URL in the 404 error directive. You can’t verify your site with Google then, but who adds nasty sites to the Webmaster console?

Warning: sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT shipped March/8/1983 hosting Google.com! All other error directives pointing to absolute URLs result in troubles of all sorts.

Here is a well formed .htaccess example:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files ".ht*">
deny from all
</Files>
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.canonical-server-name.com [NC]
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. You can make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages (pages set in .htaccess).

Provide a search facility, a list of top-level categories and other navigational assistance on each error page, your users will appreciate it.

You can test actual HTTP response codes with online header checkers.

The other statements above do different things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address. This .htacces file is safe to use on hosting accounts where multiple domains point to one content directory. Regardless which server name is used in the URL to request a page, it redirects users as well as crawlers to the canoniocal URL.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Reprinted with permission

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>