Spider Proofing Your Drupal Site

Spiders can be the bitch-goddesses of the Internet: essential to document your traffic; lethal to a precarious web server. We launched a Drupal site a month ago. Since then it has been a running battle: the site limps along until a web spider hits and knocks down traffic to a crawl. The problem is that Drupal requires a lot of processing to output a page. If a spider hits or the King of Siam, the same level of processing is required. We don't want to turn away spiders; or give them a bad view of the site. We can't turn off all of this php and db processing. Once you invoke Drupal's core file "bootstrap.inc" it's all done: a cascade of other files are included. Aliases, for example, require three series of database calls and functions to derive the intended destination. How do you avoid all of this processing? You intercept the traffic.

You need to do two things: identity traffic; and intercept traffic.

Look at your hit logs: you don't need nor want to intercept all of the hits, just take the edge off of things. In our example, four pages accounted for 10% of our traffic. We could recreate all of the pages-- but that would mean 10Gb of pages-- alot. So, we hit the heavy-hitter pages. We recreated those four pages in a /temp_pages directory. A 'wget' statement fetches the page as it looks to anonymous users. Anonymous users are key-- spiders and anonymous users look almost exactly the same.

Here is an example of statement I used:

cd /home/our_dir/temp_pages && wget http://www.comminit.com/es/sections/mhp/37%2C1810/0/18 -q --output-document=%2Fes%2Fsections%2Fmhp%2F37%2C1810%2F0%2F18

-
cd ... -- this changes to the directory where I want to save my traffic
-
wget -- gets the file.
-
-q -- always good to keep wget calls quiet
-
--output-document -- you can name the saved document's name.

You want all urls to end up in this directory-- for example while the uri reads "/new/events/"-- you want that page to sit in the same directory, so you need to URL encode the files as they go in-- e.g. replace "/" with %2F to safely wrap it for transport. The trick is to get the file name to match the DOCUMENT_URI after urlencoding.

Our pages update with some frequency and I wanted our temp pages to reflect that. I plugged the cd && wget statement into my crontabs. On a recurring basis, these temp pages are blown out and recreated.

To intercept traffic, you need to change index.php-- put this code in the way of the traffic before it gets to the bootstrap.inc call.

So, I inserted this code into my index.php around line 12:


if (!isset($_COOKIE[session_name()])) {
// anon user present
$file = 'temp_pages/'.urlencode($_SERVER['REQUEST_URI']);
if (file_exists($file)) {
include($file);
echo "\n";
exit();
}
}

This chunk of code works like this:
- is there a cookie? spiders and anonymous users come in without cookies
- get the URI
- URLencode() it
- if the file exists, then include it and output it
- exit(); (get the heck out of there)

This chunk of code is a processing hit, but it's a small hit and only impacts anonymous users.

Comments