Monday, December 17, 2007

no XO, a non-update

I have been following the OLPC News site. Bricks, concerns and lots of people waiting for their laptops.

I am in the last category: I am waiting for my laptop to show up. When I phoned, the window was to be December 14th-24th for people who bought laptops on the first day. So, we are in the window... waiting...

Sunday, December 16, 2007

Neozombies.com

View Auction

After holding onto the neozombie.com domain for a number of years, I've decided to flog it instead. I had the idea of doing a NeoPets send-up. I did a Google search and found that "neozombies" is a term used to describe NeoCon zombies. With 2008 being a U.S. Election year, I figure that someone out there may want the domain.

Saturday, December 15, 2007

Optional Module Loading Drupal

At the core of Drupal you will find its module loading capabilities. the functions module_invoke() and module_invoke_all() fire to load modules that you want to use: core modules, optional modules and modules that you need in support of others. Drupal will load all of the code-- all of the modules-- because the application doesn't know if it will use until much later. Buy a dozen eggs because you don't know if you will be cooking breakfast or going to the movies; and get a movie ticket-- and maybe a lion tamer's hat while you're at it, just in case. Some modules never come into play but they still get loaded. There is a logic to this: some modules affect user permissions; some change the display of output-- but alot of modules only occasionally come into play. Each request to a Drupal site is a complete instantiation-- an inclusion of all of the modules and their code. A narrow useage of Drupal (like an Ajax load); or a broad useage (like an editing page; or loading the main page)-- both load almost the same amout of code.
What if you know your site, you're low of server load and long on knowledge of how your site runs? In my case, I turned to optionally loading some modules-- intercepting their load when module_list() is called-- then qualifying that some modules, though active, should not be loaded in response to some requests. Using the arg() function is good at judging what is needed-- but it uses a lot of processing. The $_SERVER variable is available-- so I selected to use it-- to suss out what the incoming arguments are. The idea is to err on the side of caution: when in doubt, load all of the modules.

In includes/modules.inc, below the else of the if ($bootstap) {} code, I modified the code to look like this:

$do_not_load = array();
if ((!(preg_match('/user/',$_SERVER['REQUEST_URI']))) &&
(!(preg_match('/admin/',$_SERVER['REQUEST_URI']))) &&
(!(preg_match('/civicrm/',$_SERVER['REQUEST_URI']))) &&
(!(preg_match('/subscribe/',$_SERVER['REQUEST_URI'])))) {
// supress civicrm
$do_not_load[] = 'civicrm';
}

if ((preg_match('/mainpage/',$_SERVER['REQUEST_URI'])) ||
(preg_match('/sections/',$_SERVER['REQUEST_URI']))) {
// supress non-mainpage functions
$do_not_load[] = 'archive';
$do_not_load[] = 'authorship';
$do_not_load[] = 'form_restore';
$do_not_load[] = 'helloworld';
$do_not_load[] = 'maximizer';
$do_not_load[] = 'moviereview';
$do_not_load[] = 'upload';
}

$dont_load = "";
if (count($do_not_load) > 0) {
$dont_load = " AND (name != '".implode("' AND name != '",$do_not_load)."')";
}
$result = db_query("SELECT name, filename, throttle FROM {system} WHERE
type = 'module' AND status = 1 ".$dont_load." ORDER BY weight ASC, filename ASC");


This code assesses the $_SERVER['REQUEST_URI'] to see if the page to be loaded needs all of the modules. In our case, alot of pages didn't need to use the processor heavy civicrm, so there are a lot of instances where civicrm doesn't come into play. There are a number of other modules that never get used on some our frequently used pages, so they do not to be used. Then these arrays of unused options are passed onto the SELECT statement-- they qualify a shorter list of modules to load. When the modules are needed (according to the URL), there will be no impediment to them being loaded.

If you consider doing this: tread carefully. Judge the impact of excepting modules from some page loads. Each module has many hooks, so be careful to look at all of the hooks and make certain they are really unneccessary. Each site is different: if your page needs "xyz" modules on a page; another site may not need it. If they can be done without, then maybe they are candidates for optional loading.

Friday, December 07, 2007

Spider Proofing Your Drupal Site

Spiders can be the bitch-goddesses of the Internet: essential to document your traffic; lethal to a precarious web server. We launched a Drupal site a month ago. Since then it has been a running battle: the site limps along until a web spider hits and knocks down traffic to a crawl. The problem is that Drupal requires a lot of processing to output a page. If a spider hits or the King of Siam, the same level of processing is required. We don't want to turn away spiders; or give them a bad view of the site. We can't turn off all of this php and db processing. Once you invoke Drupal's core file "bootstrap.inc" it's all done: a cascade of other files are included. Aliases, for example, require three series of database calls and functions to derive the intended destination. How do you avoid all of this processing? You intercept the traffic.

You need to do two things: identity traffic; and intercept traffic.

Look at your hit logs: you don't need nor want to intercept all of the hits, just take the edge off of things. In our example, four pages accounted for 10% of our traffic. We could recreate all of the pages-- but that would mean 10Gb of pages-- alot. So, we hit the heavy-hitter pages. We recreated those four pages in a /temp_pages directory. A 'wget' statement fetches the page as it looks to anonymous users. Anonymous users are key-- spiders and anonymous users look almost exactly the same.

Here is an example of statement I used:

cd /home/our_dir/temp_pages && wget http://www.comminit.com/es/sections/mhp/37%2C1810/0/18 -q --output-document=%2Fes%2Fsections%2Fmhp%2F37%2C1810%2F0%2F18

-
cd ... -- this changes to the directory where I want to save my traffic
-
wget -- gets the file.
-
-q -- always good to keep wget calls quiet
-
--output-document -- you can name the saved document's name.

You want all urls to end up in this directory-- for example while the uri reads "/new/events/"-- you want that page to sit in the same directory, so you need to URL encode the files as they go in-- e.g. replace "/" with %2F to safely wrap it for transport. The trick is to get the file name to match the DOCUMENT_URI after urlencoding.

Our pages update with some frequency and I wanted our temp pages to reflect that. I plugged the cd && wget statement into my crontabs. On a recurring basis, these temp pages are blown out and recreated.

To intercept traffic, you need to change index.php-- put this code in the way of the traffic before it gets to the bootstrap.inc call.

So, I inserted this code into my index.php around line 12:


if (!isset($_COOKIE[session_name()])) {
// anon user present
$file = 'temp_pages/'.urlencode($_SERVER['REQUEST_URI']);
if (file_exists($file)) {
include($file);
echo "\n";
exit();
}
}

This chunk of code works like this:
- is there a cookie? spiders and anonymous users come in without cookies
- get the URI
- URLencode() it
- if the file exists, then include it and output it
- exit(); (get the heck out of there)

This chunk of code is a processing hit, but it's a small hit and only impacts anonymous users.