Thursday, December 07, 2006

Drupal: Directly Importing Nodes

I was tasked with developing an import tool to pull over our 40,000+ items from our old database. There is a lot of variety to how our old data has been stored, so our nodes had to match the old models and our import tool had to account for this. Also, we have a lot of information that would benefit from being cross-referenced (e.g. 35,000 contacts shared and used as node references for the 40,000 pieces).

So, I tried using the ImportExport API and I ran into wall with it. We needed to be able to take one piece of source data and export it as many nodes as neccessary (e.g. 1 piece and its 4 supporting nodes). As the export builds its data, it makes node ids. makes nodes for them and then puts them in as references in the "main" node. My working title for this module is "direct_node_import" and I have written it for 4.7x with an eye to importing CCK created content-types.

- Expect that command line executions will be common. When dealing with large record sets, it could take hours to process
- Expect that the import destination will be remote. In our case, the old server and the new server are 7,000 miles apart. The old server is unlikely to handle Drupal well, so the export work is done there and the import work is done on the new server
- Use the import site as the source for import directives and taxonomy. To this end, I built in a way to export individual taxonomy references. The exporting site will store these and use them so that it makes the fewest trips possible.
- While XML is great, I didn't go with it. I used serialization. The import tool publishes schemas that are Drupal node and taxonomy objects that have been serialized and output. The exporting site will read these, unserialize them and use them in the program flow. I have left out authentication, so schemas are publically available when requested but I think they are not valuable information in their own right.

Here is where I am at:

The export/import process:
  1. Serializes
    • An example node and all of the nodes that are also used as node references is built up as an example model of how you want your content to look. This means that you have to have example nodes ands supporting nodes in place and title them "base_example_" (followed by something that denotes them as what you intend to name them as). The base examples to not need to be published, just present in the db. This means that if you had a "contact" node reference that would likely pick-up two references from the export, you could pass out two supporting nodes-- one named "base_example_contact_regular" and "base_example_contact_emergency"-- then boh would be available to populated by the exporting script.
    • A "seed" node-- the starting position. This is good and dangerous. While the remote site does its work, it will build node ids. These will be used on the return trip. If someone has invented nodes in the mean time, this will likely overwrite those nodes. Because of the node references needing to the internally consistent, I went this way. A workaround is to do smaller export jobs so that others can contribute in and around import/export process.

  2. Output as plain text - The schema is published as plain text output suitable to be read in by the exporting script.
  3. The exporter reads the serialized data and unserializes it

    • it reads all of the example nodes in
    • it reads the seed number
    • it reads prep vs. publish arguments and information on where to get the source data. This information is more relevant to the exporting script. It can also be tasked to get only so many records. Really, the logic and flow of the export is up to you. As long as it can read the serialized schema and produce a serialized export, the rest is up to you.

  4. the exporter steps through the records it finds from the exporting data source

    • it builds nodes as required
    • it will build any other direct SQL statements that can be passed back
    • it serializes those

      • writes them to an output stream
      • writes them to a text file

  5. the importer can either:

    • read the external source
    • prep an export by calling a URL
    • this can be also tasked to fire the import task when the next cron.php firing occurs

Here is my to-do list:
  • Do some more testing to make sure the code is as solid as possible
  • Try to introduce authentication in some regard for the published schemas
  • Try to build a perpetual exporters-- one that could handshake with the export source, make a small request for a cluster of nodes, save them, Then request more. Repeat until it is done.
  • Try to build a "relative" node number passed back-- in other words, pass back +23 instead of 222 so that it can take the next node id and add to it instead of forcing a node id number.
  • Get my "SQL Instruction" part of the functionality working and limit it so that it can do INSERTs and maybe UPDATEs but little else so that it doesn't become a source of exploit
  • Build a logging system so that you can review which nodes were insert when and by what import task.
  • PHP serialize/unserialize works fine. See if I get this to work with non-PHP sites. Either investigate how other languages can unserialize and serialize their data,
  • Maybe build an XML option into this-- to publish schemas as XML and convert inbound XML into new nodes.

Comments? Ideas? Dread warnings?

tags: Drupal, import, export, serialization, futile tasks