Monday, May 17, 2010

Mongo Want Candy!

I have been interested by the concept of NoSQL. Having lived and breathed relational databases for the last 12 years, an alternative could be welcome. Then Mongo rides into town. The MongoDB is a relatively new database system from the NoSQL movement. The NoSQL concept reemerged after Eric Evans and Johan Oskarsson did work on They wanted to organize an event to discuss open source distributed databases.

NoSQL architectures provides weak consistency guarantees such as eventual consistency. That's the "E" in the BASE concept, as opposed to the ACID concept that seemed to be the way to go.
NoSQL systems find it easier to deploy distributed architecture, with the data being held in a redundant manner on several servers. That allows the data services to be scaled up easily by adding more servers.

Here's how I got Mongo to come to town:
I went to the Mongo site.
I went to the repository and downloaded the MongoDB engine that I needed. The suggestion is that you put MongoDB into the root and make C:\data\db to hold the data.
I'm using XAMPP on my Windows machine. I got the PHP version, (5.28-- so, 5.2x what the flavour of Mongo I wanted).
I got the drivers I needed from the PHP drivers page, choosing the 5.2x thread-safe driver. I downloaded it, installed it so that the appropriate dll would be in my php/ext directory.
In Apache, I modified the php.ini. I added two lines:
(both lines seem to be neccessary).
I restarted Apache.
Most important: I started Mongo. It doesn't hang in the BG, or fire up in response to the driver calls. I had to go into the MongoDB/bin/ and execute mongo.exe to get it to be resident.
Voila: I had Mongo available for tinkering. Next, I did lots of trolling through the slim documentation on Mongo Reference on Fairly quickly, I was able to store and find data. Some things (see below), eludded me.

Some of the upsides:
  • Schemaless. This is HUGE. When you hold two records side-by-side, one is an array with two cells, the other can be an object with 100 properties-- both in the same table. This doesn't make data management a cakewalk. If you put inconsistent data into a database, you have to find a way to fish it out later. Every so often, I've been tasked with holding complex data in a database. MS Access has sub-tables and I groan in harmony with the creaks from their engine when a sub-table is deployed.
  • Fast. Benchmarks differ, but simply put: MongoDB is fast. I think it's easy to figure out why: most of the chaff from MySQL is not a factor in Mongo.

Some of the downsides:
  • Baby's Got Back. My database has 40 records, none with more than 400 characters of information. The database's file size is 84MB. For less than 16,000 characters of data, that seems like a LOT of overhead. Were I able to find an ISP with small account sizes, it would be hard to get this much space for this little data.
  • Libraries in loose clay. I tried to make use of the PHP library of functions. The expanded functions were not recognized in PHP 5.28 even after adding the library and extension.
  • I was able to insert and find records. When I tried to use the $in function-- possible and functional for the MongoDB client (console side)-- it worked fine. When I tried the same from PHP, no dice.
  • Like is there no like? LIKE is a really sloppy call in MySQL. Every time I use it I wince a little (net effect: I wince a little alot). There is no apparent equivalent in MongoDB. Correction: There is a regular expression statement that's easy use
    db.customers.find( { name : /acme.*corp/i } );
    It works well and it's very snappy. No longer do you need to figure out how to do LIKE statments as well as regular expressions. Now, the better question I need to figure out: how to make a phrase like the above work into a PHP statement.

Some of the changes (I think neither good nor bad when you do the calculus):
  • No JOINS. MySQL performance falls apart when you do joins. Joins are the Achilles-heel of Drupal that relies on them so heavily. It's understandable why, given how much data has to be joined and compared. Yet DBAs the world over always atomized and isolate data, then combine the data for the end result. There are two ways to banish joins:
    a) use MongoDB that doesn't have the capactity to do joins.
    b) use MySQL and repeat data in different tables to commit to a practice of fewer joins. When you absolutely have to break this rule, you can do so at the cost of a performance hit.
  • No configs. I was intrigued by the concept after having wades through all of the tweaks you can visit upon MyISAM and InnoDB settings. I think MongoDB's "no config tweaking" will go by the wayside within a year, when somebody out there cranks the performance by messing with some environment variables, maybe even a variable outside of MongoDB itself (like putting the MongoDB on its own disk).
  • Easily create databases and tables (aka collections). This is one innovation that really balances out as a net zero. By calling a database or collection, you create it if it doesn't exist. That is so very easy. But, how many times do coders trip over typos? If you took out typos, you could remove maybe 20-40% of your debugging time. MongoDB doesn't bleat when you create a database or collection with the wrong name. Worse than that, it allocates megabytes of disk space. Your data could accidentally end up in a sink hole. With MySQL, the errors would bleat out and give you something to repair. This could mean that development may have choppy waters, but the application in production may have an easier go of it because of the MongoDB performance benefits.
    Here's what I think the recipe for disaster could be:
    • 1 coder who names a collection in a client side variable-- or makes it dynamic: available for the user input to generate.
    • 1 hacker who finds this numpty practice
    • 1,000,000 exploits done automatically.
    Bake for a few short minutes when nobody is watching.
    Yields one web server out of disk space.

How do I do an $in in PHP ? The suggested attempt failed.
Can I do a LIKE equivalent in MongoDB?
Dreamhost says it hosts MongoDB. Are there any other places that allow MongoDBs?

ACID - Atomicity, Consistency, Isolation, Durability. A set of properties that guarantee database transactions are processed reliably. The concept of ACID is to evaluate databases and application architecture. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even though that might involve multiple changes (such as debiting one account and crediting another), is a single transaction. Back to top

BASE - Basically Available, Soft state, Eventually consistent.
BASE, as the acronym denotes, is opposed to ACID. ACID is pessimistic and forces consistency at the end of every operation. BASE is optimistic and accepts that the database consistency will be in flux. Easy to achieve with BASE, impossible to consider with ACID.
BASE can accomplish availability despite partial failures, hence the "Basically Available." Soft state means it's in flux and is non-deterministic. Eventually consistent means that if one data source doesn't report what you'd expect, eventually the data would propagate and become consistent throughout the incarnations of data no matter where it's replicated.
Back to top

ACID vs. BASE - I think that ACID may be impractical to guarantee. And, it may be unimportant at the end of the day. Following the Buddhist concept that "all things are impermanent", you can have inconsistent data today because in 100 years, no one will care about the data; or it will all come out in the wash.
I saw this one annoying talk on the topic. The speaker said, "so your comment goes missing... [exasperated pause] Who cares? [room erupts in laughter and applause]." He didn't care because the comments would appear eventually; or who cares: it was just one comment and could have been lost through network connectivity. I thought it was really amusing that the author of some of the most inane comments I have ever read would be ambivalent about comments. When you look at the river of news, you can miss something. If it's important, it will come around again. That arrogance towards data is at the core of BASE, like a technical concept founded on sloppiness. I would prefer to cherry-pick between ACID and BASE. ACID when you're handling real data (transactions, comments, content). BASE when you're dipping into the river (news reproductions, live video, etc.). Back to top


Sam Corder said...

Like is done with regular expressions from which you can get a lot more mileage.

Mike DeWolfe said...

Cool, Sam! Thanks!
Do you know of any good online resources for how to put regular expressions into play in MongoDB or better yet, using regular expressions in PHP into Mongo?

Crias said...

A quick comment on the size you noticed:

Mongo appears to use a kind of page-file style layout. You'll probably find that if you add 10 records to your 40 record example, the size is no different.

Once you go over the threshold it will create a second page-file twice as big as the first, and continue going.

Mike DeWolfe said...

I agree, Crias. I think 40 records will take up as much disk space as 1400 records. Pre-allocation of space is common, but I was a little surprised to see it used so much space.

Anonymous said...

according to mongomachine:

the pre-allocated file doubles each time until it hits 2GB, which is the max a pre-allocated file will grow