Tuesday, September 24, 2013

New JSON UDF's for MySQL in Labs



One of the more common complaints I hear from developers about MySQL when compared to other databases like PostgreSQL or MongoDB is a lack of decent JSON support.  Consequently, we usually end up storing JSON blobs in the database.

I know what you're thinking, "Why don't you extract the information and put it into a nice, normalized set of tables?" to which the response is usually "Because that takes more time and effort."

So, assuming you don't want to store your JSON in MongoDB where you can index it, query it in complex ways, and be totally web-scale what can you do?  Well, for a while now there have been JSON related functions in common-schema by Shlomi Noach.  There are both the 'extract_json_value' and 'json_to_xml' procedures that can be used.  While these do help a little they are, fundamentally, still just brute-forcing their way through these blobs of text.  There have also been a few attempts at UDF's for this over the years with the most extensive, in my opinion, being lib_mysqludf_json by Arnold Daniels.  There is some pretty good documentation on this and, admittedly, I have not used it extensively.

However, as I discovered quite by chance, we now have some additional UDF's from the fine folks at MySQL up since a few days ago on labs.mysql.com.  I have not heard any mention of them yet in any of the normal places (if they were mentioned at MySQL Connect this past weekend I did not hear about it).  In this article I'll talk a bit more about them.

To start with you have to download and install the UDF's which is a fairly straightforward process and instructions are given in the README inside the download, though the name of the library file "libmy_json.so" inside the README does not match the name of the provided library file itself "libmy_json_udf.so" so you have to account for that when creating the functions.  A small matter, sure, but odd that the meager documentation would get that part wrong.  However, it's in Labs, not GA yet so not a big deal.

It should be noted before moving on that we are still storing JSON in normal MySQL types (varchar, text, blob) rather than having a new data-type for this, which, in my opinion, is fine.  We basically have to do the same thing for the XML stored in there.

There are a number of functions provided:

  • json_valid - gives you a 1 or a 0 depending on the validity of the JSON document
  • json_search - Find stuff in your JSON blobs.
  • json_extract - Pull out individual sections / values based on keys
  • json_replace - just what it sounds like and, probably, the most exciting feature since you don't have to now return the whole blob to your application, do your JSON operation, then store the whole thing back again.
  • json_append - cool for similar reasons as above.
  • json_remove - again, similar to the above.
  • json_set - according to the documentation a sort of 'INSERT... ON DUPLICATE KEY UPDATE' function which checks and parses the JSON (but not the whole document in this version for some reason).
  • json_merge - What it sounds like, though I'm not sure you would use this on any sorts of writes, though maybe as a sort of aggregation method when querying the data.
  • json_contains_key - sort of like 'IFNULL'.


I have yet to try out all of them but a few of them are quite useful so far.  Just to show you a bit of how it works let's insert some stuff and query it.

First, let's create a table, I'm using test.jtest for this:

CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Next, I need sample data.  I'm using the zips.json samples from the fine folks at MongoDB:

wget -q http://media.mongodb.org/zips.json

This contains a lot of documents of the following format:

{"city": "ACMAR", "loc": [-86.51557, 33.584132], "pop": 6055, "state": "AL", "_id": "35004"}

Now, let's insert them (I know you can do a perl one-liner for this but I chose to write a quick 2 line script instead since I'll likely be reusing this a few times:

vim json_insert.pl
#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}

cat zips.json | ./json_insert.pl | mysql

We need to verify that our data made it:

SELECT COUNT(*) FROM test.jtest;
+----------+
| count(*) |
+----------+
|    29472 |
+----------+
1 row in set (0.01 sec)


Anyway, let's move on and query the data:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Aw...  it doesn't work.  Why not?  It looks like valid JSON to my eye (maybe I'm just used to the MongoDB style)  but it isn't for the UDF's we're using.  The JSON UDF functions only operate on valid JSON documents.  See:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | count(*) |
+---------+----------+
|       0 |    29472 |
+---------+----------+
1 row in set (0.84 sec)

Ok, so, that's disappointing.  I can't just pull out of MongoDB and drop directly in there (yet).  After a lot of playing around it seems the working definition of valid JSON here requires key-value pairs to have both sides properly quoted (yes, that is most correct, but it is disappointing for numbers, coordinates, etc).  So how do we have fun here?  We need to butcher our JSON a bit.  In this case I'm taking out the coordinates for "loc" and quoting the numbers of the populations:

vim json_insert.pl
#!/usr/bin/perl
while ($line = <>) {
chomp $line ;
$line =~ s/\[.*\]/\"blank\"/g;
$line =~ s/pop\"\: (\d+)/pop\"\: \"$1\"/g;
printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $line);
}


Now, let's reset and try again:

TRUNCATE TABLE test.jtest;

cat zips.json | ./json_insert.pl | mysql

Now let's check again:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29470 |
+---------+----------+
1 row in set (0.11 sec)


Woohoo!  Can we do things now?

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;
+------------+-------+
| city       | pop   |
+------------+-------+
| ACMAR      | 6055  |
| ADAMSVILLE | 10616 |
| ADGER      | 3205  |
| KEYSTONE   | 14218 |
| NEW SITE   | 19942 |
| ALPINE     | 3062  |
| ARAB       | 13650 |
| BAILEYTON  | 1781  |
| BESSEMER   | 40549 |
| HUEYTOWN   | 39677 |
+------------+-------+
10 rows in set (0.00 sec)

We can!

Conclusion: I don't think we're quite there yet, but it is a definite start.


No comments:

Post a Comment