Tuesday, September 30, 2014

JSON UDF's for MySQL in Labs - Take 2


About a year ago I wrote a little on the MySQL JSON UDF's from the MySQL Labs.  You can find the original article and testing steps here.

I decided with the most recent Labs release (0.3.2) that Sveta put out I would try this again and see how far we've come.  For reference, the work is documented also here on Sveta's blog - https://blogs.oracle.com/svetasmirnova/

I repeated my exact same steps to create the table and load it full of the most recent zips.json data straight from the MongoDB folks.  After running a handful of documents through a quick external validation via the JSONLint website (quick plug, I love this thing, especially when I have to edit IAM policies for AWS stuff) I set about loading them in again using the perl 1.5 liner I used last time.

To start with, the installation and documentation are much improved.  I was impressed with just how simple it was and how flushed out that README file is becoming, two things I had previously commented on that needed some small improvement.

Now, the fun part.

Data loaded in normally, cleanly, nicely using the same steps as before.

wget -q http://media.mongodb.org/zips.json

vim json_insert.pl

#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}


CREATE DATABASE IF NOT EXISTS test;
USE test;
CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

cat zips.json | ./json_insert.pl | mysql

Ok, so far so good.  Now let's try the queries from last time again:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Hmm, that looks familiar...  Ok, maybe we're still working with the more formal definition of JSON here and I need to quote both key and value again.  I'll check with that whole 'json_valid' UDF here:

mysql> SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29353 |
+---------+----------+
1 row in set (0.42 sec)


Hmm, a few more zip codes than last time, but otherwise it claims the column contains nice, valid JSON documents.  Last time to check functionality I went ahead and replaced the geographical coordinates with just the word "blank" in quotes to satisfy the requirement of the fully quoted key and value pairs, also removing any list as a value.  This time, repeating the process nets me a familiar:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

So, that brings us right back where we started.  I haven't played too much with the other functions yet, I admit.  I keep getting stuck with this one, which at least to me and for my uses is the big one (if it works, obviously json_append(), json_replace(), and json_set() will be big ones to use for me)

Playing around a little more I am finding the same thing again here.  This is still promising and I still hope it works, but it is not working the way I would expect it to and it is not working in any of the other variations I have concocted so far.  I think I am next going to play with the built-in PostgreSQL functionality for JSON in 9.3 and see how it compares.  Stay tuned for that in the next few days, I hope.

In the meantime, Sveta - I love your work, I think I'm following the examples in the README just fine, but what am I missing here?  I'd love to show this working here.

Tuesday, July 22, 2014

Copying and Pasting SQL

Lesson learned today - Do not trust auto-formatting on websites / instant messaging apps / e-mail.

When copying and pasting queries be careful about auto-formatting.  Many editors want to replace an n-dash '-' with an m-dash '–' .  Some (or possibly most) SQL shells interpret these differently.  Often, only the first is synonymous with subtraction.

It took an embarrassing amount of time today for me to catch that one.  I hope someone is spared a few minutes by my wasted time.


Thursday, May 15, 2014

Older MySQL Poems

By unpopular demand, here are some of the ones from the Percona Live 2013 presentation.  I make no claims of quality:


deadlock detected
we rollback transaction two
err one two one three

- a MySQL Haiku


In the style of Edgar Allan Poe's "The Raven"...

Once upon a SQL query
While I joked with Apple's Siri
Formatting many a logical volume on my quad core
Suddenly there came an alert by email
as of some threshold starting to wail
wailing like my SMS tone
"Tis just Nagios", I muttered,
"Sending alerts unto my phone,
Only this - I might have known..."

Ah distinctly I remember
as I documented for each member
of the team just last November
in the wiki that we keep -
write and keep and nothing more...
When my query thus completed
Fourteen duplicate rows deleted
All my replicas then repeated
repeated the changes as before.
I dumped it all to a shared disk,
kept as a backup forever more.



There once was a small database program
It had InnoDB and MyISAM
One did transactions well,
and the other would crash like hell.
Between the two they used all of my RAM

- A MySQL Limerick -



Round and round the disk drive spins
but SSD sits still and grins.
It is randomly fast
for data current and past.
My database upgrade begins.



Flush local tables
Make and LVM snapshot
Backup with rsync

- A Haiku on simple backups -



Oracle is red,
IBM is blue,
I like stuff for free -
MySQL will do.



There was a dba who lived in a shoe,
He had so many tables he didn't know what to do
He optimized some and analyzed a few
then truncated them all and left for the day



And a throwback to Dr. Seuss...

Do you like MyISAM?
I do not like it, Sam-I-am.
I do not like MyISAM.

Would you use it here or there?
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it, Sam-I-am.

Would you like it in an e-commerce site?
Would you like it with in the middle of the night?
I do not like it for an e-commerce site.
I do not like it in the middle of the night.
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it Sam-I-am.

Would you could you for foreign keys?
Use it, use it, just use it please!
You may like it, you will see
Just convert these tables three…
Not for foreign keys, not for those tables three!
I will not use it, you let me be!

The DBA's Creed

Alright, it's time for another cheesy one, but here goes.  Feel free to modify it in the comments if you have better:



This is my database.  There are many like it, but this one is mine.
My database is my best friend.  It is my life.  I must master it as I master my life.
My database, without me, is useless.  Without my database, I am useless.
I must query my database well.  I must make queries respond faster than my customers who are trying to kill me.  I must tune the database before he shoots me. I will.
My database and I know what COUNT(*) means.  It's the JOIN's in the query, the temp tables on disk, or the IOPS used.  We know they should be sargable.  We will index properly...

My database is code, even as I, because it is my life.  Thus, I will learn it as a brother.  I will learn its weaknesses, its strength, its changelogs, its plugins, its views and its triggers.
I will keep my buffer_cache filled and ready, even and my dirty writes are flushed.  We will become part of each other.  We will...

Before Codd, I swear this creed.  My database and I are the defenders of my company's data.  We are the masters of our queries. Backups are the saviors of my life.

So be it, until there are no slow queries and all is normalized, and cached.

Friday, November 22, 2013

Data Poem Time!

(Inspired / modified from Robert Frost's "Nothing Gold Can Stay" and Elizabeth Barrett Browning's "How Do I Love Thee")

Nothing Old Can Stay (A poem about data archiving and purging)

Website's first orders are cold,
Its hardest rows to query.
The original hits were a joy;
But only until the second deploy.
Then buffer pages are not marked new,
As the data grew and grew,
Till the web traffic of today.
Nothing old can stay.


How Do I Join Thee

How do I join thee?  Let me count the rows.
I join thee to the maximum join size
My join buffer will permit in bytes
For after that it is written on disk,
Though performance is less brisk.
I join thee first on the left with sighs
I join thee then on the inner sides.
I join thee not only so I can join other tables
But to sum and aggregate over thy labels.
I join thee using thy reference column
I join thee in my reports so solemn.
Select, Delete, Update, or Insert:
I shall join thee until I cause an alert.

Tuesday, September 24, 2013

New JSON UDF's for MySQL in Labs



One of the more common complaints I hear from developers about MySQL when compared to other databases like PostgreSQL or MongoDB is a lack of decent JSON support.  Consequently, we usually end up storing JSON blobs in the database.

I know what you're thinking, "Why don't you extract the information and put it into a nice, normalized set of tables?" to which the response is usually "Because that takes more time and effort."

So, assuming you don't want to store your JSON in MongoDB where you can index it, query it in complex ways, and be totally web-scale what can you do?  Well, for a while now there have been JSON related functions in common-schema by Shlomi Noach.  There are both the 'extract_json_value' and 'json_to_xml' procedures that can be used.  While these do help a little they are, fundamentally, still just brute-forcing their way through these blobs of text.  There have also been a few attempts at UDF's for this over the years with the most extensive, in my opinion, being lib_mysqludf_json by Arnold Daniels.  There is some pretty good documentation on this and, admittedly, I have not used it extensively.

However, as I discovered quite by chance, we now have some additional UDF's from the fine folks at MySQL up since a few days ago on labs.mysql.com.  I have not heard any mention of them yet in any of the normal places (if they were mentioned at MySQL Connect this past weekend I did not hear about it).  In this article I'll talk a bit more about them.

To start with you have to download and install the UDF's which is a fairly straightforward process and instructions are given in the README inside the download, though the name of the library file "libmy_json.so" inside the README does not match the name of the provided library file itself "libmy_json_udf.so" so you have to account for that when creating the functions.  A small matter, sure, but odd that the meager documentation would get that part wrong.  However, it's in Labs, not GA yet so not a big deal.

It should be noted before moving on that we are still storing JSON in normal MySQL types (varchar, text, blob) rather than having a new data-type for this, which, in my opinion, is fine.  We basically have to do the same thing for the XML stored in there.

There are a number of functions provided:

  • json_valid - gives you a 1 or a 0 depending on the validity of the JSON document
  • json_search - Find stuff in your JSON blobs.
  • json_extract - Pull out individual sections / values based on keys
  • json_replace - just what it sounds like and, probably, the most exciting feature since you don't have to now return the whole blob to your application, do your JSON operation, then store the whole thing back again.
  • json_append - cool for similar reasons as above.
  • json_remove - again, similar to the above.
  • json_set - according to the documentation a sort of 'INSERT... ON DUPLICATE KEY UPDATE' function which checks and parses the JSON (but not the whole document in this version for some reason).
  • json_merge - What it sounds like, though I'm not sure you would use this on any sorts of writes, though maybe as a sort of aggregation method when querying the data.
  • json_contains_key - sort of like 'IFNULL'.


I have yet to try out all of them but a few of them are quite useful so far.  Just to show you a bit of how it works let's insert some stuff and query it.

First, let's create a table, I'm using test.jtest for this:

CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Next, I need sample data.  I'm using the zips.json samples from the fine folks at MongoDB:

wget -q http://media.mongodb.org/zips.json

This contains a lot of documents of the following format:

{"city": "ACMAR", "loc": [-86.51557, 33.584132], "pop": 6055, "state": "AL", "_id": "35004"}

Now, let's insert them (I know you can do a perl one-liner for this but I chose to write a quick 2 line script instead since I'll likely be reusing this a few times:

vim json_insert.pl
#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}

cat zips.json | ./json_insert.pl | mysql

We need to verify that our data made it:

SELECT COUNT(*) FROM test.jtest;
+----------+
| count(*) |
+----------+
|    29472 |
+----------+
1 row in set (0.01 sec)


Anyway, let's move on and query the data:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Aw...  it doesn't work.  Why not?  It looks like valid JSON to my eye (maybe I'm just used to the MongoDB style)  but it isn't for the UDF's we're using.  The JSON UDF functions only operate on valid JSON documents.  See:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | count(*) |
+---------+----------+
|       0 |    29472 |
+---------+----------+
1 row in set (0.84 sec)

Ok, so, that's disappointing.  I can't just pull out of MongoDB and drop directly in there (yet).  After a lot of playing around it seems the working definition of valid JSON here requires key-value pairs to have both sides properly quoted (yes, that is most correct, but it is disappointing for numbers, coordinates, etc).  So how do we have fun here?  We need to butcher our JSON a bit.  In this case I'm taking out the coordinates for "loc" and quoting the numbers of the populations:

vim json_insert.pl
#!/usr/bin/perl
while ($line = <>) {
chomp $line ;
$line =~ s/\[.*\]/\"blank\"/g;
$line =~ s/pop\"\: (\d+)/pop\"\: \"$1\"/g;
printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $line);
}


Now, let's reset and try again:

TRUNCATE TABLE test.jtest;

cat zips.json | ./json_insert.pl | mysql

Now let's check again:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29470 |
+---------+----------+
1 row in set (0.11 sec)


Woohoo!  Can we do things now?

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;
+------------+-------+
| city       | pop   |
+------------+-------+
| ACMAR      | 6055  |
| ADAMSVILLE | 10616 |
| ADGER      | 3205  |
| KEYSTONE   | 14218 |
| NEW SITE   | 19942 |
| ALPINE     | 3062  |
| ARAB       | 13650 |
| BAILEYTON  | 1781  |
| BESSEMER   | 40549 |
| HUEYTOWN   | 39677 |
+------------+-------+
10 rows in set (0.00 sec)

We can!

Conclusion: I don't think we're quite there yet, but it is a definite start.


Thursday, August 1, 2013

Random Cool Stuff

It's been a while without activity here but I have not been entirely idle.  I have spent rather a lot of time working lately (entirely too much).  Sometimes I find myself reading on various topics or stumbling across interesting bits as I attempt to research answers to things.  I figured I'd share a few of the interesting reads, projects, or ideas I've come across lately.

http://www.php.net/manual/en/intro.mysqlnd-ms.php - Definitely has potential, makes me excited, but needs a bit more intelligent read/write splitting abilities, still.

http://mysqldba.blogspot.com/2012/02/asyncronous-shard-queries-in-php-using.html - some bits around asynchronous reads from a sharded set of databases.

http://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html - Switching from MyISAM to InnoDB on some tables I have had to field a few questions from perplexed developers around their auto-increment values.  Sometimes there are massive gaps now.  Why?  In part they are doing a lot of 'INSERT IGNORE' queries which effectively reserve an auto-inc value (unless you are specifying it manually) in the event they can insert.  You don't get those back if the result is "ignore".  You can, however, influence this a bit with the 'innodb_autoinc_lock_mode' variable.  In short, you should not be relying on the auto-inc column being gap free.

http://www.percona.com/doc/percona-server/5.5/reliability/crash_resistant_replication.html - This one is fun.  I have not benchmarked the cost of enabling this yet but I have experienced the problems it is trying to solve.  Note that MySQL 5.6 also fixes this problem and the Percona 5.6 option is replaced by the Oracle MySQL solution.

Since it doesn't come up often I sometimes have to refer folks to this pretty good explanation and solution to the many-to-many relationship problem that isn't entirely valid in a normal RDBMS - http://www.tonymarston.net/php-mysql/many-to-many.html .  Good read.

I have been conned into supporting PostgreSQL as well, lately.  There is a bit to get used to like permissions.  Permissions are wildly different and, I dare say, not well documented in the manual or many other places I found on the wide internet.  Remember to grant usage on the schema itself when needed and remember to check for expired accounts.  http://stackoverflow.com/questions/6799224/pgsql-grant-every-single-right-to-a-user-on-a-schema

http://www.tokutek.com/2011/07/indexing-the-director’s-cut/ - Note really a read so much as an excellent presentation.  Make some popcorn, put it on the big-screen, enjoy.

http://www.npr.org/blogs/health/2013/07/18/203298244/worlds-biggest-virus-may-have-ancient-roots - Not related at all to databases, but still crazy cool.

http://www.oracle.com/us/corporate/press/1967380 - Yeah, it's Oracle, not MySQL but hey, they are doing some really nifty things, too.  Pluggable (portable) databases, data heat-maps for auto-tiering storage, compression policies, detailed heuristics for index stats, and MapReduce features.  I do always wish MySQL would do index stats more intelligently.  For example, just because I have three possible values for a column doesn't mean 33% of the column is value 'a'.

Did I mention I passed the first part of the Oracle OCA exam?  Yep, I now know the basics of Oracle SQL.  It is, in a word, very different than MySQL SQL.  So are many of the principles, like aliasing rules, whether or not you should use subqueries (obviously, the optimizer is a bit different, you can nest "unlimited" numbers of subqueries in some parts of the query and up to 256 deep in the where clause if you're crazy enough).  Next step the actual Administration part.  There is a lot of material for someone who doesn't use it hardly ever, but soon...

In case you have a lot to delete out of a very large table (let's just assume you "sharded it" or don't care and are trying to clean up old garbage) you may notice that a simple "DELETE FROM tbl WHERE col1='value';" is a bit too slow to be done in production, particularly with InnoDB if col1 isn't a primary key.  Ideally, you want to do small, incremental deletes by primary key reference if possible to avoid hosing everything in the process.  Here is a nice explanation of how to do it http://mysql.rjweb.org/doc.php/deletebig

Other neat things:  https://code.google.com/p/flexviews/ and https://sdm.lbl.gov/fastbit/ and http://www.arubin.org/blog/2012/07/16/mysql-pam-ldap-authentication-module-configuration/ and https://github.com/mcafee/mysql-audit

I'll leave you searching for the reference with the following table, keeping in mind that one must never attempt to normalize Lewis Carroll:

CREATE TABLE OtherThings (
   Speaker varchar(100) not null default "Walrus",
   Listener varchar(100) not null default "Carpenter",
   Topic enum('shoes', 'ships', 'sealing wax', 'cabbages', 'kings'),
   SeaIsHost boolean,
   PigsHaveWings boolean
) ENGINE=BLACKHOLE;


I'll try to pick a more particular topic for the next post...