Saturday, April 11, 2015

MySQL and SSL on Amazon RDS


Warning, this is a longer post.  Feel free to skip chunks as dictated by your level of ADHD.

Amazon's relational database service recently had to rotate SSL certificates for all their RDS instances when the existing certificate expired, as they all eventually do.  This got me a little curious since I remembered that MySQL can make use of SSL certificates to both secure connections and also to authenticate users.

I decided to see if I could use client-based certificates to authenticate clients in RDS so I spun up a quick RDS instance (perhaps I'll write a post about that later) and started to tinker.  However, I forgot to fully consider the problem and focused instead on what I had set out as the requirement for my project - simply making client certificates work in RDS.

First, a little theory...


   What is this SSL thing anyway?

In a word, encryption.
In a few more words it is encryption that includes a method of both doing the key exchange through use of asymmetric ciphers such as RSA as well as the more efficient block cipher for actual transport once a key and cipher-suite has been negotiated.  Now, it's not just that (the Confidentiality by means of encryption) but also includes the HMAC (hashed message authentication code, or a digest to verify the validity of the message) for the Integrity part and also a means of Authentication by means of verifying identities using a trusted issuer of the certificates that signs them so you know they are who they say (provided you remember to check the signature and can also trust that third party).
If you want more than that there is this awesome site called Wikipedia that has many more words and some diagrams and nerdy stuff.  It is not my intention to explain how awesome RSA is, cause it's cool.
   Here's the high-level of how it all fits together, though.
   First, there is this entity called a Certificate Authority.  This is someone everybody involved can trust.  For example, when you're dealing with websites you don't know this might be VeriSign or DigiCert or someone like that.  Your web browser typically comes with some Certificate Authority certificates built-in so that when you browse to a site that has a certificate that is signed by or issued by one of those authorities your browser can check the signature and tell that it's a valid certificate and should be trusted.
   Next, one party that wants to be trusted too will generate a key plus a certificate signing request.  He or she will keep the key private and never share that, then send the certificate signing request that has all the details about him or her to the certificate authority to sign.  The most important information in the request is the "Subject" line which contains details about the who, where, etc of the person / server.
   The Certificate Authority (CA) then decides if it's willing to trust and certify that person or server who made the request (if they are a commercial CA like VeriSign then they usually want some money to convince them that they trust you).  So, thinking over the CA whips out that private key it keeps tightly tucked away for just such occasions and signs the request.  Now it has not just the Subject line but an Issuer line of information included in it.  This is given back to the requester who now has a proper pair of keys, private and public (called a key and a cert here).

Now, how does MySQL use these in a normal setup?

   On the server side you have the CA cert and a key/cert pair for the server itself, configured with the following parameters:
--ssl-ca=ca.pem
--ssl-cert=server-cert.pem
--ssl-key=server-key.pem

So, here is where I should point out that you need those in "PEM" format instead of some other weird thing like PKCS#7 or something that nobody cares about.

At this point you're good to go from the server side and your client can connect.  Now, the MySQL client will currently ignore the fact that the server has a certificate configured unless you explicitly tell it to care (now changed in 5.7.7) so you don't have to worry about something terrible happening to all your normal connections.  However, you also can now connect to the server with an SSL encrypted connection!  All you need is a copy of the public CA cert from the server.  So, let's try it -

# mysql -uroot --ssl-ca=ca.pem
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.6.24-log MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Ok, cool, it didn't error, but how do we know we're using SSL here?  The simple method is to run the 'status' command

mysql> status;
--------------
mysql  Ver 14.14 Distrib 5.6.24, for Linux (x86_64) using  EditLine wrapper

Connection id:          3
Current database:
Current user:           root@localhost
SSL:                    Cipher in use is DHE-RSA-AES256-SHA
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server version:         5.6.24-log MySQL Community Server (GPL)
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    utf8
Conn.  characterset:    utf8
UNIX socket:            /var/lib/mysql/mysql.sock
Uptime:                 22 min 12 sec

Threads: 1  Questions: 12  Slow queries: 0  Opens: 67  Flush tables: 1  Open tables: 60  Queries per second avg: 0.009
--------------

See that lovely orange line that's terribly hard to miss when it's highlighted?  Yep, that tells us that we're using SSL and what the negotiated cipher-suite is for this connection.

But wait, there's more!  In a standard MySQL setup and following that manual, we can also give one of these nifty certs to the client.  Why?  So not only can we verify the server is trusted but so that the server can verify that we the client are trusted, too.  Just repeat the process of creating the server's key/cert pair before except call it the client key/cert.  Then make a user and require one of three things from the user as part of the create user / grant statement.  Now use some more flags on the command line when connecting:

 GRANT SELECT ON *.* TO 'bjim'@'localhost' REQUIRE X509;

# mysql -ubjim --ssl-ca=ca-cert.pem --ssl-cert=client-cert.pem --ssl-key=client-key.pem  Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.6.24-log MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

"Require X509" just means that the client should have a properly formatted cert signed by the CA specified.  Kind of like having a company badge.  Nobody ever looks at the details, but if you have on that looks like the company made it then you're good to go.

The other options are to require either the Issuer or Subject string to match.  Remember those from a few paragraphs ago?  You can do things like this:

GRANT SELECT ON *.* TO 'bjim'@'localhost' IDENTIFIED BY 'pass123' REQUIRE ISSUER '/C=US/ST=Texas/L=Dallas/O=Banana Corp/CN=CA/emailAddress=ca@example.com';

Or

GRANT SELECT ON *.* TO 'bjim'@'localhost' IDENTIFIED BY 'pass123' REQUIRE SUBJECT '/C=US/ST=Texas/L=Dallas/O=Banana Stand/CN=client/emailAddress=client@example.com';

Those just match the properties of the certificate and they can be combined.  Cool, right?  There is something deceptive here that is simply not stated but which leads you to make a terrible assumption which is where I went wrong which we'll see later...

Why do we want to secure MySQL connections?

   Well, now that's an excellent question.  Basically, it's mostly to encrypt the network traffic to prevent the dangers of someone snooping on the traffic or playing the fun little Man-in-the-Middle game.  Most of the time you rely on other options for "securing" the network aspects of MySQL such as keeping your databases in a private network in their own VLAN and allowing access from certain controlled, or seemingly controlled, servers such as your web / app servers or some bastion servers you might hop through for administrative or monitoring purposes with connections to that being secured by means of a VPN or SSH.  So what does SSL bring to the table past that?  Well, not a ton, in my opinion, just the option to have something akin to 2-factor authentication.

Enough talk, let's do this!


How to generate all the certs

   We're basically going to follow the standard method of generating a CA cert, then using it to sign our own server and client certs which we will also generate.  Anyone who has ever done this for a web server is already familiar with the process.  Just to make sure we're doing it the MySQL way we'll follow the manual.

You fill in your own blanks when prompted
# make a CA key
   openssl genrsa 2048 > ca-key.pem
# make the CA cert
   openssl req -new -x509 -nodes -days 3600 -key ca-key.pem -out ca.pem
# make the Server's key and CSR
   openssl req -newkey rsa:2048 -days 3600 -nodes -keyout server-key.pem -out server-req.pem
# Fix some formatting (necessary for newer versions of OpenSSL)
   openssl rsa -in server-key.pem -out server-key.pem
# sign the Server's CSR and generate the Cert
   openssl x509 -req -in server-req.pem -days 3600 -CA ca.pem -CAkey ca-key.pem -set_serial 01 -out server-cert.pem
# make the Client's key and CSR
   openssl req -newkey rsa:2048 -days 3600 -nodes -keyout client-key.pem -out client-req.pem
# Fix the formatting on this one, too
   openssl rsa -in client-key.pem -out client-key.pem
# Sign the client's CSR and produce the client's cert
   openssl x509 -req -in client-req.pem -days 3600 -CA ca.pem -CAkey ca-key.pem -set_serial 01 -out client-cert.pem

Verify the issuer and subject (here's what mine say):
# openssl x509 -subject -noout < client-cert.pem
subject= /C=US/ST=TX/L=Dallas/O=Hopscotch Pros/OU=Analytics/CN=client1.example.com/emailAddress=analysts@example.com
# openssl x509 -issuer -noout < client-cert.pem
issuer= /C=US/ST=TX/L=Dallas/O=examples cert authority/OU=mr it man/CN=ca.example.com/emailAddress=root@example.com


Cool, now we have a bunch of files, we plug them in the blanks, create some users to match against the subject, and poof!

So I make the user:

GRANT SELECT ON *.* TO 'jim'@'localhost' IDENTIFIED BY 'pass123' REQUIRE SUBJECT '/C=US/ST=TX/L=Dallas/O=Hopscotch Pros/OU=Analytics/CN=client1.example.com/emailAddress=analysts@example.com';

# mysql -ujim -ppass123 --ssl-ca=ca.pem --ssl-cert=client-cert.pem --ssl-key=client-key.pem
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.6.24-log MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Yay!


Errors!?!



Ok, ok, I got this!  Now let's create the same user on my RDS instance (except with the host part as % instead of localhost) and try with the same certs:



# mysql -ujim -ppass123 -h myinstance.myid.us-west2.rds.amazonaws.com --ssl-ca=rds-combined-ca-bundle.pem --ssl-cert=client-cert.pem --ssl-key=client-key.pem
ERROR 2026 (HY000): SSL connection error: protocol version mismatch

Wait, what?  What does that even mean?

After some digging around I found that there are some issues when using newer versions of OpenSSL because by default it uses a slightly different header on the pem file.  Solution is to convert from pem to pem, which is what we did in our creation steps as you may now note.  So that wasn't it.

Some more digging and newer versions of OpenSSL also default to sha256 instead of sha1 (for decent reasons) as the MAC on the certificate.  Ok, so I check and yes, the CA cert here has sha1 and my generated certs are using sha256.  That could be causing a problem (though, honestly, you'd think each cert could be validated independently, regardless of actual format of the cert itself)
# openssl x509 -text -noout < rds-combined-ca-bundle.pem | grep sha
    Signature Algorithm: sha1WithRSAEncryption
    Signature Algorithm: sha1WithRSAEncryption
# openssl x509 -text -noout < client-cert.pem | grep sha
    Signature Algorithm: sha256WithRSAEncryption
    Signature Algorithm: sha256WithRSAEncryption

Trying again... and again...



Ok, I can fix this.  Redo the whole cert generation process with the additional option passed to OpenSSL each time '-sha1'.  Now my certs use the same old hashing algorithm sha1 as the CA and RDS's server certs.  Let's try again:

# mysql -ujim -ppass123 -h myinstance.myid.us-west2.rds.amazonaws.com --ssl-ca=rds-combined-ca-bundle.pem --ssl-cert=client-cert.pem --ssl-key=client-key.pem
ERROR 2026 (HY000): SSL connection error: protocol version mismatch

Son of a caterpillar!  What is going on here?  More googling.  Couple of bugs, nothing helpful.

At this point I try locally (see above working solution) and it works fine.  Huh.  Then it dawns on me - the one thing that's pretty different here between what I'm doing locally and against RDS is that I am generating the keys from the CA cert on the server (it's all done with the same CA cert locally).  With the RDS instance I don't have that option since I don't have access to the CA key.  Still, I should at least be able to generate a cert with a new CA and try locally.  So, I create a new CA key/cert pair different from the one used in the my.cnf file and use that to create and sign a new client key/cert pair.  Testing that locally, but which CA cert are you supposed to reference on the CLI?  Manual says it's the one for the server's cert.  Ok.

# mysql -ujim -ppass123 --ssl-ca=ca.pem --ssl-cert=client-cert2.pem --ssl-key=client-key2.pem
ERROR 2026 (HY000): SSL connection error: ASN: bad other signature confirmation

Well, that doesn't help.  What if I use the new CA cert?
# mysql -ujim -ppass123 --ssl-ca=ca2.pem --ssl-cert=client-cert2.pem --ssl-key=client-key2.pem
ERROR 2026 (HY000): SSL connection error: ASN: bad other signature confirmation

Alright.  This isn't working.  Maybe it's because I'm using OpenSSL to generate these certs and MySQL is using yaSSL (default in all community builds).  The Enterprise Edition of MySQL uses OpenSSL still for some reason (Still curious about that difference, Oracle...) so I install that for testing and try it.

mysql -ujim -ppass123 --ssl-ca=ca.pem --ssl-cert=client-cert2.pem --ssl-key=client-key2.pem
ERROR 2026 (HY000): SSL connection error: error:00000001:lib(0):func(0):reason(1)

Oh, now I see why yaSSL is nice.  OpenSSL based builds give even more cryptic errors.

More googling, even less useful stuff.

The "well duh" moment



What the manual forgets to mention but what is obvious when you stop to think about it as I was forced to do when I hit this wall and had to slow down and use my own head for a few minutes is this:  It needs to be the same CA for both server and client certs.  Yes, needs to be.  The whole point is trust.  MySQL allows you to specify an Issuer as a requirement which led me to initially assume that you could let any old client cert work regardless of origin.  However, as you've seen above you can actually input ANYTHING into those fields when generating a CA cert then sign stuff.  Regardless of actual validity you can spoof any Issuer line.  You can use any self-made CA cert to sign any self-generated client certificate as well thus spoofing any Subject line, too.  This becomes really kind of pointless when it comes to security.  As user 'bob' on the instance I can just do a SHOW GRANTS FOR 'jim'@'%'; command and see what to spoof in the Subject line if that's what is required.

Conclusion - a single CA certificate must be used to sign both the server's certificate AND the client's certificate or it will not work.

What this means for RDS.

   Since RDS comes with a certificate provided by AWS and it is not a setting that can be changed, in other words, you cannot upload your own certs) and everyone has the same CA cert it means that what you're verifying when you specify the RDS CA cert is that you're connecting to an RDS MySQL instance.  Since you don't have the CA key for AWS you cannot create the matching client certificates to use.  Generating your own does you no good.  In the end, if authentication is part of your goal for MySQL then you must use the passwords and you should really be using network level access control by means of VPC / EC2 Security Groups.

Moral of the story



   There are three real lessons I take away from this (with a bit of additional experience added in):
1 - If the RDS manual doesn't explicitly say you can do something with a MySQL feature you can probably assume that it either doesn't work or has some caveats that you'll come across eventually.
2 - Just because you can't enjoy all the security benefits of SSL doesn't mean you can't enjoy some of them.  If you like nicely encrypted connections then go for it.
3 - The most important lesson - think about what you're doing and why you're doing it.  Does it accomplish what you are asking?  Does what you're asking to do even make sense?  I fear too often that we paint ourselves into intractable corners because we skip this.  I know I'm guilty of it as I've shown with my headlong attempt to do something nobody was doing without first really thinking about why.

The point of this story, and indeed this entire blog, is learning.  If I had to learn it then someone else will have to learn it, too.  I hope this saves someone some frustration.

That said, if someone is able to make this work with two separate CA's for server and client and disprove my conclusion I am excited to hear about it.  Please let me know.  I am certainly not above being wrong.

Tuesday, September 30, 2014

JSON UDF's for MySQL in Labs - Take 2


About a year ago I wrote a little on the MySQL JSON UDF's from the MySQL Labs.  You can find the original article and testing steps here.

I decided with the most recent Labs release (0.3.2) that Sveta put out I would try this again and see how far we've come.  For reference, the work is documented also here on Sveta's blog - https://blogs.oracle.com/svetasmirnova/

I repeated my exact same steps to create the table and load it full of the most recent zips.json data straight from the MongoDB folks.  After running a handful of documents through a quick external validation via the JSONLint website (quick plug, I love this thing, especially when I have to edit IAM policies for AWS stuff) I set about loading them in again using the perl 1.5 liner I used last time.

To start with, the installation and documentation are much improved.  I was impressed with just how simple it was and how flushed out that README file is becoming, two things I had previously commented on that needed some small improvement.

Now, the fun part.

Data loaded in normally, cleanly, nicely using the same steps as before.

wget -q http://media.mongodb.org/zips.json

vim json_insert.pl

#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}


CREATE DATABASE IF NOT EXISTS test;
USE test;
CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

cat zips.json | ./json_insert.pl | mysql

Ok, so far so good.  Now let's try the queries from last time again:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Hmm, that looks familiar...  Ok, maybe we're still working with the more formal definition of JSON here and I need to quote both key and value again.  I'll check with that whole 'json_valid' UDF here:

mysql> SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29353 |
+---------+----------+
1 row in set (0.42 sec)


Hmm, a few more zip codes than last time, but otherwise it claims the column contains nice, valid JSON documents.  Last time to check functionality I went ahead and replaced the geographical coordinates with just the word "blank" in quotes to satisfy the requirement of the fully quoted key and value pairs, also removing any list as a value.  This time, repeating the process nets me a familiar:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

So, that brings us right back where we started.  I haven't played too much with the other functions yet, I admit.  I keep getting stuck with this one, which at least to me and for my uses is the big one (if it works, obviously json_append(), json_replace(), and json_set() will be big ones to use for me)

Playing around a little more I am finding the same thing again here.  This is still promising and I still hope it works, but it is not working the way I would expect it to and it is not working in any of the other variations I have concocted so far.  I think I am next going to play with the built-in PostgreSQL functionality for JSON in 9.3 and see how it compares.  Stay tuned for that in the next few days, I hope.

In the meantime, Sveta - I love your work, I think I'm following the examples in the README just fine, but what am I missing here?  I'd love to show this working here.

Tuesday, July 22, 2014

Copying and Pasting SQL

Lesson learned today - Do not trust auto-formatting on websites / instant messaging apps / e-mail.

When copying and pasting queries be careful about auto-formatting.  Many editors want to replace an n-dash '-' with an m-dash '–' .  Some (or possibly most) SQL shells interpret these differently.  Often, only the first is synonymous with subtraction.

It took an embarrassing amount of time today for me to catch that one.  I hope someone is spared a few minutes by my wasted time.


Thursday, May 15, 2014

Older MySQL Poems

By unpopular demand, here are some of the ones from the Percona Live 2013 presentation.  I make no claims of quality:


deadlock detected
we rollback transaction two
err one two one three

- a MySQL Haiku


In the style of Edgar Allan Poe's "The Raven"...

Once upon a SQL query
While I joked with Apple's Siri
Formatting many a logical volume on my quad core
Suddenly there came an alert by email
as of some threshold starting to wail
wailing like my SMS tone
"Tis just Nagios", I muttered,
"Sending alerts unto my phone,
Only this - I might have known..."

Ah distinctly I remember
as I documented for each member
of the team just last November
in the wiki that we keep -
write and keep and nothing more...
When my query thus completed
Fourteen duplicate rows deleted
All my replicas then repeated
repeated the changes as before.
I dumped it all to a shared disk,
kept as a backup forever more.



There once was a small database program
It had InnoDB and MyISAM
One did transactions well,
and the other would crash like hell.
Between the two they used all of my RAM

- A MySQL Limerick -



Round and round the disk drive spins
but SSD sits still and grins.
It is randomly fast
for data current and past.
My database upgrade begins.



Flush local tables
Make and LVM snapshot
Backup with rsync

- A Haiku on simple backups -



Oracle is red,
IBM is blue,
I like stuff for free -
MySQL will do.



There was a dba who lived in a shoe,
He had so many tables he didn't know what to do
He optimized some and analyzed a few
then truncated them all and left for the day



And a throwback to Dr. Seuss...

Do you like MyISAM?
I do not like it, Sam-I-am.
I do not like MyISAM.

Would you use it here or there?
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it, Sam-I-am.

Would you like it in an e-commerce site?
Would you like it with in the middle of the night?
I do not like it for an e-commerce site.
I do not like it in the middle of the night.
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it Sam-I-am.

Would you could you for foreign keys?
Use it, use it, just use it please!
You may like it, you will see
Just convert these tables three…
Not for foreign keys, not for those tables three!
I will not use it, you let me be!

The DBA's Creed

Alright, it's time for another cheesy one, but here goes.  Feel free to modify it in the comments if you have better:



This is my database.  There are many like it, but this one is mine.
My database is my best friend.  It is my life.  I must master it as I master my life.
My database, without me, is useless.  Without my database, I am useless.
I must query my database well.  I must make queries respond faster than my customers who are trying to kill me.  I must tune the database before he shoots me. I will.
My database and I know what COUNT(*) means.  It's the JOIN's in the query, the temp tables on disk, or the IOPS used.  We know they should be sargable.  We will index properly...

My database is code, even as I, because it is my life.  Thus, I will learn it as a brother.  I will learn its weaknesses, its strength, its changelogs, its plugins, its views and its triggers.
I will keep my buffer_cache filled and ready, even and my dirty writes are flushed.  We will become part of each other.  We will...

Before Codd, I swear this creed.  My database and I are the defenders of my company's data.  We are the masters of our queries. Backups are the saviors of my life.

So be it, until there are no slow queries and all is normalized, and cached.

Friday, November 22, 2013

Data Poem Time!

(Inspired / modified from Robert Frost's "Nothing Gold Can Stay" and Elizabeth Barrett Browning's "How Do I Love Thee")

Nothing Old Can Stay (A poem about data archiving and purging)

Website's first orders are cold,
Its hardest rows to query.
The original hits were a joy;
But only until the second deploy.
Then buffer pages are not marked new,
As the data grew and grew,
Till the web traffic of today.
Nothing old can stay.


How Do I Join Thee

How do I join thee?  Let me count the rows.
I join thee to the maximum join size
My join buffer will permit in bytes
For after that it is written on disk,
Though performance is less brisk.
I join thee first on the left with sighs
I join thee then on the inner sides.
I join thee not only so I can join other tables
But to sum and aggregate over thy labels.
I join thee using thy reference column
I join thee in my reports so solemn.
Select, Delete, Update, or Insert:
I shall join thee until I cause an alert.

Tuesday, September 24, 2013

New JSON UDF's for MySQL in Labs



One of the more common complaints I hear from developers about MySQL when compared to other databases like PostgreSQL or MongoDB is a lack of decent JSON support.  Consequently, we usually end up storing JSON blobs in the database.

I know what you're thinking, "Why don't you extract the information and put it into a nice, normalized set of tables?" to which the response is usually "Because that takes more time and effort."

So, assuming you don't want to store your JSON in MongoDB where you can index it, query it in complex ways, and be totally web-scale what can you do?  Well, for a while now there have been JSON related functions in common-schema by Shlomi Noach.  There are both the 'extract_json_value' and 'json_to_xml' procedures that can be used.  While these do help a little they are, fundamentally, still just brute-forcing their way through these blobs of text.  There have also been a few attempts at UDF's for this over the years with the most extensive, in my opinion, being lib_mysqludf_json by Arnold Daniels.  There is some pretty good documentation on this and, admittedly, I have not used it extensively.

However, as I discovered quite by chance, we now have some additional UDF's from the fine folks at MySQL up since a few days ago on labs.mysql.com.  I have not heard any mention of them yet in any of the normal places (if they were mentioned at MySQL Connect this past weekend I did not hear about it).  In this article I'll talk a bit more about them.

To start with you have to download and install the UDF's which is a fairly straightforward process and instructions are given in the README inside the download, though the name of the library file "libmy_json.so" inside the README does not match the name of the provided library file itself "libmy_json_udf.so" so you have to account for that when creating the functions.  A small matter, sure, but odd that the meager documentation would get that part wrong.  However, it's in Labs, not GA yet so not a big deal.

It should be noted before moving on that we are still storing JSON in normal MySQL types (varchar, text, blob) rather than having a new data-type for this, which, in my opinion, is fine.  We basically have to do the same thing for the XML stored in there.

There are a number of functions provided:

  • json_valid - gives you a 1 or a 0 depending on the validity of the JSON document
  • json_search - Find stuff in your JSON blobs.
  • json_extract - Pull out individual sections / values based on keys
  • json_replace - just what it sounds like and, probably, the most exciting feature since you don't have to now return the whole blob to your application, do your JSON operation, then store the whole thing back again.
  • json_append - cool for similar reasons as above.
  • json_remove - again, similar to the above.
  • json_set - according to the documentation a sort of 'INSERT... ON DUPLICATE KEY UPDATE' function which checks and parses the JSON (but not the whole document in this version for some reason).
  • json_merge - What it sounds like, though I'm not sure you would use this on any sorts of writes, though maybe as a sort of aggregation method when querying the data.
  • json_contains_key - sort of like 'IFNULL'.


I have yet to try out all of them but a few of them are quite useful so far.  Just to show you a bit of how it works let's insert some stuff and query it.

First, let's create a table, I'm using test.jtest for this:

CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Next, I need sample data.  I'm using the zips.json samples from the fine folks at MongoDB:

wget -q http://media.mongodb.org/zips.json

This contains a lot of documents of the following format:

{"city": "ACMAR", "loc": [-86.51557, 33.584132], "pop": 6055, "state": "AL", "_id": "35004"}

Now, let's insert them (I know you can do a perl one-liner for this but I chose to write a quick 2 line script instead since I'll likely be reusing this a few times:

vim json_insert.pl
#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}

cat zips.json | ./json_insert.pl | mysql

We need to verify that our data made it:

SELECT COUNT(*) FROM test.jtest;
+----------+
| count(*) |
+----------+
|    29472 |
+----------+
1 row in set (0.01 sec)


Anyway, let's move on and query the data:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Aw...  it doesn't work.  Why not?  It looks like valid JSON to my eye (maybe I'm just used to the MongoDB style)  but it isn't for the UDF's we're using.  The JSON UDF functions only operate on valid JSON documents.  See:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | count(*) |
+---------+----------+
|       0 |    29472 |
+---------+----------+
1 row in set (0.84 sec)

Ok, so, that's disappointing.  I can't just pull out of MongoDB and drop directly in there (yet).  After a lot of playing around it seems the working definition of valid JSON here requires key-value pairs to have both sides properly quoted (yes, that is most correct, but it is disappointing for numbers, coordinates, etc).  So how do we have fun here?  We need to butcher our JSON a bit.  In this case I'm taking out the coordinates for "loc" and quoting the numbers of the populations:

vim json_insert.pl
#!/usr/bin/perl
while ($line = <>) {
chomp $line ;
$line =~ s/\[.*\]/\"blank\"/g;
$line =~ s/pop\"\: (\d+)/pop\"\: \"$1\"/g;
printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $line);
}


Now, let's reset and try again:

TRUNCATE TABLE test.jtest;

cat zips.json | ./json_insert.pl | mysql

Now let's check again:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29470 |
+---------+----------+
1 row in set (0.11 sec)


Woohoo!  Can we do things now?

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;
+------------+-------+
| city       | pop   |
+------------+-------+
| ACMAR      | 6055  |
| ADAMSVILLE | 10616 |
| ADGER      | 3205  |
| KEYSTONE   | 14218 |
| NEW SITE   | 19942 |
| ALPINE     | 3062  |
| ARAB       | 13650 |
| BAILEYTON  | 1781  |
| BESSEMER   | 40549 |
| HUEYTOWN   | 39677 |
+------------+-------+
10 rows in set (0.00 sec)

We can!

Conclusion: I don't think we're quite there yet, but it is a definite start.