Friday, November 22, 2013

Data Poem Time!

(Inspired / modified from Robert Frost's "Nothing Gold Can Stay" and Elizabeth Barrett Browning's "How Do I Love Thee")

Nothing Old Can Stay (A poem about data archiving and purging)

Website's first orders are cold,
Its hardest rows to query.
The original hits were a joy;
But only until the second deploy.
Then buffer pages are not marked new,
As the data grew and grew,
Till the web traffic of today.
Nothing old can stay.


How Do I Join Thee

How do I join thee?  Let me count the rows.
I join thee to the maximum join size
My join buffer will permit in bytes
For after that it is written on disk,
Though performance is less brisk.
I join thee first on the left with sighs
I join thee then on the inner sides.
I join thee not only so I can join other tables
But to sum and aggregate over thy labels.
I join thee using thy reference column
I join thee in my reports so solemn.
Select, Delete, Update, or Insert:
I shall join thee until I cause an alert.

Tuesday, September 24, 2013

New JSON UDF's for MySQL in Labs



One of the more common complaints I hear from developers about MySQL when compared to other databases like PostgreSQL or MongoDB is a lack of decent JSON support.  Consequently, we usually end up storing JSON blobs in the database.

I know what you're thinking, "Why don't you extract the information and put it into a nice, normalized set of tables?" to which the response is usually "Because that takes more time and effort."

So, assuming you don't want to store your JSON in MongoDB where you can index it, query it in complex ways, and be totally web-scale what can you do?  Well, for a while now there have been JSON related functions in common-schema by Shlomi Noach.  There are both the 'extract_json_value' and 'json_to_xml' procedures that can be used.  While these do help a little they are, fundamentally, still just brute-forcing their way through these blobs of text.  There have also been a few attempts at UDF's for this over the years with the most extensive, in my opinion, being lib_mysqludf_json by Arnold Daniels.  There is some pretty good documentation on this and, admittedly, I have not used it extensively.

However, as I discovered quite by chance, we now have some additional UDF's from the fine folks at MySQL up since a few days ago on labs.mysql.com.  I have not heard any mention of them yet in any of the normal places (if they were mentioned at MySQL Connect this past weekend I did not hear about it).  In this article I'll talk a bit more about them.

To start with you have to download and install the UDF's which is a fairly straightforward process and instructions are given in the README inside the download, though the name of the library file "libmy_json.so" inside the README does not match the name of the provided library file itself "libmy_json_udf.so" so you have to account for that when creating the functions.  A small matter, sure, but odd that the meager documentation would get that part wrong.  However, it's in Labs, not GA yet so not a big deal.

It should be noted before moving on that we are still storing JSON in normal MySQL types (varchar, text, blob) rather than having a new data-type for this, which, in my opinion, is fine.  We basically have to do the same thing for the XML stored in there.

There are a number of functions provided:

  • json_valid - gives you a 1 or a 0 depending on the validity of the JSON document
  • json_search - Find stuff in your JSON blobs.
  • json_extract - Pull out individual sections / values based on keys
  • json_replace - just what it sounds like and, probably, the most exciting feature since you don't have to now return the whole blob to your application, do your JSON operation, then store the whole thing back again.
  • json_append - cool for similar reasons as above.
  • json_remove - again, similar to the above.
  • json_set - according to the documentation a sort of 'INSERT... ON DUPLICATE KEY UPDATE' function which checks and parses the JSON (but not the whole document in this version for some reason).
  • json_merge - What it sounds like, though I'm not sure you would use this on any sorts of writes, though maybe as a sort of aggregation method when querying the data.
  • json_contains_key - sort of like 'IFNULL'.


I have yet to try out all of them but a few of them are quite useful so far.  Just to show you a bit of how it works let's insert some stuff and query it.

First, let's create a table, I'm using test.jtest for this:

CREATE TABLE `jtest` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `jdoc` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Next, I need sample data.  I'm using the zips.json samples from the fine folks at MongoDB:

wget -q http://media.mongodb.org/zips.json

This contains a lot of documents of the following format:

{"city": "ACMAR", "loc": [-86.51557, 33.584132], "pop": 6055, "state": "AL", "_id": "35004"}

Now, let's insert them (I know you can do a perl one-liner for this but I chose to write a quick 2 line script instead since I'll likely be reusing this a few times:

vim json_insert.pl
#!/usr/bin/perl
while (<>) {chomp; printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $_);}

cat zips.json | ./json_insert.pl | mysql

We need to verify that our data made it:

SELECT COUNT(*) FROM test.jtest;
+----------+
| count(*) |
+----------+
|    29472 |
+----------+
1 row in set (0.01 sec)


Anyway, let's move on and query the data:

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;

+------+------+
| city | pop  |
+------+------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+------+------+
10 rows in set (0.00 sec)

Aw...  it doesn't work.  Why not?  It looks like valid JSON to my eye (maybe I'm just used to the MongoDB style)  but it isn't for the UDF's we're using.  The JSON UDF functions only operate on valid JSON documents.  See:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | count(*) |
+---------+----------+
|       0 |    29472 |
+---------+----------+
1 row in set (0.84 sec)

Ok, so, that's disappointing.  I can't just pull out of MongoDB and drop directly in there (yet).  After a lot of playing around it seems the working definition of valid JSON here requires key-value pairs to have both sides properly quoted (yes, that is most correct, but it is disappointing for numbers, coordinates, etc).  So how do we have fun here?  We need to butcher our JSON a bit.  In this case I'm taking out the coordinates for "loc" and quoting the numbers of the populations:

vim json_insert.pl
#!/usr/bin/perl
while ($line = <>) {
chomp $line ;
$line =~ s/\[.*\]/\"blank\"/g;
$line =~ s/pop\"\: (\d+)/pop\"\: \"$1\"/g;
printf("INSERT INTO test.jtest (jdoc) VALUES(\'%s\');\n", $line);
}


Now, let's reset and try again:

TRUNCATE TABLE test.jtest;

cat zips.json | ./json_insert.pl | mysql

Now let's check again:

SELECT json_valid(jdoc) isValid, COUNT(*) FROM jtest GROUP BY isValid;
+---------+----------+
| isValid | COUNT(*) |
+---------+----------+
|       1 |    29470 |
+---------+----------+
1 row in set (0.11 sec)


Woohoo!  Can we do things now?

SELECT json_extract(jdoc, "city") city, json_extract(jdoc, "pop") pop FROM test.jtest LIMIT 10;
+------------+-------+
| city       | pop   |
+------------+-------+
| ACMAR      | 6055  |
| ADAMSVILLE | 10616 |
| ADGER      | 3205  |
| KEYSTONE   | 14218 |
| NEW SITE   | 19942 |
| ALPINE     | 3062  |
| ARAB       | 13650 |
| BAILEYTON  | 1781  |
| BESSEMER   | 40549 |
| HUEYTOWN   | 39677 |
+------------+-------+
10 rows in set (0.00 sec)

We can!

Conclusion: I don't think we're quite there yet, but it is a definite start.


Thursday, August 1, 2013

Random Cool Stuff

It's been a while without activity here but I have not been entirely idle.  I have spent rather a lot of time working lately (entirely too much).  Sometimes I find myself reading on various topics or stumbling across interesting bits as I attempt to research answers to things.  I figured I'd share a few of the interesting reads, projects, or ideas I've come across lately.

http://www.php.net/manual/en/intro.mysqlnd-ms.php - Definitely has potential, makes me excited, but needs a bit more intelligent read/write splitting abilities, still.

http://mysqldba.blogspot.com/2012/02/asyncronous-shard-queries-in-php-using.html - some bits around asynchronous reads from a sharded set of databases.

http://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html - Switching from MyISAM to InnoDB on some tables I have had to field a few questions from perplexed developers around their auto-increment values.  Sometimes there are massive gaps now.  Why?  In part they are doing a lot of 'INSERT IGNORE' queries which effectively reserve an auto-inc value (unless you are specifying it manually) in the event they can insert.  You don't get those back if the result is "ignore".  You can, however, influence this a bit with the 'innodb_autoinc_lock_mode' variable.  In short, you should not be relying on the auto-inc column being gap free.

http://www.percona.com/doc/percona-server/5.5/reliability/crash_resistant_replication.html - This one is fun.  I have not benchmarked the cost of enabling this yet but I have experienced the problems it is trying to solve.  Note that MySQL 5.6 also fixes this problem and the Percona 5.6 option is replaced by the Oracle MySQL solution.

Since it doesn't come up often I sometimes have to refer folks to this pretty good explanation and solution to the many-to-many relationship problem that isn't entirely valid in a normal RDBMS - http://www.tonymarston.net/php-mysql/many-to-many.html .  Good read.

I have been conned into supporting PostgreSQL as well, lately.  There is a bit to get used to like permissions.  Permissions are wildly different and, I dare say, not well documented in the manual or many other places I found on the wide internet.  Remember to grant usage on the schema itself when needed and remember to check for expired accounts.  http://stackoverflow.com/questions/6799224/pgsql-grant-every-single-right-to-a-user-on-a-schema

http://www.tokutek.com/2011/07/indexing-the-director’s-cut/ - Note really a read so much as an excellent presentation.  Make some popcorn, put it on the big-screen, enjoy.

http://www.npr.org/blogs/health/2013/07/18/203298244/worlds-biggest-virus-may-have-ancient-roots - Not related at all to databases, but still crazy cool.

http://www.oracle.com/us/corporate/press/1967380 - Yeah, it's Oracle, not MySQL but hey, they are doing some really nifty things, too.  Pluggable (portable) databases, data heat-maps for auto-tiering storage, compression policies, detailed heuristics for index stats, and MapReduce features.  I do always wish MySQL would do index stats more intelligently.  For example, just because I have three possible values for a column doesn't mean 33% of the column is value 'a'.

Did I mention I passed the first part of the Oracle OCA exam?  Yep, I now know the basics of Oracle SQL.  It is, in a word, very different than MySQL SQL.  So are many of the principles, like aliasing rules, whether or not you should use subqueries (obviously, the optimizer is a bit different, you can nest "unlimited" numbers of subqueries in some parts of the query and up to 256 deep in the where clause if you're crazy enough).  Next step the actual Administration part.  There is a lot of material for someone who doesn't use it hardly ever, but soon...

In case you have a lot to delete out of a very large table (let's just assume you "sharded it" or don't care and are trying to clean up old garbage) you may notice that a simple "DELETE FROM tbl WHERE col1='value';" is a bit too slow to be done in production, particularly with InnoDB if col1 isn't a primary key.  Ideally, you want to do small, incremental deletes by primary key reference if possible to avoid hosing everything in the process.  Here is a nice explanation of how to do it http://mysql.rjweb.org/doc.php/deletebig

Other neat things:  https://code.google.com/p/flexviews/ and https://sdm.lbl.gov/fastbit/ and http://www.arubin.org/blog/2012/07/16/mysql-pam-ldap-authentication-module-configuration/ and https://github.com/mcafee/mysql-audit

I'll leave you searching for the reference with the following table, keeping in mind that one must never attempt to normalize Lewis Carroll:

CREATE TABLE OtherThings (
   Speaker varchar(100) not null default "Walrus",
   Listener varchar(100) not null default "Carpenter",
   Topic enum('shoes', 'ships', 'sealing wax', 'cabbages', 'kings'),
   SeaIsHost boolean,
   PigsHaveWings boolean
) ENGINE=BLACKHOLE;


I'll try to pick a more particular topic for the next post...

Monday, June 3, 2013

MySQL Integer Types and Sizes

Just a quick note on integer type sizing in MySQL since I am coming across some very interesting column definitions in some odd corners of my inherited database schemas.

First, a cute picture to soften everybody up:



We'll start with a question - What is the difference between a column of type INT(8) and a column of type INT(11) or even INT(64) ?

Think about it for a second…  Think you know the answer?  Scroll down to find out!


The max unsigned value for an INT column is 2^32 -1 regardless of what sits in the parentheses.  That is 4294967295.  You may notice the number of characters in that is 10 (incidentally, the same for signed).

Now, let's create a quick table and see what happens with it:

CREATE TABLE `myints` (
  `a` int(4) unsigned NOT NULL,
  `b` int(8) unsigned NOT NULL,
  `c` int(10) unsigned NOT NULL,
  `d` int(16) unsigned NOT NULL,
  `e` int(32) unsigned NOT NULL,
  `f` int(64) unsigned NOT NULL,
  PRIMARY KEY (`a`)
)

Let's try a nice 4 digit number:
insert into myints values (1024, 1024, 1024, 1024, 1024, 1024);
Query OK, 1 row affected (0.00 sec)

Let's try another one with 8 digits:
insert into myints values (10242048, 10242048, 10242048, 10242048, 10242048, 10242048);
Query OK, 1 row affected (0.00 sec)

Let's try again with 12 digits:
insert into myints values (102420484096, 102420484096, 102420484096, 102420484096, 102420484096, 102420484096);
ERROR 1264 (22003): Out of range value for column 'a' at row 1

Aw, failure…  Yes, those are bigger than the max value of a 32 bit integer.  Let's back off a bit, maybe just column 'a' didn't like it since it's an int(4), after all:

insert into myints values (4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295);
Query OK, 1 row affected (0.00 sec)

Max value works, let's try incrementing the 'f' column, remember we said int(64) there, and see:
insert into myints values (4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967296);
ERROR 1264 (22003): Out of range value for column 'f' at row 1

So, doesn't seem to limit what values you can insert.  What about when you select values out?

mysql> select * from myints;
+------------+------------+------------+------------+------------+------------+
| a          | b          | c          | d          | e          | f          |
+------------+------------+------------+------------+------------+------------+
|       1024 |       1024 |       1024 |       1024 |       1024 |       1024 |
|   10242048 |   10242048 |   10242048 |   10242048 |   10242048 |   10242048 |
| 4294967295 | 4294967295 | 4294967295 | 4294967295 | 4294967295 | 4294967295 |
+------------+------------+------------+------------+------------+------------+

Nope.  Not there either.


So, in the words of Yzma from Emperor's New Groove "Kronk!  Why do we even have that lever?!?"


For one and only one particular use.  Formatting when you enable the ZEROFILL option on the column.  However, a note From the manual:

"If you specify ZEROFILL for a numeric column, MySQL automatically adds the UNSIGNED attribute to the column."

Now, a demonstration.

mysql> alter table myints modify column a int(4) unsigned zerofill not null;
mysql> alter table myints modify column b int(8) unsigned zerofill not null;
mysql> alter table myints modify column c int(10) unsigned zerofill not null;
mysql> alter table myints modify column d int(16) unsigned not null zerofill;

We'll leave 'e' and 'f' alone for now for the sake of sane formatting.

mysql> select * from myints;
+------------+------------+------------+------------------+------------+------------+
| a          | b          | c          | d                | e          | f          |
+------------+------------+------------+------------------+------------+------------+
|       1024 |   00001024 | 0000001024 | 0000000000001024 |       1024 |       1024 |
|       2048 |   00002048 | 0000002048 | 0000000000002048 |       2048 |       2048 |
|   10242048 |   10242048 | 0010242048 | 0000000010242048 |   10242048 |   10242048 |
| 4294967295 | 4294967295 | 4294967295 | 0000004294967295 | 4294967295 | 4294967295 |
+------------+------------+------------+------------------+------------+------------+


That's what ZEROFILL does.  That's the only actual use for the number inside the parentheses with an integer type in MySQL.  Without ZEROFILL it is effectively meaningless.  You can either leave it off or just put it at whatever you like.

Please feel free to read more details in the documentation.


So, did you guess right?  If so the next picture is for you:



Monday, April 29, 2013

Reflections on the Percona Live Conference

I think I had more fun at Percona Live this year than last year. In part, I mostly knew what to expect, but even more I was able to meet up with many folks From both last year and previous jobs. MySQL definitely hasn't gone away for them (although one company still has the database detractors who believe PostgreSQL is the best thing since Y Combinator, and I honestly do hope they make heavy use of it for at least a few apps and let me know what their ultimate conclusion is after living with it for a year or so).

I definitely enjoyed the tutorials this year and spent the day doing the Percona XtraDB Cluster training with Jay Janssen from Percona. This was the right kind of format for me and a good pace. I did forget to bring my laptop to that one and ended up building out a three node cluster using a single, remote virtual server from my iPad. Overall, harder, to be sure, but the thrill of setting it up from scratch and keeping up was fun.

Just a few more notable thoughts from the rest of the time...

The Oracle hosted MySQL Community Reception (aka, the 5.6 very post launch party) was a bit lacking in structure, bumpin tunes, free-verse database poetry, or even social interaction. However, the ice-dolphin was a pretty nice touch.

Tuesday morning saw a great keynote by Thomas Ulin of Oracle who did an excellent job of showing why Oracle is bringing a maturity and stability to MySQL which it had not previously seen. I definitely applaud what they are doing there (yes, we all have bugs we wish they'd get around to fixing, but I think that is a different story, personally). To follow-up on that he led an important Birds of a Feather session that evening on 5.6 upgrade experiences and heard many pain points, gripes, and questions. I was quite impressed that he brought a number of key engineers with him and easily took control of the situation, fielding hard questions and then turning to his guys and asking what was being done to address the particular issue, when they expected to have fixes, what the difficulties really were, and what we could all expect. While I'm sure it was less fun to be one of those engineers for that hour this was a major win for transparency and demonstrated a serious commitment to improvement. My hat really goes off to Oracle for trying hard with us as a community (yes, we beggars who use free software are also very choosy). It is also clear that they are basically just nerds like the rest of us. :) This all stood in stark contrast to the session on MariaDB 10.0 with Monty which went into some additional detail, although discovering much it as they went, about their methodology and goals. It is clear that the two camps are diverging at greater rates in most conceivable ways.

Wednesday I attended a talk by Paul Vallee entitled "Mission Critical, not Mission Impossible". This may have been one of the more inspiring talks at the conference on just how much room for improvement there is in the execution portion of IT these days. Eye opener and, hopefully, life changer. I would really like to hear that one again.

Thursday saw some great moment from the Facebook guys on how they manage to run efficiently at scale and also from Jermy Cole and Davi Arnaut on InnoDB file strucutres. A lot of hard work went into that one and it is evident that the Facebook team has done an amazing job at architecting a solid setup that most of us will only really dream about.

At any rate, these are my favorites, the items that really stuck out. Add to that the open-sourcing of the TokuDB engine and the demonstration of how easy things can / should be from Robert Hodges of Continuent and this really was a great place to spend the week.

However... As a nerd who likes his food I really have to take a quick jab at the Hyatt catering team. I do not understand what you folks were doing or thinking with those menus and the dishes served. Sometimes they were ok, mostly they elicited a desire to stoop to eating Taco Bell and if there had been one near I definitely would have done so. There is normal food there as well, I saw it for the other conferences being held in other parts of the center, just not for us. Oh well. Perhaps, I will just brown-bag it next year.


Tuesday, April 23, 2013

Linux Filesystems and MySQL

Percona Live 2013 is underway in Santa Clara, CA.  I presented tonight on Linux Filesystems and MySQL.

I think, overall, it went well, though I may have underestimated the technical nature of my audience a little judging by their questions.  Admittedly, it was geared toward the somewhat newer DBA / System Administrator who has been pressed into MySQL responsibilities.  As a result, I chose to cover a lot of theory that I think made the presentation a bit "boring" for the first half for those folks who already have down the basics of SAS vs SATA, IO Scheduler differences, and filesystem differences.  This is, of course, a tribute to the rather bright nature of most attendees at this conference.  The caliber is, on average, a good bit higher than other technical conferences I've been to in the past.

As part of it, I decided to lighten the mood a bit with some bad poetry sprinkled throughout.  In a few days Percona will upload the slides and I'll update this with the link.

I will do a more detailed review of the conference at the end but so far everything is good except the food the hotel is serving.

[Update] Slides for the talk can be found here .

Wednesday, April 10, 2013

Percona Live 2013 is coming soon

I will be speaking at Percona Live 2013 later this month in Santa Clara, CA.  It is day 1 of the sessions in the evening and the title is "Linux Filesystems and MySQL".  It has been all kinds of fun doing the research for this and I've learned more about benchmarking than I had guessed I would as I attempted to generate the numbers for it.

I hope to see a lot of people there, some of them from last year and I hope to meet quite a few more.

Ammon

Tuesday, April 9, 2013

MySQL Duplicate Table Aliases

So it turns out that MySQL will, in certain circumstances, allow for duplicate table aliases to be used, eg:

SELECT
  t1.a,
  t2.b
FROM
  test1.t1 AS tbl
  JOIN test2.t2 AS tbl
WHERE
  t1.a = 5
  AND t2.b = 'foo';

If you look at what the parser sees this as (using a 'show warnings;' immediately after an 'explain extended') then you will see that it actually replaces those aliases, so it kind of works.  However, this is still ugly and makes me cringe a little.  I would never have thought it would be allowed until I came across it in a live query yesterday.

Ultimately, it may be documented in some dark corner as a feature but I filed a bug report on it anyway.

So the rules appear to be these:
1 - The two tables must be in different databases.
2 - There may not be any duplicate column names referenced in any part of the query.


Ammon

Saturday, April 6, 2013

Shamallû

For those who have not had the privilege of pouring through ancient Babylonian texts I need to explain the title of this blog for a moment.  Shumallû is the Akkadian (Babylonian and Assyrian) word for "Student" or "Pupil", typically with regards to a scribe in training.  While the word is Akkadian the written form most often goes to the older Sumerian version, being LÚ.SHAB.TUR.

I choose this as my title because I am still learning MySQL.  While I have been using it for years and was certified in it several years back I find that with every new setup and every new challenge I have much more to learn and there are those who are far more knowledgeable than I am.  I hope this blog to be a place where I can share some of the random knowledge I have collected for weird problems, observations, or testing especially for the things I could not easily find on Google.  Perhaps, others who know more can continue to instruct or correct me in comments or feedback as I go and I will continue to learn from the wealth of experience among my peers.

Also, sometimes I like to have fun with MySQL, languages, and food.  There may appear some mixing of those subjects for which I make no apologies.

Ammon