MySQL 5.6 InnoDB Fulltext Round 2: Stopwords

A few months ago, in MySQL 5.6.4, I took a brief look at MySQL 5.6’s new InnoDB Fulltext support; see Initial impressions of InnoDB Fulltext. There have been quite a few new point releases since then, and we’re into RC releases of MySQL 5.6 now, so I thought I’d take another look. One of my major concerns in my first look was the security implication of the way any user could see fulltext index details for the table set in the global innodb_ft_aux_table
variable. Fortunately, this is «fixed», but not quite in the way I’d like. Now, a user needs the PROCESS
privilege to be able to read from theINFORMATION_SCHEMA
tables that may contain InnoDB data. So, a user needs SUPER
to be able to set innodb_ft_aux_table
and then PROCESS
to be able to read from the various tables in INFORMATION_SCHEMA
. The PROCESS
privilege «pertains to display of information about the threads executing within the server», according to the manual, so I’m not really seeing the relationship between this privilege and InnoDB Fulltext index data, but at least the security hole I identified originally no longer exists. Still, though, the requirement that a user have SUPER
and now PROCESS
privileges in order simply to see information about fulltext indexes on their own tables is very disappointing. A user with privileges to create tables with fulltext indexes cannot get any details about those indexes unless they are given not one but two global privileges completely unrelated to the analysis of InnoDB Fulltext index information. This is not a reasonable interface to this data nor a reasonable approach for making it available to users. I got caught up last time in wrapping my head around this INFORMATION_SCHEMA
functionality, so this time I wanted to take a look at some of the other functionality available with InnoDB Fulltext. I didn’t make it very far, though… One of the things that caught my attention this time around is all of the «auxiliary» tables that are required for InnoDB Fulltext. When you create a very simple InnoDB table with a single fulltext index, there are 8 additional auxiliary tables created, each with its own .ibd
file on the filesystem (if you have innodb_file_per_table
enabled, of course), and another 6 that don’t have an accompanying .ibd
.
mysql 5.6.9-rc (root) [test]> create table i1 (id int unsigned not null auto_increment primary key, c varchar(32), fulltext(c)); Query OK, 0 rows affected (0.06 sec) mysql 5.6.9-rc (root) [test]> select table_id, name from information_schema.innodb_sys_tables where name like concat('%FTS_',lpad(conv((select table_id from information_schema.innodb_sys_tables where name='test/i1'),10,16),16,'0'),'%'); +----------+----------------------------------------------------+ | table_id | name | +----------+----------------------------------------------------+ | 202 | test/FTS_00000000000000c2_0000000000000114_DOC_ID | | 203 | test/FTS_00000000000000c2_0000000000000114_INDEX_1 | | 204 | test/FTS_00000000000000c2_0000000000000114_INDEX_2 | | 205 | test/FTS_00000000000000c2_0000000000000114_INDEX_3 | | 206 | test/FTS_00000000000000c2_0000000000000114_INDEX_4 | | 207 | test/FTS_00000000000000c2_0000000000000114_INDEX_5 | | 208 | test/FTS_00000000000000c2_0000000000000114_INDEX_6 | | 195 | test/FTS_00000000000000c2_ADDED | | 198 | test/FTS_00000000000000c2_BEING_DELETED | | 199 | test/FTS_00000000000000c2_BEING_DELETED_CACHE | | 200 | test/FTS_00000000000000c2_CONFIG | | 196 | test/FTS_00000000000000c2_DELETED | | 197 | test/FTS_00000000000000c2_DELETED_CACHE | | 201 | test/FTS_00000000000000c2_STOPWORDS | +----------+----------------------------------------------------+ 14 rows in set (0.00 sec) kolbe@prosimmon 5.6 $ ls ./data/test/ FTS_00000000000000c2_0000000000000114_DOC_ID.ibd FTS_00000000000000c2_CONFIG.ibd i1.frm FTS_00000000000000c2_ADDED.ibd FTS_00000000000000c2_DELETED.ibd i1.ibd FTS_00000000000000c2_BEING_DELETED.ibd FTS_00000000000000c2_DELETED_CACHE.ibd i1_stopwords.frm FTS_00000000000000c2_BEING_DELETED_CACHE.ibd FTS_00000000000000c2_STOPWORDS.ibd i1_stopwords.ibd
There are some concerns of course with all of these extra things being created for each InnoDB table where you use fulltext. Additional file descriptors, additional InnoDB data dictionary overhead, et cetera. But even more strange is the fact that not all of these even appear to be used by InnoDB. For some reason I was particularly drawn to the «STOPWORDS» table (FTS_00000000000000c2_STOPWORDS.ibd
). There are two sources of stopwords for MySQL 5.6 InnoDB Fulltext: the default stopwords, viewable in INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD
(this is from a static array compiled into the server, apparently based on some minimally multilingual amalgam of some elements of «Google Stopwords»), and «user-defined stopwords», controlled with yet more system variables: innodb_ft_server_stopword_table
andinnodb_ft_user_stopword_table
. I mostly want to look atinnodb_ft_user_stopword_table
, because trying to control per-table behavior with MySQL system variables is pretty cumbersome and error-prone and I wanted to see how well this works for InnoDB Fulltext. If you are creating a new table and you want to use some custom set of stopwords, you would do so like this:
mysql 5.6.9-rc (root) [test]> create table i1_stopwords (value varchar(255)) engine=innodb; Query OK, 0 rows affected (0.02 sec) mysql 5.6.9-rc (root) [test]> insert into i1_stopwords values ('apple'),('banana'),('coconut'),('durian'); Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql 5.6.9-rc (root) [test]> set innodb_ft_user_stopword_table='test/i1_stopwords'; Query OK, 0 rows affected (0.00 sec) mysql 5.6.9-rc (root) [test]> create table i1 (id int unsigned not null auto_increment primary key, c varchar(32), fulltext(c)); Query OK, 0 rows affected (0.06 sec) mysql 5.6.9-rc (root) [test]> insert into i1 (c) values ('a crabby apple'),('the banana bonanza'),('some crazy coconuts'),('one daring durian'); Query OK, 4 rows affected (0.01 sec) Records: 4 Duplicates: 0 Warnings: 0
We can use our old friends innodb_ft_aux_table
andINFORMATION_SCHEMA.INNODB_FT_TABLE_INDEX
/CACHE
to see what made it into our index.
mysql 5.6.9-rc (root) [test]> set global innodb_ft_aux_table='test/i1'; Query OK, 0 rows affected (0.00 sec) mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_index_cache union all select * from information_schema.innodb_ft_index_table; +----------+--------------+-------------+-----------+--------+----------+ | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | +----------+--------------+-------------+-----------+--------+----------+ | bonanza | 2 | 2 | 1 | 2 | 11 | | coconuts | 3 | 3 | 1 | 3 | 11 | | crabby | 1 | 1 | 1 | 1 | 2 | | crazy | 3 | 3 | 1 | 3 | 5 | | daring | 4 | 4 | 1 | 4 | 4 | | one | 4 | 4 | 1 | 4 | 0 | | some | 3 | 3 | 1 | 3 | 0 | | the | 2 | 2 | 1 | 2 | 0 | +----------+--------------+-------------+-----------+--------+----------+ 8 rows in set (0.01 sec)
We can also see the stopword table in INFORMATION_SCHEMA.INNODB_FT_CONFIG
:
mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config; +---------------------------+-------------------+ | KEY | VALUE | +---------------------------+-------------------+ | optimize_checkpoint_limit | 180 | | synced_doc_id | 0 | | last_optimized_word | | | deleted_doc_count | 0 | | total_word_count | | | optimize_start_time | | | optimize_end_time | | | stopword_table_name | test/i1_stopwords | | use_stopword | 1 | | table_state | 0 | +---------------------------+-------------------+ 10 rows in set (0.00 sec)
Alright, it looks like our user-defined stopword table worked just fine. So, what’s the problem? What if you need to dump/reload this table? You have to manually verify whether or not the table uses a user-defined stopword table, which requires consultingINFORMATION_SCHEMA.INNODB_FT_CONFIG
, which requires setting innodb_ft_aux_table
, which requires the SUPER
privilege. You have to make sure you also dump/reload the stopword table, if you’re moving things to another machine, and you are responsible for manually setting innodb_ft_user_stopword_table
as appropriate when you import. But it’s even worse than that. OPTIMIZE TABLE
discards the stopword list and re-builds the entire fulltext index according to the current value ofinnodb_ft_user_stopword_table
.
mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config; +---------------------------+-------------------+ | KEY | VALUE | +---------------------------+-------------------+ | optimize_checkpoint_limit | 180 | | synced_doc_id | 0 | | last_optimized_word | | | deleted_doc_count | 0 | | total_word_count | | | optimize_start_time | | | optimize_end_time | | | stopword_table_name | test/i1_stopwords | | use_stopword | 1 | | table_state | 0 | +---------------------------+-------------------+ 10 rows in set (0.00 sec) mysql 5.6.9-rc (root) [test]> optimize table i1; +---------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+----------+----------+-------------------------------------------------------------------+ | test.i1 | optimize | note | Table does not support optimize, doing recreate + analyze instead | | test.i1 | optimize | status | OK | +---------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (0.07 sec) mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config; +---------------------------+-------+ | KEY | VALUE | +---------------------------+-------+ | optimize_checkpoint_limit | 180 | | synced_doc_id | 0 | | last_optimized_word | | | deleted_doc_count | 0 | | total_word_count | | | optimize_start_time | | | optimize_end_time | | | stopword_table_name | | | use_stopword | 1 | | table_state | 0 | +---------------------------+-------+ 10 rows in set (0.00 sec) mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_index_cache union all select * from information_schema.innodb_ft_index_table; +----------+--------------+-------------+-----------+--------+----------+ | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | +----------+--------------+-------------+-----------+--------+----------+ | apple | 1 | 1 | 1 | 1 | 9 | | banana | 2 | 2 | 1 | 2 | 4 | | bonanza | 2 | 2 | 1 | 2 | 11 | | coconuts | 3 | 3 | 1 | 3 | 11 | | crabby | 1 | 1 | 1 | 1 | 2 | | crazy | 3 | 3 | 1 | 3 | 5 | | daring | 4 | 4 | 1 | 4 | 4 | | durian | 4 | 4 | 1 | 4 | 11 | | one | 4 | 4 | 1 | 4 | 0 | | some | 3 | 3 | 1 | 3 | 0 | +----------+--------------+-------------+-----------+--------+----------+ 10 rows in set (0.00 sec)
A regular user (one without SUPER
or PROCESS
) cannot consultINFORMATION_SCHEMA.INNODB_FT_CONFIG
to see the current stopword_table_name
for the table and thus cannot safely optimize or dump/reload their own tables without external accounting (i.e. write down which stopword table(s) your tables use) if the table was created with a user-defined stopword table. But remember that «FTS_00000000000000c2_STOPWORDS
» auxiliary table we saw earlier? Well, that doesn’t appear to be used at all.
kolbe@prosimmon 5.6 $ strings -a ./data/test/i1_stopwords.ibd infimum supremum apple banana .coconut =durian kolbe@prosimmon 5.6 $ strings -a ./data/test/FTS_00000000000000c2_STOPWORDS.ibd P'=+ infimum supremum
Maybe there’s some planned future support for more visible and manageable stopwords, but for now the facility for these is quite cumbersome and problematic. Beyond these basic usability problems, there’s a bug that’s causing some memory related to the innodb_ft_user_stopword_table
variable to be corrupted. So far it’s made my server lock up and crash quite a few times while doing my simple testing. Here’s some of what I saw:
mysql 5.6.9-rc (root) [test]> set innodb_ft_user_stopword_table='test/no_stopwords'; ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql 5.6.9-rc (root) [test]> select @@innodb_ft_user_stopword_table; ERROR 2006 (HY000): MySQL server has gone away
mysql 5.6.9-rc (root) [test]> show variables like 'innodb_ft_user_stopword_table'; +-------------------------------+-------+ | Variable_name | Value | +-------------------------------+-------+ | innodb_ft_user_stopword_table | | +-------------------------------+-------+ 1 row in set, 1 warning (0.00 sec) mysql 5.6.9-rc (root) [test]> show warnings; +---------+------+--------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: 'x98xACx99x1AxE3' for column 'VARIABLE_VALUE' at row 1 | +---------+------+--------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
mysql 5.6.9-rc (root) [test]> select @@global.innodb_ft_user_stopword_table; +----------------------------------------+ | @@global.innodb_ft_user_stopword_table | +----------------------------------------+ | O?o? | +----------------------------------------+ 1 row in set (0.00 sec) mysql 5.6.9-rc (root) [test]> show variables like 'innodb_ft_user_stopword_table'; +-------------------------------+-------+ | Variable_name | Value | +-------------------------------+-------+ | innodb_ft_user_stopword_table | O | +-------------------------------+-------+ 1 row in set, 1 warning (0.00 sec) mysql 5.6.9-rc (root) [test]> show warnings; +---------+------+------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+------------------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: 'xB9oxFEx07' for column 'VARIABLE_VALUE' at row 1 | +---------+------+------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
I’ve filed bug #67960 and the folks at Oracle have already been very good about analyzing it. I trust it will be fixed before the next release of MySQL 5.6.
Post a Comment
Log into your MariaDB ID account to post a comment.