You should avoid using MySQL collation utf8_unicode_ci as it can be quite slow

MySQL
2016-09-14 18:44 (8 years ago) ytyng

This isn't a completely quantitative discussion, but...

We added a full-text index to a field in a MyISAM table on a certain service, using bigrams for search indexing. This was for about 200,000 records.

Until now, the collation for the character set of that field was utf8_general_ci (the default), but since we wanted to match both Katakana and Hiragana in Japanese, we decided to change the collation to utf8_unicode_ci.

However, the performance drastically worsened, and the service stopped working altogether. When I checked with SHOW FULL PROCESSLIST;, I saw that the search queries were getting stuck.

So, we reverted from utf8_unicode_ci back to the original setting. We decided to normalize the search data upon insertion to handle the variations between Katakana and Hiragana in Japanese.

The takeaway is that it's better to avoid utf8_unicode_ci. In fact, for search-related operations, it's probably better to use Elasticsearch or Cloudsearch instead of relying on MySQL with full-text indexing.

Currently unrated
The author runs the application development company Cyberneura.
We look forward to discussing your development needs.

Comments

Archive

2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011