This isn't a completely quantitative discussion, but...
We added a full-text index to a field in a MyISAM table on a certain service, using bigrams for search indexing. This was for about 200,000 records.
Until now, the collation for the character set of that field was utf8_general_ci (the default), but since we wanted to match both Katakana and Hiragana in Japanese, we decided to change the collation to utf8_unicode_ci.
However, the performance drastically worsened, and the service stopped working altogether. When I checked with SHOW FULL PROCESSLIST;, I saw that the search queries were getting stuck.
So, we reverted from utf8_unicode_ci back to the original setting. We decided to normalize the search data upon insertion to handle the variations between Katakana and Hiragana in Japanese.
The takeaway is that it's better to avoid utf8_unicode_ci. In fact, for search-related operations, it's probably better to use Elasticsearch or Cloudsearch instead of relying on MySQL with full-text indexing.
Comments