Independent Digital

Creating and managing dynamic websites using Php, MySql, HTML and Wordpress with a bit of PEAR on the side

Entries Comments



Email This Post    Print This Post   

MySQL character sets and collation

31 October, 2006 (14:23) | MySQL | By: clive





The following document serves to introduce character sets and collation which play an inportant part in MySQL databases. It would be to your advantage to know about these before creating your database so as to ensure that the data stored and retrieved is according to your expectations.

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. For example, each letter in the alphabet is a symbol which is then encoded. The letter A could be represented by the number 0. A is the symbol and 0 is its encoding. Together the symbols and their encodings are known as the character set.

We can compare the values of symbols in a string by collating their encodings. This is known as collation which is a set of rules. The simplest of all collations is the binary collation. In the above example where one encoding is collated (compared) to another using one collation rule namely, comparing the values of the symbols. This, the simplest of all possible collations is a binary collation. A case-insensitive collation is a little more complex in that it uses two collation rules, by first treating the upper and lower case symbols as equal and then by comparing the values of the symbols.

In reality most character sets have many characters, special symbols and punctuation marks, requiring the collations to have many rules. These include case insensitivity and accent insensitivity (an ?accent? is a mark attached to a character as in the German ?) and multiple-character mappings (such as the rule that ? = OE in one of the two German collations)

Western European character sets cover most West European languages and have latin1 as the default character set with latin1_swedish_ci as the default collation. This is the most common setting for MySQL users.

The single character set model (pre MySQL 4.1)

Here character set values are interpreted with respect to the server?s character set. This is the default character set selected when the server was built and is usually latin1. This can be overridden at runtime by using the default-character-set option.

The problem with this is that it limits the database to one character set at a time and can lead to index-related problems. The index values for the columns are stored in the sorted order according to the collating sequence of the character set in force at the time that the indexes were created. If the server?s character set is changed after having created the tables, the indexed-based queries might not work correctly as the character sets have different collating sequences. This is especially apparent with the accents (such as ?) which are ordered differently in the various character sets.

The solution is to rebuild the indexes for each existing table that has a character-based index, using the collating order of the new character set. There are 2 ways that you can do this:

  • First dump the table using mysqldump, then drop the table and finally reload it. The indexes will be rebuilt as the file is loaded.
  • Drop the indexes then add them again. You will need to know the exact index definitions so that you can re-create them.

Although there are methods to reorder the index, it is still a bother to have to do so. Fortunately this is all taken care of as of MySQL 4.1

The multi character-set model (MySQL 4.1 and above)

  • The MySQL server supports the use of multiple character sets simultaneously (more than 70 collations for more than 30 character sets). However, you cannot mix character sets within a string or use different character sets for different rows in a given column.
  • Character sets can have more than one collation.
  • Unicode support is provided giving the multilingual support that you may need.
  • Functions are available for converting and determining the character set.
  • The COLLATE operator alters the collation of a string and the COLLATION ( ) function returns the collation of a string.
  • The SHOW CHARACTER SET and SHOW COLLATION statements list the available character sets and collations.
  • The server automatically reorders indexes when you change the collation of an indexed character column.

 

There are default settings for character sets and collations at four levels:

  • server
  • database
  • table
  • column

1. Server Character Set and Collation

The server character set and collation are determined:

  • According to the option settings in effect when the server starts. You can use –default-character-set for the character set and –default-collation for the collation. If you don?t specify a character set, then the default is set to latin1. Similarly by not specifying a collation, the default is set to latin1_swedish_ci because this is the default collation for latin1.
  • According to the values set at runtime. The current character set and collation variables can be changed at runtime.

You can override the default character set and collation at server start-up by using character-set-server and collation -server options. If you specify only the character set, its default collation becomes the server?s default collation. If you specify a collation, it must be compatible with the character set. (A collation is compatible with a character set if its name begins with the character set name. For example, latin1_swedish_ci is compatible with latin1.

2. Database Character Set and Collation

Each database has a database character set and a database collation. The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifying the database character set and collation. For example:

CREATE DATABASE db_name DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;

To change the database?s default character set and collation , use:
ALTER DATABASE db_name CHARACTER SET charset COLLATE collation;

Bear in mind that if both character set and collation were specified then these will be the current settings. If only the character set is defined, then the default collation for that character set will be allocated. If neither is specified then the server character set and collation are chosen.

It is possible to create databases with different character sets and collations on the same MySQL server. The database character set and collation are used as default values for tables if these are not specified when the table is created. The character set and collation for the default database are available as the values of the character_set_database and collation _database system variables. The server sets these variables whenever the default database changes. If there is no default database then the variables have the same value as the corresponding server-level variables, character_set_server and collation _server.

3. Table Character Set and Collation

Every table has a table character set and table collation. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation. For example:

CREATE TABLE abc ( … ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

Bear in mind that if both character set and collation were specified then these will be the current settings. If only the character set is defined, then the default collation for that character set will be allocated. If neither is specified then the database character set and collation are used.

4. Column Character Set and Collation

Every ?character? column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation. Column definition syntax has optional clauses for specifying the column character set and collation. For example:

CREATE TABLE abc (column1 VARCHAR (5) CHARACTER SET latin1 COLLATE latin1_german1_ci);

Bear in mind that if both character set and collation were specified then these will be the current settings. If only the character set is defined, then the default collation for that character set will be allocated. If neither is specified then the table character set and collation are chosen.

5. Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

1. Table and Column Definition

CREATE TABLE abc (column1 CHAR (10) CHARACTER SET latin1 COLLATE latin1_german1_ci) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

Here the column has a latin1 character set and a latin1_german1_ci collation while the table has a latin2 character set and a latin2_bin collation.

2. Table and Column Definition

CREATE TABLE abc (column1 CHAR (10) CHARACTER SET latin1) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

Here we have a column with a latin1 character set and a default collation for this character set (which is always latin1_swedish_ci - note that it is not taken from the table level - latin1_danish_ci).

3. Table and Column Definition

CREATE TABLE abc (column1 CHAR (10)) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

Here we have a column with a default character set and a default collation (taken from the table level). In this case the character set for column abc is latin1 and its collation is latin1_danish_ci.

4. Database, Table, and Column Definition

CREATE DATABASE dbase1 DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;

USE dbase1;
CREATE TABLE abc
(
column1 CHAR (10)
);

Here we create a column (no character set and collation is specified at column or table level). The database’s settings become the table’s settings, and thereafter become the column’s setting. So, the character set for column1 is latin2 and its collation is latin2_czech_ci.

6. Connection Character Sets and Collations

Some character set and collation variables play a role in the connection between a client and the server. With queries and result sets being passed between client and server, it is important to know:

  • That the server takes the character_set_client variable to be the character set in which queries are sent by the client.
  • That the server converts queries sent by the client from character_set_client to character_set_connection, similarly with the collation (where collation _connection is important for the comparisons of literal strings (char, varchar and text)).
  • That the character_set_results variable determines the character set in which the server returns query results to the client

There are two statements that affect the connection character sets:

  1. SET NAMES ‘charset_name‘ ? tells the server that all incoming messages from the client are in ‘charset_name’ character set. This is also the character set that the server needs to return results back to the client in.
  2. SET CHARACTER SET ‘charset_name‘ - sets the connection character set and collation to be those of the default database.

On connection, the client indicates to the server the name of the character set that it wants to use. The server sets the character_set_client, character_set_results, and character_set_connection variables to that character set. Conversion may be lossy if there are characters that are not in both character sets. If you do not want the server to perform any conversion, set character_set_results to NULL:

mysql> SET character_set_results = NULL;

7. Character String Literal Character Set and Collation

Every character string literal (char, varchar and text) has a character set and a collation and may have an optional character set introducer and COLLATE clause:
[_charset_name]’string’ [COLLATE collation _name]

For example:

SELECT _latin1 ’string’ COLLATE latin1_danish_ci;

For the simple statement, SELECT ’string’, the string has the character set and collation defined by the character_set_connection and collation _connection system variables. The _charset_name expression is called an introducer. It notifies the parser that, ?the following string uses character set abc? An introducer is simply an indicator and does not do any conversion.

Bear in mind that if both character set and collation were specified then these will be the current settings. If only the character set is defined, then the default collation for that character set will be allocated. If neither is specified then the character set and collation given by the character_set_connection and collation _connection system variables are used.

8. Using COLLATE in SQL Statements

You can override whatever the default collation is by using the COLLATE clause. For example:

SELECT abc FROM table ORDER BY abc COLLATE latin1_german2_ci;

SELECT abc COLLATE latin1_german2_ci AS abc1 FROM table1 ORDER BY abc1;

SELECT abc FROM table1 GROUP BY abc COLLATE latin1_german2_ci;

SELECT MAX(abc COLLATE latin1_german2_ci) FROM table1;

SELECT DISTINCT abc COLLATE latin1_german2_ci FROM table1;

SELECT * FROM table1 WHERE _latin1 ‘M?ller’ COLLATE latin1_german2_ci = abc;

SELECT * FROM table1 WHERE abc LIKE _latin1 ‘M?ller’ COLLATE latin1_german2_ci;

SELECT abc FROM table1 GROUP BY abc HAVING abc = _latin1 ‘M?ller’ COLLATE latin1_german2_ci;

9. The BINARY Operator

Each character set has a binary collation. The BINARY operator is a shorter form of the COLLATE clause. For example, the binary collation for the latin1 character set is latin1_bin, so if column1 is of character set latin1, the following have the same effect:

SELECT * FROM table1 ORDER BY BINARY column1;
SELECT * FROM table1 ORDER BY column1 COLLATE latin1_bin;

10. Coercibility rules

When collating, it is sometimes difficult to see which collation takes precedence. Hence the following coercibility rules:

An explicit COLLATE clause = coercibility of 0. (Not coercible at all.)
Concatenating two strings with different collations = coercibility of 1.
Column’s collation = coercibility of 2.
A ?system constant? (the string returned by functions) = coercibility of 3.
A literal’s (char, varchar and text) collation = coercibility of 4.
NULL or an expression that is derived from NULL = coercibility of 5.

Use the collation with the lowest coercibility value. If both sides have the same coercibility and the collations are not the same, then there is an error.

An Example of the Effect of Collation

Suppose that column1 in table1 has these latin1 column values:

Muffler
M?ller
MX Systems
MySQL

And suppose that the column values are retrieved using the following statement:

SELECT column1 FROM table1 ORDER BY column1 COLLATE collation _name;

The resulting order of the values for different collations is shown in this table:

latin1_swedish_ci

latin1_german1_ci

latin1_german2_ci

Muffler

Muffler

M?ller

MX Systems

M?ller

Muffler

M?ller

MX Systems

MX Systems

MySQL

MySQL

MySQL

 

Here the U with two dots over it (?) is collated differently depending on the collation used. The first column shows the result of using the Swedish/Finnish collating rule (U-umlaut sorts with Y). The second column shows the result of using the German DIN-1 rule (U-umlaut sorts with U). The third column shows the result of using the German DIN-2 rule (U-umlaut sorts with UE).

Unicode support

There are many different character sets because of the different encodings required by the various languages. Different languages also require different numbers of bytes to represent a character. All characters in the latin1 character set can be represented by a single byte. Other languages may need more than one byte per character.

Unicode provides a single character-encoding system wherein character sets from all languages can be represented in a consistent manner. Within MySQL there are two Unicode sets:

  1. ucs2 Which corresponds to the Unicode UCS-2 encoding. Each character is represented by two bytes.
  2. utf8 Which has a variable length format, representing characters with from 1 to 3 bytes.

 

You can store text in about 650 languages using these character sets.

Please refer to the MySQL documentation for more detailed information.

Have a look at www.idig.za.net for interesting articles and tutorials on PEAR, WordPress, Search Engine Optimization and many more.

Bookmark and Share

Sorry, there are no related posts but check these out

Comments

Comment from Skaah
Time: May 22, 2007, 2:59 pm

Thanks for this interesting article. I found it quite strange that changing a table’s character set does nothing, you have to change all the columns manually. Quite annoying.

Comment from Roshan Shahare
Time: October 19, 2007, 11:45 am

I have faced the same problem. I have chnaged charset, collate of whole db using above query. query successfully worked. but i have to chnage columns manually.

Comment from werutzb
Time: October 8, 2008, 4:59 am

Hi!

I would like extend my SQL capabilities.
I red that many SQL resources and want to
get more about SQL for my work as mysql database manager.

What can you recommend?

Thanks,
Werutz