What Does Collation Mean in SQL? Baeldung on SQL

1. Overview

Anyone working with databases will encounter collation issues at some point, especially when dealing with multilingual data or international applications. Collation affects how strings are compared and ordered, which is critical for accurate data retrieval and presentation.

In this tutorial, we’ll explore the concept of collation in SQL databases. We’ll also pay special attention to UTF-8 character encoding (charset), because the interplay between charsets and collations is critical for internationalized database applications.

We’ll focus on MySQL, PostgreSQL, and MSSQL.

2. Definition of Collation

Collation in SQL databases defines the set of rules for comparing and sorting character data. It determines how characters are evaluated, taking into account factors such as case sensitivity (e.g., A vs. a), accent sensitivity (e.g., e vs. é), and locale-specific sorting rules (e.g., ß vs. ss in German).

Here are the main differences between MySQL, PostgreSQL and MSSQL in how they handle collation:

MySQL – Provides collations tied to charsets, enabling fine control for language and locale-specific rules
PostgreSQL – Uses the operating system’s locale settings and supports flexible collation via the COLLATE clause
MSSQL – Offers server, database, and column-level collations with customizable options for language and region

Understanding these differences is important for database administrators and developers to ensure consistent behavior across different database systems.

3. Our UTF-8 Example Table

We added names with accents, diacritics, and non-English characters to the sample data from the Student table in our University database. Then, to illustrate the different behavior of SQL queries based on collation, we inserted the same names with ASCII-only equivalents:

The query for adding these names is in the file addInternationalStudents.sql. Before executing it, however, we need to make sure that our database uses UTF-8, as it’s a requirement to store multilingual content.

3.1. MySQL Database Charset

Let’s run the following query to check the charset of the currently selected database:

SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME = DATABASE();
+-------------+----------------------------+
| SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME |
+-------------+----------------------------+
| University  | utf8mb3                    |
+-------------+----------------------------+

Unfortunately, utf8mb3 is problematic and deprecated because it covers most, but not all, Unicode characters. Nevertheless, this old encoding is still widely used for historical and backward compatibility reasons. Many legacy systems, applications, or scripts may still rely on utf8mb3 due to lack of updates. But if we need to support all emoji, the full set of Chinese, Japanese, Korean, Devanagari and other Asian ideographs, and some special symbols, utf8mb4 is a requirement.

So let’s upgrade to utf8mb4 to future-proof our database for modern applications and standards:

ALTER DATABASE University
CHARACTER SET utf8mb4;

ALTER TABLE Student
CONVERT TO CHARACTER SET utf8mb4;

We can repeat the second command for all the other tables in the current database. We should note, however, that if we change the charset, as we did in this case, we generally need to change the collation as well, but we’ll look at that in more detail shortly.

In some cases, it may also be necessary to take action on the charset of individual table columns if it doesn’t match the table charset.

3.2. PostgreSQL Database Charset

Let’s check the charset of the current database:

SELECT datname, pg_encoding_to_char(encoding) AS encoding
FROM pg_database
WHERE datname = current_database();

 datname  | encoding 
----------+----------
 postgres | UTF8

The UTF8 result is the desired one.

If the database has a different charset, it’s not trivial to change it, as PostgreSQL requires several steps to modify the charset on an existing database, typically exporting and importing a database dump.

3.3. MSSQL Database Charset

In MSSQL, the charset used is specified in the collation name:

SELECT name, collation_name 
FROM sys.databases 
WHERE name = DB_NAME();

name          collation_name
------------------------------------------
University    SQL_Latin1_General_CP1_CI_AS

In SQL_Latin1_General_CP1_CI_AS, Latin1_General refers to Latin1 charset (Western European languages) and CP1 refers to Code Page 1252 (Windows-1252), which is a single-byte encoding for Western European languages only.

Let’s change the collation to explicitly support UTF-8:

ALTER DATABASE University
COLLATE Latin1_General_100_CI_AS_SC_UTF8;

As far as character encoding is concerned, this is sufficient. We’ll see later what this collation means in detail.

4. Collation in MySQL

MySQL provides a wide range of collation options for precise control over text comparison and sorting. Collations are tied to charsets, each of which has one or more sorting rules. Since we chose the utf8mb4 charset earlier because it supports all characters, it’s the only charset we need to focus on.

4.1. utf8mb4 Collations

Here are some generic collations we could use:

utf8mb4_general_ci – Fast, general-purpose collation that is case-insensitive (ci) and accent-insensitive
utf8mb4_unicode_ci – Case-insensitive and accent-insensitive, it’s more accurate than utf8mb4_general_ci for international characters
utf8mb4_unicode_520_ci – Same as utf8mb4_unicode_ci, but with improved handling of additional characters
utf8mb4_bin – Binary collation that is case-sensitive and accent-sensitive

While utf8mb4_general_ci is generally faster, the other collations may provide better accuracy in multilingual scenarios. For example, different collations differ in the way they compare ß and ss. We can take a look at the official documentation for an accurate comparison.

On the other hand, if we have very specific linguistic needs, we can use the locale code in a structured query to easily identify the most appropriate collation. For example, if we’re interested in Bulgarian, the locale code is bg:

SHOW COLLATION
WHERE Charset = 'utf8mb4'
AND Collation LIKE '%bg%';

+-----------------------+---------+-----+---------+----------+---------+---------------+
| Collation             | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+-----------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_bg_0900_ai_ci | utf8mb4 | 318 |         | Yes      |       0 | NO PAD        |
| utf8mb4_bg_0900_as_cs | utf8mb4 | 319 |         | Yes      |       0 | NO PAD        |
+-----------------------+---------+-----+---------+----------+---------+---------------+

The difference between utf8mb4_bg_0900_ai_ci and utf8mb4_bg_0900_as_cs lies in their sensitivity:

utf8mb4_bg_0900_ai_ci is accent-insensitive (ai) and case-insensitive (ci)
utf8mb4_bg_0900_as_cs is accent-sensitive (as) and case-sensitive (cs)

However, not all languages have dedicated collations.

4.2. Setting Collation

To set collation at different levels in MySQL, we can use the ALTER statements. As mentioned above, charset and collation should usually be changed together:

# Database Level
ALTER DATABASE University
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

# Table Level
ALTER TABLE Student
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

# Column Level
ALTER TABLE Student
MODIFY name VARCHAR(255)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

It’s helpful to use the same collation for all character columns in a database to ensure proper sorting, searching, and comparing. MySQL supports multiple collations in the same database, but using them simultaneously in queries can lead to implicit conversions or errors in some cases.

4.3. Changing Collation for Testing

We’re currently using utf8mb4_unicode_ci. Let’s do a test:

SELECT name
FROM Student
WHERE name = 'Jose Garcia';

+--------------+
| name         |
+--------------+
| José Garcia  |
| Jose Garcia  |
+--------------+

The result is as expected, since utf8mb4_unicode_ci is case-insensitive and accent-insensitive. However, for testing purposes, we could force the use of a different collation within a query without altering the database or table settings. For example, let’s test utf8mb4_bin, which is case-sensitive and accent-sensitive:

SELECT name
FROM Student
WHERE name = 'Jose Garcia'
COLLATE utf8mb4_bin;

+-------------+
| name        |
+-------------+
| Jose Garcia |
+-------------+

This allows us to examine the impact of different collations on specific cases.

5. Collation in PostgreSQL

Unlike MySQL, where collation settings are more tightly integrated into the database itself and offer greater flexibility to change collations at the table or column level after creation, PostgreSQL requires us to specify the collation when we create the database, table, or column, and changing it later can require several steps.

5.1. Collation Handling and Locale Settings

In PostgreSQL, collation is determined by locales that are set when the database cluster is initialized. A database cluster in PostgreSQL refers to a collection of databases managed by a single PostgreSQL server instance. Each database in the cluster can have its own collation settings.

Here is an example of some UTF-8 locales:

en_US.UTF-8 – American English
fr_FR.UTF-8 – French
de_DE.UTF-8 – German
tr_TR.UTF-8 – Turkish

To check the current locale settings of the database, we can use this SQL query:

SELECT datname,
       datcollate AS "Collation",
       datctype AS "Character Type"
FROM pg_database
WHERE datname = current_database();

 datname  |  Collation  | Character Type 
----------+-------------+----------------
 postgres | it_IT.UTF-8 | it_IT.UTF-8

In this case, our collation corresponds to the Italian locale. Since we’re using Linux, we can also check the server’s locale settings from the terminal:

$ locale
LANG=it_IT.UTF-8
[...]

This way we got the locale settings of the operating system that PostgreSQL uses by default.

5.2. Sensitivity in Queries

To force a case-sensitive and accent-sensitive binary comparison ignoring the current locale-specific rules, we can use the COLLATE clause with default C collation:

Case-insensitive comparison is more tricky, because PostgreSQL doesn’t provide a built-in case-insensitive collation. However, we can perform a case-insensitive (but accent-sensitive) comparison using the LOWER() function or the ILIKE operator:

To perform an accent-insensitive comparison, we can use the unaccent extension:

Of course, the unaccent extension can be combined with the LOWER() function or the ILIKE operator:

Finally, we can list all the collations we can use with the COLLATE clause:

SELECT collname
FROM pg_collation;

       collname        
------------------------
 default
 C
 POSIX
 ucs_basic
 C.utf8
 en_AG
 en_AG.utf8
[...]

However, we can only use the locales that are actually installed on our system.

6. Collation in MSSQL

Microsoft SQL Server uses collations at different levels:

Server-level collation – Set during installation; affects system databases and defaults for new databases
Database-level collation – Inherited from the server level, but we can override it when creating a new database
Column-level collation – Can be set for individual columns, overriding database defaults

Choosing the same collations at different levels ensures consistency and prevents errors.

6.1. Common Collation Options

MSSQL provides a wide range of collations to accommodate different languages and regional settings. Here are some common collations:

SQL_Latin1_General_CP1_CI_AS – Case-insensitive, accent-sensitive collation using Code Page 1252 (Windows-1252), suitable for English and Western European languages
Latin1_General_CI_AS – Similar to the above but uses Windows collations for better consistency with Windows applications
Latin1_General_100_CI_AS_SC_UTF8 – Supports UTF-8 encoding, case-insensitive, accent-sensitive, and includes supplementary characters (e.g., emoji)

After understanding some common collations, it’s essential to recognize the broader range of collations available in MSSQL, especially those that support UTF-8 encoding, and how they differ.

6.2. Understanding Collation Naming Conventions

Collation names in MSSQL follow a specific format that describes their properties:

<CollationDesignator>_<LocaleVersion>_<ComparisonStyle>[_SC][_UTF8]

Let’s break it down:

Collation designator – Indicates the language or locale (e.g., Latin1_General, Japanese)
Locale version – Reflects the Unicode version support (e.g., _100_ for Unicode 5.0, _140_ for Unicode 9.0)
Comparison style – Includes sensitivity settings:
- CI/CS – Case-Insensitive or Case-Sensitive
- AI/AS – Accent-Insensitive or Accent-Sensitive
- KI/KS – Kana-Insensitive or Kana-Sensitive (specific to Japanese kana characters)
- WI/WS – Width-Insensitive or Width-Sensitive (distinguishes between full-width and half-width characters)
_SC – Supplementary characters support (handles characters outside the Basic Multilingual Plane, such as emoji)
_UTF8 – Indicates support for UTF-8 encoding

Let’s note that in modern multilingual applications, UTF-8 is a requirement. Here are some examples of UTF-8 collations:

Latin1_General_100_CI_AS_SC_UTF8
Japanese_Bushu_Kakusu_140_CS_AS_UTF8
Modern_Spanish_100_CI_AS_SC_UTF8

To get the list of all available UFT-8 collations, we can use this query:

SELECT name
FROM sys.fn_helpcollations()
WHERE name LIKE '%UTF8';

[...]
Albanian_100_CI_AI_SC_UTF8                                                                                                      
Albanian_100_CI_AI_WS_SC_UTF8                                                                                                   
Albanian_100_CI_AI_KS_SC_UTF8                                                                                                   
Albanian_100_CI_AI_KS_WS_SC_UTF8                                                                                                
Albanian_100_CI_AS_SC_UTF8
[...]

There are 1553 collations in total. By understanding the naming conventions, we can make informed decisions about which collation best suits the needs of our database.

6.3. Testing MSSQL Collations

Using the COLLATION clause, we can test some collations according to their naming convention:

SELECT name
FROM Student
WHERE name = 'Zoë Johnson'
COLLATE Latin1_General_100_CI_AI_SC_UTF8;

name
------------
Zoë Johnson
Zoe Johnson

SELECT name
FROM Student
WHERE name = 'Zoë Johnson'
COLLATE Latin1_General_100_CI_AS_SC_UTF8;

name
------------
Zoë Johnson

The results are as expected.

7. Conclusion

In this article, we explored the concept of collation in SQL databases and how it affects text comparison and sorting in MySQL, PostgreSQL, and MSSQL. We discussed setting up databases with appropriate character encodings, examined collation options, and tested their impact on queries.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Full Archive

About Baeldung