Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: August 21, 2025
Anyone working with databases will encounter collation issues at some point, especially when dealing with multilingual data or international applications. Collation affects how strings are compared and ordered, which is critical for accurate data retrieval and presentation.
In this tutorial, we’ll explore the concept of collation in SQL databases. We’ll also pay special attention to UTF-8 character encoding (charset), because the interplay between charsets and collations is critical for internationalized database applications.
We’ll focus on MySQL, PostgreSQL, and MSSQL.
Collation in SQL databases defines the set of rules for comparing and sorting character data. It determines how characters are evaluated, taking into account factors such as case sensitivity (e.g., A vs. a), accent sensitivity (e.g., e vs. é), and locale-specific sorting rules (e.g., ß vs. ss in German).
Here are the main differences between MySQL, PostgreSQL and MSSQL in how they handle collation:
Understanding these differences is important for database administrators and developers to ensure consistent behavior across different database systems.
We added names with accents, diacritics, and non-English characters to the sample data from the Student table in our University database. Then, to illustrate the different behavior of SQL queries based on collation, we inserted the same names with ASCII-only equivalents:
The query for adding these names is in the file addInternationalStudents.sql. Before executing it, however, we need to make sure that our database uses UTF-8, as it’s a requirement to store multilingual content.
Let’s run the following query to check the charset of the currently selected database:
SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME = DATABASE();
+-------------+----------------------------+
| SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME |
+-------------+----------------------------+
| University | utf8mb3 |
+-------------+----------------------------+
Unfortunately, utf8mb3 is problematic and deprecated because it covers most, but not all, Unicode characters. Nevertheless, this old encoding is still widely used for historical and backward compatibility reasons. Many legacy systems, applications, or scripts may still rely on utf8mb3 due to lack of updates. But if we need to support all emoji, the full set of Chinese, Japanese, Korean, Devanagari and other Asian ideographs, and some special symbols, utf8mb4 is a requirement.
So let’s upgrade to utf8mb4 to future-proof our database for modern applications and standards:
ALTER DATABASE University
CHARACTER SET utf8mb4;
ALTER TABLE Student
CONVERT TO CHARACTER SET utf8mb4;
We can repeat the second command for all the other tables in the current database. We should note, however, that if we change the charset, as we did in this case, we generally need to change the collation as well, but we’ll look at that in more detail shortly.
In some cases, it may also be necessary to take action on the charset of individual table columns if it doesn’t match the table charset.
Let’s check the charset of the current database:
SELECT datname, pg_encoding_to_char(encoding) AS encoding
FROM pg_database
WHERE datname = current_database();
datname | encoding
----------+----------
postgres | UTF8
The UTF8 result is the desired one.
If the database has a different charset, it’s not trivial to change it, as PostgreSQL requires several steps to modify the charset on an existing database, typically exporting and importing a database dump.
In MSSQL, the charset used is specified in the collation name:
SELECT name, collation_name
FROM sys.databases
WHERE name = DB_NAME();
name collation_name
------------------------------------------
University SQL_Latin1_General_CP1_CI_AS
In SQL_Latin1_General_CP1_CI_AS, Latin1_General refers to Latin1 charset (Western European languages) and CP1 refers to Code Page 1252 (Windows-1252), which is a single-byte encoding for Western European languages only.
Let’s change the collation to explicitly support UTF-8:
ALTER DATABASE University
COLLATE Latin1_General_100_CI_AS_SC_UTF8;
As far as character encoding is concerned, this is sufficient. We’ll see later what this collation means in detail.
MySQL provides a wide range of collation options for precise control over text comparison and sorting. Collations are tied to charsets, each of which has one or more sorting rules. Since we chose the utf8mb4 charset earlier because it supports all characters, it’s the only charset we need to focus on.
Here are some generic collations we could use:
While utf8mb4_general_ci is generally faster, the other collations may provide better accuracy in multilingual scenarios. For example, different collations differ in the way they compare ß and ss. We can take a look at the official documentation for an accurate comparison.
On the other hand, if we have very specific linguistic needs, we can use the locale code in a structured query to easily identify the most appropriate collation. For example, if we’re interested in Bulgarian, the locale code is bg:
SHOW COLLATION
WHERE Charset = 'utf8mb4'
AND Collation LIKE '%bg%';
+-----------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+-----------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_bg_0900_ai_ci | utf8mb4 | 318 | | Yes | 0 | NO PAD |
| utf8mb4_bg_0900_as_cs | utf8mb4 | 319 | | Yes | 0 | NO PAD |
+-----------------------+---------+-----+---------+----------+---------+---------------+
The difference between utf8mb4_bg_0900_ai_ci and utf8mb4_bg_0900_as_cs lies in their sensitivity:
However, not all languages have dedicated collations.
To set collation at different levels in MySQL, we can use the ALTER statements. As mentioned above, charset and collation should usually be changed together:
# Database Level
ALTER DATABASE University
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
# Table Level
ALTER TABLE Student
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
# Column Level
ALTER TABLE Student
MODIFY name VARCHAR(255)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
It’s helpful to use the same collation for all character columns in a database to ensure proper sorting, searching, and comparing. MySQL supports multiple collations in the same database, but using them simultaneously in queries can lead to implicit conversions or errors in some cases.
We’re currently using utf8mb4_unicode_ci. Let’s do a test:
SELECT name
FROM Student
WHERE name = 'Jose Garcia';
+--------------+
| name |
+--------------+
| José Garcia |
| Jose Garcia |
+--------------+
The result is as expected, since utf8mb4_unicode_ci is case-insensitive and accent-insensitive. However, for testing purposes, we could force the use of a different collation within a query without altering the database or table settings. For example, let’s test utf8mb4_bin, which is case-sensitive and accent-sensitive:
SELECT name
FROM Student
WHERE name = 'Jose Garcia'
COLLATE utf8mb4_bin;
+-------------+
| name |
+-------------+
| Jose Garcia |
+-------------+
This allows us to examine the impact of different collations on specific cases.
Unlike MySQL, where collation settings are more tightly integrated into the database itself and offer greater flexibility to change collations at the table or column level after creation, PostgreSQL requires us to specify the collation when we create the database, table, or column, and changing it later can require several steps.
In PostgreSQL, collation is determined by locales that are set when the database cluster is initialized. A database cluster in PostgreSQL refers to a collection of databases managed by a single PostgreSQL server instance. Each database in the cluster can have its own collation settings.
Here is an example of some UTF-8 locales:
To check the current locale settings of the database, we can use this SQL query:
SELECT datname,
datcollate AS "Collation",
datctype AS "Character Type"
FROM pg_database
WHERE datname = current_database();
datname | Collation | Character Type
----------+-------------+----------------
postgres | it_IT.UTF-8 | it_IT.UTF-8
In this case, our collation corresponds to the Italian locale. Since we’re using Linux, we can also check the server’s locale settings from the terminal:
$ locale
LANG=it_IT.UTF-8
[...]
This way we got the locale settings of the operating system that PostgreSQL uses by default.
To force a case-sensitive and accent-sensitive binary comparison ignoring the current locale-specific rules, we can use the COLLATE clause with default C collation:
Case-insensitive comparison is more tricky, because PostgreSQL doesn’t provide a built-in case-insensitive collation. However, we can perform a case-insensitive (but accent-sensitive) comparison using the LOWER() function or the ILIKE operator:
To perform an accent-insensitive comparison, we can use the unaccent extension:
Of course, the unaccent extension can be combined with the LOWER() function or the ILIKE operator:
Finally, we can list all the collations we can use with the COLLATE clause:
SELECT collname
FROM pg_collation;
collname
------------------------
default
C
POSIX
ucs_basic
C.utf8
en_AG
en_AG.utf8
[...]
However, we can only use the locales that are actually installed on our system.
Microsoft SQL Server uses collations at different levels:
Choosing the same collations at different levels ensures consistency and prevents errors.
MSSQL provides a wide range of collations to accommodate different languages and regional settings. Here are some common collations:
After understanding some common collations, it’s essential to recognize the broader range of collations available in MSSQL, especially those that support UTF-8 encoding, and how they differ.
Collation names in MSSQL follow a specific format that describes their properties:
<CollationDesignator>_<LocaleVersion>_<ComparisonStyle>[_SC][_UTF8]
Let’s break it down:
Let’s note that in modern multilingual applications, UTF-8 is a requirement. Here are some examples of UTF-8 collations:
To get the list of all available UFT-8 collations, we can use this query:
SELECT name
FROM sys.fn_helpcollations()
WHERE name LIKE '%UTF8';
[...]
Albanian_100_CI_AI_SC_UTF8
Albanian_100_CI_AI_WS_SC_UTF8
Albanian_100_CI_AI_KS_SC_UTF8
Albanian_100_CI_AI_KS_WS_SC_UTF8
Albanian_100_CI_AS_SC_UTF8
[...]
There are 1553 collations in total. By understanding the naming conventions, we can make informed decisions about which collation best suits the needs of our database.
Using the COLLATION clause, we can test some collations according to their naming convention:
SELECT name
FROM Student
WHERE name = 'Zoë Johnson'
COLLATE Latin1_General_100_CI_AI_SC_UTF8;
name
------------
Zoë Johnson
Zoe Johnson
SELECT name
FROM Student
WHERE name = 'Zoë Johnson'
COLLATE Latin1_General_100_CI_AS_SC_UTF8;
name
------------
Zoë Johnson
The results are as expected.
In this article, we explored the concept of collation in SQL databases and how it affects text comparison and sorting in MySQL, PostgreSQL, and MSSQL. We discussed setting up databases with appropriate character encodings, examined collation options, and tested their impact on queries.