{"id":36761,"date":"2024-08-05T11:48:12","date_gmt":"2024-08-05T01:48:12","guid":{"rendered":"https:\/\/database.guide\/?p=36761"},"modified":"2024-08-05T11:48:14","modified_gmt":"2024-08-05T01:48:14","slug":"check-whether-a-unicode-string-is-normalized-in-postgresql","status":"publish","type":"post","link":"https:\/\/database.guide\/check-whether-a-unicode-string-is-normalized-in-postgresql\/","title":{"rendered":"Check Whether a Unicode String is Normalized in PostgreSQL"},"content":{"rendered":"\n<p class=\"\">In PostgreSQL, we can use the <code>... IS NORMALIZED<\/code> expression to check whether or not a given Unicode string is in a specified Unicode normalization form.<\/p>\n\n\n\n<p class=\"\">By default, it checks whether it&#8217;s in the <code>NFC<\/code> form, but we also have the option of specifying <code>NFD<\/code>, <code>NFKC<\/code>, or <code>NFKD<\/code>.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Syntax<\/h2>\n\n\n\n<p class=\"\">The syntax goes like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>text IS &#91;NOT] &#91;form] NORMALIZED<\/code><\/pre>\n\n\n\n<p class=\"\">So we can use <code>NOT<\/code> to negate the operation. We can also specify which normalization form to check for.<\/p>\n\n\n\n<p class=\"\">The expression returns a boolean value (either true or false).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example<\/h2>\n\n\n\n<p class=\"\">Here&#8217;s a quick example of how to use the expression:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT '\u00f6' IS NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">t<\/pre>\n\n\n\n<p class=\"\">In this case we get <code>t<\/code> for true. In other words, the Unicode character is normalized in the <code>NFC<\/code> form. We know it&#8217;s the <code>NFC<\/code> form, because that&#8217;s the default.<\/p>\n\n\n\n<p class=\"\">Here it is again, but this time we&#8217;ll check it for another form:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT '\u00f6' IS NFD NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">f<\/pre>\n\n\n\n<p class=\"\">This time we get <code>f<\/code> for false. That&#8217;s because, although it&#8217;s normalized, it&#8217;s not normalized in the <code>NFD<\/code> form.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Negating the Operation<\/h2>\n\n\n\n<p class=\"\">We can use <code>NOT<\/code> to negate the operation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT \n    '\u00f6' IS NORMALIZED AS \"Normalized\",\n    '\u00f6' IS NOT NORMALIZED AS \"Not Normalized\";<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> Normalized | Not Normalized <br>------------+----------------<br> t          | f<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Understanding Normalization<\/h2>\n\n\n\n<p class=\"\">To get a proper understanding of how the <code>... IS NORMALIZED<\/code> expression works, we need to understand why we might need it.<\/p>\n\n\n\n<p class=\"\">Take the following example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT '\u00f6' = 'o\u0308';<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">f<\/pre>\n\n\n\n<p class=\"\">Here, I compared two apparently identical characters, but the result indicates that they&#8217;re not identical (we got <code>f<\/code> for false).<\/p>\n\n\n\n<p class=\"\">The reason for this is because, although they are the same character, I actually used different Unicode code points to construct them. In my case I constructed them using HTML and then copied and pasted them into my <code><a href=\"https:\/\/database.guide\/what-is-psql\/\" data-type=\"post\" data-id=\"15024\">psql<\/a><\/code> terminal.<\/p>\n\n\n\n<p class=\"\">Many Unicode characters can be constructed from more than one set of code points, and that&#8217;s the case with the above character, known in Unicode as <em>LATIN SMALL LETTER O WITH DIAERESIS<\/em>. This can be represented as a single code point (<code>U+00F6<\/code>) or as two separate code points (<code>U+006F<\/code> and <code>U+0308<\/code>). Regarding the second one, <code>U+006F<\/code> is the code point for <em>LATIN SMALL LETTER O<\/em> and <code>U+0308<\/code> is the code point for the <em>COMBINING DIAERESIS<\/em>. These two code points are combined to form the same character that the single code point <code>U+00F6<\/code> would produce.<\/p>\n\n\n\n<p class=\"\">So given these are constructed with different code points, PostgreSQL sees them as not equal.<\/p>\n\n\n\n<p class=\"\">So given they&#8217;re not equal, let&#8217;s check each character with the <code>... IS NORMALIZED<\/code> expression:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT \n    '\u00f6' IS NORMALIZED, \n    'o\u0308' IS NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> is_normalized | is_normalized <br>---------------+---------------<br> t             | f<\/pre>\n\n\n\n<p class=\"\">So we can see that the first one is normalized in <code>NFC<\/code> form but the second one isn&#8217;t (remember, <code>NFC<\/code> is the default). <\/p>\n\n\n\n<p class=\"\">Let&#8217;s specify a different form:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT \n    '\u00f6' IS NFD NORMALIZED, \n    'o\u0308' IS NFD NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> is_normalized | is_normalized <br>---------------+---------------<br> f             | t<\/pre>\n\n\n\n<p class=\"\">This time we get the opposite result &#8211; the first column is false and the second column is true. <\/p>\n\n\n\n<p class=\"\">If we want to avoid this confusion, we can use the <a href=\"https:\/\/database.guide\/understanding-the-normalize-function-in-postgresql\/\" data-type=\"post\" data-id=\"36295\"><code>normalize()<\/code> function<\/a> to normalize both characters using the same form. The default form for this function is <code>NFC<\/code>, so we can do the following to normalize them in that form:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT \n    normalize('\u00f6') IS NORMALIZED,\n    normalize('o\u0308') IS NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> is_normalized | is_normalized <br>---------------+---------------<br> t             | t<\/pre>\n\n\n\n<p class=\"\">This time both characters returned true when checking for normalization.<\/p>\n\n\n\n<p class=\"\">We can specify a different form if required:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT \n    normalize('\u00f6', NFD) IS NFD NORMALIZED,\n    normalize('o\u0308', NFD) IS NFD NORMALIZED;<\/code><\/pre>\n\n\n\n<p class=\"\">Result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> is_normalized | is_normalized <br>---------------+---------------<br> t             | t<\/pre>\n\n\n\n<p class=\"\"><\/p>\n\n\n\n<p class=\"\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In PostgreSQL, we can use the &#8230; IS NORMALIZED expression to check whether or not a given Unicode string is in a specified Unicode normalization form. By default, it checks whether it&#8217;s in the NFC form, but we also have the option of specifying NFD, NFKC, or NFKD.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40],"tags":[10,84],"class_list":["post-36761","post","type-post","status-publish","format-standard","hentry","category-postgresql","tag-how-to","tag-string"],"_links":{"self":[{"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/posts\/36761","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/comments?post=36761"}],"version-history":[{"count":5,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/posts\/36761\/revisions"}],"predecessor-version":[{"id":36776,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/posts\/36761\/revisions\/36776"}],"wp:attachment":[{"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/media?parent=36761"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/categories?post=36761"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/database.guide\/wp-json\/wp\/v2\/tags?post=36761"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}