8

There are a few ways to get the list of all Unicode characters' names: for example using Python module unicodedata, as explained in List of unicode character names, or using the website: https://unicode.org/charts/charindex.html but here it's incomplete, and you have to open and parse PDF to find the names.

But what is the official source / repository of all Unicode character names? (such that if a new character is added, the list is updated, so I'm looking for the initial source for these names, in a machine readable format).

I'm looking for a list with just code point and name, in CSV or any other format:

code   character name
...
0102   LATIN CAPITAL LETTER A WITH BREVE
0103   LATIN SMALL LETTER A WITH BREVE
...
4
  • What has this to do with "python", "string" and "utf-8"? Commented Dec 5, 2020 at 16:26
  • @AmigoJack I initially wanted to use unicodedata docs.python.org/3/library/unicodedata.html, as mentioned in the question, but you're right this aspect is secondary. Commented Dec 5, 2020 at 16:28
  • How about editing your question so unicodedata links to Python (because it can mean something different) and removing the other two tags? I came here for "utf-8" just to find out the encoding is nowhere involved. Commented Dec 5, 2020 at 16:33
  • There isn't any one source of data. You can look in the following files: UnicodeData.txt, NameAliases.txt, NamedSequences.txt, the short names in Jamo.txt as an alias to each combining jamo. On top of that you have emoji sequences and Han ideographs. Commented Jul 8, 2024 at 4:04

2 Answers 2

12

The official source for the actual character data (which includes the character names and many, many other details) is the Unicode Character Database.

The latest version of the data files can be accessed via http://www.unicode.org/Public/UCD/latest/.

Names specifically can be found in the files NamesList.txt. The format of that file is described here.

This is the list in CSV format: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Sign up to request clarification or add additional context in comments.

4 Comments

The official names are in UnicodeData.txt, and much easier to parse. OTOH your files contain other names (from NameAliases.txt) which are all "official" and in same namespace.
This CSV file contains 34627 lines. Yet, Wikipedia claims there are 144697 characters in Unicode. It's also backed by official page - unicode.org/versions/stats/charcountv14_0.html
@Ginden UnicodeData.txt doesn't include the "Unihan" CJK data for Chinese, Japanese & Korean characters. This is deployed separately in Unihan.zip.
@moon This explains disrepancy, but there are 39493 characters in "Alphabetics, Symbols". 39493-34627 = 4866
1

The CSV file located at

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Has data for each named code point in the format that looks like this:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

If you want to parse the latest database of Unicode character names, here is a Ruby to do that:

#!/usr/bin/env ruby

require 'net/http'

uri = URI('https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt')
txt = Net::HTTP.get(uri)
txt.split(/\R/).each{|line| 
    fields=line.split(/;/)
    if fields[1][/<[^>]*>/]
        lf=fields[-1][/^N$/] ? "" : fields[-1]
        puts "#{fields[0]} #{fields[1]} #{lf}"
    else
        puts "#{fields[0]} #{fields[1]}"
    end    
    }

Or a curl and awk pipe:

awk -F";" '
{   sub(/;*$/,""); $1=$1
    if ($2~"^<.*>$") 
        printf "%s %s %s\n", $1, $2, ($NF~"^N$") ? "" : $NF
    else
        printf "%s %s\n", $1, $2
}' <(curl -s "https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt") 

Prints:

0000 <control> NULL
0001 <control> START OF HEADING
0002 <control> START OF TEXT
...
0041 LATIN CAPITAL LETTER A
0042 LATIN CAPITAL LETTER B
0043 LATIN CAPITAL LETTER C
...
00C0 LATIN CAPITAL LETTER A WITH GRAVE
00C1 LATIN CAPITAL LETTER A WITH ACUTE
00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3 LATIN CAPITAL LETTER A WITH TILDE
...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.