42

The W3C "HTML5 differences from HTML4" working draft states:

For the HTML syntax, authors are required to declare the character encoding.

What does "required" mean?

Obviously, a browser will still render HTML5 without the charset meta attribute. If no encoding is specified, which encoding will a browser use?

Basically, I want to know if it is actually necessary to include <meta charset="">, or if 99% of the time browsers will use the correct encoding anyway.

2
  • I guess charset can also be considered "declared" if present in Content-Type response header. Commented Feb 3, 2013 at 4:07
  • 2
    If anyone is interested, I also happened to come across a page that explains how excluding the encoding can result in an XSS vulnerability: openmya.hacker.jp/hasegawa/security/utf7cs.html Commented Feb 9, 2013 at 0:35

4 Answers 4

43

It is not necessary to include <meta charset="blah">. As the specification says, the character set may also be specified by the server using the HTTP Content-Type header or by including a Unicode BOM at the beginning of the downloaded file.

Most web servers today will send back a character set in the Content-Type header for HTML text data if none is specified. If the web server doesn't send back a character set with the Content-Type header and the file does not include a BOM and the page does not include a <meta charset="blah"> declaration, the browser will have a default encoding that is usually based on the language settings of the host computer. If this does not match the actual character encoding of the file, then some characters will be displayed improperly.

Will browsers use the proper encoding 99% of the time? If your page is UTF-8, probably. If not, probably not.

The W3C provides a document outlining the precendence rules for the three methods that says the order is HTTP header, BOM, followed by in-document specification (meta tag).

Sign up to request clarification or add additional context in comments.

7 Comments

So what would be the order of precedence if the Content-type, BOM, and <meta charset=""> all had different values?
HTTP header, BOM, followed by meta tag. I'll update the answer with a link I found from W3C answering this very question.
That's really interesting. I would have thought that the purpose of the meta tag would be to override everything else. It seems like it would actually be rather difficult to have a situation where the meta tag would be necessary. Am I missing something?
@twiz, it is necessary to use a meta tag to declare encoding when the server sends a Content-Type header without charset parameter and you cannot affect this (and you are not using UTF-8). This is not an uncommon scenario. Moreover, the meta tag is relevant if a page is saved locally by a user. (When opened later, there will be no HTTP headers.)
@JukkaK.Korpela I don't know a lot about encoding, so just wondering, what would be an example of a common scenario where the charset might be left out?
|
5

According to the Google PageSpeed browser extension, declaring a charset in a meta element "disables IE8's lookahead feature" which apparently forces it to download everything in serial.

My understanding was that <meta charset-"utf-8"> was required for valid HTML5, but that is why I started browsing here.

That draft of the spec seems pretty clear to me and since I add the HTTP header via .htaccess, I am going to start leaving it out...even though I'm tempted not to, just make IE8 users suffer a bit more.

Thanks.

@Jules Mazur do you have any references about those points? Most of what I do is SEO and accessibility is important to me and if that is the case I am more than receptive to leaving the the meta declaration.

Comments

3

The short answer is NO, the charset tag is not required, but recommended.

Modern HTML5 browsers all assume you are using UTF-8 encoding by default (it is the HTML5 standard encoding) AND nearly all of UTF-8 encoding/decoding routines work perfectly with older browser schemes of characters - like Latin-1, ASCII-127, etc. - because they both store character code point numbers the same starting with one byte of memory. UTF-8 was designed to address backwards compatibility issues like this and that is why HTML5 defaults to UTF-8. Many HTTP servers also deliver the correct charset encoding for HTML5 pages, anyway, which is UTF-8. If you leave it off of your HTML web pages, you should only see issues when using exotic upper plain Unicode characters or languages where the pages or character byte code was encoded incorrectly and the browser loses access to the right code points to a few Unicode characters. But again, UTF-8 is always assumed with modern browsers and HTML5. And most delivered pages, past and present, are easily decoded into the memory of the user agent correctly using HTML5's UTF-8.

MORE DETAILS BELOW...

Since 1998, when most of these W3C HTML and encoding specifications we use today came out, the standards bodies have pushed vendors (makers of servers and browsers and document applications) to follow encoding rules and use meta tags to help determine intent.

But due to greed, poor browser design, and other factors very few have followed the specifications consistently over the years. As a result, we have a fractured system. Some vendors, like Mozilla, have followed the standards since 2001 for meta tags while others, like Microsoft and Google, have not.

For that reason, if you want your web pages viewable in 99.9% of user agents still around, all web developers should use contingency design in how all their web pages are constructed, and use meta tags and other standard markup to support the right character encoding used in construction of the web page, despite inconsistent support for such tags. In other words, use both meta tag types. Why? The short "charset" meta tag version works well in modern HTML5 browsers, while the latter is needed in many versions of web browsers prior to 2010 that defaulted to older standards, like Latin-1 and ASCII, but started to support UTF-8 encoding after 2000. Example:

<meta charset="utf-8">
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

...though in reality such markup above will rarely decide how modern web pages are decoded or interpreted by web browsers, past and present.

What encoding is used by the browser when interpreting the page will often be based on the software used in creating the web page itself (as someone above mentioned) which increasingly is UTF-8, but often an ASCII text editor. This is a just a standard encoding scheme of Unicode that's currently popular in creating HTML5 web sites. The user's browser will then likely skip over meta tags and check the page to guess the encoding intent of the author.

You will also notice, in a typical HTML5 page, when you provide <link> or <script> tags to external files, you can control encoding/decoding suggestions using the tag attributes. But those are again, like the meta tag, just "hints" to the browser of what encoding to use and do not fully control what the browsers actually decides what encoding the files are really encoded in, or what the server headers tell the browser they are encoded in.

The main driver of encoding scheme used is the web server whose HTTP response header will often tell the browser the encoding type used, which again for HTML5 pages is always UTF-8. Because old ASCII (first 127 characters) used in older web pages is fully "decodable" from ASCII to UTF-8 in most cases everything using English characters, users in the West rarely have issues between new and old encoding web page technology. Because of all these fall back designs, using meta tags is often not needed at all today and completely ignored in modern web page parsing for the reasons outlined above.

JavaScript using UTF-16 is a different story...

ADDITIONAL OLD BROWSER HISTORY

Some more history of meta tags....in 2000 this whole meta tag debate was much worse than it is today. Use of HTML 4 with embedded Unicode characters often meant pages where neither encoded correctly or rendered correctly, despite server HTTP headers, use of character entities, and meta tags simply because modern browsers back then did not follow the standards and didn't look at meta tags, page encoding, or encoded character entities. Even today, old web pages encoded in old Windows ANSI still cannot be decoded by UTF-8 or UTF-16. That is why to battle all the complex combinations of support and systems in failed standards adoptions, it’s best to use all combinations of optional HTML tag technology to increase the 'likelihood' of your web pages being rendered correctly.

We learned a valuable lesson back then: Web standards would never be consistently followed by companies. When standards are not adopted consistently by private industry it's always best to use all forms and version of tagging, all the time, in every form possible way to maximize your pages are viewed correctly across many different devices using various forms of those standards, even if today they don't matter (as browsers now parse pages and determine encoding themselves).

This why I say, yes, you should use the charset meta tags, even if ignored by many browsers today. It can only help with cross-browser issues and maximize the percentage chance of user agents created the past 20 years can read your valuable web content.

That should be the strategy used for all web page design until we somehow enforce universal adoption of web standards which is increasingly unlikely now with mobile user-agents and HTML5 which have forced us to abandon yet again many of the XML standards that would have enforced better markup design.

Comments

2

It’s important to specify a character set of the document as earlier as possible (either through the Content-Type header or the META tag), otherwise the browser will be left to determine the encoding before parsing the document and this may negatively impact the page load time.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.