W3C_logo_2010.jpg

HTML5 defines the 5th major revision of the core language of the Web. In arecently published work the W3C details the differences between HTML 4 and HTML5 and provides insight into the thinking behind the changes.

What is HTML5?

HTML5 is a revision of the HTML standard in the works by the W3C (news, site). HTML 5 will (eventually) replace HTML4 and XHTML1, as HTML5 defines a single language that can be written using either HTML or XML syntax.

There are separate conformance requirements for authors and user agents. Those writing web browsers and other clients that work over the web are required to support older elements and attributes for backwards compatibility. Web page authors will work with a slightly simplified language with some elements and attributes moved into CSS, and will no longer encounter the term deprecated since older code will still be supported.

Syntax Differences

There are many differences in HTML5, as outlined by the W3C in a recent document. HTML5 syntax is compatible with both HTML4 and XHTML1, except for "the more esoteric SGML features of HTML4." These are mostly those that aren't supported by most user agents, including such examples as processing instructions and shorthand markup. Other differences when it comes to syntax include:

  • Most HTML documents will be served with the text/html media type
  • Parsing rules, including error handling, that user agents must use with the text/html media type
  • A text/html-sandboxed media type for HTML syntax documents where you're hosting untrusted content
  • XML syntax documents must be served with an XML media type such as application/xml, with elements in the XHTML namespace following the XML specifications
  • HTML5's HTML now has native support for Internationalized Resource Identifiers (IRIs) if the encoding is UTF-8 or UTF-16
  • The lang attribute can now contain an empty string along with a valid language identifier, mirroring xml:lang in XML
  • The HTML syntax of HTML5 requires a DOCTYPE declaration of " to ensure that the browser renders the page in standards mode," but the XML syntax doesn't since XML is always rendered in this mode
  • HTML5's HTML syntax allows for MathML and SVG elements

Character Encoding Differences

HTML5's HTML syntax offers three options for character encoding, according to the HTML5 differences from HTML4 document:

  1. At the transport level of the TCP/IP stack, such as using the HTTP Content-Type header
  2. Starting the file with a Unicode Byte Order Mark (BOM) character, which provides a signature for the type of encoding used
  3. Including a meta element (for example, <meta charset="UTF-8">) with a charset attribute specifying the encoding within the first 512 bytes of the document, significantly shortening the syntax required previously

Authors working in XML syntax will use the rules already set in the XML specifications.

Changes in Elements

There are three groups of changes to the language of HTML from versions 4 to 5. These groups break down to new, changed, and deleted elements.

While too numerous to list out fully here, you can find the full list of changes in the HTML5 differences from HTML4 document. Some of the more interesting new elements (subject to change) are:

  • article: An independent piece of content within a larger document, such as a blog entry or newspaper article
  • embed: Place a piece of plugin content
  • figure: A piece of "self-contained flow content," such as an image or video (though video and audio are also both new elements, supported by APIs allowing authors to script user interfaces and more)
  • mark: Highlighted text in a document for reference purposes, due to relevance in another context
  • nav: A section designed for navigation
  • progress: How complete a task is, such as downloading or a long sequence of operations
  • section: A generic document or application section, to be used in conjunction with the header tags (h1, h2, etc.) to define structure

The input element's type attribute now also has some new values, such as date, month, time, and range.

Some HTML4 elements were changed at a deeper level for HTML5 "to better reflect how they are used on the Web or to make them more useful." One example of these changes is cite, which now only represents the title of a work.

Also, some elements were completely removed from HTML5. Authors aren't supposed to use them, but user agents will have to support them for backwards compatibility. Those elements removed because "their function is better handled by CSS" are basefont, big, center, font, s, strike, tt and u. Elements removed "because their usage affected usability and accessibility for the end user in a negative way" are frame, frameset and noframes. Those removed because "they have not been used often, created confusion, or their function can be handled by other elements" are acronym, applet, dir and isindex. And the noscript element, while still in the HTML syntax for HTML5, is not included in the XML syntax since it relies on an HTML parser.

New Attributes in HTML5

Along with new elements, there are numerous new attributes, including:

  • A media attribute to go with the a and area elements, to used in the same sense as link
  • New autocomplete, max, min, multiple, pattern and step attributes for the input element
  • New type, label and contextmenu attributes for the menu element for UI work
  • An async attribute for the script element, for influencing script loading and execution

Also, a number of existing HTML4 attributes that before applied only to specific elements are now global, applying to all elements: class, dir, id, lang, style, tabindex and title. New global attributes were also created, such as draggable for working with the new drag and drop API.

Learning Opportunities

Some attributes were changed at a deeper level as well, but the W3C is stating that while they're still allowed, "authors are discouraged from using them and instead strongly encouraged to use an alternative solution." For example, the img element's border attribute is now required to have the value "0" when it's present, but they suggest using CSS instead for this purpose.

Numerous attributes were removed from HTML5 as well. For example, the link and a elements no longer allow the rev and charset attributes, the html element no longer allows the version attribute and the meta element no longer allows the scheme attribute. Among those removed because their functions are better served by CSS are the body element's background attribute, the table element's frame and rules attributes, and the hr element's noshade attribute.

APIs

HTML5 also introduces new APIs for creating web applications, including:

  • One for playing video and audio, which is used with the video and audio elements
  • One for enabling offline web applications
  • One for letting web applications register for protocols or media types
  • One for handling editing through the new global contenteditable attribute
  • One for creating drag and drop interfaces with the draggable attribute
  • One for exposing the history and letting pages add to it, to avoid "breaking the back button"

Interface Extensions

Two interfaces were also extended: HTMLDocument and HTMLElement. HTMLDocument is now implemented on all objects using the Document interface. Additions to HTMLDocument are:

  • The ability to select elements by their class name with getElementsByClassName()
  • Parse and serialize HTML or XML documents with innerHTML
  • Determine which element is focused on with activeElement and hasFocus, and whether the Document has focus as well
  • Return an object representing the current selection(s) with getSelection()

Additions to HTMLElement are its own version of getElementsByClassName() and innerHTML and an accessor for className called classList.

When Will HTML5 Be Ready?

The creation of a new standard, especially for the bedrock of the web, takes years. According to the WHATWG FAQ from the Web Hypertext Application Technology Working Group, they expect HTML5 to reach W3C Candidate Recommendation Stage in 2012, and graduate to W3C Recommendation in 2022.

Explaining why this work is expected to take twenty years, the document states: "Work on HTML4 started in the mid 90s, and HTML4 still, more than ten years later, hasn't reached the level that we want to reach with HTML5. There is no real test suite, there are many parts of the spec that are lacking real implementations, there are big parts that aren't interoperable, and the spec has hundreds if not thousands of known errors that haven't been fixed."

The requirements for a specification to reach Recommendation level have also changed since HTML4 was released. The document points out that, "For a spec to become a REC today, it requires two 100% complete and fully interoperable implementations, which is proven by each successfully passing literally thousands of test cases (20,000 tests for the whole spec would probably be a conservative estimate)."

So, double the time that it took HTML4 to reach Recommendation status might not seem so out of line. In the meantime, each section of the HTML5 specification is marked by how stable it is for those who want to experiment with it. Some aspects of the standard are already implemented in multiple browsers.

Is all of this process overkill? Given how big of a deal it is to evolve to a new standard for working on the web, the W3C's approach to HTML5 is probably a smart one. I suspect by the time the standard reaches full Recommendation status, much of it will already be in use. Official transition by that point might be in many ways a formality.