Alfreds Genkins

Written by:Alfreds Genkins


Wednesday, October 17, 2018

HTML Parsing Quirks

How source HTML can be different from parsed?

Tag Content Categories

Every tag in HTML is a member of a content category. There are three main types of content categories: main, form-related and content-specific.

When developers discuss the main content category type, they are usually referring to three subtypes: Flow Content, Sectioning Content, Phrasing Content.

Flow content — usually contains text or embedded content.

Sectioning content — usually creates sections in the current outline.

Phrasing content — defines the text and the mark-up it contains.

How can parsed HTML be different from the source?

Sometimes, elements have special tag omission rules. These are usually applied if the **immediately following **children of an element do not follow a specific rule.

According to the specification, the

tag allows only Phrasing content to be entered into it:

The start tag is required. The end tag may be omitted if the

element is immediately followed by [list of elements which accept Flow content]

It’s important to clarify that in this case, **immediately followed **not only applies to first child, but any first-level child. So any

inside of

will be omitted. For example:


Will be compiled into:


Why is parsing quirky?

Let’s take a look at the following example: The end tag may be omitted if the element is not immediately followed by a comment, and if it contains a * *element that is either not empty or whose start tag is present.

<html><body></body><!-— document end --></html>

Will be parsed into:

<html><body></body><!-— document end --></html>

Notice there’s no difference. However:



<!-— document end --></html>

Will be parsed into:


<body><!-— document end --></body>


In both cases, the immediate child of the end tag is the comment, the is empty and the start tag is present. So why are the results different? There are no tabs / spaces, which could be treated by the browser like a text node (new lines are not treated at all). So it’s not clear why does these two examples work different.

How do semantic elements complicate tag omission?

With the introduction of new elements, like

Heart sign
thank you

We will contact you as soon as possible

Warning sign
something wrong

Please check fields and try again

contact us