manakai

Whatpm::HTML::Parser

An HTML parser

SYNOPSIS

  use Whatpm::HTML::Parser;
  use Message::DOM::DOMImplementation;
  $parser = Whatpm::HTML::Parser->new;
  $dom = Message::DOM::DOMImplementation->new;
  $doc = $dom->create_document;
  
  $parser->parse_char_string ($chars => $doc);
  $parser->parse_byte_string ($encoding, $bytes => $doc);

  ## Or, just use DOM attribute:
  $doc->manakai_is_html (1);
  $doc->inner_html ($chars);

DESCRIPTION

The Whatpm::HTML::Parser module is an implementation of the HTML parser. It implements the HTML parsing algorithm as defined by HTML Living Standard. Therefore, it's parsing behavior is fully compatible with Web browsers with HTML5 parser enabled.

METHODS

It is recommended to use standard DOM interface, such as inner_html method of the Document object, to parse an HTML string, where possible. The Whatpm::HTML::Parser module, which, in fact, is used to implement the inner_html method, offers more control on how parser behaves, which would not be useful unless you are writing a complex user agent such as browser or validator.

The Whatpm::HTML::Parser module provides following methods:

$parser = Whatpm::HTML::Parser->new

Create a new parser.

$parser->parse_char_string ($chars => $doc)

Parse a string of characters (i.e. a possibly utf8-flagged string) as HTML and construct the DOM tree.

The first argument to the method must be a string to parse. It may or may not be a valid HTML document.

The second argument to the method must be a DOM Document object (Message::DOM::Document). Any child nodes of the document is first removed by the parser.

$parser->parse_byte_string ($encoding, $bytes => $doc)

Parse a string of bytes as HTML and construct the DOM tree.

The first argument to the method must be the label of a (character) encoding, as specified by the Encoding Standard. The undef value can be specified if the encoding is not known.

The second argument to the method must be a string to parse. It may or may not be a valid HTML document.

The third argument to the method must be a DOM Document object (Message::DOM::Document). Any child nodes of the document is first removed by the parser.

$parser->set_inner_html ($node, $chars)

Parse a string of characters in the context of a node. If the node is a Document, this is equivalent to the parse_char_string method. If the node is an Element, parsing is performed in the fragment mode.

The first argument to the method must be a DOM Node object (Message::DOM::Node) that is also a Document (Message::DOM::Document) or an Element (Message::DOM::Element). The node is used to give the context to the parser and to receive the parsed subtree. Any existing child node of the node is removed first.

The second argument to the method must be a string of characters.

$code = $parser->onerror
$parser->onerror ($new_code)

Get or set the error handler for the parser. Any parse error, as well as warning and information, is reported to the handler. See Whatpm::Errors for more information.

Parsed document structure is reflected to the Document object specified as an argument to parse methods. The character encoding used to parse the document can be retrieved by the input_encoding method of the Document.

Although the parser is intended to be fully conformant to the HTML Living Standard, it might not implement latest spec changes yet. See list of bugs on the HTML parser <http://manakai.g.hatena.ne.jp/task/2/> for the current implementation status.

SEE ALSO

Message::DOM::Document, Message::DOM::Element.

Whatpm::HTML::Serializer.

Whatpm::ContentChecker.

Whatpm::XML::Parser.

SPECIFICATIONS

[HTML]

HTML Living Standard - Parsing HTML documents <http://www.whatwg.org/specs/web-apps/current-work/#parsing>.

HTML Living Standard - Parsing HTML fragments <http://www.whatwg.org/specs/web-apps/current-work/#parsing-html-fragments>.

[ENCODING]

Encoding Standard <http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>.

AUTHOR

Wakaba <w@suika.fam.cx>.

LICENSE

Copyright 2007-2012 Wakaba <w@suika.fam.cx>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.