Whatpm::HTML::Parser
use Whatpm::HTML::Parser;
use Message::DOM::DOMImplementation;
$parser = Whatpm::HTML::Parser->new;
$dom = Message::DOM::DOMImplementation->new;
$doc = $dom->create_document;
$parser->parse_char_string ($chars => $doc);
$parser->parse_byte_string ($encoding, $bytes => $doc);
## Or, just use DOM attribute:
$doc->manakai_is_html (1);
$doc->inner_html ($chars);
The Whatpm::HTML::Parser
module is an implementation of the HTML parser. It implements the HTML parsing algorithm as defined by HTML Living Standard. Therefore, it's parsing behavior is fully compatible with Web browsers with HTML5 parser enabled.
It is recommended to use standard DOM interface, such as inner_html
method of the Document
object, to parse an HTML string, where possible. The Whatpm::HTML::Parser module, which, in fact, is used to implement the inner_html
method, offers more control on how parser behaves, which would not be useful unless you are writing a complex user agent such as browser or validator.
The Whatpm::HTML::Parser module provides following methods:
$parser = Whatpm::HTML::Parser->new
Create a new parser.
$parser->parse_char_string ($chars => $doc)
Parse a string of characters (i.e. a possibly utf8-flagged string) as HTML and construct the DOM tree.
The first argument to the method must be a string to parse. It may or may not be a valid HTML document.
The second argument to the method must be a DOM Document
object (Message::DOM::Document). Any child nodes of the document is first removed by the parser.
$parser->parse_byte_string ($encoding, $bytes => $doc)
Parse a string of bytes as HTML and construct the DOM tree.
The first argument to the method must be the label of a (character) encoding, as specified by the Encoding Standard. The undef
value can be specified if the encoding is not known.
The second argument to the method must be a string to parse. It may or may not be a valid HTML document.
The third argument to the method must be a DOM Document
object (Message::DOM::Document). Any child nodes of the document is first removed by the parser.
$parser->set_inner_html ($node, $chars)
Parse a string of characters in the context of a node. If the node is a Document
, this is equivalent to the parse_char_string
method. If the node is an Element
, parsing is performed in the fragment mode.
The first argument to the method must be a DOM Node
object (Message::DOM::Node) that is also a Document
(Message::DOM::Document) or an Element
(Message::DOM::Element). The node is used to give the context to the parser and to receive the parsed subtree. Any existing child node of the node is removed first.
The second argument to the method must be a string of characters.
$code = $parser->onerror
$parser->onerror ($new_code)
Get or set the error handler for the parser. Any parse error, as well as warning and information, is reported to the handler. See Whatpm::Errors for more information.
Parsed document structure is reflected to the Document
object specified as an argument to parse methods. The character encoding used to parse the document can be retrieved by the input_encoding
method of the Document
.
Although the parser is intended to be fully conformant to the HTML Living Standard, it might not implement latest spec changes yet. See list of bugs on the HTML parser <http://manakai.g.hatena.ne.jp/task/2/>
for the current implementation status.
HTML Living Standard - Parsing HTML documents <http://www.whatwg.org/specs/web-apps/current-work/#parsing>
.
HTML Living Standard - Parsing HTML fragments <http://www.whatwg.org/specs/web-apps/current-work/#parsing-html-fragments>
.
Encoding Standard <http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>
.
Wakaba <w@suika.fam.cx>.
Copyright 2007-2012 Wakaba <w@suika.fam.cx>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.