Whatpm::HTML::Parser - An HTML parser
use Whatpm::HTML::Parser; use Message::DOM::DOMImplementation; $parser = Whatpm::HTML::Parser->new; $dom = Message::DOM::DOMImplementation->new; $doc = $dom->create_document; $parser->parse_char_string ($chars => $doc); $parser->parse_byte_string ($encoding, $bytes => $doc);
## Or, just use DOM attribute: $doc->manakai_is_html (1); $doc->inner_html ($chars);
The Whatpm::HTML::Parser
module is an implementation of the HTML
parser. It implements the HTML parsing algorithm as defined by HTML
Living Standard. Therefore, it's parsing behavior is fully compatible
with Web browsers with HTML5 parser enabled.
It is recommended to use standard DOM interface, such as inner_html
method of the Document
object, to parse an HTML string, where
possible. The the Whatpm::HTML::Parser manpage module, which, in fact, is used
to implement the inner_html
method, offers more control on how
parser behaves, which would not be useful unless you are writing a
complex user agent such as browser or validator.
The the Whatpm::HTML::Parser manpage module provides following methods:
Create a new parser.
Parse a string of characters (i.e. a possibly utf8-flagged string) as HTML and construct the DOM tree.
The first argument to the method must be a string to parse. It may or may not be a valid HTML document.
The second argument to the method must be a DOM Document
object
(the Message::DOM::Document manpage). Any child nodes of the document is first
removed by the parser.
Parse a string of bytes as HTML and construct the DOM tree.
The first argument to the method must be the label of a (character)
encoding, as specified by the Encoding Standard. The undef
value
can be specified if the encoding is not known.
The second argument to the method must be a string to parse. It may or may not be a valid HTML document.
The third argument to the method must be a DOM Document
object
(the Message::DOM::Document manpage). Any child nodes of the document is first
removed by the parser.
Parse a string of characters in the context of a node. If the node is
a Document
, this is equivalent to the parse_char_string
method.
If the node is an Element
, parsing is performed in the fragment
mode.
The first argument to the method must be a DOM Node
object
(the Message::DOM::Node manpage) that is also a Document
(the Message::DOM::Document manpage) or an Element
(the Message::DOM::Element manpage). The node is used to give the context to
the parser and to receive the parsed subtree. Any existing child
node of the node is removed first.
The second argument to the method must be a string of characters.
Get or set the error handler for the parser. Any parse error, as well as warning and information, is reported to the handler. See the Whatpm::Errors manpage for more information.
Parsed document structure is reflected to the Document
object
specified as an argument to parse methods. The character encoding
used to parse the document can be retrieved by the input_encoding
method of the Document
.
Although the parser is intended to be fully conformant to the HTML Living Standard, it might not implement latest spec changes yet. See list of bugs on the HTML parser <http://manakai.g.hatena.ne.jp/task/2/> for the current implementation status.
the Message::DOM::Document manpage, the Message::DOM::Element manpage.
the Whatpm::HTML::Serializer manpage.
the Whatpm::ContentChecker manpage.
the Whatpm::XML::Parser manpage.
HTML Living Standard - Parsing HTML documents <http://www.whatwg.org/specs/web-apps/current-work/#parsing>.
HTML Living Standard - Parsing HTML fragments <http://www.whatwg.org/specs/web-apps/current-work/#parsing-html-fragments>.
Encoding Standard <http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>.
Wakaba <w@suika.fam.cx>.
Copyright 2007-2012 Wakaba <w@suika.fam.cx>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.