Whatpm::HTML - An HTML Parser
use Whatpm::HTML;
my $s = q<<!DOCTYPE html><html>...</html>>;
# $doc = an empty DOM |Document| object
my $on_error = sub {
my $error_code = shift;
warn $error_code, "\n";
};
Whatpm::HTML->parse_string ($s => $doc, $onerror);
## Then, |$doc| is the DOM representation of |$s|.
The Whatpm::HTML module contains HTML parser and serializer.
The HTML parser can be used to construct the DOM tree representation from an HTML document. The parsing and tree construction are done as described in the Web Application 1.0 specification.
The HTML serializer can be used to obtain the HTML document representation
of a DOM tree (or a tree fragment thereof). The serialization
is performed as described in the Web Applications 1.0 specification
for innerHTML DOM attribute.
This module is part of Whatpm - Perl Modules for Web Hypertext Application Technologies.
The first argument, $s, MUST be a string. It is parsed as a sequence of characters representing an HTML document.
The second argument, $doc, MUST be an empty read-write
DOM Document object. The HTML DOM tree is constructed
onto this Document object.
The third argument, $onerror, MUST be a reference to
the error handler code. Whenever a parse error is detected,
this code is invoked with an argument that contains a
useless string that might describe what is wrong.
The code MAY throw an exception, so that whole the parsing
process aborts. Otherwise, the parser will continue to
process the input. The code MUST NOT modify $s or $doc.
If it does, then the result is undefined.
This argument is optional; if missing, any
parse error makes that string being warned.
NOTE: To be a conforming user agent, the code MUST either abort the processing by throwing an exception at the first invocation or MUST continue the processing until the parser stops.
The method returns the DOM Document object (i.e. the second argument).
Note that the Whatpm::NanoDOM module provides a non-conforming
implementation of DOM that only implements a subset that
is necessary for the purpose of Whatpm::HTML's parsing and
serializing.
With this module, creating a new HTML Document object
from a string containing HTML document might be coded as:
use Whatpm::HTML;
use Whatpm::NanoDOM;
my $doc = Whatpm::HTML->parse_string
($s => Whatpm::NanoDOM::Document->new, $onerror);
The first argument, $node, MUST be a DOM Document,
Element, or DocumentFragment node.
The second argument, $onerror, MUST be a reference to the
error handling code. This code will be invoked if a descendant
of $node is neither of Element, Text, CDATASection,
Comment, DocumentType, nor EntityReference, so
that an INVALID_STATE_ERR exception MUST be thrown.
The code will be invoked with an argument, which is the node
whose type is invalid.
The argument $onerror is optional; if missing, any erroneous
node is simply ignored.
The method returns a reference to the inner_html attribute
value, i.e. the HTML serialization of the $node.
Tokenizer should emit a sequence of character tokens as one token to improve performance.
A method that accepts a byte stream as an input.
Charset detection algorithm.
Documentation for the setter of inner_html.
And there are many ``TODO''s and ``ISSUE''s in the source code.
Whatpm <http://suika.fam.cx/www/markup/html/whatpm/readme>
Web Applications 1.0 Working Draft (aka HTML5) <http://whatwg.org/html5>. (Revision 792, 1 May 2007)
Wakaba <w@suika.fam.cx>.
Copyright 2007 Wakaba <w@suika.fam.cx>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.