1 |
wakaba |
1.1 |
=head1 NAME |
2 |
|
|
|
3 |
|
|
What::HTML - An HTML Parser |
4 |
|
|
|
5 |
|
|
=head1 SYNOPSIS |
6 |
|
|
|
7 |
|
|
use What::HTML; |
8 |
|
|
|
9 |
|
|
my $s = q<<!DOCTYPE html><html>...</html>>; |
10 |
|
|
# $doc = an empty DOM |Document| object |
11 |
|
|
my $on_error = sub { |
12 |
|
|
my $error_code = shift; |
13 |
|
|
warn $error_code, "\n"; |
14 |
|
|
}; |
15 |
|
|
|
16 |
|
|
What::HTML->parse_string ($s => $doc, $onerror); |
17 |
|
|
|
18 |
|
|
## Then, |$doc| is the DOM representation of |$s|. |
19 |
|
|
|
20 |
|
|
=head1 DESCRIPTION |
21 |
|
|
|
22 |
|
|
The C<What::HTML> module is an experimental implementation |
23 |
|
|
of the HTML5 parsing specification. |
24 |
|
|
|
25 |
|
|
=head1 METHODS |
26 |
|
|
|
27 |
|
|
=over 4 |
28 |
|
|
|
29 |
|
|
=item [I<$doc> =] What::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]); |
30 |
|
|
|
31 |
|
|
Parse a string I<$s> as an HTML document. |
32 |
|
|
|
33 |
|
|
The first argument, I<$s>, MUST be a string. It is parsed |
34 |
|
|
as a sequence of characters representing an HTML document. |
35 |
|
|
|
36 |
|
|
The second argument, I<$doc>, MUST be an empty read-write |
37 |
|
|
DOM C<Document> object. The HTML DOM tree is constructed |
38 |
|
|
onto this C<Document> object. |
39 |
|
|
|
40 |
|
|
The third argument, I<$onerror>, MUST be a reference to |
41 |
|
|
the error handler code. Whenever a parse error is detected, |
42 |
|
|
this code is invoked with an argument that contains a |
43 |
|
|
useless string that might describe what is wrong. |
44 |
|
|
The code MAY throw an exception, so that whole the parsing |
45 |
|
|
process aborts. Otherwise, the parser will continue to |
46 |
|
|
process the input. The code MUST NOT modify I<$s> or I<$doc>. |
47 |
|
|
If it does, then the result is undefined. |
48 |
|
|
This argument is optional; if missing, any |
49 |
|
|
parse error makes that string being C<warn>ed. |
50 |
|
|
|
51 |
|
|
The method returns the DOM C<Document> object (i.e. the second argument). |
52 |
|
|
|
53 |
|
|
Note that the C<What::NanoDOM> module provides a non-conforming |
54 |
|
|
implementation of DOM that only implements the subset that |
55 |
|
|
is necessary for the purpose of C<What::HTML>'s parsing and |
56 |
|
|
serializing. |
57 |
|
|
With this module, creating a new HTML C<Document> object |
58 |
|
|
from a string containing HTML document can be coded as: |
59 |
|
|
|
60 |
|
|
use What::HTML; |
61 |
|
|
use What::NanoDOM; |
62 |
|
|
my $doc = What::HTML->parse_string ($s => What::NanoDOM->new, $onerror); |
63 |
|
|
|
64 |
|
|
=item I<$s> = What::HTML->get_inner_html (I<$node>[, I<$onerror>]); |
65 |
|
|
|
66 |
|
|
Return the HTML serialization of a DOM node I<$node>. |
67 |
|
|
|
68 |
|
|
The first argument, I<$node>, MUST be a DOM C<Document>, |
69 |
|
|
C<Node>, or C<DocumentFragment> object. |
70 |
|
|
|
71 |
|
|
The second argument, I<$onerror>, MUST be a reference to the |
72 |
|
|
error handling code. This code will be invoked if a descendant |
73 |
|
|
of C<$node> is not of C<Element>, C<Text>, C<CDATASection>, |
74 |
|
|
C<Comment>, C<DocumentType>, or C<EntityReference> so |
75 |
|
|
that C<INVALID_STATE_ERR> MUST be thrown. |
76 |
|
|
The code will be invoked with an argument, which is the node |
77 |
|
|
whose type is invalid. |
78 |
|
|
This argument is optional; if missing, any such |
79 |
|
|
node is simply ignored. |
80 |
|
|
|
81 |
|
|
The method returns the C<inner_html> attribute |
82 |
|
|
value, i.e. the HTML serialization of the C<$node>. |
83 |
|
|
|
84 |
|
|
=back |
85 |
|
|
|
86 |
|
|
=head1 TODO |
87 |
|
|
|
88 |
|
|
Tokenizer should emit a sequence of character tokens as one token |
89 |
|
|
to improve performance. |
90 |
|
|
|
91 |
|
|
A method that accepts a byte stream as an input. |
92 |
|
|
|
93 |
|
|
Charset detection algorithm. |
94 |
|
|
|
95 |
|
|
Setting inner_html. |
96 |
|
|
|
97 |
|
|
And there are many "TODO"s and "ISSUE"s in the source code. |
98 |
|
|
|
99 |
|
|
=head1 SEE ALSO |
100 |
|
|
|
101 |
|
|
Web Applications 1.0 Working Draft (aka HTML5) |
102 |
|
|
<http://whatwg.org/html5>. (Revision 792, 1 May 2007) |
103 |
|
|
|
104 |
|
|
L<What::NanoDOM> |
105 |
|
|
|
106 |
|
|
=head1 AUTHOR |
107 |
|
|
|
108 |
|
|
Wakaba <w@suika.fam.cx>. |
109 |
|
|
|
110 |
|
|
=head1 LICENSE |
111 |
|
|
|
112 |
|
|
Copyright 2007 Wakaba <w@suika.fam.cx> |
113 |
|
|
|
114 |
|
|
This library is free software; you can redistribute it |
115 |
|
|
and/or modify it under the same terms as Perl itself. |
116 |
|
|
|
117 |
|
|
=cut |
118 |
|
|
|
119 |
|
|
# $Date: 2007/04/28 14:31:34 $ |