Whatpm — Perl modules for Web hypertext application technologies (beta)

Introduction

Whatpm is a work-in-progress set of Perl modules for Web hypertext application technologies. It is part of the manakai project.

Whatpm supports various Web standard technologies, including HTML, XHTML, XML, CSS, HTTP, and URL.

What's new?

An Atom feed for ChangeLog diffs is available.

Modules

Note that all of these modules are work in progress and have a number of unresolved problems.

Note also that some modules have no documentation yet.

Modules for HTML and XML

Modules related to HTML and XHTML are as follows:

Whatpm::HTML
An implementation of HTML5 document and fragment parsing algorithms. It can be used to convert an arbitrary string into a DOM. (See also demo.)
Whatpm::HTML::Serializer
An implementation of HTML5 fragment serialization algorithm. (See also demo.)
Whatpm::HTMLTable
An implementation of the HTML5 table algorithm. It can be used to extract a table structure from a DOM table element node. (See also demo.)

Modules for the XML support is as follow:

Whatpm::XML::Parser

An XML parser with non-draconian error handling. It can construct a DOM tree from XML 1.0/1.1 documents that does not rely on external entities (including the external subset entity) and that does not contain general entity reference that references an entity whose replacement text contains character & or <. It also supports XML namespaces.

It does not stop the process to construct a DOM tree even if it detects a well-formedness or a namespace well-formedness error. It recovers from errors in a manner similar to HTML5's tokenization algorithm. It is expected that the combination of this module and a future extension to the Whatpm::ContentChecker framework will provide a mean to detect all well-formedness and validity errors, if desired.

(See also demo.)

Whatpm::XMLSerializer
A simple XML serializer. It does namespace prefix fixups and suitable for the serialization of a carefully build XML DOM tree. It does not assure that the output is well-formed.

The module for conformance checking of a DOM tree (i.e. a in-memory representation of an HTML or XML document) is as follows:

Whatpm::ContentChecker
A DOM5 HTML (in-memory representation of a document) conformance checker with a partial support for Atom 1.0. (See also demo and application.)

For these modules, a DOM implementation that supports the manakai's Perl binding of DOM is necessary to represent a document in memory. The manakai-core package contains such an implementation, Message::DOM::Implementation, but it should also be possible to use any other implementation that supports the binding.

Modules for CSS

Modules for CSS and related technologies are as follows:

Whatpm::CSS::Cascade
A media-independent implementation of CSS cascading and value computations. (See also demo.)
Whatpm::CSS::MediaQueryParser
A media query parser. Note that only CSS 2.1 media types are supported at the moment.
Whatpm::CSS::MediaQuerySerializer
A media query serializer. Note that only CSS 2.1 media types are supported at the moment.
Whatpm::CSS::Parser
A CSS parser that constructs CSSOM trees from style sheets. (See also demo.)
Whatpm::CSS::SelectorsParser
A group of selectors parser. (See also demo.)
Whatpm::CSS::SelectorsSerializer
A group of selectors serializer. (See also specification and demo.)
Whatpm::CSS::Tokenizer
A CSS tokenizer. (See also demo.)

For the Whatpm::CSS::Parser module reresenting a CSSOM tree, modules in the manakai-core package are used. Those modules also provide the serializer for the CSSOM tree, in the form of the standard css_text CSSOM attribute.

Modules for HTTP

Modules for HTTP and related technologies are as follows:

Whatpm::ContentType
An implementation of HTML5 Content Type sniffing algorithm.
Whatpm::IMTChecker
An Internet Media Type (aka MIME type) label conformance checker.

Currently, support for parsing of HTTP headers and as such is not yet available.

Module for URL

Module for the URL support is as follows:

Whatpm::URIChecker
An IRI reference conformance checker.

Support for HTML5's realistic definition of URL is not available yet.

Modules for other technologies

Following modules provide support for other Web-related technologies:

Whatpm::CacheManifest
An HTML5 cache manifest parser.
Whatpm::Charset::DecodeHandle
A filehandle-like wrapper interface to decode byte stream encoded in some character encoding.
Whatpm::Charset::UnicodeChecker
A Unicode character string checker.
Whatpm::Charset::UniversalCharDet
A Perl interface to universalchardet character encoding detection library.
Whatpm::LangTag
A language tag parser and conformance checker, supporting both older RFC 3066 definition and latest RFC 4646 definition. (See also demo.)
Whatpm::RDFXML
An implementation of RDF/XML by which RDF triples can be extracted from RDF/XML documents.
Whatpm::WebIDL
A WebIDL fragment parser. It parses an IDL fragment, whether conforming or not, and constructs a DOM-like object model for further processing. Non-conforming (or broken) IDL fragment-like string will be parsed using CSS-like error-tolerant parsing rules, e.g. ignoring anything until next ; character.

How to use modules

Modules listed above, which are included in the Whatpm package, can be used by directly useing or requireing these modules and then invoking their native interface. For more information on those native interfaces, see the document of those modules and the source code of them.

In addition, some of functionality provided by those modules can be accessed via standardized DOM interfaces implemented by modules included in the manakai-core package. See the document of the module Message::DOM::DOMImplementation for the way to access to the DOM interfaces.

The table below summarizes the relationship between Whatpm modules and DOM methods/attributes implemented by manakai-core modules:

Whatpm module DOM methods/attributes
Whatpm::CSS::Cascade get_computed_style (ViewCSS), current_style (ElementCSS)
Whatpm::CSS::Parser CSSStyleDeclaration's attributes and methods, css_text (CSSOM interfaces)
Whatpm::CSS::Serializer
Whatpm::CSS::SelectorsParser query_selector, query_selector_all (DocumentSelector, ElementSelector)
selector_text (CSSStyleRule)
Whatpm::CSS::SelectorsSerializer
Whatpm::HTML inner_html (HTMLDocument, Element)
Whatpm::HTML::Serializer
Whatpm::XML::Parser
Whatpm::XMLSerializer

Documents

For the description of functionalities provided by each module, see pod documentation of the module. HTML version of pod documentations are linked from the list of modules above.

In addition, there are additional documents for some topics:

Standards supported by WebHACC
List and description of Web standards supported by the WebHACC conformance checker. Although it is a documentation for the WebHACC, it is also applicable to Whatpm in general (note that WebHACC is an interactive user interface for the conformance checking feature provided by Whatpm).
List of error types
Description of errors to be notified to callback functions by Whatpm modules.
Selectors object
Description of data structure for Selectors, as implemented by Whatpm::CSS::SelectorsParser (as output), and Whatpm::CSS::SelectorsSerializer (as input).
List of predefined user data names
List of user data names defined by Whatpm modules.
Handle objects
Description of character or byte stream input handle interfaces.

Following specifications define Whatpm-specific formats and extensions:

SSFT Specification
The specification for the serialization format used for testing Selectors-related modules.
manakai's CSS extensions
The specification for -manakai-* properties and property values implemented by CSS-related modules.
manakai's Selectors extensions
The specification for :-manakai-* pseudo-classes implemented by Selectors-related modules.

Demo

Applications

See also a list of applications using modules in the manakai-core package; some of them indirectly use Whatpm modules via DOM interfaces provided by manakai-core.

Dependency

Perl 5.8 or later
It is recommended to use newer stable release of Perl 5.8 (or later).
Some modules require Encode modules, which are part of standard Perl distribution.
Modules from manakai-core
Error
Module Whatpm::HTML requires Error, which is bundled in manakai-core.
Message::IMT::InternetMediaType
Module Whatpm::IMTChecker depends on Message::IMT::InternetMediaType, which is part of manakai-core.
Message::URI::URIReference
Modules Whatpm::URIChecker and Whatpm::CacheManifest depend on Message::URI::URIReference, which is part of manakai-core.
Message::Charset::Info
Module Whatpm::ContentChecker depends on Message::Charset::Info, which is part of manakai-core.
Message::DOM::DOMImplementation
Module Whatpm::URIChecker depends on Message::DOM::DOMImplementation, which is part of manakai-core.
Message::DOM::DOMImplementation and related modules
Testing for module Whatpm::ContentChecker depends on Message::DOM::DOMImplementation and related modules in manakai-core. They are not required for any practical use of those modules.
manakai charlib
Module Whatpm::Charset::DecodeHandle depends on modules in manakai charlib for decoding of Japanese character encodings. See the documentation for manakai charlib for more information.
Python, Perl Inline::Python module, and Universal Encoding Detector
For the module Whatpm::Charset::UniversalCharDet being meaningful, these softwares are required on the system. See the documentation for more information.
JSON
Testing for modules Whatpm::HTML and Whatpm::CSS::Tokenizer depends on JSON and related modules. They are not required for any practical use of those modules.

Distribution

The development version of Whatpm may be found in the CVS repository.

The latest developmenet version of the Whatpm is also available as a tarball.

TO DO

See also the bug tracking system.

Acknowledgments

Thanks to the html5lib team for their HTML5 parser test data.

Author

.

License

Copyright 2007‐2009 Wakaba <>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.