manakai

Whatpm::LangTag

Language Tag Parsing, Conformance Checking, and Normalization

SYNOPSIS

  use Whatpm::LangTag;
  
  $parsed = Whatpm::LangTag->parse_rfc5646_tag ($tag, $onerror);
  Whatpm::LangTag->check_rfc5646_parsed_tag ($parsed, $onerror);
  $tag = Whatpm::LangTag->normalize_rfc5646_tag ($tag);

DESCRIPTION

The Whatpm::LangTag module contains methods to handle language tags as defined by BCP 47. It can be used to parse, validate, or normalize language tags according to relevant standard.

METHODS

For the following strings, if an input or output is a language tag or a language range, it is a character string (or possibly utf8 flagged string of characters), not a byte string. Although language tags and ranges are specified as a string of ASCII characters, illegal tags and ranges can always contain any non-ASCII characters.

PARSING

$parsed = Whatpm::LangTag->parse_tag ($tag, $onerror)

Parses a language tag into subtags. This method interprets the language tag using the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$parsed = Whatpm::LangTag->parse_rfc5646_tag ($tag, $onerror)

Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 5646.

Any errors and warnings would be reported to the code refeference specified as the second argument.

$parsed = Whatpm::LangTag->parse_rfc4646_tag ($tag, $onerror)

Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 4646.

Any errors and warnings would be reported to the code refeference specified as the second argument.

These methods return a hash reference, which contains one or more key-value pairs from the following list:

language (string)

The language subtag. There is always a language subtag, even if the input is illegal, unless there is grandfathered tag. E.g. 'ja' for input ja-JP.

extlang (arrayref of strings)

The extlang subtags. E.g. 'yue' for input zh-yue.

script (string or undef)

The script subtag. E.g. 'Latn' for input ja-Latn-JP.

region (string or undef)

The region subtag. E.g. 'JP' for input en-JP.

variant (arrayref of strings)

The variant subtags. E.g. ['fonipa'] for input en-JP-fonipa.

extension (arrayref of arrayrefs of strings)

The extension subtags. E.g. [['u', 'islamCal']] for input en-US-u-islamCal.

privateuse (arrayref of strings)

The privateuse subtags. E.g. ['x', 'pig', 'latin'] for input x-pig-latin.

illegal (arrayref of strings)

Illegal (syntactically non-conforming) string fragments. E.g. ['1234', 'xyz', 'abc'] for input 1234-xyz-abc.

grandfathered (string or undef)

"Grandfathered" language tag. E.g. 'i-default' for input i-default.

u

If the tag contains a u extension, parse result of the extension is contained here. The value is an array reference of array references of strings. The first inner array reference contains the attributes in the extension. The remaining inner array references, if any, represents the keywords (i.e. the key-type pairs) in the extension in order. E.g. [[], ['ca', 'japanese'], ['va', '0061', '0061']] for input ja-u-ca-japanese-va-0061-0061.

SERIALIZATION

$tag = Whatpm::LangTag->serialize_parsed_tag ($parsed_tag)

Convert a parsed language tag into a language tag string. The argument must be a parsed tag as defined in the previous section; a broken value would not be processed properly.

If the given parsed tag does not represent a well-formed language tag, the result string would not be a well-formed language tag.

CONFORMANCE CHECKING

Whatpm::LangTag->check_parsed_tag ($parsed, $onerror)

Checks for conformance errors in the parsed language tag, against the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

Whatpm::LangTag->check_rfc5646_parsed_tag ($parsed, $onerror)

Checks for conformance errors in the parsed language tag, against RFC 5646.

This method does not report any parse erros, as this method receives a parsed language tag.

Any errors and warnings would be reported to the code refeference specified as the second argument.

The method returns a hash reference with two keys: well-formed and valid. They represent whether the given language tag is well-formed or valid or not as per RFC 5646.

Whatpm::LangTag->check_rfc4646_parsed_tag ($parsed, $onerror)

Checks for conformance errors in the parsed language tag, against RFC 4646.

This method does not report any parse erros, as this method receives a parsed language tag.

Any errors and warnings would be reported to the code refeference specified as the second argument.

The method returns a hash reference with two keys: well-formed and valid. They represent whether the given language tag is well-formed or valid or not as per RFC 4646.

Whatpm::LangTag->check_rfc3066_tag ($tag, $onerror)

Parses and checks for conformance errors in the parsed language tag, against RFC 3066.

Any errors and warnings would be reported to the code refeference specified as the second argument.

Whatpm::LangTag->check_rfc1766_tag ($tag, $onerror)

Parses and checks for conformance errors in the parsed language tag, against RFC 1766.

Any errors and warnings would be reported to the code refeference specified as the second argument.

Note that specs sometimes contain semantic or contextual conformance rules, such as: "strongly RECOMMENDED that users not define their own rules for language tag choice" (RFC 4646 4.1.), "Subtags SHOULD only be used where they add useful distinguishing information" (RFC 4646 4.1.), and "Use as precise a tag as possible, but no more specific than is justified" (RFC 4646 4.1. 1.). These kinds of requirements cannot be tested without human interpretation, and therefore the methods in this module do not (or cannot) try to detect violation to these rules.

NORMALIZATION

$tag = Whatpm::LangTag->normalize_tag ($tag_orig)

Normalize the language tag by folding cases, following the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = Whatpm::LangTag->normalize_rfc5646_tag ($tag_orig)

Normalize the language tag by folding cases, following RFC 5646 2.1. and 2.2.6. Note that this method does not replace any subtag into its preferred alternative; this method does not rearrange ordering of subtags.

Although this method does not completely convert language tags into their canonical form, its result will be good enough for comparison in most usual situations.

$tag = Whatpm::LangTag->canonicalize_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = Whatpm::LangTag->canonicalize_rfc5646_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per RFC 5646 4.5. That is, replace any subtag into its Preferred-Value form if possible and sort any extension subtags. Note that this method does NOT do any case folding. In addition, the "canonicalized form" of a langauge tag is not necessary a fully canonicalized form at all - for example, variant subtags might not be in the recommended order.

Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag.

$tag = Whatpm::LangTag->to_extlang_form_tag ($tag_orig)

Normalize the language tag into its extlang form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = Whatpm::LangTag->to_extlang_form_rfc5646_tag ($tag_orig)

Normalize the language tag into its extlang form, as per RFC 5646 4.5. The extlang form is same as the canonicalized form, except that use of extlang subtags is preferred to language-only (or extlang-free) representation.

Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag.

COMPARISON

BOOL = Whatpm::LangTag->basic_filtering_range ($range, $tag)

Compares a basic language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4645.

BOOL = Whatpm::LangTag->basic_filtering_rfc4647_range ($range, $tag)

Compares a basic language range to a language tag, according to RFC 4647 Section 3.3.1. This method returns whether the range matches to the tag or not.

A basic language range is either a language tag or *. (For more information, see RFC 4647 Section 2.1.).

BOOL = Whatpm::LangTag->match_rfc3066_range ($range, $tag)

Compares a language-range to a language tag according to RFC 3066 Section 2.5. This method returns whether the range matches to the tag or not. Note that RFC 3066 is obsoleted by RFC 4647.

A language range is either a language tag or *. (For more information, see RFC 3066 2.5).

Note that this method is equivalent to basic_filtering_rfc4647_range by definition.

BOOL = Whatpm::LangTag->extended_filtering_range ($range, $tag)

Compares an extended language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4647.

BOOL = Whatpm::LangTag->extended_filtering_rfc4647_range ($range, $tag)

Compares an extended language range to a language tag, according to RFC 4647 Section 3.3.2. This method returns whether the range matches to the tag or not.

An extended language range is a language tag whose subtags can be *s. (For more information, see RFC 4647 Section 2.2.).

ERRORS

For methods with argument $onerror, any error and warning detected during the parsing or conformance checking would be reporeted by invoking the specified code reference with the description of the error or warning.

Following name-value pairs describing the error are given to the code reference as arguments:

type (string, always specified)

A short string describing the kind of the error. Descriptions of error types are available at <https://suika.suikawiki.org/gate/2007/html/error-description#{type}>, where {type} is an error type string.

For the list of error types, see <https://suika.suikawiki.org/gate/2007/html/error-description#langtag-errors>.

text (string, possibily missing)

A short string, which arguments the error type. Its semantics depends on the error type.

value (string, possibly missing)

A part of the input, in which an error is detected.

level (string, always specified)

A character representing the level or severity of the error, which is one of the following characters: m (violation to a MUST-level requirement), s (violation to a SHOULD-level requirement), w (a warning), and i (an informational notification).

SEE ALSO

RFC 1766 <http://tools.ietf.org/html/rfc1766>.

RFC 3066 <http://tools.ietf.org/html/rfc3066>.

RFC 4646 <http://tools.ietf.org/html/rfc4646>.

RFC 4647 <http://tools.ietf.org/html/rfc4647>.

RFC 5646 <http://tools.ietf.org/html/rfc5646>.

IANA Language Subtag Registry <http://www.iana.org/assignments/language-subtag-registry>.

Language Tag Extensions Registry <http://www.iana.org/assignments/language-tag-extensions-registry>.

UTS #35: Unicode Locale Data Markup Language <http://unicode.org/reports/tr35/>.

Unicode Locale Extension (ā€˜uā€™) for BCP 47 <http://cldr.unicode.org/index/bcp47-extension>, <http://unicode.org/repos/cldr/trunk/common/bcp47/>.

SuikaWiki:Language Tags <https://suika.suikawiki.org/~wakaba/wiki/sw/n/language%20tags>

HISTORY

2007-09-09

First version.

2011-09-24

Implemented RFC 5646. Implemented comparison. Implemented RFC 1766.

2011-10-01

Implemented the u extension.

2011-10-02

Implemented full validation of RFC 3066 and RFC 1766 language tags. Added unversioned aliases for operations.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2011 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.