PERLUNICOOK(1) Perl Programmers Reference Guide PERLUNICOOK(1) #
PERLUNICOOK(1) Perl Programmers Reference Guide PERLUNICOOK(1)
NNAAMMEE #
perlunicook - cookbookish examples of handling Unicode in Perl
DDEESSCCRRIIPPTTIIOONN #
This manpage contains short recipes demonstrating how to handle common
Unicode operations in Perl, plus one complete program at the end. Any
undeclared variables in individual recipes are assumed to have a previous
appropriate value in them.
EEXXAAMMPPLLEESS #
℞℞ 00:: SSttaannddaarrdd pprreeaammbbllee Unless otherwise notes, all examples below require this standard preamble to work correctly, with the “#!” adjusted to work on your system:
#!/usr/bin/env perl
use v5.36; # or later to get "unicode_strings" feature,
# plus strict, warnings
use utf8; # so literals and identifiers can be in UTF-8
use warnings qw(FATAL utf8); # fatalize encoding glitches
use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
This _d_o_e_s make even Unix programmers "binmode" your binary streams, or
open them with ":raw", but that's the only way to get at them portably
anyway.
WWAARRNNIINNGG: "use autodie" (pre 2.26) and "use open" do not get along with
each other.
℞℞ 11:: GGeenneerriicc UUnniiccooddee--ssaavvvvyy ffiilltteerr Always decompose on the way in, then recompose on the way out.
use Unicode::Normalize;
while (<>) {
$_ = NFD($_); # decompose + reorder canonically
...
} continue {
print NFC($_); # recompose (where possible) + reorder canonically
}
℞℞ 22:: FFiinnee--ttuunniinngg UUnniiccooddee wwaarrnniinnggss As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
use v5.14; # subwarnings unavailable any earlier
no warnings "nonchar"; # the 66 forbidden non-characters
no warnings "surrogate"; # UTF-16/CESU-8 nonsense
no warnings "non_unicode"; # for codepoints over 0x10_FFFF
℞℞ 33:: DDeeccllaarree ssoouurrccee iinn uuttff88 ffoorr iiddeennttiiffiieerrss aanndd lliitteerraallss Without the all-critical “use utf8” declaration, putting UTF‑8 in your literals and identifiers won’t work right. If you used the standard preamble just given above, this already happened. If you did, you can do things like this:
use utf8;
my $measure = "Ångström";
my @μsoft = qw( cp852 cp1251 cp1252 );
my @ὑπέρμεγας = qw( ὑπέρ μεγας );
my @鯉 = qw( koi8-f koi8-u koi8-r );
my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
If you forget "use utf8", high bytes will be misunderstood as separate
characters, and nothing will work right.
℞℞ 44:: CChhaarraacctteerrss aanndd tthheeiirr nnuummbbeerrss The “ord” and “chr” functions work transparently on all codepoints, not just on ASCII alone — nor in fact, not even just on Unicode alone.
# ASCII characters
ord("A")
chr(65)
# characters from the Basic Multilingual Plane
ord("Σ")
chr(0x3A3)
# beyond the BMP
ord("𝑛") # MATHEMATICAL ITALIC SMALL N
chr(0x1D45B)
# beyond Unicode! (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)
℞℞ 55:: UUnniiccooddee lliitteerraallss bbyy cchhaarraacctteerr nnuummbbeerr In an interpolated literal, whether a double-quoted string or a regex, you may specify a character by its number using the “\x{_H_H_H_H_H_H}” escape.
String: "\x{3a3}"
Regex: /\x{3a3}/
String: "\x{1d45b}"
Regex: /\x{1d45b}/
# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/
℞℞ 66:: GGeett cchhaarraacctteerr nnaammee bbyy nnuummbbeerr use charnames (); my $name = charnames::viacode(0x03A3);
℞℞ 77:: GGeett cchhaarraacctteerr nnuummbbeerr bbyy nnaammee use charnames (); my $number = charnames::vianame(“GREEK CAPITAL LETTER SIGMA”);
℞℞ 88:: UUnniiccooddee nnaammeedd cchhaarraacctteerrss Use the “\N{_c_h_a_r_n_a_m_e}” notation to get the character by that name for use in interpolated literals (double-quoted strings and regexes). In v5.16, there is an implicit
use charnames qw(:full :short);
But prior to v5.16, you must be explicit about which set of charnames you
want. The ":full" names are the official Unicode character name, alias,
or sequence, which all share a namespace.
use charnames qw(:full :short latin greek);
"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full
Anything else is a Perl-specific convenience abbreviation. Specify one
or more scripts by names if you want short names that are script-
specific.
"\N{Greek:Sigma}" # :short
"\N{ae}" # latin
"\N{epsilon}" # greek
The v5.16 release also supports a ":loose" import for loose matching of
character names, which works just like loose matching of property names:
that is, it disregards case, whitespace, and underscores:
"\N{euro sign}" # :loose (from v5.16)
Starting in v5.32, you can also use
qr/\p{name=euro sign}/
to get official Unicode named characters in regular expressions. Loose
matching is always done for these.
℞℞ 99:: UUnniiccooddee nnaammeedd sseeqquueenncceess These look just like character names but return multiple codepoints. Notice the %vx vector-print functionality in “printf”.
use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300 #
℞℞ 1100:: CCuussttoomm nnaammeedd cchhaarraacctteerrss Use “:alias” to give your own lexically scoped nicknames to existing characters, or even to give unnamed private-use characters useful names.
use charnames ":full", ":alias" => {
ecute => "LATIN SMALL LETTER E WITH ACUTE",
"APPLE LOGO" => 0xF8FF, # private use character
};
"\N{ecute}"
“\N{APPLE LOGO}” #
℞℞ 1111:: NNaammeess ooff CCJJKK ccooddeeppooiinnttss Sinograms like “東京” come back with character names of “CJK UNIFIED IDEOGRAPH-6771” and “CJK UNIFIED IDEOGRAPH-4EAC”, because their “names” vary. The CPAN “Unicode::Unihan” module has a large database for decoding these (and a whole lot more), provided you know how to understand its output.
# cpan -i Unicode::Unihan
use Unicode::Unihan;
my $str = "東京";
my $unhan = Unicode::Unihan->new;
for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
printf "CJK $str in %-12s is ", $lang;
say $unhan->$lang($str);
}
prints:
CJK 東京 in Mandarin is DONG1JING1
CJK 東京 in Cantonese is dung1ging1
CJK 東京 in Korean is TONGKYENG
CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
If you have a specific romanization scheme in mind, use the specific
module:
# cpan -i Lingua::JA::Romanize::Japanese
use Lingua::JA::Romanize::Japanese;
my $k2r = Lingua::JA::Romanize::Japanese->new;
my $str = "東京";
say "Japanese for $str is ", $k2r->chars($str);
prints
Japanese for 東京 is toukyou
℞℞ 1122:: EExxpplliicciitt eennccooddee//ddeeccooddee On rare occasion, such as a database read, you may be given encoded text you need to decode.
use Encode qw(encode decode);
my $chars = decode("shiftjis", $bytes, 1);
# OR #
my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
For streams all in the same encoding, don't use encode/decode; instead
set the file encoding when you open the file or immediately after with
"binmode" as described later below.
℞℞ 1133:: DDeeccooddee pprrooggrraamm aarrgguummeennttss aass uuttff88 $ perl -CA … or $ export PERL_UNICODE=A or use Encode qw(decode); @ARGV = map { decode(‘UTF-8’, $_, 1) } @ARGV;
℞℞ 1144:: DDeeccooddee pprrooggrraamm aarrgguummeennttss aass llooccaallee eennccooddiinngg # cpan -i Encode::Locale use Encode qw(locale); use Encode::Locale;
# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_, 1) } @ARGV;
℞℞ 1155:: DDeeccllaarree SSTTDD{{IINN,,OOUUTT,,EERRRR}} ttoo bbee uuttff88 Use a command-line option, an environment variable, or else call “binmode” explicitly:
$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use open qw(:std :encoding(UTF-8));
or
binmode(STDIN, ":encoding(UTF-8)");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
℞℞ 1166:: DDeeccllaarree SSTTDD{{IINN,,OOUUTT,,EERRRR}} ttoo bbee iinn llooccaallee eennccooddiinngg # cpan -i Encode::Locale use Encode; use Encode::Locale;
# or as a stream for binmode or open
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
℞℞ 1177:: MMaakkee ffiillee II//OO ddeeffaauulltt ttoo uuttff88 Files opened without an encoding argument will be in UTF-8:
$ perl -CD ...
or
$ export PERL_UNICODE=D
or
use open qw(:encoding(UTF-8));
℞℞ 1188:: MMaakkee aallll II//OO aanndd aarrggss ddeeffaauulltt ttoo uuttff88 $ perl -CSDA … or $ export PERL_UNICODE=SDA or use open qw(:std :encoding(UTF-8)); use Encode qw(decode); @ARGV = map { decode(‘UTF-8’, $_, 1) } @ARGV;
℞℞ 1199:: OOppeenn ffiillee wwiitthh ssppeecciiffiicc eennccooddiinngg Specify stream encoding. This is the normal way to deal with encoded text, not by calling low-level functions.
# input file
open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR #
open(my $in_file, "<", "wintext");
binmode($in_file, ":encoding(UTF-16)");
THEN #
my $line = <$in_file>;
# output file
open($out_file, "> :encoding(cp1252)", "wintext");
OR #
open(my $out_file, ">", "wintext");
binmode($out_file, ":encoding(cp1252)");
THEN #
print $out_file "some text\n";
More layers than just the encoding can be specified here. For example,
the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
handling.
℞℞ 2200:: UUnniiccooddee ccaassiinngg Unicode casing is very different from ASCII casing.
uc("henry ⅷ") # "HENRY Ⅷ"
uc("tschüß") # "TSCHÜSS" notice ß => SS
# both are true:
"tschüß" =~ /TSCHÜSS/i # notice ß => SS
"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
℞℞ 2211:: UUnniiccooddee ccaassee--iinnsseennssiittiivvee ccoommppaarriissoonnss Also available in the CPAN Unicode::CaseFold module, the new “fc” “foldcase” function from v5.16 grants access to the same Unicode casefolding as the “/i” pattern modifier has always used:
use feature "fc"; # fc() function is from v5.16
# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;
# both are true:
fc("tschüß") eq fc("TSCHÜSS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
℞℞ 2222:: MMaattcchh UUnniiccooddee lliinneebbrreeaakk sseeqquueennccee iinn rreeggeexx A Unicode linebreak matches the two-character CRLF grapheme or any of seven vertical whitespace characters. Good for dealing with textfiles coming from different operating systems.
\R #
s/\R/\n/g; # normalize all linebreaks to \n
℞℞ 2233:: GGeett cchhaarraacctteerr ccaatteeggoorryy Find the general category of a numeric codepoint.
use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category}; # "Lu"
℞℞ 2244:: DDiissaabblliinngg UUnniiccooddee--aawwaarreenneessss iinn bbuuiillttiinn cchhaarrccllaasssseess Disable “\w”, “\b”, “\s”, “\d”, and the POSIX classes from working correctly on Unicode either in this scope, or in just one regex.
use v5.14;
use re "/a";
# OR #
my($num) = $str =~ /(\d+)/a;
Or use specific un-Unicode properties, like "\p{ahex}" and
"\p{POSIX_Digit"}. Properties still work normally no matter what charset
modifiers ("/d /u /l /a /aa") should be effect.
℞℞ 2255:: MMaattcchh UUnniiccooddee pprrooppeerrttiieess iinn rreeggeexx wwiitthh \\pp,, \\PP These all match a single codepoint with the given property. Use “\P” in place of “\p” to match one codepoint lacking that property.
\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script_extensions=Latin}, \p{scx=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}
℞℞ 2266:: CCuussttoomm cchhaarraacctteerr pprrooppeerrttiieess Define at compile-time your own custom character properties for use in regexes.
# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }
if (/\p{In_Tengwar}/) { ... }
# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET #
if (/\p{Is_GraecoRoman_Title}/ { ... }
℞℞ 2277:: UUnniiccooddee nnoorrmmaalliizzaattiioonn Typically render into NFD on input and NFC on output. Using NFKC or NFKD functions improves recall on searches, assuming you’ve already done to the same text to be searched. Note that this is about much more than just pre- combined compatibility glyphs; it also reorders marks according to their canonical combining classes and weeds out singletons.
use Unicode::Normalize;
my $nfd = NFD($orig);
my $nfc = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);
℞℞ 2288:: CCoonnvveerrtt nnoonn--AASSCCIIII UUnniiccooddee nnuummeerriiccss Unless you’ve used “/a” or “/aa”, “\d” matches more than ASCII digits only, but Perl’s implicit string-to-number conversion does not current recognize these. Here’s how to convert such strings manually.
use v5.14; # needed for num() function
use Unicode::UCD qw(num);
my $str = "got Ⅻ and ४५६७ and ⅞ and here";
my @nums = ();
while ($str =~ /(\d+|\N)/g) { # not just ASCII!
push @nums, num($1);
}
say "@nums"; # 12 4567 0.875
use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
℞℞ 2299:: MMaattcchh UUnniiccooddee ggrraapphheemmee cclluusstteerr iinn rreeggeexx Programmer-visible “characters” are codepoints matched by “/./s”, but user-visible “characters” are graphemes matched by “/\X/”.
# Find vowel *plus* any combining diacritics,underlining,etc.
my $nfd = NFD($orig);
$nfd =~ / (?=[aeiou]) \X /xi
℞℞ 3300:: EExxttrraacctt bbyy ggrraapphheemmee iinnsstteeaadd ooff bbyy ccooddeeppooiinntt ((rreeggeexx)) # match and grab five first graphemes my($first_five) = $str =~ /^ ( \X{5} ) /x;
℞℞ 3311:: EExxttrraacctt bbyy ggrraapphheemmee iinnsstteeaadd ooff bbyy ccooddeeppooiinntt ((ssuubbssttrr)) # cpan -i Unicode::GCString use Unicode::GCString; my $gcs = Unicode::GCString->new($str); my $first_five = $gcs->substr(0, 5);
℞℞ 3322:: RReevveerrssee ssttrriinngg bbyy ggrraapphheemmee Reversing by codepoint messes up diacritics, mistakenly converting “crème brûlée” into “éel̂urb em̀erc” instead of into “eélûrb emèrc”; so reverse by grapheme instead. Both these approaches work right no matter what normalization the string is in:
$str = join("", reverse $str =~ /\X/g);
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);
℞℞ 3333:: SSttrriinngg lleennggtthh iinn ggrraapphheemmeess The string “brûlée” has six graphemes but up to eight codepoints. This counts by grapheme, not by codepoint:
my $str = "brûlée";
my $count = 0;
while ($str =~ /\X/g) { $count++ }
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $count = $gcs->length;
℞℞ 3344:: UUnniiccooddee ccoolluummnn--wwiiddtthh ffoorr pprriinnttiinngg Perl’s “printf”, “sprintf”, and “format” think all codepoints take up 1 print column, but many take 0 or 2. Here to show that normalization makes no difference, we print out both forms:
use Unicode::GCString;
use Unicode::Normalize;
my @words = qw/crème brûlée/;
@words = map { NFC($_), NFD($_) } @words;
for my $str (@words) {
my $gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
my $pad = " " x (10 - $cols);
say str, $pad, " |";
}
generates this to show that it pads correctly no matter the
normalization:
crème |
crème |
brûlée |
brûlée |
℞℞ 3355:: UUnniiccooddee ccoollllaattiioonn Text sorted by numeric codepoint follows no reasonable alphabetic order; use the UCA for sorting text.
use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);
See the _u_c_s_o_r_t program from the Unicode::Tussle CPAN module for a
convenient command-line interface to this module.
℞℞ 3366:: CCaassee-- _a_n_d aacccceenntt--iinnsseennssiittiivvee UUnniiccooddee ssoorrtt Specify a collation strength of level 1 to ignore case and diacritics, only looking at the basic character.
use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);
℞℞ 3377:: UUnniiccooddee llooccaallee ccoollllaattiioonn Some locales have special sorting rules.
# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);
The _u_c_s_o_r_t program mentioned above accepts a "--locale" parameter.
℞℞ 3388:: MMaakkiinngg “"ccmmpp"” wwoorrkk oonn tteexxtt iinnsstteeaadd ooff ccooddeeppooiinnttss Instead of this:
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME} cmp $b->{NAME}
} @recs;
Use this:
my $coll = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;
℞℞ 3399:: CCaassee-- _a_n_d aacccceenntt--iinnsseennssiittiivvee ccoommppaarriissoonnss Use a collator object to compare Unicode text by character instead of by codepoint.
use Unicode::Collate;
my $es = Unicode::Collate->new(
level => 1,
normalization => undef
);
# now both are true:
$es->eq("García", "GARCIA" );
$es->eq("Márquez", "MARQUEZ");
℞℞ 4400:: CCaassee-- _a_n_d aacccceenntt--iinnsseennssiittiivvee llooccaallee ccoommppaarriissoonnss Same, but in a specific locale.
my $de = Unicode::Collate::Locale->new(
locale => "de__phonebook",
);
# now this is true:
$de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
℞℞ 4411:: UUnniiccooddee lliinneebbrreeaakkiinngg Break up text into lines according to Unicode rules.
# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);
my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = Unicode::LineBreak->new;
print $fmt->break($para), "\n";
℞℞ 4422:: UUnniiccooddee tteexxtt iinn DDBBMM hhaasshheess,, tthhee tteeddiioouuss wwaayy Using a regular Perl string as a key or value for a DBM hash will trigger a wide character exception if any codepoints won’t fit into a byte. Here’s how to manually manage the translation:
use DB_File;
use Encode qw(encode decode);
tie %dbhash, "DB_File", "pathname";
# STORE #
# assume $uni_key and $uni_value are abstract Unicode strings
my $enc_key = encode("UTF-8", $uni_key, 1);
my $enc_value = encode("UTF-8", $uni_value, 1);
$dbhash{$enc_key} = $enc_value;
# FETCH #
# assume $uni_key holds a normal Perl string (abstract Unicode)
my $enc_key = encode("UTF-8", $uni_key, 1);
my $enc_value = $dbhash{$enc_key};
my $uni_value = decode("UTF-8", $enc_value, 1);
℞℞ 4433:: UUnniiccooddee tteexxtt iinn DDBBMM hhaasshheess,, tthhee eeaassyy wwaayy Here’s how to implicitly manage the translation; all encoding and decoding is done automatically, just as with streams that have a particular encoding attached to them:
use DB_File;
use DBM_Filter;
my $dbobj = tie %dbhash, "DB_File", "pathname";
$dbobj->Filter_Value("utf8"); # this is the magic bit
# STORE #
# assume $uni_key and $uni_value are abstract Unicode strings
$dbhash{$uni_key} = $uni_value;
# FETCH #
# $uni_key holds a normal Perl string (abstract Unicode)
my $uni_value = $dbhash{$uni_key};
℞℞ 4444:: PPRROOGGRRAAMM:: DDeemmoo ooff UUnniiccooddee ccoollllaattiioonn aanndd pprriinnttiinngg Here’s a full program showing how to make use of locale-sensitive sorting, Unicode casing, and managing print widths when some of the characters take up zero or two columns, not just one column each time. When run, the following program produces this nicely aligned output:
Crème Brûlée....... €2.00
Éclair............. €1.60
Fideuà............. €4.20
Hamburger.......... €6.00
Jamón Serrano...... €4.45
Linguiça........... €7.00
Pâté............... €4.15
Pears.............. €2.00
Pêches............. €2.25
Smørbrød........... €5.75
Spätzle............ €5.50
Xoriço............. €3.00
Γύρος.............. €6.50
막걸리............. €4.00
おもち............. €2.65
お好み焼き......... €8.00
シュークリーム..... €1.85
寿司............... €9.99
包子............... €7.50
Here's that program.
#!/usr/bin/env perl
# umenu - demo sorting and printing of Unicode food
#
# (obligatory and increasingly long preamble)
#
use v5.36;
use utf8;
use warnings qw(FATAL utf8); # fatalize encoding faults
use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
# std modules
use Unicode::Normalize; # std perl distro as of v5.8
use List::Util qw(max); # std perl distro as of v5.10
use Unicode::Collate::Locale; # std perl distro as of v5.14
# cpan modules
use Unicode::GCString; # from CPAN
my %price = (
"γύρος" => 6.50, # gyros
"pears" => 2.00, # like um, pears
"linguiça" => 7.00, # spicy sausage, Portuguese
"xoriço" => 3.00, # chorizo sausage, Catalan
"hamburger" => 6.00, # burgermeister meisterburger
"éclair" => 1.60, # dessert, French
"smørbrød" => 5.75, # sandwiches, Norwegian
"spätzle" => 5.50, # Bayerisch noodles, little sparrows
"包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
"jamón serrano" => 4.45, # country ham, Spanish
"pêches" => 2.25, # peaches, French
"シュークリーム" => 1.85, # cream-filled pastry like eclair
"막걸리" => 4.00, # makgeolli, Korean rice wine
"寿司" => 9.99, # sushi, Japanese
"おもち" => 2.65, # omochi, rice cakes, Japanese
"crème brûlée" => 2.00, # crema catalana
"fideuà" => 4.20, # more noodles, Valencian
# (Catalan=fideuada)
"pâté" => 4.15, # gooseliver paste, French
"お好み焼き" => 8.00, # okonomiyaki, Japanese
);
my $width = 5 + max map { colwidth($_) } keys %price;
# So the Asian stuff comes out in an order that someone
# who reads those scripts won't freak out over; the
# CJK stuff will be in JIS X 0208 order that way.
my $coll = Unicode::Collate::Locale->new(locale => "ja");
for my $item ($coll->sort(keys %price)) {
print pad(entitle($item), $width, ".");
printf " €%.2f\n", $price{$item};
}
sub pad ($str, $width, $padchar) {
return $str . ($padchar x ($width - colwidth($str)));
}
sub colwidth ($str) {
return Unicode::GCString->new($str)->columns;
}
sub entitle ($str) {
$str =~ s{ (?=\pL)(\S) (\S*) }
{ ucfirst($1) . lc($2) }xge;
return $str;
}
SSEEEE AALLSSOO #
See these manpages, some of which are CPAN modules: perlunicode,
perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
Lingua::KO::Romanize::Hangul.
The Unicode::Tussle CPAN module includes many programs to help with
working with Unicode, including these programs to fully or partly replace
standard utilities: _t_c_g_r_e_p instead of _e_g_r_e_p, _u_n_i_q_u_o_t_e instead of _c_a_t _-_v
or _h_e_x_d_u_m_p, _u_n_i_w_c instead of _w_c, _u_n_i_l_o_o_k instead of _l_o_o_k, _u_n_i_f_m_t instead
of _f_m_t, and _u_c_s_o_r_t instead of _s_o_r_t. For exploring Unicode character
names and character properties, see its _u_n_i_p_r_o_p_s, _u_n_i_c_h_a_r_s, and _u_n_i_n_a_m_e_s
programs. It also supplies these programs, all of which are general
filters that do Unicode-y things: _u_n_i_t_i_t_l_e and _u_n_i_c_a_p_s; _u_n_i_w_i_d_e and
_u_n_i_n_a_r_r_o_w; _u_n_i_s_u_p_e_r_s and _u_n_i_s_u_b_s; _n_f_d, _n_f_c, _n_f_k_d, and _n_f_k_c; and _u_c, _l_c,
and _t_c.
Finally, see the published Unicode Standard (page numbers are from
version 6.0.0), including these specific annexes and technical reports:
§3.13 Default Case Algorithms, page 113; §4.2 Case, pages 120–122; Case
Mappings, page 166–172, especially Caseless Matching starting on page
170.
UAX #44: Unicode Character Database
UTS #18: Unicode Regular Expressions
UAX #15: Unicode Normalization Forms
UTS #10: Unicode Collation Algorithm
UAX #29: Unicode Text Segmentation
UAX #14: Unicode Line Breaking Algorithm
UAX #11: East Asian Width
AAUUTTHHOORR #
Tom Christiansen <tchrist@perl.com> wrote this, with occasional
kibbitzing from Larry Wall and Jeffrey Friedl in the background.
CCOOPPYYRRIIGGHHTT AANNDD LLIICCEENNCCEE #
Copyright © 2012 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it
under the same terms as Perl itself.
Most of these examples taken from the current edition of the “Camel
Book”; that is, from the 4ᵗʰ Edition of _P_r_o_g_r_a_m_m_i_n_g _P_e_r_l, Copyright ©
2012 Tom Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code
itself is freely redistributable, and you are encouraged to transplant,
fold, spindle, and mutilate any of the examples in this manpage however
you please for inclusion into your own programs without any encumbrance
whatsoever. Acknowledgement via code comment is polite but not required.
RREEVVIISSIIOONN HHIISSTTOORRYY #
v1.0.0 – first public release, 2012-02-27
perl v5.36.3 2023-02-15 PERLUNICOOK(1)