This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache, mod_perl, and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data.
You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl.
That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.
Our objective
Our objective is to use utf-8 encoded strings everywhere. We are not concerned with other character encodings. This may not exactly match your requirements so consider your own situation before copying our setup.
We want to store text in the database in utf-8, we want to display our web pages as utf-8, and we want all form input to be utf-8 clean.
We want this to work on our Fedora 10 development machines and our RHEL 5 production servers. Our Mason configuration in Apache is fairly standard and along the lines of:
<LocationMatch "(\.html|\.pl)$">
SetHandler modperl
PerlResponseHandler HTML::Mason::ApacheHandler
</LocationMatch>
<Perl>
use Apache2::Const -compile => qw(NOT_FOUND);
</Perl>
<LocationMatch "(\.m(html|txt|pl|css)|dhandler|autohandler)$">
SetHandler perl-script
PerlInitHandler Apache2::Const::NOT_FOUND
</LocationMatch>
Our solution
TL;DR Summary for the impatient
This is the quick summary for the impatient. If something doesn’t make sense, then see below for the details.
-
Make sure you send the right HTTP, XML, and HTML information:
Make sure you send
charset=utf-8
in your HTTPContent-Type
header.Make sure you have
encoding="utf-8"
in any<?xml ?>
preambles (if you useapplication/xhtml+xml
or any other XML content type).For good measure, add
charset=utf-8
to a<meta http-equiv="Content-Type" ... >
line in your HTML HEAD section.
Add
PerlSetVar MasonPreamble "use utf8;"
to your Apache configuration file to make your Mason source files utf-8 safe.Add
PerlAddVar MasonPlugins "MasonX::Plugin::UTF8"
to your Apache configuration file to make your forms utf-8 safe.Add
{ pg_enable_utf8 => 1 }
(ormysql_enable_utf8=>1, unicode=>1
, or similar depending on your RDBMS) to yourDBI->connect
call to make your database utf-8 safe.
1. Send the web pages with the right character set
You are probably doing this already, but it is better to be safe. You have to send the right charset in the HTTP headers and the HTML head sections. If you are using xhtml or another XML format, then you also need the right encoding attribute on your <?xml ?>
tag.
HTTP headers
The HTTP headers first. Somewhere in your setup, probably in code called from your Mason autohandler
, you are presumably already sending the Content-Type: header
. You need to ensure that you have charset=utf-8
at the end of that. For example:
%init>
<# ...
$r->content_type(q{text/html; charset=utf-8})
# ...
%init> </
Our full code allows us to override the default content type and character encoding in individual sections:
%init>
<my $self = $m->request_comp();
my $encoding = $self->attr('encoding') || "utf-8";
my $content_type = $self->attr('content_type');
if ( !defined($content_type) ) {
$content_type = "text/html"; # Fallback type
my $accept_header = $r->headers_in->{'Accept'} || q{application/xhtml+xml};
my $a = HTTP::Headers::Accept->new(header => $accept_header);
if ( $HTTP::Headers::Accept::double_wildcard < $a->match_media(q{application/xhtml+xml}) ) {
$content_type = "application/xhtml+xml";
}
}$r->content_type(qq{$content_type; charset=$encoding});
# ...
Alternatively, you may be able to simply add to your Apache configuration file:
AddDefaultCharset utf-8
However, this only works for text/plain
and text/html
content so you lose some flexibility.
XML and HTML headers
Next, if you are sending your pages as any XML format you need to ensure that your XML preable has the right encoding parameter:
<?xml version="1.0" encoding="utf-8" ?>
And whatever you do, you probably want to copy your $r->content_type
value as a META tag in the HTML HEAD section:
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
<!-- ... -->
</head>
2. Fix Mason source files
Now you should be all done, but, alas, your problems are just starting. I assume that you have some sort of autohandler that manages your site’s templates and you can just create a new file, say test.html
with the content
<p>Copyright © by me</p>
When I do that on my system and display the page in Firefox (or another recent browser) I see
Copyright © by me
Eeeek!
The problem is good old perl. And I mean old as in ancient, pre Unicode. Perl doesn’t do Unicode very well, or at least not very intuitively.
When you are using Mason your .html files are no longer just HTML. Instead they are really perl code snippets that Mason magically runs for you. And, by default, perl source code is not utf-8 safe! (This is true at least up to perl v.5.10 and probably will remain true until perl v6.) You need to add the magic incantation use utf8;
before you can use utf8 characters in your perl source code strings (which, remember, is what your .html file really is). So try changing text.html
to:
<p>Copyright © by me</p>
<%once>
use utf8;
</%once>
And you will see it works.
This would obviously be a pain to do (and remember to do) in ever file. Fortunately, you can simply add to your Apache configuration file
PerlSetVar MasonPreamble "use utf8;"
And all your source files are utf-8 safe. Cool!
3. Fix form inputs
I am assuming that all your HTML forms are already looking a little like this:
<form action="#" method="post" accept-charset="UTF-8" enctype="multipart/form-data">
<!-- ... -->
</form>
You obviously want to tell the client (browser) that you accept utf-8 strings, and since there is currently no safe standard way of URL encode Unicode characters, you are left with POST methods and the multipart/form-data encoding.
And it should all work out of the box. But it doesn’t. Make a form and add our favourite test string “Copyright © by me” to it and preview the result. Our friend the ‘Â’ is back. Sigh.
So we create a plugin module called MasonX::Plugin::UTF8
essentially as:
package MasonX::Plugin::UTF8;
use base qw(HTML::Mason::Plugin);
use warnings;
use strict;
sub start_request_hook {
my ( $self, $context ) = @_;
my $args_ref = $context->args();
foreach my $arg ( @{$args_ref} ) {
utf8::is_utf8($arg) || utf8::decode($arg);
}return;
}
And we add it to our Apache configuration with
<Perl>
use lib '/some/path';
</Perl>
PerlAddVar MasonPlugins "MasonX::Plugin::UTF8"
And it just works. As it should have done out of the box.
4. Fix your database
So now you save your form data in your relational database system, get it back out, display it and find your old friend ‘Â’ is back. Sigh. Did I mention that perl is old?
I am assuming that you are using the DBI module to access the database. This is currently not utf-8 clean by default because most of the underlying drivers are not utf-8 clean by default. (And that is for “backwards compatibility”.)
The fix depends on your database driver, but many of them have a magic attribute you can pass to the DBI->connect
call to make them utf-8 safe. Some of them are listed below:
Driver | Database | Utf-8 attribute |
---|---|---|
DBD::Pg | PostgreSQL |
pg_enable_utf8 => 1
|
DBD::mysql | MySQL |
mysql_enable_utf8 => 1
|
DBD::SQLite | SQLite |
sqlite_unicode => 1
|
(Warning: The MySQL option is currently “experimental and may change in future versions” and SQLite requires special handling of blobs with the unicode flag enabled.)
So for PostgreSQL you would do something like:
my $dbh = DBI->connect( $dsn, $user, $pass,
1, RaiseError => 0, pg_enable_utf8 => 1 } ); { AutoCommit =>
This of course assumes that you have created the database as a utf-8 database in the first instance! Sticking with PostgreSQL as the example, you would have done createdb -E utf8 ...
from the command line or used the SQL command CREATE DATABASE ... ENCODING 'UTF8';
.
If your database driver does not support utf-8 directly, you might want to consider the UTF8DBI module as a workaround.
Done!
And now you are done! Enjoy and let me know of your experiences.