4 easy steps to make Mason utf-8 Unicode clean with Apache, mod_perl, and DBI

technology
Published

28 January 2009

This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache, mod_perl, and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data.

You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl.

That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.

Our objective

Our objective is to use utf-8 encoded strings everywhere. We are not concerned with other character encodings. This may not exactly match your requirements so consider your own situation before copying our setup.

We want to store text in the database in utf-8, we want to display our web pages as utf-8, and we want all form input to be utf-8 clean.

We want this to work on our Fedora 10 development machines and our RHEL 5 production servers. Our Mason configuration in Apache is fairly standard and along the lines of:

<LocationMatch "(\.html|\.pl)$">
  SetHandler modperl
  PerlResponseHandler HTML::Mason::ApacheHandler
</LocationMatch>
<Perl>
  use Apache2::Const -compile => qw(NOT_FOUND);
</Perl>
<LocationMatch "(\.m(html|txt|pl|css)|dhandler|autohandler)$">
  SetHandler perl-script
  PerlInitHandler Apache2::Const::NOT_FOUND
</LocationMatch>

Our solution

TL;DR Summary for the impatient

This is the quick summary for the impatient. If something doesn’t make sense, then see below for the details.

  1. Make sure you send the right HTTP, XML, and HTML information:

    1. Make sure you send charset=utf-8 in your HTTP Content-Type header.

    2. Make sure you have encoding="utf-8" in any <?xml ?> preambles (if you use application/xhtml+xml or any other XML content type).

    3. For good measure, add charset=utf-8 to a <meta http-equiv="Content-Type" ... > line in your HTML HEAD section.

  2. Add PerlSetVar MasonPreamble "use utf8;" to your Apache configuration file to make your Mason source files utf-8 safe.

  3. Add PerlAddVar MasonPlugins "MasonX::Plugin::UTF8" to your Apache configuration file to make your forms utf-8 safe.

  4. Add { pg_enable_utf8 => 1 } (or mysql_enable_utf8=>1, unicode=>1, or similar depending on your RDBMS) to your DBI->connect call to make your database utf-8 safe.

1. Send the web pages with the right character set

You are probably doing this already, but it is better to be safe. You have to send the right charset in the HTTP headers and the HTML head sections. If you are using xhtml or another XML format, then you also need the right encoding attribute on your <?xml ?> tag.

HTTP headers

The HTTP headers first. Somewhere in your setup, probably in code called from your Mason autohandler, you are presumably already sending the Content-Type: header. You need to ensure that you have charset=utf-8 at the end of that. For example:

<%init>
# ...
$r->content_type(q{text/html; charset=utf-8})
# ...
</%init>

Our full code allows us to override the default content type and character encoding in individual sections:

<%init>
my $self         = $m->request_comp();
my $encoding     = $self->attr('encoding') || "utf-8";
my $content_type = $self->attr('content_type');
if ( !defined($content_type) ) {
  $content_type = "text/html";                          # Fallback type
  my $accept_header = $r->headers_in->{'Accept'} || q{application/xhtml+xml};
  my $a = HTTP::Headers::Accept->new(header => $accept_header);
  if ( $HTTP::Headers::Accept::double_wildcard < $a->match_media(q{application/xhtml+xml}) ) {
      $content_type = "application/xhtml+xml";
  }
}
$r->content_type(qq{$content_type; charset=$encoding});
# ...

Alternatively, you may be able to simply add to your Apache configuration file:

AddDefaultCharset utf-8 

However, this only works for text/plain and text/html content so you lose some flexibility.

XML and HTML headers

Next, if you are sending your pages as any XML format you need to ensure that your XML preable has the right encoding parameter:

<?xml version="1.0" encoding="utf-8" ?>

And whatever you do, you probably want to copy your $r->content_type value as a META tag in the HTML HEAD section:

<head>
 <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
 <!-- ... -->
</head>

2. Fix Mason source files

Now you should be all done, but, alas, your problems are just starting. I assume that you have some sort of autohandler that manages your site’s templates and you can just create a new file, say test.html with the content

<p>Copyright © by me</p>

When I do that on my system and display the page in Firefox (or another recent browser) I see

Copyright © by me

Eeeek!

The problem is good old perl. And I mean old as in ancient, pre Unicode. Perl doesn’t do Unicode very well, or at least not very intuitively.

When you are using Mason your .html files are no longer just HTML. Instead they are really perl code snippets that Mason magically runs for you. And, by default, perl source code is not utf-8 safe! (This is true at least up to perl v.5.10 and probably will remain true until perl v6.) You need to add the magic incantation use utf8; before you can use utf8 characters in your perl source code strings (which, remember, is what your .html file really is). So try changing text.html to:

<p>Copyright © by me</p>
<%once>
use utf8;
</%once>

And you will see it works.

This would obviously be a pain to do (and remember to do) in ever file. Fortunately, you can simply add to your Apache configuration file

PerlSetVar MasonPreamble "use utf8;"

And all your source files are utf-8 safe. Cool!

3. Fix form inputs

I am assuming that all your HTML forms are already looking a little like this:

<form action="#" method="post" accept-charset="UTF-8" enctype="multipart/form-data">
<!-- ... -->
</form>

You obviously want to tell the client (browser) that you accept utf-8 strings, and since there is currently no safe standard way of URL encode Unicode characters, you are left with POST methods and the multipart/form-data encoding.

And it should all work out of the box. But it doesn’t. Make a form and add our favourite test string “Copyright © by me” to it and preview the result. Our friend the ‘Â’ is back. Sigh.

So we create a plugin module called MasonX::Plugin::UTF8 essentially as:

package MasonX::Plugin::UTF8;
use base qw(HTML::Mason::Plugin);
use warnings;
use strict;
sub start_request_hook {
    my ( $self, $context ) = @_;
    my $args_ref = $context->args();
    foreach my $arg ( @{$args_ref} ) {
        utf8::is_utf8($arg) || utf8::decode($arg);
    }
    return;
}

And we add it to our Apache configuration with

<Perl>
  use lib '/some/path';
</Perl>
PerlAddVar MasonPlugins "MasonX::Plugin::UTF8"

And it just works. As it should have done out of the box.

4. Fix your database

So now you save your form data in your relational database system, get it back out, display it and find your old friend ‘Â’ is back. Sigh. Did I mention that perl is old?

I am assuming that you are using the DBI module to access the database. This is currently not utf-8 clean by default because most of the underlying drivers are not utf-8 clean by default. (And that is for “backwards compatibility”.)

The fix depends on your database driver, but many of them have a magic attribute you can pass to the DBI->connect call to make them utf-8 safe. Some of them are listed below:

Driver Database Utf-8 attribute
DBD::Pg PostgreSQL pg_enable_utf8 => 1
DBD::mysql MySQL mysql_enable_utf8 => 1
DBD::SQLite SQLite sqlite_unicode => 1

(Warning: The MySQL option is currently “experimental and may change in future versions” and SQLite requires special handling of blobs with the unicode flag enabled.)

So for PostgreSQL you would do something like:

my $dbh = DBI->connect( $dsn, $user, $pass, 
                        { AutoCommit => 1, RaiseError => 0, pg_enable_utf8 => 1 } );

This of course assumes that you have created the database as a utf-8 database in the first instance! Sticking with PostgreSQL as the example, you would have done createdb -E utf8 ... from the command line or used the SQL command CREATE DATABASE ... ENCODING 'UTF8';.

If your database driver does not support utf-8 directly, you might want to consider the UTF8DBI module as a workaround.

Done!

And now you are done! Enjoy and let me know of your experiences.