On 2009-01-28 17:57:00, Allan Engelhardt wrote in CYBAEA Technology Notes:
This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache, mod_perl, and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data.
You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl.
That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.
Our objective is to use utf-8 encoded strings everywhere. We are not concerned with other character encodings. This may not exactly match your requirements so consider your own situation before copying our setup.
We want to store text in the database in utf-8, we want to display our web pages as utf-8, and we want all form input to be utf-8 clean.
We want this to work on our Fedora 10 development machines and our RHEL 5 production servers. Our Mason configuration in Apache is fairly standard and along the lines of:
<LocationMatch "(\.html|\.pl)$"> SetHandler modperl PerlResponseHandler HTML::Mason::ApacheHandler </LocationMatch> <Perl> use Apache2::Const -compile => qw(NOT_FOUND); </Perl> <LocationMatch "(\.m(html|txt|pl|css)|dhandler|autohandler)$"> SetHandler perl-script PerlInitHandler Apache2::Const::NOT_FOUND </LocationMatch>
This is the quick summary for the impatient. If something doesn't make sense, then see below for the details.
charset=utf-8 in your HTTP Content-Type header. (More)encoding="utf-8" in any <?xml ?> preambles (if you use application/xhtml+xml or any other XML content type). (More)charset=utf-8 to a <meta http-equiv="Content-Type" ... > line in your HTML HEAD section. (As above)PerlSetVar MasonPreamble "use utf8;" to your Apache configuration file to make your Mason source files utf-8 safe. (More)PerlAddVar MasonPlugins "MasonX::Plugin::UTF8" to your Apache configuration file to make your forms utf-8 safe. (More){ pg_enable_utf8 => 1 } (or mysql_enable_utf8=>1, unicode=>1, or similar depending on your RDBMS) to your DBI->connect call to make your database utf-8 safe. (More)
You are probably doing this already, but it is better to be safe. You have to send the right charset in the HTTP headers and the HTML head sections. If you are using xhtml or another XML format, then you also need the right encoding attribute on your <?xml ?> tag.
The HTTP headers first. Somewhere in your setup, probably in code called from your Mason autohandler, you are presumably already sending the Content-Type: header. You need to ensure that you have charset=utf-8 at the end of that. For example:
<%init> # ... $r->content_type(q{text/html; charset=utf-8}) # ... </%init>
Our full code allows us to override the default content type and character encoding in individual sections:
<%init>
my $self = $m->request_comp();
my $encoding = $self->attr('encoding') || "utf-8";
my $content_type = $self->attr('content_type');
if ( !defined($content_type) ) {
$content_type = "text/html"; # Fallback type
my $accept_header = $r->headers_in->{'Accept'} || q{application/xhtml+xml};
my $a = HTTP::Headers::Accept->new(header => $accept_header);
if ( $HTTP::Headers::Accept::double_wildcard < $a->match_media(q{application/xhtml+xml}) ) {
$content_type = "application/xhtml+xml";
}
}
$r->content_type(qq{$content_type; charset=$encoding});
# ...
Alternatively, you may be able to simply add to your Apache configuration file:
AddDefaultCharset utf-8
However, this only works for text/plain and text/html content and you lose some flexibility.
Next, if you are sending your pages as any XML format you need to ensure that your XML preable has the right encoding parameter:
<?xml version="1.0" encoding="utf-8" ?>
And whatever you do, you probably want to copy your $r->content_type value as a META tag in the HTML HEAD section:
<head> <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" /> <!-- ... --> </head>
Now you should be all done, but, alas, your problems are just starting. I assume that you have some sort of autohandler that manages your site's templates and you can just create a new file, say test.html with the content
<p>Copyright © by me</p>
When I do that on my system and display the page in Firefox (or another recent browser) I see
Copyright © by me
Eeeek!
The problem is good old perl. And I mean old as in ancient, pre Unicode. Perl doesn't do Unicode very well, or at least not very intuitively.
When you are using Mason your .html files are no longer just HTML. Instead they are really perl code snippets that Mason magically runs for you. And, by default, perl source code is not utf-8 safe! (This is true at least up to perl v.5.10 and probably will remain true until perl v6.) You need to add the magic incantation use utf8; before you can use utf8 characters in your perl source code strings (which, remember, is what your .html file really is). So try changing text.html to:
<p>Copyright © by me</p> <%once> use utf8; </%once>
And you will see it works.
This would obviously be a pain to do (and remember to do) in ever file. Fortunately, you can simply add to your Apache configuration file
PerlSetVar MasonPreamble "use utf8;"
And all your source files are utf-8 safe. Cool!
I am assuming that all your HTML forms are already looking a little like this:
<form action="#" method="post" accept-charset="UTF-8" enctype="multipart/form-data"> <!-- ... --> </form>
You obviously want to tell the client (browser) that you accept utf-8 strings, and since there is currently no safe standard way of URL encode Unicode characters, you are left with POST methods and the multipart/form-data encoding.
And it should all work out of the box. But it doesn't. Make a form and add our favourite test string “Copyright © by me” to it and preview the result. Our friend the ‘Â’ is back. Sigh.
So we create a plugin module called MasonX::Plugin::UTF8 essentially as:
package MasonX::Plugin::UTF8;
use base qw(HTML::Mason::Plugin);
use warnings;
use strict;
sub start_request_hook {
my ( $self, $context ) = @_;
my $args_ref = $context->args();
foreach my $arg ( @{$args_ref} ) {
utf8::is_utf8($arg) || utf8::decode($arg);
}
return;
}
And we add it to our Apache configuration with
<Perl> use lib '/some/path'; </Perl> PerlAddVar MasonPlugins "MasonX::Plugin::UTF8"
And it just works. As it should have done out of the box.
So now you save your form data in your relational database system, get it back out, display it and find your old friend ‘Â’ is back. Sigh. Did I mention that perl is old?
I am assuming that you are using the DBI module to access the database. This is currently not utf-8 clean by default because most of the underlying drivers are not utf-8 clean by default. (And that is for “backwards compatibility”.)
The fix depends on your database driver, but many of them have a magic attribute you can pass to the DBI->connect call to make them utf-8 safe. Some of them are listed below:
| Driver | Database | Utf-8 attribute |
|---|---|---|
| DBD::Pg | PostgreSQL | pg_enable_utf8 => 1 |
| DBD::mysql | MySQL | mysql_enable_utf8 => 1 |
| DBD::SQLite | SQLite | sqlite_unicode => 1 |
(Warning: The MySQL option is currently “experimental and may change in future versions” and SQLite requires special handling of blobs with the unicode flag enabled.)
So for PostgreSQL you would do something like:
my $dbh = DBI->connect( $dsn, $user, $pass,
{ AutoCommit => 1, RaiseError => 0, pg_enable_utf8 => 1 } );
This of course assumes that you have created the database as a utf-8 database in the first instance! Sticking with PostgreSQL as the example, you would have done createdb -E utf8 ... from the command line or used the SQL commandCREATE DATABASE ... ENCODING 'UTF8';.
If your database driver does not support utf-8 directly, you might want to consider the UTF8DBI module as a workaround.
And now you are done! Enjoy and let me know of your experiences.
On 2010-07-13 07:47:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.
The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type="l") does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary.
We also get a nice opportunity to use the under-appreciated read.fwf function.
Read more (~535 words).
On 2010-06-22 11:45:00, Allan Engelhardt wrote in CYBAEA Journal:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.
We now re-do the analysis four years later and, just because we can, we are using the leading companies of the London stock exchange instead of the largest American companies.
The results still hold. We called it the 3/2 rule: treble the number of workers and you halve their individual productivity. Large companies with ten times the number of employees are ¼ as productive as their smaller competitors.
Employee productivity is a big issue. If all the FTSE-100 companies achieved their average profits per employee, then the index would generate almost £1 trn of additional net profits for the economy.
Read more (~245 words).
On 2010-06-22 11:20:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
Read more (~763 words, 5 comments).
On 2010-06-17 09:05:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.
Read more (~300 words, 2 comments).
On 2010-06-15 10:21:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.
But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate.
The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.
Read more (~934 words, 1 comments).
Join the discussion
One more thing to watch for...
I haven't narrowed the issue down completely, but I ran into a case today where an object that overrides stringification was breaking encoding for the entire page. Narsty bug to hunt down, but if you can't find any other problems, try putting this on a test page and see if it breaks your encoding (note: to fix, use $obj->value):
<%perl>
my $obj = Data::Currency->new;
</%perl>
<% $obj %>