Ovid's Journal: Oh god, please, no.

Ovid's Journal

Struggling all day with Gutenberg. Someone (not naming them as I don't have permission) sent me code to let me use Redland for my RDF parsing and it looks lovely. Too bad Redland doesn't compile for anyone. Didn't compile for me, either.

I put this aside for a bit and tried parsing result pages.

Tried to use the Web::Scraper module to at least pull results from Web pages, but I'm too stupid to figure out its syntax. Learning a new API, CSS selectors and battling strange "don't know what to do with undef" errors proved too much. Embarrassing.

I thought to use HTML::TableParser for some stuff, but that doesn't seem to let me at the attributes I need.

I thought XPath would be good, but it's not well-formed XML. Someone mentioned to me that there might be an XPath module which might have an option which might let you parse malformed XML. I didn't follow up on that.

I finally switch to my HTML::TokeParser::Simple module for this. It's not a good fit for this problem. No, scratch that. It's a bad fit for this problem, but it worked. Then I turned back to search. For this, I used WWW::Mechanize. Notice anything, um, crap about these damned results?

sub search {
    my $self = shift;
    my $mech = WWW::Mechanize->new(
        agent     => 'App::Gutenberg (perl)',
        autocheck => 1,
    );

    $mech->get(App::Gutenberg->search_url);

    $mech->submit_form(
        form_number => 1,
        fields      => {
            'author' => ($self->author || ''),
            'title'  => ($self->title  || ''),
        }
    );

    my $uri = $mech->uri;
    if ( $uri =~ /#([[:word:]]+)\z/ ) {
        # you have got to
    }
    else {
        # be kidding me
    }
}

If that URL matches, you're indexing into a list of <li> elements. Otherwise, you're parsing a table. Either way, it's a right pain to get the data you want. Oh, and it's subtly different sets of data and the criteria for why it would be one type of result or another is unclear.

This is why I want to see REST for just about anything today. It's simple. It's straightforward. It doesn't make me cry. Now I know why you don't see Perl command line clients for Gutenberg. Everything I'm writing is so damned fragile it will break if you look at it funny. *sniff*

Update: it looks like any search with an author will return a list, but all other searches (only tested the basic form) return tables.

Ovid's Journal: Gutenberg RDF

Ovid's Journal

Gutenberg's complete and total lack of an API is killing me. I decided to play with this today, only to find that I can't even parse the core RDF file. Well, ok, so they have an API. RDF::Core::Model::Parser is too slow to parse it. I'll have to rip it apart and shove it in a database. I liked inkdroid's suggestion of of SRU, but that requires a 'net connection. I want a resource I can grab and use offline. For the time being, though, I'll have to rely on some 'net connection for searching. Still, SRU eventually just directed me back to the Gutenberg Web pages and that forces me to scrape the HTML. Since I have to do that anyway, I'll skip SRU and go straight to the Gutenberg site.

Perlbuzz: Parrot 1.0 will be out in March 2009

At the first Parrot Developer Summit in Mountain View, CA, core Parrot developers got together and worked on the plan for Parrot language, including a release schedule for the next three years. From the summary posted by Allison Randal:

  • March 2009, 1.0, stable API for language developers
  • July 2009, 1.5, integration
  • January 2010, 2.0, production use
  • July 2010, 2.5, portability
  • January 2011, 3.0, independence
  • July 2011, 3.5, green fields

Very cool that they'll be stabilizing the API so that language development can have a solid base to work with. I'm just a little disappointed that "production use" is 14 months away. Does that mean that the soonest Rakudo will be available for "production use" will be January 2010?

Alias's Journal: Broken Gravatars - Why you should test with filters...

Alias's Journal

One of the more interesting aspects of the web being so universal is that it gets pulled in a lot of different directions, so sites can stop looking like what you think they look like.

One example is the YAPC::NA/OSDC website, where none of the sponsor logos on the website were working for me, but yet it looked fine for other people in the committee.

The problem turned out to be Ad-Block Plus. But more specifically, because the guy doing the website had naturally enough put the images into an /images/sponsors/.

And of course, since tons of ads are in things with the word "sponsors" in it, all the conference sponsors logos were blocked.

We currently have a similar problem on http://search.cpan.org/ with broken gravatars. Because gravatars are most popular in social network type websites, they fall into the "social networking" category of the corporate fun filter here at work, along with MySpace, FaceBook and all that junk.

Result, ugly broken image square on the search.cpan.org module pages :(

So note for future, if you ever use gravatars or other similar systems with social networking background for more serious purposes, always make sure to put a squid proxy or some other form of obfuscation on a serious domain in front of the raw service, so that fun filters won't generate false-positives on them.

Perlipse Development: 0.9.8 released

hot off the compiler i present to you perlipse 0.9.8. some items of interest:
  • java 1.5 is now required
  • todo task tags
  • improved syntax highlighting
  • detail formatters for debugging
head on over to the project page for more details.

Perlipse Development: syntax highlighting improvements

syntax highlighting improvements have landed, here's a list things that can now be highlighted:
  • keywords, barewords, and file handles
  • variables (arrays, hashes, scalars)
  • quote like/regular expression operators
  • package/subroutine name declarations
here's a quick screen shot of the highlighting in action:

syntaxColors.png

i'm sure there are a plethora of edge cases that have yet to be discovered, so please feel free to file a bug report should you encounter one.

Ovid's Journal: Test::Most::Exception - Important, But You Won't Need It

Ovid's Journal

Recently I uploaded the new Test::Most. From the end user perspective, the only real change (aside from being a touch easier to install), is that if you ask it to die when a test fails, it no longer just dies. Instead, it throws a Test::Most::Exception. The vast majority of people will never, ever need this feature. However, our test suite gets a bit tricky at times and we do things like this:

foreach my $test (@tests) {
    my $tests_finished = eval { $test->run };
    if ( my $error = $@ ) {
        report_error( $test, $error );
    }
    else {
        ...
    }
}

Internally, the way we report failures depends very much upon whether or not the tests halted because Test::Most was told to halt on failures, or whether they really died. Now I can just do this in the &report_error sub:

if ( eval { $error->isa('Test::Most::Exception') } ) { ... }

And much grief is saved.

In other news, I've been asked to add timing data to Test::Aggregate and I've thought that prove's state mechanism should possibly be extended to capture aggregated state information. In other words, while I didn't really intend to, I'm writing yet another new test harness. You would think I've learned my lesson after writing the new Test::Harness (also by accident, I might add).

And yes, I've toyed with colored test output for it ...

Shlomi Fish's Journal: Why Dist-Zilla is Probably Not For Me

Shlomi Fish's Journal

rjbs' introduction to Dist-Zilla piqued my interest, and I used CPANPLUS-Dist-Mdv to prepare .rpm's for it and its depenedencies and install them. However, I wondered about a potential problem with it, before I even tried it, and speaking with rjbs on IRC confirmed that it exists.

Dist-Zilla generates the resulting .pm, scripts, etc. from templates, and as a result the lines that are reported by errors and warnings are not the same as the ones you've edited. This makes tracing lines back to their source much more difficult. Since most of my times is spent debugging and handling errors (whether I encounter them or I find them on CPAN testers or in bug reports), and I want to edit the source directly, I think that Dist-Zilla is not for me. What instead I'd like is a way to change common, repeated text (like version numbers, licensing, URLs, etc.) commonly across all the files in a certain distribution, or in many distributions. That and also that module-starter (or whatever) provide scaffolding for creating a new .pm file.

I appreciate rjbs's efforts, but I'm going to stick with module-starter (and contribute to it).

acme's bits: Perl 5.8.9 RC1 is out

As Nicholas pointed out on the front page of use.perl.org, Perl 5.8.9 Release Candidate 1 has been uploaded to CPAN. It contains a whole bunch of changes since Perl 5.8.8. We have tested it on our machines. We have tested it on build farms. We have tested it with as much of CPAN as we could. What we haven't tested it with is the DARKPAN: your code, your computer and servers, your work code, or any of your modules. Blame Transfer Protocol initiated, as Nicholas points out slightly more informally on the London.pm mailing list: we would be very generous indeed if you could test the release candidate all code that you have to hand, and especially with work code. Think of it as your civic duty, if you lived in the city of Perl.



Read and post comments | Send to a friend

Ovid's Journal: Gutenberg API

Ovid's Journal

As far as I can tell from reading the archives and checking their Web site, Project Gutenberg does not appear to have an API. The closed I've found is an RSS feed and an RDF document. These don't really constitute and API, but the latter can be parsed for adding to an SQLite database. Still trying to figure this out, though. Trying to grab one version of their catalog in RDF format:

gutenberg $ tar -xjf catalog.rdf.bz2
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers
tar: Error exit delayed from previous errors

I was able to unzip their .zip version of the same file, but I was disappointed to learn that their Perl examples are rather old and no longer appear to properly parse the data.

But why would you care? Because I think I want to make this happen:

gutenberg --read "Art of War"

You know, sometimes I worry about posting neat ideas to use.perl for fear that someone would jump the gun and Just Do It. I realize now that this is foolish for two reasons. First, they Won't Just Do It. Second, if they did, I'd be happy just to have the project done :)

Suggestions welcome. There needs to be an easy way to update the database, track what a user has read, allow them to "bookmark" a book (or better yet, "annotate" a document"), etc. I've never used an eReader. I never gave a damn about them, really, because I like the feeling of a book in my hands. Still, this seems worthwhile.

Alias's Journal: My 2nd generation release automation hits 1000 releases

Alias's Journal

After moving from CVS to SVN a number of years ago, I heavily modified my original first generation release script to take advantage of some of the basic features of SVN, like everything being accessible by HTTP by default.

For example, you can see the current incarnation of my release automation directly in the repository here:

http://svn.ali.as/cpan/tools/release.pl

Because my release automation is so intensely tuned to the way I work, I've never felt that (as a whole process) it was worth foisting on the CPAN. I find it quite interesting that of all the people with 100+ CPAN modules, no two of them use the same release system, and all of them (except maybe SMUELLER) rolled their own.

So I release bits and pieces, the reusable components used to implement single tasks, and that's been enough for me.

After 1000 releases, however, my current system is definitely feeling the strain. Although it makes releasing extremely quickly, the lack of awareness of the overall repository is becoming a problem.

With so many packages, and several dozen external committers, it has become impossible to know which modules in the trunk match the most recent release, which have updates, and whether or not those updates are significant.

It's also impossible to do any form of bulk updates, bulk incremental release, or bulk releases (something that has always bothered me when I maintain Module::Install).

So to mark the 1000th release by my second generation release automation, I've commenced development on a new third generation "release" automation system that could perhaps be more accurately described as a "repository automation system".

Having recommitted to using svn for my repository (for reasons that may become more obvious later) and with the code likely to be tied very closely to the way I work, it will most likely be of no direct use to anyone else. And the "release" packages won't be going up to the CPAN.

But if you interested in the whole release automation area, you may wish to watch and see what comes out of it.

You can see the beginnings of this work at the following path:

http://svn.ali.as/cpan/trunk/ADAMK-Repository/

The initial first few "releases" will be focused on building up a cohesive object model of my own repository. The intent is to understand what modules are maintained in the repository, what releases have been done and WHEN (in repository revision terms) and what SVN path/revision pairs map to which conceptual places.

With this in place, a second layer of logic can focus on issues of release state vs repository state, on change (and who made those changes), and on knowing if modules are passing tests, and whether or not arbitrary external patches will break those tests.

And finally, now I finally have a good Perl editor in sight (albeit 6 months or a year from being ideal for me), I have a chance to tie the repository automation directly to my editor with a personal plugin that lets me just look at any file within a package, and run a "release" command from a menu and have everything needed to SAFELY release modules.

And man, THAT would be awesome.

Sebastian Riedel - Perl and the Web: Environments for Mojolicious

Back in the days when i was working on Catalyst, i thought it would be a great idea to have a special debug environment. You could activate it with a simple -Debug flag in the import list. But what if your development process has more stages than development and production? Right, you are screwed. That's why we've decided to use a more ambitious concept we call Modes in Mojolicious, where you can have an unlimited number of different environments.
package MyApp;
use base 'Mojolicious';

sub production_mode {
    my $self = shift;

    # Production templates
    $self->renderer->root('/Users/production/templates');
}

sub development_mode {
    my $self = shift;

    # Development templates
    $self->renderer->root('/Users/dev/templates');
}

sub startup {
    my $self = shift;

    # Default templates for everything else
    $self->renderer->root('/Users/defaul/templates');
}

1;
As you can see we've chosen a very minimalistic approach that feels in line with the rest of Mojolicious. Switching between modes is as easy as changing a environment variable.
% MOJO_MODE=production bin/my_app daemon

Alias's Journal: Subtle improvements in Windows Vista

Alias's Journal

I've moved my home 4-core parallel-coding/gaming machine over to Vista, following the typical "wait for the first service pack" approach that most people take when considering new Microsoft operating systems.

I have to say that it pretty much WORKSFORME and apart from a few different dialog boxes and what not I'm settling in mostly fine. And of course, being a developer, I've jealousy turned off all the eye candy to save the resources for myself :)

One of the more subtle changes in Vista that I hadn't paid attention to before but have noticed since I started doing some Padre coding on it is that Vista has changed the directory layout for home directories.

Instead of the previous (horrible) style...

C:\Documents and Settings\Adam\My Documents

We now have a much cleaner...

C:\Users\Adam\Documents

The obvious improvement here is for the server hackers and command line users, who have much less to type.

But one additional and more interesting benefit is that this change removes all of the whitespace from these paths. What attracts me most about this seemingly simple change is that in a single stroke it lets us avoid a ton of deeply buried bugs in certain very old Perl modules that require lists passed as whitespace-separated strings.

Avoiding these bugs is one of the reasons that Strawberry Perl is locked down to C:\strawberry, because if we allowed installation to arbitrary paths it would be too easy for people to enter whitespace-containing paths that might break things in unexpected ways. (The other reason is the lack of power in InnoSetup, which I'm in the process of fixing by moving to WiX)

It also simplifies things somewhat for people with whitespace escaping troubles.

Now granted, it doesn't FIX the problems as such. Escaping bug fixes now need to be done. But now the bug is not EXPOSED on quite so many operating systems it helps to limit the scope of the bug and reduce the impact from it between now and such time as it is resolved more thoroughly in all the various places fixes are needed.

The ultimate effect of this "whitespace tweak" is that Vista (and I assume Server 2003/2008 as well) are much better platforms for building Unix-originated Open Source applications on than XP was.

As much as I had grown to like "My Documents" (and File::HomeDir will continue to use the terminology in ->my_documents) I think this change represents a win for engineering over marketing, potentially at a slight cost to usability for computer novices.

But I guess that as a ratio of the population, computer novices are in a slow inevitable decline (at least in the developed world where the bulk of Microsoft's revenue comes from) so this does make some sense...

Header image by Tambako the Jaguar. Some rights reserved.