Pixel Pedals of Tomakomai

北海道苫小牧市出身の初老の日常

Algorithm::NaiveBayes

試してみました。


引き数でrssのURLを渡すと、それがどの世代が書いた物かを判定するプログラムです。世代のデータは、ブログ村RSSを借りてます。

#!/usr/local/bin/perl
use strict;
use utf8;
use XML::RSS;
use LWP::Simple;
use Text::MeCab;
use Encode;
use Algorithm::NaiveBayes;

my $_mecab = new Text::MeCab();
sub get_words($){
    my $ref_text = shift;
    my %ret = ();

    for (my $node = $_mecab->parse($$ref_text); $node; $node = $node->next()) {
	my $word = $node->surface();
	$ret{$word}++;
    }

    return \%ret;
}


sub get_descs($){
    my $url = shift;
    my $data = get($url);
    my $rss = new XML::RSS();
    $rss->parse($data);

    return map { $_->{description} } @{$rss->{items}};
}


my @categories = map {[sprintf('http://%s.blogmura.com/recent.rdf', $_), $_]} (
'oyaji',
'senior',
'housewife',
'ol',
'salaryman',
'university',
'specialschool',
'highschool',
'juniorschool',
'school',
);


my $bayes = new Algorithm::NaiveBayes();
foreach(@categories){
    my ($url, $label) = @$_;

    warn "searching for $label infomation...\n";
    my @descs = get_descs($url);
    warn "finished\n";

    my $word_cnt = 0;
    foreach (@descs){
	my $ref_words = get_words(\ $_);
	$bayes->add_instance(attributes => $ref_words, 
                             label => $label);

	$word_cnt += %$ref_words;
    }
    warn "$word_cnt words was done.\n";
}

$bayes->train();

print "RESULTS\n", "\n";

my $check_rss = $ARGV[0];
my @check_descs = get_descs($check_rss);

foreach my $i (0 .. $#check_descs){
    my $ref_words = get_words(\ $check_descs[$i]);
    my $result = $bayes->predict(attributes => $ref_words);
    print "ENTRY " . ($i + 1) . "\n";
    foreach(keys %$result){
        print sprintf("%s    %f\n", $_, $result->{$_} * 100);
    }
    print "\n";
}


ぶっちゃけ、精度が低過ぎてまったく役に立ちません(笑)。食わせるRSSを増やせば、もっとマシになるかも。

% perl test.pl http://d.hatena.ne.jp/hiratara/rss
searching for oyaji infomation...
finished
438 words was done.
searching for senior infomation...
finished
399 words was done.
searching for housewife infomation...
finished
370 words was done.
searching for ol infomation...
finished
313 words was done.
searching for salaryman infomation...
finished
328 words was done.
searching for university infomation...
finished
337 words was done.
searching for specialschool infomation...
finished
344 words was done.
searching for highschool infomation...
finished
375 words was done.
searching for juniorschool infomation...
finished
316 words was done.
searching for school infomation...
finished
318 words was done.
RESULTS

ENTRY 1
university    0.000000
salaryman    0.001270
specialschool    0.000000
oyaji    0.098629
ol    0.000000
highschool    0.000004
juniorschool    0.000000
senior    99.997384
housewife    0.716548
school    0.000000

ENTRY 2
university    0.000001
salaryman    0.019688
specialschool    0.000001
oyaji    0.104211
ol    0.007454
highschool    0.076126
juniorschool    0.000000
senior    99.999915
housewife    0.000018
school    0.000000

ENTRY 3
university    0.000001
salaryman    2.746964
specialschool    0.000024
oyaji    99.962177
ol    0.000059
highschool    0.005061
juniorschool    0.023965
senior    0.001732
housewife    0.129087
school    0.000000

ENTRY 4
university    0.000000
salaryman    0.002674
specialschool    0.000028
oyaji    9.121914
ol    0.000000
highschool    0.000093
juniorschool    0.000238
senior    99.583068
housewife    0.056186
school    0.000000

ENTRY 5
university    0.000000
salaryman    0.000159
specialschool    0.000000
oyaji    1.250409
ol    0.000000
highschool    0.000000
juniorschool    0.000003
senior    99.298917
housewife    11.754211
school    0.000000

ENTRY 6
university    0.000000
salaryman    0.386313
specialschool    0.000000
oyaji    97.542844
ol    0.000000
highschool    0.000044
juniorschool    0.000004
senior    22.021109
housewife    0.561419
school    0.000000

ENTRY 7
university    0.000000
salaryman    0.224335
specialschool    0.000010
oyaji    74.503064
ol    0.000009
highschool    0.002550
juniorschool    0.008731
senior    0.006324
housewife    66.702646
school    0.000000