試してみました。
引き数でrssのURLを渡すと、それがどの世代が書いた物かを判定するプログラムです。世代のデータは、ブログ村のRSSを借りてます。
#!/usr/local/bin/perl use strict; use utf8; use XML::RSS; use LWP::Simple; use Text::MeCab; use Encode; use Algorithm::NaiveBayes; my $_mecab = new Text::MeCab(); sub get_words($){ my $ref_text = shift; my %ret = (); for (my $node = $_mecab->parse($$ref_text); $node; $node = $node->next()) { my $word = $node->surface(); $ret{$word}++; } return \%ret; } sub get_descs($){ my $url = shift; my $data = get($url); my $rss = new XML::RSS(); $rss->parse($data); return map { $_->{description} } @{$rss->{items}}; } my @categories = map {[sprintf('http://%s.blogmura.com/recent.rdf', $_), $_]} ( 'oyaji', 'senior', 'housewife', 'ol', 'salaryman', 'university', 'specialschool', 'highschool', 'juniorschool', 'school', ); my $bayes = new Algorithm::NaiveBayes(); foreach(@categories){ my ($url, $label) = @$_; warn "searching for $label infomation...\n"; my @descs = get_descs($url); warn "finished\n"; my $word_cnt = 0; foreach (@descs){ my $ref_words = get_words(\ $_); $bayes->add_instance(attributes => $ref_words, label => $label); $word_cnt += %$ref_words; } warn "$word_cnt words was done.\n"; } $bayes->train(); print "RESULTS\n", "\n"; my $check_rss = $ARGV[0]; my @check_descs = get_descs($check_rss); foreach my $i (0 .. $#check_descs){ my $ref_words = get_words(\ $check_descs[$i]); my $result = $bayes->predict(attributes => $ref_words); print "ENTRY " . ($i + 1) . "\n"; foreach(keys %$result){ print sprintf("%s %f\n", $_, $result->{$_} * 100); } print "\n"; }
ぶっちゃけ、精度が低過ぎてまったく役に立ちません(笑)。食わせるRSSを増やせば、もっとマシになるかも。
% perl test.pl http://d.hatena.ne.jp/hiratara/rss searching for oyaji infomation... finished 438 words was done. searching for senior infomation... finished 399 words was done. searching for housewife infomation... finished 370 words was done. searching for ol infomation... finished 313 words was done. searching for salaryman infomation... finished 328 words was done. searching for university infomation... finished 337 words was done. searching for specialschool infomation... finished 344 words was done. searching for highschool infomation... finished 375 words was done. searching for juniorschool infomation... finished 316 words was done. searching for school infomation... finished 318 words was done. RESULTS ENTRY 1 university 0.000000 salaryman 0.001270 specialschool 0.000000 oyaji 0.098629 ol 0.000000 highschool 0.000004 juniorschool 0.000000 senior 99.997384 housewife 0.716548 school 0.000000 ENTRY 2 university 0.000001 salaryman 0.019688 specialschool 0.000001 oyaji 0.104211 ol 0.007454 highschool 0.076126 juniorschool 0.000000 senior 99.999915 housewife 0.000018 school 0.000000 ENTRY 3 university 0.000001 salaryman 2.746964 specialschool 0.000024 oyaji 99.962177 ol 0.000059 highschool 0.005061 juniorschool 0.023965 senior 0.001732 housewife 0.129087 school 0.000000 ENTRY 4 university 0.000000 salaryman 0.002674 specialschool 0.000028 oyaji 9.121914 ol 0.000000 highschool 0.000093 juniorschool 0.000238 senior 99.583068 housewife 0.056186 school 0.000000 ENTRY 5 university 0.000000 salaryman 0.000159 specialschool 0.000000 oyaji 1.250409 ol 0.000000 highschool 0.000000 juniorschool 0.000003 senior 99.298917 housewife 11.754211 school 0.000000 ENTRY 6 university 0.000000 salaryman 0.386313 specialschool 0.000000 oyaji 97.542844 ol 0.000000 highschool 0.000044 juniorschool 0.000004 senior 22.021109 housewife 0.561419 school 0.000000 ENTRY 7 university 0.000000 salaryman 0.224335 specialschool 0.000010 oyaji 74.503064 ol 0.000009 highschool 0.002550 juniorschool 0.008731 senior 0.006324 housewife 66.702646 school 0.000000