perldoc WWW::Mechanize の読書メモです
WWW::Mechanize::FAQ なんてのがある
my $mech = WWW::Mechanize->new()WWW::Mechanize のコンストラクタは、LWP::UserAgent のしてくれることに加えて、User-Agent to とクッキージャーを、↓のような具合にセットアップしてくれる
agent => "WWW-Mechanize/#.##" cookie_jar => {} # an empty, memory-only HTTP::Cookies objectユーザーエージェントを変更するなら
my $mech = WWW::Mechanize->new( agent=>"wonderbot 1.01" );クッキーを有効にしないんだったら
my $mech = WWW::Mechanize->new( cookie_jar => undef );
この他に、LWP::UserAgent にはない以下のパラメータを設定できる
autocheck => [0|1] ::
onerror => \&func() ::
quiet => [0|1] ::
stack_depth => $value ::
$mech->agent_alias( $alias ) ::
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
$mech->content( format => "text" ) ::
$mech->content( base_href => [$base_href|undef] ) ::
$mech->follow_link( text => "download", n => 3 );
$mech->follow_link( url_regex => qr/download/i ); $mech->follow_link( url_regex => qr/(?i:download)/ ); # こちらでも同じ
$mech->follow_link( n => 3 );
"text > 'string'," ::
- リンクのテキストが 'string'に正確にマッチ
<pre class
"example">
$mech->find_link( text => "download" );
"text_regex > qr/regex/," ::
- 正規表現 が、リンクテキストにマッチ
<pre class
"example">
$mech->find_link( text_regex => qr/download/i );
> 'string'," ::
- url が string にマッチ
"url_regex
> qr/regex/," ::
"url_abs > string" ::
- abs_url が string にマッチ
"url_abs_regex
> regex" ::
"name > string" ::
- name が string にマッチ
"name_regex
> regex" ::
"tag > string" ::
"tag_regex
> regex" ::
$mech->find_link( tag_regex => qr/^(a|frame)$/ );
$mech-find_link() : link format>.
これらで n を指定しないとデフォールト値の 1 が使われ、最初のマッチが拾われる。
複数の指定を同時に行なうと and になる。
$mech->find_link( text => "News", url_regex => qr/cnn\.com/ );
返り値は、WWW::Mechanize::Link オブジェクトの配列へのリファレンスになる。
$mech->find_image( image => "News", url_regex => qr/cnn\.com/ );
$mech->set_fields( $name > $value ... ) ::
- 一度に現在フォームの複数のフィールドに値をセットする
- 同名のフィールドが複数あれば、最初にみつかったものにセットする
- 複数の同名フィールドに値をセットする時は下のように無名配列をつかう
<pre class
"example">
$mech->set_fields( $name => [ 'foo', 2 ] ) ;
$mech->set_visible( $username, $password ) ;
$mech->set_visible( [ radio => "KCRW" ] ) ;with
$mech->set_visible( "fred", "secret", [ option => "Checking" ] ) ;
"text", "password", "hidden", "textarea", "file", "image", "submit", "radio", "checkbox" and "option".
> name :: 現在フォームの名前が name な奴をクリック
number
> n :: 現在フォームのn番目をクリックvalue
> value :: 現在フォームの 値が value な奴をクリック input
> $inputobject :: $inputobject が参照するボタンをクリック
- インプットオブジェクトは HTML::Form::SubmitInput のインスタンス
- 下のようにしてゲットする
$mech->current_form()->find_input(undef, "submit")
$mech->submit() :: - クリックせずにページを送信 - かつて下と同義で使われたが、もう使われない <pre class"example"> $mech->click("submit")"x => x =item * y
> y :: クリック座標
> $value [, name
> $value... ] )$mech->add_header( Encoding => 'text/klingon' );
$mech->add_header( Referer => undef );
# Don't send a Referer for this URL $mech->add_header( Referer => undef ); # Get the URL $mech->get( $url ); # Back to the default behavior $mech->delete_header( 'Referer' );
$mech->quiet(0); # turns on warnings (the default) $mech->quiet(1); # turns off warnings $mech->quiet(); # returns the current quietness status
Spidering Hacks, by Kevin Hemenway and Tara Calishain Spidering Hacks from O'Reilly (<http://www.oreilly.com/catalog/spi- derhks/>) is a great book for anyone wanting to know more about screen- scraping and spidering. There are six hacks that use Mech or a Mech derivative: #21 WWW::Mechanize 101 #22 Scraping with WWW::Mechanize #36 Downloading Images from Webshots #44 Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups #64 Super Author Searching #73 Scraping TV Listings The book was also positively reviewed on Slashdot: <http://books.slash- dot.org/article.pl?sid=03/12/11/2126256>
ONLINE RESOURCES
WWW::Mechanize Development mailing list
Hosted at Sourceforge, this is where the contributors to Mech dis-
cuss things. <http://sourceforge.net/mail/?group_id=83309>
* LWP mailing list
The LWP mailing list is at
<http://lists.perl.org/showlist.cgi?name=libwww>, and is more user-
oriented and well-populated than the WWW::Mechanize Development
list. This is a good list for Mech users, since LWP is the basis
for Mech.
* WWW::Mechanize::Examples
A random array of examples submitted by users, included with the
Mechanize distribution.
* <http://www.oreilly.com/catalog/googlehks2/chapter/hack84.pdf> Leland Johnson's hack #84 in Google Hacks, 2nd Edition is an exam- ple of a production script that uses WWW::Mechanize and HTML::TableContentParser. It takes in keywords and returns the estimated price of these keywords on Google's AdWords program. * <http://www.perl.com/pub/a/2004/06/04/recorder.html> Linda Julien writes about using HTTP::Recorder to create WWW::Mech- anize scripts. * <http://www.developer.com/lang/other/article.php/3454041> Jason Gilmore's article on using WWW::Mechanize for scraping sales information from Amazon and eBay. * <http://www.perl.com/pub/a/2003/01/22/mechanize.html> Chris Ball's article about using WWW::Mechanize for scraping TV listings. * <http://www.stonehenge.com/merlyn/LinuxMag/col47.html> Randal Schwartz's article on scraping Yahoo News for images. It's already out of date: He manually walks the list of links hunting for matches, which wouldn't have been necessary if the "find_link()" method existed at press time. * <http://www.perladvent.org/2002/16th/> WWW::Mechanize on the Perl Advent Calendar, by Mark Fowler. * <http://www.linux-magazin.de/Artikel/ausgabe/2004/03/perl/perl.html> Michael Schilli's article on Mech and WWW::Mechanize::Shell for the German magazine Linux Magazin. Other modules that use Mechanize Here are modules that use or subclass Mechanize. Let me know of any others: * Finance::Bank::LloydsTSB * HTTP::Recorder Acts as a proxy for web interaction, and then generates WWW::Mecha- nize scripts. * Win32::IE::Mechanize Just like Mech, but using Microsoft Internet Explorer to do the work. * WWW::Bugzilla * WWW::CheckSite * WWW::Google::Groups * WWW::Hotmail * WWW::Mechanize::Cached * WWW::Mechanize::FormFiller * WWW::Mechanize::Shell * WWW::Mechanize::Sleepy * WWW::Mechanize::SpamCop * WWW::Mechanize::Timed * WWW::SourceForge * WWW::Yahoo::Groups