名前¶

HTML::PullParser - Alternative HTML::Parser interface

HTML::PullParser - 代替 HTML::Parser インタフェース

(訳注: (TBR)がついている段落は「みんなの自動翻訳＠TexTra」による機械翻訳です。)

概要¶

 use HTML::PullParser;

 $p = HTML::PullParser->new(file => "index.html",
                            start => 'event, tagname, @attr',
                            end   => 'event, tagname',
                            ignore_elements => [qw(script style)],
                           ) || die "Can't open: $!";
 while (my $token = $p->get_token) {
     #...do something with $token
 }

説明¶

The HTML::PullParser is an alternative interface to the HTML::Parser class. It basically turns the HTML::Parser inside out. You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document.

HTML::PullParserは、HTML::Parserクラスの代替インタフェースです。これは基本的にHTML::Parserを裏返しにします。ファイル(または任意のIO::Handleオブジェクトまたは文字列)を構築時にパーサーに関連付け、$parser->get_tokenを繰り返し呼び出して、解析された文書内で見つかったタグとテキストを取得します。 (TBR)

The following methods are provided:

次のメソッドが用意されています。 (TBR)

$p = HTML::PullParser->new( file => $file, %options )

$p = HTML::PullParser->new( doc => \$doc, %options )

A HTML::PullParser can be made to parse from either a file or a literal document based on whether the file or doc option is passed to the parser's constructor.

HTML::PullParserは、fileまたはdocオプションがパーサーのコンストラクタに渡されたかどうかに基づいて、ファイルまたはリテラル文書のいずれかから解析するように設定できます。 (TBR)

The file passed in can either be a file name or a file handle object. If a file name is passed, and it can't be opened for reading, then the constructor will return an undefined value and $! will tell you why it failed. Otherwise the argument is taken to be some object that the HTML::PullParser can read() from when it needs more data. The stream will be read() until EOF, but not closed.

渡されるfileは、ファイル名またはファイルハンドルオブジェクトのいずれかです。ファイル名が渡され、読み込み用に開くことができない場合、コンストラクタは未定義の値を返し、$!は失敗した理由を示します。それ以外の場合、引数は、HTML::PullParserがさらにデータを必要とするときにread()できるオブジェクトと見なされます。ストリームはEOFまでread()されますが、閉じられません。 (TBR)

A doc can be passed plain or as a reference to a scalar. If a reference is passed then the value of this scalar should not be changed before all tokens have been extracted.

docは、プレーンまたはスカラーへの参照として渡すことができます。参照が渡される場合、このスカラーの値は、すべてのトークンが抽出される前に変更されるべきではありません。 (TBR)

Next the information to be returned for the different token types must be set up. This is done by simply associating an argspec (as defined in HTML::Parser) with the events you have an interest in. For instance, if you want start tokens to be reported as the string 'S' followed by the tagname and the attributes you might pass an start-option like this:

次に、異なるトークンタイプに対して返される情報を設定する必要があります。これは、(HTML::Parserで定義されているように)argspecを目的のイベントに関連付けるだけで実行されます。たとえば、startトークンを文字列'S'の後にタグ名と属性を付けて報告する場合は、次のようにstart-オプションを渡します。 (TBR)

   $p = HTML::PullParser->new(
          doc   => $document_to_parse,
          start => '"S", tagname, @attr',
          end   => '"E", tagname',
        );

At last other HTML::Parser options, like ignore_tags, and unbroken_text, can be passed in. Note that you should not use the event_h options to set up parser handlers. That would confuse the inner logic of HTML::PullParser.

最後に、ignore_tagsやunbroken_textなどの他のHTML::Parserオプションを渡すことができます。 event_hオプションを使用してパーサーハンドラを設定しないでください。これはHTML::PullParserの内部ロジックを混乱させます。 (TBR)

$token = $p->get_token

This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The content of this array match the argspec set up during HTML::PullParser construction.

このメソッドは、文書内で見つかった次のtoken、または文書の最後のundefを返します。トークンは配列参照として返されます。この配列の内容は、HTML::PullParserの構築時に設定されたargspecと一致します。 (TBR)

$p->unget_token( @tokens )

If you find out you have read too many tokens you can push them back, so that they are returned again the next time $p->get_token is called.

読み取ったトークンが多すぎることがわかった場合は、それらをプッシュバックして、次に$p->get_tokenが呼び出されたときに再び返されるようにすることができます。 (TBR)

例¶

The 'eg/hform' script shows how we might parse the form section of HTML::Documents using HTML::PullParser.

'eg/hform'スクリプトは、HTML::PullParserを使用してHTML::Documentsのフォームセクションを解析する方法を示しています。 (TBR)

コピーライト¶

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

名前¶

概要¶

説明¶

例¶

SEE ALSO¶

コピーライト¶