RFC: Google Groups grabber in Perl

  • Thread starter Arthur J. O'Dwyer
  • Start date
A

Arthur J. O'Dwyer

Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:

(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.

(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.

(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)

I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!

http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

Thanks,
-Arthur
http://www.contrib.andrew.cmu.edu/~ajo/dont-be-evil.html
 
J

John W. Krahn

Arthur said:
Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:

(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.

(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.

(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)

I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!

http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

Three problems I can see right off the bat:
printf STDOUT ("from url ${url}\n");

The variable $url may contain printf formatting characters.

print STDOUT "from url $url\n";

And because STDOUT is the default output filehandle:

print "from url $url\n";

open $outfile, '>', "$msgid";
print { $outfile } $msgtext;
close $outfile;

You are trying to open a file and print to it. You should test these
functions for error conditions.

open my $outfile, '>', $msgid or die "Error: $!";
print $outfile $msgtext or die "Error: $!";
close $outfile or die "Error: $!";

for my $msgid ($indexpage =~ m{<a href=/groups.*selm=(.*)&rnum=.*>}g) {

The '*' quantifier is greedy so you _may_ be matching more then you intended.


Aside from that there is a lot of superfluous punctuation and the use of file
scoped instead of locally scoped variables.



John
 
M

muthu.gvmuthu

fetching information from usenet groups.. is not a big issue.. actually
many usenet groups are developed with this facility only.. no. of third
party freeware scripts and applications are available to get
inforamtion from usenet.. with filters..
 
T

Tad McClellan

fetching information from usenet groups.. is not a big issue..


Neither is it the topic of this thread.

Google Groups is not Usenet, it is an archive of Usenet postings.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,979
Messages
2,570,185
Members
46,728
Latest member
FernMcmull

Latest Threads

Top