RFC: Google Groups grabber in Perl

Arthur J. O'Dwyer · May 6, 2005

Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:

(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.

(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.

(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)

I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!

http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

Thanks,
-Arthur
http://www.contrib.andrew.cmu.edu/~ajo/dont-be-evil.html

John W. Krahn · May 6, 2005

Arthur said:
Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:

(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.

(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.

(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)

I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!

http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

Three problems I can see right off the bat:

printf STDOUT ("from url ${url}\n");

The variable $url may contain printf formatting characters.

print STDOUT "from url $url\n";

And because STDOUT is the default output filehandle:

print "from url $url\n";

open $outfile, '>', "$msgid";
print { $outfile } $msgtext;
close $outfile;

You are trying to open a file and print to it. You should test these
functions for error conditions.

open my $outfile, '>', $msgid or die "Error: $!";
print $outfile $msgtext or die "Error: $!";
close $outfile or die "Error: $!";

for my $msgid ($indexpage =~ m{<a href=/groups.*selm=(.*)&rnum=.*>}g) {

The '*' quantifier is greedy so you _may_ be matching more then you intended.

Aside from that there is a lot of superfluous punctuation and the use of file
scoped instead of locally scoped variables.

John

Christopher Nehren · May 6, 2005

Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt

No warnings? No way to allow the user to specify a different URL
retrieval method? Why not even just use LWP? That way, it's much more
portable (and easily so), because you keep everything in Perl.

Best Regards,
Christopher Nehren

muthu.gvmuthu · May 10, 2005

fetching information from usenet groups.. is not a big issue.. actually
many usenet groups are developed with this facility only.. no. of third
party freeware scripts and applications are available to get
inforamtion from usenet.. with filters..

Tad McClellan · May 10, 2005

fetching information from usenet groups.. is not a big issue..

Neither is it the topic of this thread.

Google Groups is not Usenet, it is an archive of Usenet postings.

New Google Groups UI?	4	Sep 11, 2011
Using the nntplib module to count Google Groups users	3	Oct 27, 2013
Google Groups spam filter	40	Oct 12, 2009
Some posts do not show up in Google Groups	6	Apr 30, 2012
Suggestion For Useful Script -- Google Groups Search and Archive	2	Sep 6, 2008
Google Groups and indentations	4	Feb 19, 2005
[OT] Google Groups posters, please read	7	Feb 2, 2005
[OT] Google Groups: vote for Default quoting	66	Nov 16, 2005

RFC: Google Groups grabber in Perl

Arthur J. O'Dwyer

John W. Krahn

Christopher Nehren

muthu.gvmuthu

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads