A
Arthur J. O'Dwyer
Just a request-for-comments and minor self-promotion:
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt
I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:
(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.
(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.
(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)
I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt
Thanks,
-Arthur
http://www.contrib.andrew.cmu.edu/~ajo/dont-be-evil.html
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt
I've written a little Perl script that goes through Google Groups
Classic's archive and extracts the texts of all my Usenet posts
(that is, all the posts found with a given "author:foo" search string).
(As a bonus, 'ls' in the resulting directory produces a Message-ID list
ready to submit to http://groups-beta.google.com/groups/msgs_remove !)
I would appreciate comments on:
(1) Does it have any security holes? I'm leery of the way it calls
Lynx, even with '' quotes present, and I wonder what could go wrong
by using the Message-ID itself as a unique filename for each message
text.
(2) How is it bad Perl code? I'm not an even semi-regular Perl user,
so I know this source code is naive and clumsy. So if you have the
inclination to critique Perl code, go right ahead.
(3) [more for comp.programming] Is it useful? What other tools exist
for Usenet scraping? What more sensible naming scheme or file format
could I use to store Usenet posts and headers, numbered in the low
thousands? (How do Usenet servers do it?)
I release the code to the public domain; please take it, use it, and
promulgate it if you find it useful. But hurry --- the door of opportunity
is shutting quickly!
http://www.contrib.andrew.cmu.edu/~ajo/usenet_archive/retrieve.txt
Thanks,
-Arthur
http://www.contrib.andrew.cmu.edu/~ajo/dont-be-evil.html