How to separate a big text file (say 400 news stories) to many smalltext files?

W

william

I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl? The actual text file
is attached below.

Thank you very much.

William

1 of 412 DOCUMENTS

Austin American-Statesman (Texas)

December 5, 2008 Friday
Final Edition

CENTRAL TEXAS DIGEST

BYLINE: FROM STAFF REPORTS

SECTION: BUSINESS; Pg. B08

LENGTH: 375 words


   COMPUTER MAKERS

   Dell stockholders ask for investigation

   Dell Inc. said that it received a shareholder-demand letter asking
the board
to investigate allegations that some current and former directors and
officers
imprudently invested and managed funds in the 401(k) plan.
.......

LOAD-DATE: December 5, 2008

LANGUAGE: ENGLISH

PUBLICATION-TYPE: Newspaper


2 of 412 DOCUMENTS


Contra Costa Times (California)

August 29, 2008 Friday

Stocks on the move: International Paper, Magma Design, PetSmart

BYLINE: wire

SECTION: BUSINESS

LENGTH: 677 words


   By Fabio Alves

   Bloomberg News

   The following companies are having unusual price changes in U.S.
markets this
afternoon.

   Dell Inc. (DELL US) dropped 12 percent, the most since November, to
$22.08.
The world's second-largest personal-computer said the U.S. slump in
technology
spending has moved abroad. Second-quarter profit before one-time items
was 33
cents a share, short of the 36-cent average projection compiled by
Bloomberg.

LOAD-DATE: August 29, 2008

LANGUAGE: ENGLISH
 
A

A. Sinan Unur

I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl? The actual text file
is attached below.

This groups is for programmers to get help with *code*. In particular,
this is not a place to get ready-made scripts to solve your specific
problem.

You need to show what you have so far and what you need help with.

There are many ways of getting what you want. All follow the same basic
structure: Find the start of a record, save all content up to the start
of the next record and do this until there are no more records left.

There is a loop or two, a couple of regexes and very simple command line
processing involved.

If you want someone to write this for you, you should post the job at
http://jobs.perl.org/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
T

Tad J McClellan

william said:
I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl?


Read the file.

Buffer up an article, making note of the number and date when they go by.

When at end of article print it to a file.

Which of those are you having trouble with?


Have a fish.

When you come back asking how to convert "December_5_2008"
into "20081205" we will expect to see the code that you've
written so far.


------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my($article, $num, $date);
while ( <DATA> ) {
if ( /(\d+) of \d+ DOCUMENTS/ ) {
if ( $article ) { # output then empty the article buffer
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open $!";
print $ARTICLE $article;
close $ARTICLE;
$article = '';
}
$num = $1;
}

if ( /LOAD-DATE: (.*)/ ) {
$date = $1;
$date =~ tr/ ,/_/s;
}

$article .= $_;
}

if ( $article ) {
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open $!";
print $ARTICLE $article;
close $ARTICLE;
}

__DATA__
1 of 412 DOCUMENTS

Austin American-Statesman (Texas)

December 5, 2008 Friday
Final Edition

CENTRAL TEXAS DIGEST

BYLINE: FROM STAFF REPORTS

SECTION: BUSINESS; Pg. B08

LENGTH: 375 words


<A0><A0><A0>COMPUTER MAKERS

<A0><A0><A0>Dell stockholders ask for investigation

<A0><A0><A0>Dell Inc. said that it received a shareholder-demand letter asking
the board
to investigate allegations that some current and former directors and
officers
imprudently invested and managed funds in the 401(k) plan.
.......

LOAD-DATE: December 5, 2008

LANGUAGE: ENGLISH

PUBLICATION-TYPE: Newspaper


2 of 412 DOCUMENTS


Contra Costa Times (California)

August 29, 2008 Friday

Stocks on the move: International Paper, Magma Design, PetSmart

BYLINE: wire

SECTION: BUSINESS

LENGTH: 677 words


<A0><A0><A0>By Fabio Alves

<A0><A0><A0>Bloomberg News

<A0><A0><A0>The following companies are having unusual price changes in U.S.
markets this
afternoon.

<A0><A0><A0>Dell Inc. (DELL US) dropped 12 percent, the most since November, to
$22.08.
The world's second-largest personal-computer said the U.S. slump in
technology
spending has moved abroad. Second-quarter profit before one-time items
was 33
cents a share, short of the 36-cent average projection compiled by
Bloomberg.

LOAD-DATE: August 29, 2008

LANGUAGE: ENGLISH
 
W

william

Tad,

Thank you very much for providing a script for me!

I only modified one place to ignore anything before _ of _ DOCUMENTS.
It would be neat to have the date converted to the format you
mentioned. Since I'm not a Perl expert, I may have to do it later
using SAS. Below is the script I used. (I'm fascinated and overwhelmed
by what Perl's capability. I still have many questions related to
Perl.)

I really appreciate your help.

William

#!/usr/bin/perl
use warnings;
use strict;
open IN,"dell" or die "could not open $!";
my($article, $num, $date);
$num=0;
while ( <IN> ) {
if ( /(\d+) of \d+ DOCUMENTS/ ) {
if ( $article && $num ) { # output then empty the article
buffer
print "Output for file $num.\n";
open my $ARTICLE, '>', "${date}_$num.txt" or die "could
not open $!";
print $ARTICLE $article;
close $ARTICLE;
$article = '';
}
$num = $1;
}
if ( /LOAD-DATE: (.*)/ ) {
$date = $1;
$date =~ tr/ ,/_/s;
chop($date);
}
$article .= $_;
}

print "Output for the last file $num.\n";

if ( $article ) {
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open
$!";
print $ARTICLE $article;
close $ARTICLE;
}
 
T

Tad J McClellan

It would be neat to have the date converted to the format you
mentioned. Since I'm not a Perl expert, I may have to do it later
using SAS.


You do not need to be an expert, as the expert-level work has
probably already been done and packaged up for others to use.

There are many modules on CPAN (perldoc -q CPAN) that can parse
dates for you such as DateTime::Format::Natural.

Using such a module, it should take about 3 lines of code to
convert "December 5, 2008" to "20081205".

open IN,"dell" or die "could not open $!";


You should use the 3-argument form of open() and a lexical filehandle
like I did in my code.

open my $IN, '<', 'dell' or die "could not open $!";

while ( <IN> ) {


chop($date);


Don't use chop() for removing newlines.

chomp($date); # much safer than chop()
 
G

Gunnar Hjalmarsson

Tad said:
There are many modules on CPAN (perldoc -q CPAN) that can parse
dates for you such as DateTime::Format::Natural.

Using such a module, it should take about 3 lines of code to
convert "December 5, 2008" to "20081205".

I prefer Date::parse before a heavy-weight module like that.

$ perl -MDate::parse -e '
($d, $m, $y) = (strptime "December 5, 2008")[3..5]; # 1
printf "%d%02d%02d\n", $y+1900, $m+1, $d; # 2
'
20081205
$
 
W

william

Tad and Gunnar,

Thank you very much for your replies. I have largely achieved one of
my goals: to divide a big file into separate small files. But I still
have a huge task before this step. Here is a little background.

I want to search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories and separate
them. The last step would be to classify them into various groups.
Right now, I have some clue on how to finish the last two steps. It is
the first step that is giving me a lot of trouble. I tried to write
perl script to automatically login to lexis-nexis through my
university library using www::mechanize, Crypt::SSLeay, http::cookies,
and lwp::userAgent. I think that I can log into Lexis Nexis. But to do
the news search, Lexis Nexis uses javascripts. I still need to figure
out how to handle these queries through javascripts.

Anyway, thank you again for your code and suggestions!

William
 
G

Gunnar Hjalmarsson

william said:
Tad and Gunnar,

Thank you very much for your replies. I have largely achieved one of
my goals: to divide a big file into separate small files. But I still
have a huge task before this step. Here is a little background.

I want to search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories and separate
them. The last step would be to classify them into various groups.
Right now, I have some clue on how to finish the last two steps. It is
the first step that is giving me a lot of trouble. I tried to write
perl script to automatically login to lexis-nexis through my
university library using www::mechanize, Crypt::SSLeay, http::cookies,
and lwp::userAgent. I think that I can log into Lexis Nexis. But to do
the news search, Lexis Nexis uses javascripts. I still need to figure
out how to handle these queries through javascripts.

Please note that this is a Perl group, not a group for JavaScript, and
its purpose is to discuss Perl, possibly answering specific questions,
not writing programs out from vague specifications.

Sounds to me as if you need a consultant.
 
T

Tad J McClellan

william said:
I want to


But are you _allowed_ to do what you "want"?

search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories


http://www.lexisnexis.com/terms/

...

You may not decompile, reverse engineer, disassemble, rent,
lease, loan, sell, sublicense, or create derivative works
from this Web Site or the Content. Nor may you use any
network monitoring or discovery software to determine the
site architecture, or extract information about usage,
individual identities or users. You may not use any robot,
spider, other automatic software or device, or manual process
to monitor or copy our Web Site or the Content without
Provider's prior written permission.
 
W

william

Tad,

I think that you have a very good point. Although my purpose is to
analyze news effects on the financial markets, purely for academic
uses, I'd rather be a bit careful and maybe discard my first step of
crawling over the Lexis-Nexis database.

Gunnar,
Your suggestion of hiring a consultant is fine with me. When I get
permission from Lexis-Nexis for data-mining their news database, I
will definitely need some help from expert like you and Tad. It would
be a project too big to under my control.

Thanks again for all your suggestions for solving my step 2 problem.

Best,
William
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top