How to separate a big text file (say 400 news stories) to many smalltext files?

william · Mar 22, 2009

I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl? The actual text file
is attached below.

Thank you very much.

William

1 of 412 DOCUMENTS

Austin American-Statesman (Texas)

December 5, 2008 Friday
Final Edition

CENTRAL TEXAS DIGEST

BYLINE: FROM STAFF REPORTS

SECTION: BUSINESS; Pg. B08

LENGTH: 375 words

   COMPUTER MAKERS

   Dell stockholders ask for investigation

   Dell Inc. said that it received a shareholder-demand letter asking
the board
to investigate allegations that some current and former directors and
officers
imprudently invested and managed funds in the 401(k) plan.
.......

LOAD-DATE: December 5, 2008

LANGUAGE: ENGLISH

PUBLICATION-TYPE: Newspaper

2 of 412 DOCUMENTS

Contra Costa Times (California)

August 29, 2008 Friday

Stocks on the move: International Paper, Magma Design, PetSmart

BYLINE: wire

SECTION: BUSINESS

LENGTH: 677 words

   By Fabio Alves

   Bloomberg News

   The following companies are having unusual price changes in U.S.
markets this
afternoon.

   Dell Inc. (DELL US) dropped 12 percent, the most since November, to
$22.08.
The world's second-largest personal-computer said the U.S. slump in
technology
spending has moved abroad. Second-quarter profit before one-time items
was 33
cents a share, short of the 36-cent average projection compiled by
Bloomberg.

LOAD-DATE: August 29, 2008

LANGUAGE: ENGLISH

A. Sinan Unur · Mar 22, 2009

I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl? The actual text file
is attached below.

This groups is for programmers to get help with *code*. In particular,
this is not a place to get ready-made scripts to solve your specific
problem.

You need to show what you have so far and what you need help with.

There are many ways of getting what you want. All follow the same basic
structure: Find the start of a record, save all content up to the start
of the next record and do this until there are no more records left.

There is a loop or two, a couple of regexes and very simple command line
processing involved.

If you want someone to write this for you, you should post the job at
http://jobs.perl.org/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Tad J McClellan · Mar 22, 2009

william said:
I have downloaded a big text file from Lexis-Nexis. This text file has
412 news stories. Each story begins with, for example, 1 of 412
DOCUMENTS or 2 of 412 DOCUMENTS and close to the end it has LOAD-DATE:
information. I want to process this one big text file into 412
separate text files and name them according to the LOAD-DATE, for
example, DELL_20081205_1.txt or DELL_20081205_2.txt. Can anybody give
me some hint on how to achieve this using Perl?

Read the file.

Buffer up an article, making note of the number and date when they go by.

When at end of article print it to a file.

Which of those are you having trouble with?

Have a fish.

When you come back asking how to convert "December_5_2008"
into "20081205" we will expect to see the code that you've
written so far.

------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my($article, $num, $date);
while ( <DATA> ) {
if ( /(\d+) of \d+ DOCUMENTS/ ) {
if ( $article ) { # output then empty the article buffer
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open $!";
print $ARTICLE $article;
close $ARTICLE;
$article = '';
}
$num = $1;
}

if ( /LOAD-DATE: (.*)/ ) {
$date = $1;
$date =~ tr/ ,/_/s;
}

$article .= $_;
}

if ( $article ) {
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open $!";
print $ARTICLE $article;
close $ARTICLE;
}

__DATA__
1 of 412 DOCUMENTS

Austin American-Statesman (Texas)

December 5, 2008 Friday
Final Edition

CENTRAL TEXAS DIGEST

BYLINE: FROM STAFF REPORTS

SECTION: BUSINESS; Pg. B08

LENGTH: 375 words

<A0><A0><A0>COMPUTER MAKERS

<A0><A0><A0>Dell stockholders ask for investigation

<A0><A0><A0>Dell Inc. said that it received a shareholder-demand letter asking
the board
to investigate allegations that some current and former directors and
officers
imprudently invested and managed funds in the 401(k) plan.
.......

LOAD-DATE: December 5, 2008

LANGUAGE: ENGLISH

PUBLICATION-TYPE: Newspaper

2 of 412 DOCUMENTS

Contra Costa Times (California)

August 29, 2008 Friday

Stocks on the move: International Paper, Magma Design, PetSmart

BYLINE: wire

SECTION: BUSINESS

LENGTH: 677 words

<A0><A0><A0>By Fabio Alves

<A0><A0><A0>Bloomberg News

<A0><A0><A0>The following companies are having unusual price changes in U.S.
markets this
afternoon.

<A0><A0><A0>Dell Inc. (DELL US) dropped 12 percent, the most since November, to
$22.08.
The world's second-largest personal-computer said the U.S. slump in
technology
spending has moved abroad. Second-quarter profit before one-time items
was 33
cents a share, short of the 36-cent average projection compiled by
Bloomberg.

LOAD-DATE: August 29, 2008

LANGUAGE: ENGLISH

william · Mar 22, 2009

Tad,

Thank you very much for providing a script for me!

I only modified one place to ignore anything before _ of _ DOCUMENTS.
It would be neat to have the date converted to the format you
mentioned. Since I'm not a Perl expert, I may have to do it later
using SAS. Below is the script I used. (I'm fascinated and overwhelmed
by what Perl's capability. I still have many questions related to
Perl.)

I really appreciate your help.

William

#!/usr/bin/perl
use warnings;
use strict;
open IN,"dell" or die "could not open $!";
my($article, $num, $date);
$num=0;
while ( <IN> ) {
if ( /(\d+) of \d+ DOCUMENTS/ ) {
if ( $article && $num ) { # output then empty the article
buffer
print "Output for file $num.\n";
open my $ARTICLE, '>', "${date}_$num.txt" or die "could
not open $!";
print $ARTICLE $article;
close $ARTICLE;
$article = '';
}
$num = $1;
}
if ( /LOAD-DATE: (.*)/ ) {
$date = $1;
$date =~ tr/ ,/_/s;
chop($date);
}
$article .= $_;
}

print "Output for the last file $num.\n";

if ( $article ) {
open my $ARTICLE, '>', "${date}_$num.txt" or die "could not open
$!";
print $ARTICLE $article;
close $ARTICLE;
}

Tad J McClellan · Mar 22, 2009

It would be neat to have the date converted to the format you
mentioned. Since I'm not a Perl expert, I may have to do it later
using SAS.

You do not need to be an expert, as the expert-level work has
probably already been done and packaged up for others to use.

There are many modules on CPAN (perldoc -q CPAN) that can parse
dates for you such as DateTime::Format::Natural.

Using such a module, it should take about 3 lines of code to
convert "December 5, 2008" to "20081205".

open IN,"dell" or die "could not open $!";

You should use the 3-argument form of open() and a lexical filehandle
like I did in my code.

open my $IN, '<', 'dell' or die "could not open $!";

while ( <IN> ) {

chop($date);

Don't use chop() for removing newlines.

chomp($date); # much safer than chop()

Gunnar Hjalmarsson · Mar 22, 2009

Tad said:
There are many modules on CPAN (perldoc -q CPAN) that can parse
dates for you such as DateTime::Format::Natural.

Using such a module, it should take about 3 lines of code to
convert "December 5, 2008" to "20081205".

I prefer Date:

arse before a heavy-weight module like that.

$ perl -MDate:

arse -e '
($d, $m, $y) = (strptime "December 5, 2008")[3..5]; # 1
printf "%d%02d%02d\n", $y+1900, $m+1, $d; # 2
'
20081205
$

william · Mar 22, 2009

Tad and Gunnar,

Thank you very much for your replies. I have largely achieved one of
my goals: to divide a big file into separate small files. But I still
have a huge task before this step. Here is a little background.

I want to search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories and separate
them. The last step would be to classify them into various groups.
Right now, I have some clue on how to finish the last two steps. It is
the first step that is giving me a lot of trouble. I tried to write
perl script to automatically login to lexis-nexis through my
university library using www::mechanize, Crypt::SSLeay, http::cookies,
and lwp::userAgent. I think that I can log into Lexis Nexis. But to do
the news search, Lexis Nexis uses javascripts. I still need to figure
out how to handle these queries through javascripts.

Anyway, thank you again for your code and suggestions!

William

Gunnar Hjalmarsson · Mar 22, 2009

william said:
Tad and Gunnar,

Thank you very much for your replies. I have largely achieved one of
my goals: to divide a big file into separate small files. But I still
have a huge task before this step. Here is a little background.

I want to search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories and separate
them. The last step would be to classify them into various groups.
Right now, I have some clue on how to finish the last two steps. It is
the first step that is giving me a lot of trouble. I tried to write
perl script to automatically login to lexis-nexis through my
university library using www::mechanize, Crypt::SSLeay, http::cookies,
and lwp::userAgent. I think that I can log into Lexis Nexis. But to do
the news search, Lexis Nexis uses javascripts. I still need to figure
out how to handle these queries through javascripts.

Please note that this is a Perl group, not a group for JavaScript, and
its purpose is to discuss Perl, possibly answering specific questions,
not writing programs out from vague specifications.

Sounds to me as if you need a consultant.

Tad J McClellan · Mar 23, 2009

william said:
I want to

But are you _allowed_ to do what you "want"?

search news stories from Lexis-Nexis database through my
library server (ezproxy). Then download the news stories

http://www.lexisnexis.com/terms/

...

You may not decompile, reverse engineer, disassemble, rent,
lease, loan, sell, sublicense, or create derivative works
from this Web Site or the Content. Nor may you use any
network monitoring or discovery software to determine the
site architecture, or extract information about usage,
individual identities or users. You may not use any robot,
spider, other automatic software or device, or manual process
to monitor or copy our Web Site or the Content without
Provider's prior written permission.

william · Mar 23, 2009

Tad,

I think that you have a very good point. Although my purpose is to
analyze news effects on the financial markets, purely for academic
uses, I'd rather be a bit careful and maybe discard my first step of
crawling over the Lexis-Nexis database.

Gunnar,
Your suggestion of hiring a consultant is fine with me. When I get
permission from Lexis-Nexis for data-mining their news database, I
will definitely need some help from expert like you and Tad. It would
be a project too big to under my control.

Thanks again for all your suggestions for solving my step 2 problem.

Best,
William

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

How to separate a big text file (say 400 news stories) to many smalltext files?

william

A. Sinan Unur

Tad J McClellan

william

Tad J McClellan

Gunnar Hjalmarsson

william

Gunnar Hjalmarsson

Tad J McClellan

william

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads