clear the files using python

S

Sez

Hi,

I'm not a programmer. I start working as text miner and as a first task
I have given 1000 dirty files that needs to be cleaned before
classification tasks. I have been told python is the best tool for this
job.

Each file's structure as below:

Comments: This is article 1965 obtained from the website
Title: Banana Report #65, September 2003
Author: dylab
Date: 1st September 2003
Section: pulse

In the past month:
A mass hit North America, cutting electricity to 50 million people
across the North east


I'm expected execute the python script so the file suppose to look like
this:

pulse, In, the, past, month, A, mass, hit, North, America, cutting,
electricity, to, 50, million, people, across, the, North east, dylab

Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.

Thanks & regards,
Sez
 
P

Peter Hansen

Sez sez:
Each file's structure as below:
Comments: This is article 1965 obtained from the website
Title: Banana Report #65, September 2003
Author: dylab
Date: 1st September 2003
Section: pulse

In the past month:
A mass hit North America, cutting electricity to 50 million people
across the North east


I'm expected execute the python script so the file suppose to look like
this:

pulse, In, the, past, month, A, mass, hit, North, America, cutting,
electricity, to, 50, million, people, across, the, North east, dylab

You'll need either more examples or a more detailed description. The
above could be interpreted as something like "put the pulse section
first, then exactly 19 words from the following text, removing
punctuation and line breaks, and taking the last two words together as
one, then add the 'author' field, and write them all out together with a
field separator of ', ' (comma plus space)".

On the other hand, it could be interpreted a large number of other ways,
and since none of us have any idea what you are trying to do with the
results, we can't use our own intuition or experience to help.

I also personally find it hard to respond to questions like this with
real code when there are things about the task which I find very
surprising. For example, you're throwing away the date information
entirely, along with the comments and title. Is that really intended?

And are the author and section fields always exactly one word, with no
punctuation? (What would happen if an author's name was "Hansen,
Peter"? How would you format that in the output without getting the
first name confused with the next field?)
Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.

We don't "hate" novice people by any means... I suspect you are either
trying to be self-deprecating or maybe you just haven't read this
newsgroup for long. c.l.p actually *loves* novices; it just doesn't
prefer questions that aren't very clear. Keep trying (and improving!)
and you'll definitely get the help you need.

And your comment about Python being the best language for this is pretty
close to the mark... but there are certainly a variety of ways to go
about the task and the best might depend on a lot of unanswered questions.

-Peter
 
S

Simon Brunning

Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.

Oh, we don't hate novices here, not at all. On the other hand, we
aren't going to write your script for you. ;-) Why not take a look at
the Python beginners guide (at
<http://www.python.org/moin/BeginnersGuide>), and come back to us when
you have a specific problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,238
Messages
2,571,193
Members
47,830
Latest member
ZacharySap

Latest Threads

Top