text analyzator

M

martinus

I have created a little text analyzation tool, that tries to extract
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here :)

To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.

For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:

Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard

The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:

SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive

And now my question: Does anyone know where to find such tools or
algorithms?

You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer

martinus
 
A

Alexey Verkhovsky

I have created a little text analyzation tool, that tries to extract
words that are important in a given text.

Would you care to explain what could one use this for?

Alex
 
T

Thomas E Enebo

Would you care to explain what could one use this for?

I am not the author, but I can think of two...

I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Another place I bet something like this is used is in google page
ranking. They need algorithms for cutting out the noise.

-Tom
 
S

Shashank Date

Hi Martin,

--- martinus said:
And now my question: Does anyone know where to find
such tools or
algorithms?

Word (Text) analysis is a very active branch of
Information Theory.

Just Google for "word entropy" and spend the rest of
life surfing ;-)

HTH,
-- shanko




__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail
 
J

jm

Thought I'd given this simple program a go and review it for those
curious as to how well it works. Two tests were carried out, the first
I only used the following texts. The second repeated the first with
additional training texts. The texts are from the project gutenburg
(except openbsd35.readme.txt which for some reason was in the same
directory).

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt
grimm10.txt sunzu10.txt

$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per
second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per
second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per
second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per
second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per
second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per
second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per
second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per
second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per
second

I then fed it an text version of my marketing essay which should have
very little if anything in common with the training texts.

$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction

I then added more texts

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt tprnc11.txt
dracu13.txt repub11.txt warw12.txt
grimm10.txt sunzu10.txt

and reran the above creation and analyze steps. to get

most characteristic words:
marketing, customers, customer, 4ps, interaction

So, not bad for such a simple algorithm. As I would have picked the
keywords as relationship, marketing, 4Ps,
and customer retention. I'm surprised coffee didn't show up as I kept
using it in examples. It don't do too badly in this simple test
especially considering that the training test was chosen at random and
not related to the text analyzed. A dictionary of plurals or some other
means of dealing with plurals would be my only suggestion.

NB:

Jeff.
 
M

Markus

I decided to do a somewhat more ambitious test. After training on
a thousand arbitrary .doc files and a thousand arbitrary .html files
(and tweaking it to return the top 15 words instead of just the top 5) I
fed it Why the lucky stiff's latest opus:


loading wordcount.dat...
reading...
analyzing...
most characteristic words:
he, his, cham, dr, said, ruby, goat, method,
irb, ree, paij, sentence, him, had, end

Not bad at all. Although I haven't read it yet myself, this looks like
a quite reasonable summary. I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

-- Markus
 
M

martinus

The goal is to automatically create summaries of a text. For example,
if you have a large text file and you have no idea what this is about,
the analyzer should be able to give you a short summery of the file.
Another nice idea might be to add such a feature to a blogging webpage,
each entry could show a short summary, or at least the most important
words.

martinus
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Markus wrote:

| Not bad at all. Although I haven't read it yet myself, this looks like
| a quite reasonable summary. I'm a little surprised at the absence of
| flugel and trisomatic, but perhaps WTLS has gotten less predictable in
| his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

- --
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBUXhYFifl4CA0ImQRArXYAJ0Qwwy/VFYbfxqOMDjuijm6Y7o8TwCfYGm8
ZXqRiN+qpJTEBeILsMXyRIw=
=MwAE
-----END PGP SIGNATURE-----
 
M

Markus

Markus wrote:

| Not bad at all. Although I haven't read it yet myself, this looks like
| a quite reasonable summary. I'm a little surprised at the absence of
| flugel and trisomatic, but perhaps WTLS has gotten less predictable in
| his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.

-- Markus
 
M

martinus

You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

martinus
 
H

Hal Fulton

martinus said:
You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

Can you clarify this? Do you mean:

1. The text is not pulled from the format but retains some residue from
where it came from (JuliusCaesar.doc will train differently from
JuliusCaesar.html).

2. The material should be of the same general type, coming from the same
type of source; but the actual format does not affect training.

3. Something else?


Hal
 
M

Michael Campbell

I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,
Nigerian, and similar spam.

For me, anyway.
 
M

martinus

The text is never pulled from any format. If you train only html files,
and then analyze html files, these html tags are treated just like
normal words. They just don't show up in the results, because they are
mostly equally often used in both the training texts and the analyzed
text.
The algorithm is very simple, and takes absolutely no assumption of the
input.
 
B

bruno modulix

Michael Campbell a écrit :
Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,

Pardon my profound ignorance, but what do you call '419s' ?
TIA
Bruno
 
B

bruno modulix

Markus a écrit :
I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.
LOL
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,414
Latest member
GayleWedel

Latest Threads

Top