text analyzator

martinus · Sep 21, 2004

I have created a little text analyzation tool, that tries to extract
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here

To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.

For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:

Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard

The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:

SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive

And now my question: Does anyone know where to find such tools or
algorithms?

You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer

martinus

Alexey Verkhovsky · Sep 21, 2004

I have created a little text analyzation tool, that tries to extract
words that are important in a given text.

Would you care to explain what could one use this for?

Alex

Thomas E Enebo · Sep 21, 2004

Would you care to explain what could one use this for?

I am not the author, but I can think of two...

I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Another place I bet something like this is used is in google page
ranking. They need algorithms for cutting out the noise.

-Tom

Shashank Date · Sep 21, 2004

Hi Martin,

--- martinus said:
And now my question: Does anyone know where to find
such tools or
algorithms?

Word (Text) analysis is a very active branch of
Information Theory.

Just Google for "word entropy" and spend the rest of
life surfing ;-)

HTH,
-- shanko

__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail

jm · Sep 22, 2004

Thought I'd given this simple program a go and review it for those
curious as to how well it works. Two tests were carried out, the first
I only used the following texts. The second repeated the first with
additional training texts. The texts are from the project gutenburg
(except openbsd35.readme.txt which for some reason was in the same
directory).

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt
grimm10.txt sunzu10.txt

$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per
second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per
second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per
second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per
second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per
second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per
second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per
second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per
second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per
second

I then fed it an text version of my marketing essay which should have
very little if anything in common with the training texts.

$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction

I then added more texts

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt tprnc11.txt
dracu13.txt repub11.txt warw12.txt
grimm10.txt sunzu10.txt

and reran the above creation and analyze steps. to get

most characteristic words:
marketing, customers, customer, 4ps, interaction

So, not bad for such a simple algorithm. As I would have picked the
keywords as relationship, marketing, 4Ps,
and customer retention. I'm surprised coffee didn't show up as I kept
using it in examples. It don't do too badly in this simple test
especially considering that the training test was chosen at random and
not related to the text analyzed. A dictionary of plurals or some other
means of dealing with plurals would be my only suggestion.

NB:

Jeff.

Markus · Sep 22, 2004

I decided to do a somewhat more ambitious test. After training on
a thousand arbitrary .doc files and a thousand arbitrary .html files
(and tweaking it to return the top 15 words instead of just the top 5) I
fed it Why the lucky stiff's latest opus:

loading wordcount.dat...
reading...
analyzing...
most characteristic words:
he, his, cham, dr, said, ruby, goat, method,
irb, ree, paij, sentence, him, had, end

Not bad at all. Although I haven't read it yet myself, this looks like
a quite reasonable summary. I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

-- Markus

martinus · Sep 22, 2004

The goal is to automatically create summaries of a text. For example,
if you have a large text file and you have no idea what this is about,
the analyzer should be able to give you a short summery of the file.
Another nice idea might be to add such a feature to a blogging webpage,
each entry could show a short summary, or at least the most important
words.

martinus

Kaspar Schiess · Sep 22, 2004

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Markus wrote:

| Not bad at all. Although I haven't read it yet myself, this looks like
| a quite reasonable summary. I'm a little surprised at the absence of
| flugel and trisomatic, but perhaps WTLS has gotten less predictable in
| his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

- --
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBUXhYFifl4CA0ImQRArXYAJ0Qwwy/VFYbfxqOMDjuijm6Y7o8TwCfYGm8
ZXqRiN+qpJTEBeILsMXyRIw=
=MwAE
-----END PGP SIGNATURE-----

Markus · Sep 22, 2004

Markus wrote:

| Not bad at all. Although I haven't read it yet myself, this looks like
| a quite reasonable summary. I'm a little surprised at the absence of
| flugel and trisomatic, but perhaps WTLS has gotten less predictable in
| his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.

-- Markus

martinus · Sep 22, 2004

You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

martinus

Hal Fulton · Sep 22, 2004

martinus said:
You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

Can you clarify this? Do you mean:

1. The text is not pulled from the format but retains some residue from
where it came from (JuliusCaesar.doc will train differently from
JuliusCaesar.html).

2. The material should be of the same general type, coming from the same
type of source; but the actual format does not affect training.

3. Something else?

Hal

Michael Campbell · Sep 22, 2004

I think it could be useful for classification of spam. Apply

this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,
Nigerian, and similar spam.

For me, anyway.

martinus · Sep 22, 2004

The text is never pulled from any format. If you train only html files,
and then analyze html files, these html tags are treated just like
normal words. They just don't show up in the results, because they are
mostly equally often used in both the training texts and the analyzed
text.
The algorithm is very simple, and takes absolutely no assumption of the
input.

bruno modulix · Sep 22, 2004

Michael Campbell a écrit :

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,

Pardon my profound ignorance, but what do you call '419s' ?
TIA
Bruno

bruno modulix · Sep 22, 2004

Markus a écrit :

I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.

LOL

Hal Fulton · Sep 22, 2004

bruno said:
Michael Campbell a écrit :

Pardon my profound ignorance, but what do you call '419s' ?

I didn't know either, but a google for '419 spam' gave me
this interesting link: http://home.rica.net/alphae/419coal/

Hal

The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
A simple javascript-based colour picker	4	Sep 1, 2008
Ruby Weekly News 7th - 13th November 2005	1	Nov 15, 2005
Ruby Weekly News 14th - 20th February 2005	4	Feb 20, 2005
Ruby Weekly News 21st - 27th March 2005	16	Mar 28, 2005
NewsMaestro Usenet Supertool 3.8.1 is released	0	Sep 20, 2007
Ruby Weekly News 17th - 23rd January 2005	3	Jan 23, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 15, 2008

text analyzator

martinus

Alexey Verkhovsky

Thomas E Enebo

Shashank Date

jm

Markus

martinus

Kaspar Schiess

Markus

martinus

Hal Fulton

Michael Campbell

martinus

bruno modulix

bruno modulix

Hal Fulton

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads