M
martinus
I have created a little text analyzation tool, that tries to extract
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here
To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.
For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:
Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard
The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:
SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
And now my question: Does anyone know where to find such tools or
algorithms?
You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
martinus
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here
To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.
For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:
Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard
The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:
SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
And now my question: Does anyone know where to find such tools or
algorithms?
You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
martinus