Saving the web, charset problems and symbols problems

S

Sak Na rede

Hi all!

I think that a lot of ruby scripts are for web crawling, web scrapping
and many more applications with the web. I'm working with the web too, I
try to save text of many different webs. In this moment I'm trying to
solve two problems:

1 - How to standard the charset of the web. There are a lot of
differents charsets and I think that it must be possible another
solution that see every charset and convert to proper charset each time.
(By the way, what is the best method to see charset of a file? command
file is not very good, I think)

2 - How to convert HTML to plain text. I use Hpricot but a lot of very
rare simbols continues there like "€" or "”". Wich is the most used
method?

Thanks a lot
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top