character encoding question

A

Amishera Amishera

I have an html file which is encoded in UTF-8. The file contains the
following text:

It's a wonderful life

now the character code 39 is for aphostrohpe in UTF8. so suppose I got
the 39 out of the text using:

s="It's a wonderful life"

s.gsub(/&#(\d+);/, '\1')

The output is

It39s a wonderful life

So firstly I am having trouble making it

It\39s a wonderful life

Secondly I manually did this in test_utf8.rb:

puts "It\39s a wonderful life"

and ran it

ruby test_utf8.rb > utf8.txt

but by opening it in the open office by setting the encoding to utf-8
the output is

It#9s a wonderful life

So how to correctly parse the collect and convert html character
reference to encoded charcters in utf-8 and then save file?

Thanks.
 
D

David Springer

try something like this:
-------------------------------------
require 'cgi'
s="UPPERCASE Russian Alphabet\n".encode('utf-8')
s+=CGI.unescapeHTML("АБВГ".encode('utf-8'))
s+=CGI.unescapeHTML("ДЕЖЗ".encode('utf-8'))
s+=CGI.unescapeHTML("ИЙКЛ".encode('utf-8'))
s+=CGI.unescapeHTML("МНОП".encode('utf-8'))
s+=CGI.unescapeHTML("РСТУ".encode('utf-8'))
s+=CGI.unescapeHTML("ФХЦЧ".encode('utf-8'))
s+=CGI.unescapeHTML("ШЩЪЫ".encode('utf-8'))
s+=CGI.unescapeHTML("ЬЭЮЯ".encode('utf-8'))
puts s
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,729
Latest member
ScarlettJe

Latest Threads

Top