How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)

Stanley Xu · Mar 23, 2011

[Note: parts of this message were removed to make it a legal post.]

Dear Buddies,

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

Thanks.

Best wishes,
Stanley Xu

Robert Klemme · Mar 23, 2011

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

A string with a mixed encoding is difficult to handle. I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert

Stanley Xu · Mar 23, 2011

[Note: parts of this message were removed to make it a legal post.]

Thanks a lot, Robert. Your solution really helps.

Best wishes,
Stanley Xu

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

Click to expand...

A string with a mixed encoding is difficult to handle. I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert

transfer GBK into UTF-8 in csv file	0	Apr 15, 2010
How could I make the Ruby 1.9 string ignore the invalid utf-8 bytesequence in split?	7	Mar 22, 2011
How to use rb_enc_str_new() to create a String with UTF-8 encoding?	4	Dec 2, 2009
How to create a file with UTF-8 encoding	4	Sep 21, 2009
Ruby1.9: Encoding problems (how to use #force_encoding ?)	5	Sep 1, 2009
How to implement a html parser in java?	1	Dec 28, 2023
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Forcing a string to valid UTF-8	2	Apr 26, 2010

How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)

Stanley Xu

Robert Klemme

Stanley Xu

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads