How to read gzipped utf8 file in Python?

J

John Nagle

I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding. The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.) Is there some way to do this?
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

John Nagle
 
M

Martin v. Löwis

I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding.

You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).
The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.)

I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,734
Latest member
manin

Latest Threads

Top