Filtering out non-readable characters

M

MKoool

I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I'd like to write something
to just replace all non-printable characters with '' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What's the easiest way to do something like this?

thanks
 
B

Bengt Richter

I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I'd like to write something
to just replace all non-printable characters with '' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What's the easiest way to do something like this?
>>> import string
>>> string.printable
>>> identity = ''.join([chr(i) for i in xrange(256)])
>>> unprintable = ''.join([c for c in identity if c not in string.printable])
>>>
>>> def remove_unprintable(s):
... return s.translate(identity, unprintable)
...
>>> set(remove_unprintable(identity)) - set(string.printable) set([])
>>> set(remove_unprintable(identity))
set(['\x0c', ' ', '$', '(', ',', '0', '4', '8', '<', '@', 'D', 'H', 'L', 'P', 'T', 'X', '\\', '`
', 'd', 'h', 'l', 'p', 't', 'x', '|', '\x0b', '#', "'", '+', '/', '3', '7', ';', '?', 'C', 'G',
'K', 'O', 'S', 'W', '[', '_', 'c', 'g', 'k', 'o', 's', 'w', '{', '\n', '"', '&', '*', '.', '2',
'6', ':', '>', 'B', 'F', 'J', 'N', 'R', 'V', 'Z', '^', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '~', '
\t', '\r', '!', '%', ')', '-', '1', '5', '9', '=', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y', ']', 'a',
'e', 'i', 'm', 'q', 'u', 'y', '}']) True

After that, to get clean file text, something like

cleantext = remove_unprintable(file('unclean.txt').read())

should do it. Or you should be able to iterate by lines something like (untested)

for uncleanline in file('unclean.txt'):
cleanline = remove_unprintable(uncleanline)
# ... do whatever with clean line

If there is something in string.printable that you don't want included, just use your own
string of printables. BTW,
Help on method_descriptor:

translate(...)
S.translate(table [,deletechars]) -> string

Return a copy of the string S, where all characters occurring
in the optional argument deletechars are removed, and the
remaining characters have been mapped through the given
translation table, which must be a string of length 256.

Regards,
Bengt Richter
 
R

Raymond Hettinger

Wow, that was the most thorough answer to a comp.lang.python question
since the Martellibot got busy in the search business.
 
P

Peter Hansen

Bengt said:
identity = ''.join([chr(i) for i in xrange(256)])
unprintable = ''.join([c for c in identity if c not in string.printable])

And note that with Python 2.4, in each case the above square brackets
are unnecessary (though harmless), because of the arrival of "generator
expressions" in the language. (Bengt knows this already, of course, but
his brain is probably resisting the reprogramming. :) )

-Peter
 
S

Steven D'Aprano

Bengt said:
identity = ''.join([chr(i) for i in xrange(256)])
unprintable = ''.join([c for c in identity if c not in string.printable])

And note that with Python 2.4, in each case the above square brackets
are unnecessary (though harmless), because of the arrival of "generator
expressions" in the language.

But to use generator expressions, wouldn't you need an extra pair of round
brackets?

eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

with the extra spaces added for clarity.

That is, the brackets after join make the function call, and the nested
brackets make the generator. That, at least, is my understanding.
 
P

Peter Hansen

Steven said:
Bengt said:
identity = ''.join([chr(i) for i in xrange(256)])

And note that with Python 2.4, in each case the above square brackets
are unnecessary (though harmless), because of the arrival of "generator
expressions" in the language.

But to use generator expressions, wouldn't you need an extra pair of round
brackets?

eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

Come on, Steven. Don't tell us you didn't have access to a Python
interpreter to check before you posted:

c:\>python
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

-Peter
 
B

Bengt Richter

Bengt said:
identity = ''.join([chr(i) for i in xrange(256)])
unprintable = ''.join([c for c in identity if c not in string.printable])

And note that with Python 2.4, in each case the above square brackets
are unnecessary (though harmless), because of the arrival of "generator
expressions" in the language. (Bengt knows this already, of course, but
his brain is probably resisting the reprogramming. :) )
Thanks for the nudge. Actually, I know about generator expressions, but
at some point I must have misinterpreted some bug in my code to mean
that join in particular didn't like generator expression arguments,
and wanted lists. Actually it seems to like anything at all that can
be iterated produce a sequence of strings. So I'm glad to find that
join is fine after all, and to get that misap[com?:)]prehension
out of my mind ;-)

Regards,
Bengt Richter
 
S

Steven D'Aprano

Steven said:
Bengt Richter wrote:

identity = ''.join([chr(i) for i in xrange(256)])

And note that with Python 2.4, in each case the above square brackets
are unnecessary (though harmless), because of the arrival of "generator
expressions" in the language.

But to use generator expressions, wouldn't you need an extra pair of round
brackets?

eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

Come on, Steven. Don't tell us you didn't have access to a Python
interpreter to check before you posted:

Er, as I wrote in my post:

"Steven
who is still using Python 2.3, and probably will be for quite some time"

So, no, I didn't have access to a Python interpreter running version 2.4.

I take it then that generator expressions work quite differently
than list comprehensions? The equivalent "implied delimiters" for a list
comprehension would be something like this:
L = [1, 2, 3]
L[ i for i in range(2) ]
File "<stdin>", line 1
L[ i for i in range(2) ]
^
SyntaxError: invalid syntax

which is a very different result from:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: list indices must be integers

In other words, a list comprehension must have the [ ] delimiters to be
recognised as a list comprehension, EVEN IF the square brackets are there
from some other element. But a generator expression doesn't care where the
round brackets come from, so long as they are there: they can be part of
the function call.

I hope that makes sense to you.
 
S

Steven D'Aprano

George said:
Bengt Richter said:
identity = ''.join([chr(i) for i in xrange(256)])

Or equivalently:
identity = string.maketrans('','')

Wow! That's handy, not to mention undocumented. (At least in the
string module docs.) Where did you learn that, George?

I can't answer for George, but I also noticed that behaviour. I discovered
it by trial and error. I thought, oh what a nuisance that the arguments
for maketrans had to include all 256 characters, then I wondered what
error you would get if you left some out, and discovered that you didn't
get an error at all.

That actually disappointed me at the time, because I was looking for
behaviour where the missing characters weren't filled in, but I've come to
appreciate it since.
 
S

Steven D'Aprano

Replying to myself... this is getting to be a habit.

I hope that makes sense to you.

That wasn't meant as a snide little dig at Peter, and I'm sorry if anyone
reads it that way. I found myself struggling to explain simply the
different behaviour between list comps and generator expressions, and
couldn't be sure I was explaining myself as clearly as I wanted. It might
have been better if I had left off the "to you".
 
P

Peter Hansen

Steven said:
Er, as I wrote in my post:

"Steven
who is still using Python 2.3, and probably will be for quite some time"

Sorry, missed that! I don't generally notice signatures much, partly
because Thunderbird is smart enough to "grey them out" (the main text is
displayed as black, quoted material in blue, and signatures in a light
gray.)

I don't have a firm answer (though I suspect the language reference
does) about when "dedicated" parentheses are required around a generator
expression. I just know that, so far, they just work when I want them
to. Like most of Python. :)

-Peter
 
S

Steven Bethard

Bengt said:
Thanks for the nudge. Actually, I know about generator expressions, but
at some point I must have misinterpreted some bug in my code to mean
that join in particular didn't like generator expression arguments,
and wanted lists.

I suspect this is bug 905389 [1]:
.... yield 1
.... raise TypeError('from gen()')
....
>>> ''.join([x for x in gen()])
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
TypeError: sequence expected, generator found

I run into this every month or so, and have to remind myself that it
means that my generator is raising a TypeError, not that join doesn't
accept generator expressions...

STeVe

[1] http://www.python.org/sf/905389
 
?

=?ISO-8859-1?Q?Michael_Str=F6der?=

Peter said:
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Wouldn't this be a candidate for making the Python language stricter?

Do you remember old Python versions treating l.append(n1,n2) the same
way like l.append((n1,n2)). I'm glad this is forbidden now.

Ciao, Michael.
 
P

Peter Hansen

Michael said:
Wouldn't this be a candidate for making the Python language stricter?

Why would that be true? I believe str.join() takes any iterable, and a
generator (as returned by a generator expression) certainly qualifies.

-Peter
 
R

Robert Kern

Michael said:
Wouldn't this be a candidate for making the Python language stricter?

Do you remember old Python versions treating l.append(n1,n2) the same
way like l.append((n1,n2)). I'm glad this is forbidden now.

That wasn't a syntax issue; it was an API issue. list.append() allowed
multiple arguments and interpreted them as if they were a single tuple.
That was confusing and unnecessary.

Allowing generator expressions to forgo extra parentheses where they
aren't required is something different, and in my opinion, a good thing.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
R

Ross

I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I'd like to write something
to just replace all non-printable characters with '' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What's the easiest way to do something like this?

thanks

Easiest way is open the file with EdXor (freeware editor), select all,
Format > Wipe Non-Ascii.

Ok it's not python, but it's the easiest.
 
S

Steven D'Aprano

I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I'd like to write something
to just replace all non-printable characters with '' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What's the easiest way to do something like this?

thanks

Easiest way is open the file with EdXor (freeware editor), select all,
Format > Wipe Non-Ascii.

Ok it's not python, but it's the easiest.

1 Open Internet Explorer
2 Go to Google
3 Search for EdXor
4 Browser locks up
5 Force quit with ctrl-alt-del
6 Run anti-virus program
7 Download new virus definitions
8 Remove viruses
9 Run anti-spyware program
10 Download new definitions
11 Remove spyware
12 Open Internet Explorer
13 Download Firefox
14 Install Firefox
15 Open Firefox
16 Go to Google
17 Search for EdXor
18 Download application
19 Run installer
20 Reboot
21 Run EdXor
22 Open file
23 Select all
24 Select Format>Wipe Non-ASCII
25 Select Save
26 Quit EdXor

Hmmm. Perhaps not *quite* the easiest way :)
 
B

Bengt Richter

Bengt said:
Thanks for the nudge. Actually, I know about generator expressions, but
at some point I must have misinterpreted some bug in my code to mean
that join in particular didn't like generator expression arguments,
and wanted lists.

I suspect this is bug 905389 [1]:
... yield 1
... raise TypeError('from gen()')
...
''.join([x for x in gen()])
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
TypeError: sequence expected, generator found

I run into this every month or so, and have to remind myself that it
means that my generator is raising a TypeError, not that join doesn't
accept generator expressions...

STeVe

[1] http://www.python.org/sf/905389

That must have been it, thanks.

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,148
Messages
2,570,838
Members
47,385
Latest member
Joneswilliam01

Latest Threads

Top