Strip control characters in a file

M

Marc Girod

Hello,

I was asked a way to strip control characters from a text file.
Soon, it became clear that newlines must be kept, as well as
(probably) tabs.
The context was however unix only.

Inspired in part by recent posts in this group, I came up with the
following one-liner:

perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
(10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

....assuming the file was /tmp/fff, and keeping a backup of it.

I would now humbly turn to you for critique and improvements.

Thanks,
marc
 
U

Uri Guttman

MG> I was asked a way to strip control characters from a text file.
MG> Soon, it became clear that newlines must be kept, as well as
MG> (probably) tabs. The context was however unix only.

MG> Inspired in part by recent posts in this group, I came up with the
MG> following one-liner:

MG> perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
MG> (10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

MG> ...assuming the file was /tmp/fff, and keeping a backup of it.

MG> I would now humbly turn to you for critique and improvements.

use tr///. untested (need to check the chars):

perl -pi2 -e 'tr/0x00-0x090x11-0x1f//d'

uri
 
M

Marc Girod

use tr///. untested (need to check the chars):
perl -pi2 -e 'tr/0x00-0x090x11-0x1f//d'

Thanks Uri.
I couldn't make tr work with hexadecimal...
But it groked octal very nicely:

perl -pi2 -e 'tr/\000-\011\013-\037\177//d'

Marc
 
S

sln

Hello,

I was asked a way to strip control characters from a text file.
Soon, it became clear that newlines must be kept, as well as
(probably) tabs.
The context was however unix only.

Inspired in part by recent posts in this group, I came up with the
following one-liner:

perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
(10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

...assuming the file was /tmp/fff, and keeping a backup of it.

I would now humbly turn to you for critique and improvements.

Thanks,
marc

Or, you could use than new fangle \K thing:

perl -pi2 -e 's/(?:\t|\n)\K|([[:cntrl:]])//g' filename

As a program:

use strict;
use warnings;
require 5.010_000;

$_ = " bs: '\x{08}', tab: '\t', cr: '\x{0d}', zero: '\x{00}', newline: '\n', 127: '\x{1f}'";
s/(?:\t|\n)\K|([[:cntrl:]])//g;
print "$_\n\n";

Output:
bs: '', tab: ' ', cr: '', zero: '', newline: '
', 127: ''

-sln
 
S

sln

Hello,

I was asked a way to strip control characters from a text file.
Soon, it became clear that newlines must be kept, as well as
(probably) tabs.
The context was however unix only.

Inspired in part by recent posts in this group, I came up with the
following one-liner:

perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
(10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

...assuming the file was /tmp/fff, and keeping a backup of it.

I would now humbly turn to you for critique and improvements.

Thanks,
marc

Or, you could use than new fangle \K thing:

perl -pi2 -e 's/(?:\t|\n)\K|([[:cntrl:]])//g' filename
^
Done even need capture parenth's ..
s/(?:\t|\n)\K|[[:cntrl:]]//g;

-sln
 
R

RedGrittyBrick

Marc said:
Hello,

I was asked a way to strip control characters from a text file.
Soon, it became clear that newlines must be kept, as well as
(probably) tabs.
The context was however unix only.

Inspired in part by recent posts in this group, I came up with the
following one-liner:

perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
(10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

...assuming the file was /tmp/fff, and keeping a backup of it.

I would now humbly turn to you for critique and improvements.

I guess you don't need to worry about character sets other than ASCII or
it's simpler supersets?

ISO-8859 has "control characters" assigned to 0x80-0x9F
Unicode has "control characters" assigned to U+0080 - U+009F and U+2029?
EBCDIC?
Others?
 
S

sln

Marc said:
Hello,

I was asked a way to strip control characters from a text file.
Soon, it became clear that newlines must be kept, as well as
(probably) tabs.
The context was however unix only.

Inspired in part by recent posts in this group, I came up with the
following one-liner:

perl -pi2 -e 'BEGIN{$rep{chr($_)}=q() for 0..31,127;$rep{chr(10)}=chr
(10)}s/([[:cntrl:]])/$rep{$1}/g' /tmp/fff

...assuming the file was /tmp/fff, and keeping a backup of it.

I would now humbly turn to you for critique and improvements.

I guess you don't need to worry about character sets other than ASCII or
it's simpler supersets?

ISO-8859 has "control characters" assigned to 0x80-0x9F
Unicode has "control characters" assigned to U+0080 - U+009F and U+2029?
EBCDIC?
Others?

Shouldn't [[:cntrl:]] read these as Unicode control chars?
Coerced to utf8:

$_ = " u1200: '\x{1200}', bs: '\x{08}', tab: '\t', cr: '\x{0d}', zero: '\x{00}', 81h: '\x{81}', newline: '
', 7fh: '\x{7f}', u009F: '\x{009F}', u2029: '\x{2029}' ";

s/(?:\t|\n)\K|[[:cntrl:]]//g;

binmode (STDOUT, "utf8");
print "$_\n\n";

Gives:
u1200: 'ሀ', bs: '', tab: ' ', cr: '', zero: '', 81h: '', newline: '
', 7fh: '', u009F: '', u2029: 'GǬ'

All but U+2029 which seems kind of strange.

-sln
 
S

sln

That was not in the request I got, no.

Marc

Request or not, your first judgement to use '[[:cntrl:]]' was
correct because it generally recognises characterset control
characters on the host platform, and files of different encodings.
Since you don't care about encoding, just use the tr/// form and
deal with rewriting the whole thing when it fails on a Unicode file.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top