strip DOS ^Ms?

D

Dick Davies

Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....

For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....
 
R

Robert Klemme

Dick Davies said:
Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....

For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....

while ( line = gets )
line.chomp!
p line # no more \r\n at end
end

robert
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dick Davies wrote:

| Me again. I have chosen one of the crappiest websites on Gods
| green Earth to scrape here....
|
| For no good reason I have buttloads of '^M's all over the file
| when I check it in vi. What would I feed gsub to strip them out?
| Otherwise I'm looking at a very very long regex to yank out the
| fields I need....
|

I recommend the tool suite called
dos2unix
unix2dos
mac2unix
unix2mac
on unixes. They are even included in the msys toolkit. The ^M's are
superfluous line returns (don't remember which of '\n' or '\r').

Just convert your file to unix line endings using
dos2unix

The CP-conversion tool recode also supports this. And every other
editor. Actually gvim and vim seem to have the setting
set ff=[one of 'dos', 'unix' or 'mac'].

So just doing ':', 'set ff=unix' will initiate conversion.

Don't gsub. That is the way to madness.

kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAxunDFifl4CA0ImQRAoCpAJ9rzJhh/fWtmoXrtguVn6xqBFiCmwCdGKOC
Wb0CEe0AEVk2IwoSdG7PwKs=
=OvJj
-----END PGP SIGNATURE-----
 
D

Dick Davies

* Robert Klemme said:
while ( line = gets )
line.chomp!
p line # no more \r\n at end
end

They're not at end of line though, they're just scattered through the
lines... and vi/dos2unix isn't an option , I'm doing this on a string
on its way from the webserver to REXML...
 
D

Dick Davies

Fixed my own problem again. This is getting to be a habit.

shitty_html.gsub!(/\r/, '')

Thanks for suggestions!
 
D

Daniel Berger

Dick Davies said:
Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....

For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....

See "ptools", available on the RAA or at http://ruby-miscutils.sf.net.
Look for the File.nl_convert method. :)

Regards,

Dan
 
S

Sean Russell

Dick Davies said:
They're not at end of line though, they're just scattered through the
lines... and vi/dos2unix isn't an option , I'm doing this on a string
on its way from the webserver to REXML...

Use tr.

I generally strip DOS idiocy on the command line with:

tr -d '\r' < in > out

You can use the same thing in Ruby:

string.tr!( '\r', '' )

Or, you can do it in vim:

:%s/^M//g

To get the ^M down there, use visual mode to select one of the ^Ms and
yank it; then do :%s/ and type ^R" to paste it. Then finish the
replace with //g, and voila. It might even work if you do:

:%s/^V^M//g

(^V in vim lets you insert control characters). Sorry if I'm telling
you stuff about vim that you already know.

If you're slurping the files with Ruby, the easiest thing is to use
String#tr. If you're slurping the files with wget or curl, 'tr' is
probably the easiest.

--- SER
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Philipp Kern wrote:
|
| Any argument for this thesis?
|
| Bye,
| phil

My point being that a custom hacked tool for a momentary need is seldom
as clever as the tools that are specially designed for your problem.

Often the need for a small patch to bring things together is a sign for
a lack of design on a larger scale. To just gsub the problem away is the
very best way of having to gsub again tomorrow.

Now of course we all like to write code; it's just that more code is not
always the answer. Madness is a harsh word for that on a short time
scale, but on the long run, I think it actually applies.

best regards,
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAyB/0Fifl4CA0ImQRAkqDAJ9a0xcdBEteqc7xBn8LKZJEJE0FFwCdGbPa
2P0meGzpWZr2Zteq1ftSz58=
=Qa2m
-----END PGP SIGNATURE-----
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

| I generally strip DOS idiocy on the command line with:
|
| tr -d '\r' < in > out

Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?

k
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAyCDdFifl4CA0ImQRAgg3AJ0a9zVRhwmIzmBM84HOiytNmKBrEgCgs7au
bg0QR3yw1iiPBrHa+bBhAvA=
=llI9
-----END PGP SIGNATURE-----
 
D

Dick Davies

* Kaspar Schiess said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

| I generally strip DOS idiocy on the command line with:
|
| tr -d '\r' < in > out

Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?

As far as I'm concerned you can have any line terminator you like in
the privacy of your own filesystem, but when you transfer a file to
me, I'd like it in a useful format.

The server in question is the one that returns no output without a
valid user-agent, remember, and these ^Ms are in the middle of lines.
We're not talking DOS end-of-lines here.
 
S

Samuel Kvarnbrink

Kaspar said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

| I generally strip DOS idiocy on the command line with:
|
| tr -d '\r' < in > out

Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?

In the days of the "classic" Mac OS (9.x and earlier) that was true; \r
was the default line ending. But that changed during the migration to OS
X, so (with the possible exception of old, badly ported Carbon apps)
it's not an issue anymore.

//samuel
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


| In the days of the "classic" Mac OS (9.x and earlier) that was true; \r
| was the default line ending. But that changed during the migration to OS
| X, so (with the possible exception of old, badly ported Carbon apps)
| it's not an issue anymore.
|
| //samuel
|
|

So that would make Dick Davies 'wrong' input html issued from a pre OS X
editor ? See what I meant by saying 'historical context' ?

- --
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAyDP2Fifl4CA0ImQRAtttAJ4vgneNCc2tk+zbiDQRdc2wN2nK4wCfQNpj
QigwDF3rKkF00rVS6uLPgUo=
=P81A
-----END PGP SIGNATURE-----
 
B

Bill Kelly

From: "Kaspar Schiess said:
| I generally strip DOS idiocy on the command line with:
|
| tr -d '\r' < in > out

Why would a thing grown from history be called idiocy in any context ?
It just is that way.

If MS-DOS had come out in the 1950's it might be forgivable.

In 1981 elegant solutions to DOS' stupidities had existed
for at least a decade... And yet DOS debuts with:

- lack of shell globbing (STILL doesn't)
- shell can't escape its own metacharacters (STILL can't)
- ^Z character terminating text files even though this
was a historical artifact from an OS that needed them
because its filesize was measured in blocks... POINTLESS
when your OS knows the exact file size in bytes, which MS-DOS
always has
- ignorance of "a file is a bag of bytes, and everything is
a file" metaphor
- TWO end of line characters for text files when one would do
(in 1981 we had CRT screens not teletypes) <grin>
- special illegal filenames "nul" "com1" "lpt1" etc.
- no piping of program output/input streams (they finally
grafted this on haphazardly)
- can't unlink an open file and continue to read from it
(this is why ruby -i doesn't work for in-place edits in DOS)

It's "idiocy" because better solutions already existed for
a decade... Yes I've been using MS-DOS since 1981... :(

The history of MS-DOS is the very embodiment and manifestation
of the cute turn of phrase,

Those who do not understand Unix are condemned to reinvent it, poorly.
-- Henry Spencer

..Sadness.
If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?

At least the old macs just used a single character for
newline, even if it was CR instead of LF... :)


Regards,

BILLKE~1.TXT
 
K

Kaspar Schiess

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


|
| Regards,
|
| BILLKE~1.TXT
|

I agree with most of your post, but that last bit was really funny ;)

kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAyLAbFifl4CA0ImQRAgswAJoCT6ebZ6pEN8xF0dR7op3qJKjslwCgk3hW
pL0A/3n0bgvaw98pn/CnvDI=
=qeG3
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
EmeliaBryc

Latest Threads

Top