gsub and backslashes

R

Ralph Shnelvar

[Note: parts of this message were removed to make it a legal post.]

Consider the string
\1\2\3
that is
"\\1\\2\\3"

I feel really stupid ... but this simple substitution pattern does not do what I expect.

"\\1\\2\\3".gsub(/\\/,"\\\\")

What I want is to change single backslashes to double backslashes. The result of the above substitution is "no change"

On the other hand
"\\1\\2\\3".gsub(/\\/,"\\\\\\\\")
does do what I want ... but I am clueless as to why.
 
A

Ammar Ali

Consider the string
=C2=A0\1\2\3
that is
=C2=A0"\\1\\2\\3"

I feel really stupid ... but this simple substitution pattern does not do= what I expect.

=C2=A0"\\1\\2\\3".gsub(/\\/,"\\\\")

What I want is to change single backslashes to double backslashes. =C2=A0=
The result of the above substitution is "no change"
On the other hand
=C2=A0"\\1\\2\\3".gsub(/\\/,"\\\\\\\\")
does do what I want ... but I am clueless as to why.

Backslashes are tricky. What's happening here is each escaped
backslash "\\" yields one backslash, which affects (escapes) what
comes after it, in this case another escaped backslash that in turn
yields one back slash. In other words, four backslashes yield two
backslashes, which is an escaped backslash (i.e one backslash).

HTH,
Ammar
 
A

Ammar Ali

Backslashes are tricky. What's happening here is each escaped
backslash "\\" yields one backslash, which affects (escapes) what
comes after it, in this case another escaped backslash that in turn
yields one back slash. In other words, four backslashes yield two
backslashes, which is an escaped backslash (i.e one backslash).

I should have added that you can get the same result with 3
backslashes. So 6 of them will give you two.
=3D> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

Regards,
Ammar
 
B

botp

[Note: parts of this message were removed to make it a legal post.]

What I want is to change single backslashes to double backslashes. The
result of the above substitution is "no change"
On the other hand
"\\1\\2\\3".gsub(/\\/,"\\\\\\\\")
does do what I want ... but I am clueless as to why.

there are many ways,

#1
"\\1\\2\\3".gsub(/(\\)/,"\\1\\1").scan /./
#=> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#2
"\\1\\2\\3".gsub(/(\\)/,'\1\1').scan /./
#=> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#3
"\\1\\2\\3".gsub(/\\/){"\\\\"}.scan /./
#=> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#4
"\\1\\2\\3".gsub(/(\\)/){$1+$1}.scan /./
#=> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]


#1 & #2 samples uses group backreferences, ruby may need second parsing pass
for this feature to work...

#3 & #4 uses code blocks. may not need second pass. backreferences can be
had using $n notation.

best regards -botp
 
A

Ammar Ali

What I want is to change single backslashes to double backslashes. =C2=
=A0The
result of the above substitution is "no change"
On the other hand
=C2=A0"\\1\\2\\3".gsub(/\\/,"\\\\\\\\")
does do what I want ... but I am clueless as to why.

there are many ways,

#1
"\\1\\2\\3".gsub(/(\\)/,"\\1\\1").scan /./
#=3D> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#2
"\\1\\2\\3".gsub(/(\\)/,'\1\1').scan /./
#=3D> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#3
"\\1\\2\\3".gsub(/\\/){"\\\\"}.scan /./
#=3D> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]

#4
"\\1\\2\\3".gsub(/(\\)/){$1+$1}.scan /./
#=3D> ["\\", "\\", "1", "\\", "\\", "2", "\\", "\\", "3"]


#1 & #2 samples uses group backreferences, ruby may need second parsing p= ass
for this feature to work...

#3 & #4 uses code blocks. may not need second pass. backreferences can be
had using $n notation.

botp's excellent suggestions reminded of another one:
=3D> "\\\\1\\\\2\\\\3"

Regards,
Ammar
 
B

Brian Candler

Ralph Shnelvar wrote in post #962847:
Consider the string
\1\2\3
that is
"\\1\\2\\3"

I feel really stupid ... but this simple substitution pattern does not
do what I expect.

"\\1\\2\\3".gsub(/\\/,"\\\\")

Here you are replacing one backslash with one backslash.

The trouble is, in the *replacement* string, '\1' has a special meaning
(insert the value of the first capture). Because of this, a literal
backslash is backslash-backslash.

So to replace with *two* backslashes you need
backslash-backslash-backslash-backslash. And inside a double or single
quoted string, a single backslash is represented as "\\" or '\\'

irb(main):001:0> "\\1\\2\\3".gsub(/\\/,"\\\\\\\\")
=> "\\\\1\\\\2\\\\3"

The second level of backslashing isn't used with the block form, since
if you want to use captured subexpressions you can use #{$1} instead of
\1. Hence as an alternative:

irb(main):002:0> "\\1\\2\\3".gsub(/\\/) { "\\\\" }
=> "\\\\1\\\\2\\\\3"
 
A

Ammar Ali

Ralph Shnelvar wrote in post #962847:

Here you are replacing one backslash with one backslash.

The trouble is, in the *replacement* string, '\1' has a special meaning
(insert the value of the first capture). Because of this, a literal
backslash is backslash-backslash.

That's a keen observation, but the fact that they happen to be
back-references doesn't seem to play a part in this situation.
=3D> "\\\\a\\\\b\\\\c"

Regards,
Ammar
 
R

Robert Klemme

That's a keen observation, but the fact that they happen to be
back-references doesn't seem to play a part in this situation.

=3D> "\\\\a\\\\b\\\\c"

The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have to backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 =3D 4 backslashes on the screen for one literal
replacement backslash.

Additionally people are often confused by the fact that IRB by default
uses #inspect for showing expression values which will display twice
as much backslashes as are present in the string. :)

<grumpy>Can we please make a big red sticker and put it on every Ruby
installer and source tar to inform people of this and the local
variable method ambiguity. These two seem to be the issues that pop
up most of the time.</grumpy>

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
A

Ammar Ali

The key point to understand IMHO is that a backslash is special in
replacement strings. =C2=A0So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have to backslashes. =C2=A0Coincidentally a backslash is also special in = a
string (even in a single quoted string). =C2=A0So you need two levels of
escaping, makes 2 * 2 =3D 4 backslashes on the screen for one literal
replacement backslash.

Actually, 3 backslashes will yield one backslash. The first two result
in one (escaped), and the third one, escaped by the previous escaped
backslash ends up being one. My second example showed this, using 6
backslashes instead of 8. Using 4 backslashes works because the second
pair yields and escaped backslash, but it is not necessary.

Regards,
Ammar
 
R

Robert Klemme

Actually, 3 backslashes will yield one backslash. The first two result
in one (escaped), and the third one, escaped by the previous escaped
backslash ends up being one. My second example showed this, using 6
backslashes instead of 8. Using 4 backslashes works because the second
pair yields and escaped backslash, but it is not necessary.

That does not work reliably under all circumstances though:

irb(main):006:0> "abc".gsub /./, "\\\n"
=3D> "\\\n\\\n\\\n"
irb(main):007:0> puts("abc".gsub /./, "\\\n")
\
\
\
=3D> nil
irb(main):008:0> "abc".gsub /./, "\\\\n"
=3D> "\\n\\n\\n"
irb(main):009:0> puts("abc".gsub /./, "\\\\n")
\n\n\n
=3D> nil

It is safer to use 4 backslashes. This is the only robust way to do
this even though sometimes you can simply use a single backslash (e.g.
\1 instead of \\1) because string parsing is a bit tolerant under some
circumstances:

irb(main):014:0> '\1'
=3D> "\\1"
irb(main):015:0> '\\1'
=3D> "\\1"

but

irb(main):019:0> "\n"
=3D> "\n"
irb(main):020:0> "\\n"
=3D> "\\n"
irb(main):021:0> "\1"
=3D> "\x01"
irb(main):022:0> "\\1"
=3D> "\\1"


Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
A

Ammar Ali

That does not work reliably under all circumstances though:

irb(main):006:0> "abc".gsub /./, "\\\n"
=3D> "\\\n\\\n\\\n"
irb(main):007:0> puts("abc".gsub /./, "\\\n")
\
\
\
=3D> nil
irb(main):008:0> "abc".gsub /./, "\\\\n"
=3D> "\\n\\n\\n"
irb(main):009:0> puts("abc".gsub /./, "\\\\n")
\n\n\n
=3D> nil

I think these examples are somewhat misleading, because the escaped
newline (\n) normally includes a backslash. Taking that into account,
i.e. not counting the one that is part of newline character, the first
example is only using 2 backslashes, and the second example is using
3. The same goes for its friends, \a, \r, \f, etc.
It is safer to use 4 backslashes. =C2=A0This is the only robust way to do
this even though sometimes you can simply use a single backslash (e.g.
\1 instead of \\1) because string parsing is a bit tolerant under some
circumstances:

I don't think this is tolerance from the string parser, it is
recognition of the \1 as a valid octal value.
irb(main):014:0> '\1'
=3D> "\\1"
irb(main):015:0> '\\1'
=3D> "\\1"

Here the single quotes are coming into play. Octal escapes are not
recognized within them. But it outputs the string in double quotes,
"forcing" the backslash to be escaped in the output. Backslashes need
to be escaped in single quoted string, just like they do in double
quoted ones, so in the second example ('\\1'), it's just one
backslash, again.
but

irb(main):019:0> "\n"
=3D> "\n"
irb(main):020:0> "\\n"
=3D> "\\n"
irb(main):021:0> "\1"
=3D> "\x01"
irb(main):022:0> "\\1"
=3D> "\\1"

Here the double quotes are taking effect. The first correctly prints a
newline, the second an escaped one, the third gets recognized as an
octal escape, and the last escapes the meaning of the backslash that
would otherwise cause the 1 to be interpreted as an octal value.

Maybe using 4 backslashes is safer, overall, but I wouldn't make it a
rule. At least not without explaining these special cases that include
a leading backslash in their normal representation.

Regards,
Ammar
 
R

Robert Klemme

I think these examples are somewhat misleading, because the escaped
newline (\n) normally includes a backslash. Taking that into account,
i.e. not counting the one that is part of newline character, the first
example is only using 2 backslashes, and the second example is using
3. The same goes for its friends, \a, \r, \f, etc.

That is the very point of my posting: you cannot always use three
slashes reliably because - ooops - all of a sudden the last one may be
part of something else. In other case, it happens to work

irb(main):002:0> "abc".gsub /./, "\\\y"
=> "\\y\\y\\y"
irb(main):003:0> "abc".gsub /./, "\\\\y"
=> "\\y\\y\\y"

Now if someone changes "y" to "n" in the first case the (probably
unintended) effect is dramatic. Or consider a replacement string 'foo
\1 bar' which at some point in time is changed to "foo \1 bar \n"
unsuspectingly and which suddenly does not only yield a newline but some
weird octal character. This would have been avoided if the original
string did contain two backslashes already.
I don't think this is tolerance from the string parser, it is
recognition of the \1 as a valid octal value.


Here the single quotes are coming into play. Octal escapes are not
recognized within them. But it outputs the string in double quotes,
"forcing" the backslash to be escaped in the output. Backslashes need
to be escaped in single quoted string, just like they do in double
quoted ones, so in the second example ('\\1'), it's just one
backslash, again.

Apparently I was not clear enough. The point is, that there is some
tolerance. Both sequences (line 14 and 15) produce the *same* output
although they differ in backslash usage. This does not work if you try
to write '\' to get a single backslash. For that you need '\\'. If you
use two backslashes in both cases it's clear what happens and there is
no room for errors.
Here the double quotes are taking effect. The first correctly prints a
newline, the second an escaped one,

This is not an "escaped newline" but merely a backslash followed by
character "n". Whether that is considered "escaped" in some way depends
on the code that processes this string. If at all this is an escaped
"n". :)
the third gets recognized as an
octal escape, and the last escapes the meaning of the backslash that
would otherwise cause the 1 to be interpreted as an octal value.
Correct.

Maybe using 4 backslashes is safer, overall, but I wouldn't make it a
rule. At least not without explaining these special cases that include
a leading backslash in their normal representation.

My precise reason to make it a rule is that it is simple and beginners
do not have to remember all these special cases that you find so worthy
mentioning.

Actually I do not like those special cases and would rather suggest to
remove them since they make things unnecessary complicated. The
repeated occurrence of newbie confusion and the very discussion we are
having here proves that the logic creates more confusion than clarity.
The only reason I do not suggest to change this is the fact that this
might break a lot of code.

Kind regards

robert
 
A

Ammar Ali

----8<----

Apparently I was not clear enough. =C2=A0The point is, that there is some
tolerance. =C2=A0Both sequences (line 14 and 15) produce the *same* outpu= t
although they differ in backslash usage. =C2=A0This does not work if you = try to
write '\' to get a single backslash. =C2=A0For that you need '\\'. =C2=A0= If you use
two backslashes in both cases it's clear what happens and there is no roo= m
for errors.

I guess I took issue with the word tolerance. I don't think of lexers
and parsers as tolerant. They are quite ruthless and dictatorial. It's
either their way, or their way in a way one did not expect. :)

This is not an "escaped newline" but merely a backslash followed by
character "n". =C2=A0Whether that is considered "escaped" in some way dep= ends on
the code that processes this string. =C2=A0If at all this is an escaped "=
n". :)

You are correct sir. For someone who was nitpicking, I misspoke. :)

My precise reason to make it a rule is that it is simple and beginners do
not have to remember all these special cases that you find so worthy
mentioning.

This might be six of one, half a dozen of the other kind of situation.
People would start to ask if the backslash in the \n case would count
in the "just add 4" rule, or not? 4 backslashes in total or 5? It
seems to only shift the issue slightly, and temporarily, until one has
to actually understand what is really going on.
Actually I do not like those special cases and would rather suggest to
remove them since they make things unnecessary complicated. =C2=A0The rep= eated
occurrence of newbie confusion and the very discussion we are having here
proves that the logic creates more confusion than clarity. The only reaso= n I
do not suggest to change this is the fact that this might break a lot of
code.

I agree, but this long "heritage" that goes back to the 60s is
probably very hard to shake. Maybe a new language can break away from
it.

Out of curiosity, what could these beasts be replaced with? Constants?

Cheers,
Ammar
 
R

Robert Klemme

I guess I took issue with the word tolerance. I don't think of lexers
and parsers as tolerant. They are quite ruthless and dictatorial. It's
either their way, or their way in a way one did not expect. :)

:) But rules can be made to allow for some flexibility (just think
of method calls with or without brackets in Ruby).
:)

You are correct sir. For someone who was nitpicking, I misspoke. :)

No problem. Apparently we both enjoy nitpicking. :))
This might be six of one, half a dozen of the other kind of situation.
People would start to ask if the backslash in the \n case would count
in the "just add 4" rule, or not? 4 backslashes in total or 5? It
seems to only shift the issue slightly, and temporarily, until one has
to actually understand what is really going on.

Hmm... Maybe.
I agree, but this long "heritage" that goes back to the 60s is
probably very hard to shake. Maybe a new language can break away from
it.

In Ruby's case the heritage does not go back to the sixties but rather
to the nineties (1997) if I am not mistaken.
Out of curiosity, what could these beasts be replaced with? Constants?

I'd leave everything as is except drop special cases like '\1' (this
would either be an octal escape as in a double quoted string or rather
just "1"). In single quoted strings only ' would be special if
preceded by a backslash. In double quoted strings I would have those
characters which are special currently (", n, r, a, t and probably
others I'm not thinking of right now). I am undecided whether I would
make all others errors or tolerant (e.g. "\z" would either by a syntax
error or just "z"). I have a slight tendency to the more strict
variant though because otherwise people might be left wondering what
\z means when it is just "z"; also, this would help detect typing
errors (maybe someone wanted to type "\t" which is just a key away in
my German keyboard).

Kind regards

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
A

Ammar Ali

[Note: parts of this message were removed to make it a legal post.]

:) But rules can be made to allow for some flexibility (just think
of method calls with or without brackets in Ruby).


That's a good example, and I know understand what you meant by tolerance.


No problem. Apparently we both enjoy nitpicking. :))

:)


I agree, but this long "heritage" that goes back to the 60s is

In Ruby's case the heritage does not go back to the sixties but rather
to the nineties (1997) if I am not mistaken.


I was thinking of C, which I believe introduced these escapes, but I'm not
sure.


I'd leave everything as is except drop special cases like '\1' (this
would either be an octal escape as in a double quoted string or rather
just "1"). In single quoted strings only ' would be special if
preceded by a backslash. In double quoted strings I would have those
characters which are special currently (", n, r, a, t and probably
others I'm not thinking of right now). I am undecided whether I would
make all others errors or tolerant (e.g. "\z" would either by a syntax
error or just "z"). I have a slight tendency to the more strict
variant though because otherwise people might be left wondering what
\z means when it is just "z"; also, this would help detect typing
errors (maybe someone wanted to type "\t" which is just a key away in
my German keyboard).



I like the idea of treating unnecessary escapes as syntax errors, or at
least warnings. I see this a lot in regular expressions, especially in
character sets. Characters that don't need to be escaped (like ? and *) are
preceded with a backslash, just to be safe I guess, making for a harder to
code, as you noted.

Regards,
Ammar
 
R

Robert Klemme

I was thinking of C, which I believe introduced these escapes, but I'm no= t
sure.

Yeah, but I don't want to change \n, \t etc. in double quoted strings.
I mostly want to get rid of '\1' which is something completely
specific to Ruby.
I like the idea of treating unnecessary escapes as syntax errors, or at
least warnings. I see this a lot in regular expressions, especially in
character sets. Characters that don't need to be escaped (like ? and *) a= re
preceded with a backslash, just to be safe I guess, making for a harder t= o
code, as you noted.

Exactly. I would not want to get rid of optional brackets for example
because lack of brackets can make code much more readable (apart from
foo.bar=3D(123) looking weird). It's always a question of balance. I
have to say that Matz did a remarkable job at this in Ruby in general.
This is just one of very few things that could be better (class
variables is another one I can think of right now).

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

gsub and backslashes 4
regex gsub 3
gsub with wildcard 6
Partial GSUB match / replacement 6
gsub pattern substitution and ${...} 7
gsub for string 3
gsub UTF values 0
gsub help 7

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top