regex help

C

Chris Morris

I need a re such that:

' /* comment */ String s = "***/"; '.gsub(re, "*\\")

returns:

' /* comment *\ String s = "***/"; '

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
 
S

Simon Strandgaard

I need a re such that:

' /* comment */ String s = "***/"; '.gsub(re, "*\\")

returns:

' /* comment *\ String s = "***/"; '

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?


possible.. but difficult.

I spend this evening making some _broken_ experiments.

--
Simon Strandgaard


server> ruby test_main.rb
Loaded suite TestMain
Started
test_balance_bad1(TestMain): .
test_balance_bad2(TestMain): F
test_balance_bad3(TestMain): F
test_balance_ok1(TestMain): F
test_balance_ok2(TestMain): .

Finished in 0.030393 seconds.

1) Failure:
test_balance_bad2(TestMain)
[test_main.rb:25:in `assert_x'
test_main.rb:33:in `test_balance_bad2']:
<["xx ", "*/"]> expected but was
<nil>.

2) Failure:
test_balance_bad3(TestMain)
[test_main.rb:25:in `assert_x'
test_main.rb:37:in `test_balance_bad3']:
<["xx /* /* */ ", "*/"]> expected but was
<["/* /* "]>.

3) Failure:
test_balance_ok1(TestMain)
[test_main.rb:25:in `assert_x'
test_main.rb:41:in `test_balance_ok1']:
<nil> expected but was
<["/* "]>.

5 tests, 5 assertions, 3 failures, 0 errors
server> expand -t2 test_main.rb
require 'test/unit'

class TestMain < Test::Unit::TestCase
def mk_re
comment_begin = '\/\*' # /*
comment_end = '\*\/' # */
re = /
(
#{comment_begin}
.*?
)
#{comment_end}
.*?
(?! #{comment_begin} )
(?= #{comment_end} )
/x
re
end
def assert_x(expected, input)
actual = mk_re.match(input)
if actual
actual = actual.to_a
actual.shift
end
assert_equal(expected, actual)
end
def test_balance_bad1
s = 'xx /* comment */ String s = "***/"; '
assert_x(['/* comment '], s)
end
def test_balance_bad2
s = 'xx */ */ String s = "***/"; '
assert_x(['xx ', '*/'], s)
end
def test_balance_bad3
s = 'xx /* /* */ */'
assert_x(['xx /* /* */ ', '*/'], s)
end
def test_balance_ok1
s = ' /* */ /* */ '
assert_x(nil, s)
end
def test_balance_ok2
s = 'xx /* /* */ '
assert_x(nil, s)
end
end

if $0 == __FILE__
require 'test/unit/ui/console/testrunner'
Test::Unit::UI::Console::TestRunner.run(TestMain, 3)
end
server>
 
C

Chris Morris

Simon said:
possible.. but difficult.

I spend this evening making some _broken_ experiments.

I thought it might be. I've fallen back and written a simple
char-by-char parser that ignores pieces inside quotes.
 
R

Robert Klemme

Chris Morris said:
I need a re such that:

' /* comment */ String s = "***/"; '.gsub(re, "*\\")

returns:

' /* comment *\ String s = "***/"; '

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

str.gsub(%r{"[^"]*"|\*/}) {|m| m == '*/' ? '*\\' : m}

If you need single quotes as well, take this one:
str.gsub(%r{"[^"]*"|'[^']*'|\*/}) {|m| m == '*/' ? '*\\' : m}

Note: the order of the different alternatives matters.

Regards

robert
 
N

Nikolai Weibull

* Chris Morris said:
I need a re such that:

' /* comment */ String s = "***/"; '.gsub(re, "*\\")

returns:

' /* comment *\ String s = "***/"; '

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes...

str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '\*' ? '*\\' : m }

will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai
 
R

Robert Klemme

Nikolai Weibull said:
* Chris Morris said:
I need a re such that:

' /* comment */ String s = "***/"; '.gsub(re, "*\\")

returns:

' /* comment *\ String s = "***/"; '

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes...

str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '\*' ? '*\\' : m }

will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai

Did you test that? I'm afraid, it doesn't work:

irb(main):001:0> str=' /* comment */ String s = "***/"; '
=> " /* comment */ String s = \"***/\"; "
irb(main):002:0> str.gsub(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '\*' ?
'*\\' : m }
=> " /* comment */ String s = \"***/\"; "
irb(main):003:0> str == str.gsub(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m ==
'\*' ? '*\\' : m }
=> true

If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{"([^"\\]|\\")*"|\*/}) {|m| m == '*/' ?
'*\\' : m}
/* comment *\ String s = "***/";

Regards

robert
 
N

Nikolai Weibull

* Robert Klemme said:
str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '\*' ? '*\\' : m }
Did you test that? I'm afraid, it doesn't work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be

str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '*/' ? '*\\' : m }
^^
If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{"([^"\\]|\\")*"|\*/}) {|m| m == '*/' ?
'*\\' : m}
/* comment *\ String s = "***/";
well, this isn't really correct, that would only escape quotes and you
wouldn't allow for escaped backslashes in your strings....it is of
course trivial to mend. do note that my version is a lot faster...see
"Mastering Regular Expressions" by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai
 
R

Robert Klemme

Nikolai Weibull said:
* Robert Klemme said:
str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '\*' ? '*\\' :
m }
Did you test that? I'm afraid, it doesn't work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be

str.gsub!(/"[^"\\]*(\\.[^"\\]*)*"|\*\//){ |m| m == '*/' ? '*\\' : m }
^^

Oops, yes you're right.
If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{"([^"\\]|\\")*"|\*/}) {|m| m == '*/' ?
'*\\' : m}
/* comment *\ String s = "***/";
well, this isn't really correct, that would only escape quotes and you
wouldn't allow for escaped backslashes in your strings....it is of
course trivial to mend. do note that my version is a lot faster...see
"Mastering Regular Expressions" by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai

I've added escaping of arbitrary chars and put it into a benchmark
(attached). The differences don't look too big:

18:07:21 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.776000)
yours 1.719000 0.000000 1.719000 ( 1.732000)
18:07:26 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.785000)
yours 1.704000 0.000000 1.704000 ( 1.712000)
18:07:31 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.800000)
yours 1.719000 0.000000 1.719000 ( 1.706000)
18:07:36 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.788000)
yours 1.703000 0.000000 1.703000 ( 1.693000)
18:07:43 [ruby]: ./rx-bm.rb
user system total real
mine 1.766000 0.000000 1.766000 ( 1.788000)
yours 1.719000 0.000000 1.719000 ( 1.707000)
18:07:52 [ruby]:

That's certainly not what I'd call "a lot faster". Maybe the effects of
GC dominate the rx timing. Here's the output of the second benchmark:

18:15:22 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.234000 0.000000 3.234000 ( 3.227000)
yours 3.157000 0.016000 3.173000 ( 3.188000)
18:15:33 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.203000 0.000000 3.203000 ( 3.218000)
yours 3.125000 0.016000 3.141000 ( 3.231000)
18:15:44 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.187000 0.000000 3.187000 ( 3.257000)
yours 3.156000 0.000000 3.156000 ( 3.186000)

Doesn't look so much different. Any ideas or enlightening comments from
the aforementioned book?

Kind regards

robert
 
N

Nikolai Weibull

* Robert Klemme said:
Doesn't look so much different. Any ideas or enlightening comments from
the aforementioned book?
'my' version avoids a lot of unnecessary backtracking under certain
conditions. I can't really delve into it further, but if you haven't
got the book, its really worth buying. It's really very entertaining
and full of good knowledge.
nikolai
 
N

Nikolai Weibull

* Zach Dennis said:
I've got the book Nikolai, and tonight after the office closes my goal
is to dive into your code and Robert's code and find out why. Your
knowledge on this amazes me!
hehe, eh, thanks i suppose. Its covered in Chapter 6. It's a use of
what Friedl calls "Unrolling-the-Loop" for regexes. It's one of the
coolest regex optimizations ever deviced in my opinion.
nikolai

P.S.
Sorry for not being able to explain why (or more interestingly how) in
more detail, but Friedl spends some 60-70 pages on this, so it wouldn't
really be possible.
D.S.
 
R

Robert Klemme

Nikolai Weibull said:
'my' version avoids a lot of unnecessary backtracking under certain
conditions. I can't really delve into it further, but if you haven't
got the book, its really worth buying. It's really very entertaining
and full of good knowledge.
nikolai

Hm.... Maybe it's because of the alternative in the first part:
([^"\\]|\\.). But the rx engine can detect at the first char which of the
two alternatives it has to take. Hmm... It seems I gotta have to get
that book... :)

Thanks anyway!

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,817
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top