Help with a regexp

P

Phrogz

Daniel said:
If you've got comments, bring
'em on, but remember that I only just started this today.

One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).
 
D

Daniel Schierbeck

Phrogz said:
One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).

Actually, I'm imitating the behavior of YAML. I think it's very
intuitive that an object creates a bencoded copy of itself, while the
parser methods are gathered at one place. Maybe make the /decode(_.+)?/
methods private?


Cheers,
Daniel
 
S

Seth Thomas Rasmussen

Hi Daniel,

Daniel said:
I'm trying to write a regular expression that matches bencoded strings,
i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

6:foobar

while this is not:

4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point.

I think you can do what you want there, but if you're using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty "foo" tags

Peas,
 
D

dblack

Hi --

Hi Daniel,



I think you can do what you want there, but if you're using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty "foo" tags

It does, but the issue would be getting it to interpolate and be
pre-processed as a quantifier:

/(\d):\w{#{\1}}/ or something

which doesn't seem to be possible, at least as far as I can tell.


David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
http://www.manning.com/black => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
(e-mail address removed) => me
 
D

Daniel Martin

Daniel Schierbeck said:
I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

6:foobar

while this is not:

4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

The problem is that you want to leave crucial portions of the regexp
uncompiled until the moment that half of the regular expression has
matched, and this is not possible.

But matching bencoded data isn't that hard; here's something I just
whipped up that should handle bencoded data:

require 'strscan'

class BencodeScanner
def BencodeScanner.scan(str)
scan = StringScanner.new(str)
r = BencodeScanner.doscan_internal(scan,false)
raise "Malformed Bencoded String" unless scan.eos?
r
end

private

@@string_regexps = Hash.new {|h,k| h[k] = /:.{#{k}}/m}

def BencodeScanner.doscan_internal(scanner, allow_e=true)
tok = scanner.scan(/\d+|[ilde]/)
case tok
when nil
raise "Malformed Bencoded String"
when 'e'
raise "Malformed Bencoded String" unless allow_e
return nil
when 'l'
retval = []
while arritem = BencodeScanner.doscan_internal(scanner)
retval << arritem
end
return retval
when 'd'
retval = {}
while key = BencodeScanner.doscan_internal(scanner)
val = BencodeScanner.doscan_internal(scanner,false)
retval[key] = val
end
return retval
when 'i'
raise "Malformed Bencoded String" unless scanner.scan(/-?\d+e/)
return scanner.matched[0,scanner.matched.length-1].to_i
else
raise "Malformed Bencoded String" unless scanner.scan(@@string_regexps[tok])
return scanner.matched[1,tok.to_i]
end
end
end
 
C

Chris Erdal

Here's what I consider a slightly cleaner, more robust version:

require 'strscan'
inp = "3:ab23:cat5:sheep"
s = StringScanner.new( inp )
words = []
until s.eos?
begin
unless digits = s.scan( /\d+(?=:)/ )
raise "I can't find an integer followed by a colon"
end
s.pos += 1 # skip the colon we know is there
digits = digits.to_i
unless s.rest_size >= digits
raise "I ran out of characters; looking for #{digits} characters,
#{s.rest_size} left"
end
words << s.peek( digits )
s.pos += digits
rescue RuntimeError => e
warn "Looking at #{s.rest.inspect},"
warn e.message
abort "Words found so far: #{words.inspect}"
end
end
puts "Words found: ", words

I've been experimenting with Ruby since Tuesday, and I'd like to thank
you all for sharing code with us here - it really speeds us forwards in
picking up the spirit of Ruby coding.

I believe I've simplified your code by using the incredibly full set of
built-in methods in the string object, rather than depending on
"require 'strscan'":
-----------------------8<----------------------------
s = "3:a:23:cat5:sheep"
words = []
until s.empty?
begin
unless digits = s.slice!(/\d+(?=:)/)
raise "I can't find an integer followed by a colon"
end
words << s.slice!(0..digits.to_i)
unless words.last.size >= digits.to_i
raise "I ran out of characters; looking for #{digits} characters,
#{s.size} left"
end
rescue RuntimeError => e
warn "Looking at #{s.inspect},"
warn e.message
abort "Words found so far: #{words.inspect}"
end
end
puts "Words found: ", words
-----------------------8<----------------------------
 
C

Chris Erdal

I've been experimenting with Ruby ... Tuesday, ...

I never even noticed! talk about showing your age :)

sorry, I caught the colon at the start of each word - should have been :

s = "3:a:23:cat5:sheep"
words = []
until s.empty?
begin
unless digits = s.slice!(/\d+(?=:)/)
raise "I can't find an integer followed by a colon"
end
s.slice!(0)
words << s.slice!(0..digits.to_i-1)
unless words.last.size >= digits.to_i
raise "I ran out of characters; looking for #{digits} characters,
#{s.size} left"
end
rescue RuntimeError => e
warn "Looking at #{s.inspect},"
warn e.message
abort "Words found so far: #{words.inspect}"
end
end
puts "Words found: ", words

Goodbye,
 
D

Daniel Schierbeck

Daniel said:
Daniel Schierbeck said:
I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

6:foobar

while this is not:

4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

The problem is that you want to leave crucial portions of the regexp
uncompiled until the moment that half of the regular expression has
matched, and this is not possible.

But matching bencoded data isn't that hard; here's something I just
whipped up that should handle bencoded data:

require 'strscan'

class BencodeScanner
def BencodeScanner.scan(str)
scan = StringScanner.new(str)
r = BencodeScanner.doscan_internal(scan,false)
raise "Malformed Bencoded String" unless scan.eos?
r
end

private

@@string_regexps = Hash.new {|h,k| h[k] = /:.{#{k}}/m}

def BencodeScanner.doscan_internal(scanner, allow_e=true)
tok = scanner.scan(/\d+|[ilde]/)
case tok
when nil
raise "Malformed Bencoded String"
when 'e'
raise "Malformed Bencoded String" unless allow_e
return nil
when 'l'
retval = []
while arritem = BencodeScanner.doscan_internal(scanner)
retval << arritem
end
return retval
when 'd'
retval = {}
while key = BencodeScanner.doscan_internal(scanner)
val = BencodeScanner.doscan_internal(scanner,false)
retval[key] = val
end
return retval
when 'i'
raise "Malformed Bencoded String" unless scanner.scan(/-?\d+e/)
return scanner.matched[0,scanner.matched.length-1].to_i
else
raise "Malformed Bencoded String" unless scanner.scan(@@string_regexps[tok])
return scanner.matched[1,tok.to_i]
end
end
end

Thank you all for your responses!

I've been away for the last two days, so I've only just got an
opportunity to reply.

Daniel, I've further developed your solution:

module Bencode
class BencodingError < StandardError; end

class << self
def dump(obj)
obj.bencode
end

def parse(benc)
require 'strscan'

scanner = StringScanner.new(benc)
obj = scan(scanner)
raise BencodingError unless scanner.eos?
return obj
end

alias_method :load, :parse

private

def scan(scanner)
case token = scanner.scan(/[ild]|\d+:/)
when nil
raise BencodingError
when "i"
number = scanner.scan(/0|(-?[1-9][0-9]*)/)
raise BencodingError unless number
raise BencodingError unless scanner.scan(/e/)
return number
when "l"
ary = []
until scanner.peek(1) == "e"
ary.push(scan(scanner))
end
scanner.pos += 1
return ary
when "d"
hsh = {}
until scanner.peek(1) == "e"
hsh.store(scan(scanner), scan(scanner))
end
scanner.pos += 1
return hsh
when /\d+:/
length = token.chop.to_i
str = scanner.peek(length)
scanner.pos += length
return str
else
raise BencodingError
end
end
end
end


Cheers, and thank you all for helping me out!
Daniel Schierbeck
 
D

Daniel Schierbeck

Daniel said:
when "i"
number = scanner.scan(/0|(-?[1-9][0-9]*)/)
raise BencodingError unless number
raise BencodingError unless scanner.scan(/e/)
return number

That last line should of course read

return number.to_i


Daniel
 
D

dblack

Hi --

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)


David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
http://www.manning.com/black => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
(e-mail address removed) => me
 
L

Logan Capaldo

Hi --



Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)

According to _Programming Perl_ yes indeedy it is. ??{ } is "Match
Time Pattern Interpolation", and it lets you do all sorts of evil
(like matching nested parens with a regexp).

So in perl his code would be something like:

% cat mtpi.pl
$s1 = "3:abc";
$s2 = "24:abc";

print "Good\n" if ( $s1 =~ /(\d+):(??{'\w{' . $1 . '}'})/);
print "Bad\n" if ( $s2 =~ /(\d+):(??{'\w{' . $1 . '}'})/);

% perl mtpi.pl
Good

I apologize to any perlers if this isn't idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
http://www.manning.com/black => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
(e-mail address removed) => me
 
D

Daniel Martin

I apologize to any perlers if this isn't idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

I think you probably had trouble with the \ when you tried storing it
in a variable because of quoting issues. So use qr, the perl
equivalent of ruby's %r:

$s1 = "3:abc";
$s2 = "24:abc";

$regexp = qr/(\d+):(??{'\w{' . $1 . '}'})/;

print "Good\n" if ( $s1 =~ $regexp);
print "Bad\n" if ( $s2 =~ $regexp);

Although, since bencoded strings can contain any character, and not
just word characters, what you really want is:

$regexp = qr/(\d+):(??{".{$1}"})/;

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

For more examples, google "perlre".

Incidentally, I've just been able to reproduce as much of bencoding as
I implemented in ruby earlier in a pair of nasty perl regular
expressions.

I won't post it, since this is ruby-talk and not
perl-regex-nastiness-talk, but people who really want to see it can
look at http://paste.lisp.org/display/22637

It doesn't technically decode every possible bencoded string, because
limitations in perl's regexp engine don't let me say .{n} where "n" is
larger than about 32000 while a bencoded string can in theory have a
length up to 65535. But other than that, it should implement the
entire bencode spec.
 
D

dblack

Hi --

I think you probably had trouble with the \ when you tried storing it
in a variable because of quoting issues. So use qr, the perl
equivalent of ruby's %r:

$s1 = "3:abc";
$s2 = "24:abc";

$regexp = qr/(\d+):(??{'\w{' . $1 . '}'})/;

print "Good\n" if ( $s1 =~ $regexp);
print "Bad\n" if ( $s2 =~ $regexp);

Although, since bencoded strings can contain any character, and not
just word characters, what you really want is:

$regexp = qr/(\d+):(??{".{$1}"})/;

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

Ruby does let you jump back, though, with #{...}. But it looks like
Perl does an extra level of compilation. (Unless I've got that
backwards.)


David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
http://www.manning.com/black => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
(e-mail address removed) => me
 
D

Daniel Martin

Ruby does let you jump back, though, with #{...}. But it looks like
Perl does an extra level of compilation. (Unless I've got that
backwards.)

Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it's surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

As I said, most languages don't allow this. The usual pattern is:
1 make string in language-specific way
2 hand string to regexp engine and get back some compiled structure
3 store handle to compiled structure in host language
4 get string to match
5 hand string and compiled structure to regexp engine
6 regexp engine walks the string and compiled structure to determine
if there's a match.

Now, for speed, step 6 is generally done in C. Sometimes, so is step
2. (I haven't looked at the ruby code, but python does steps 1-5 in
python, and only 6 in C. Java's regexp engine of course does all
those steps in java)

Perl, however, lets you interrupt step 6 and evaluate some perl code
in the midst of the C-based matching code.
 
D

dblack

Hi --

Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it's surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

Right, I see what you mean. I think with "extra level of compilation"
that's what I was groping toward -- really a post-compilation
evaluation in a later context.

Thanks --


David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
http://www.manning.com/black => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
(e-mail address removed) => me
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top