Random Access using IO#pos in code blocks

Arun Kumar · Apr 28, 2009

Hello everyone,
I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
on a project where I'm indexing certain words in a text document. So
I'm also storing the file position where the word occurs. But the
Problem is:
The IO#pos points to the end of the file all the while... Below is
the code I'm working on:

File.open(file_name) do |f|

f.readlines("\r\n\r\n").each do |para|

para.scan(/\b\w+\b/).each do |word|

word =3D word.downcase.stem
if (!stoplist.include? word) && (!word.empty?) #excludes empty
and frequent words

unless freq.has_key?(word)
freq[word] =3D [1,f.pos,file_name] # freq is a hash, that
stores an array containing index, position of word (THE PROBLEM)..
else
freq[word].to_a[0] +=3D 1
freq[word].to_a<< f.pos << file_name
end

unless wfreq.has_key?(word)
wfreq[word] =3D [1,f.pos,file_name]
else
wfreq[word].to_a[0] +=3D 1
wfreq[word].to_a<< f.pos << file_name
end

end
end
end

File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}

Also it would be great if someone told me the replacement for the
deprecated 'to_a' method used above

Any help is greatly appreciated

---------------

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Robert Klemme · Apr 29, 2009

Hello everyone,
I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
on a project where I'm indexing certain words in a text document. So
I'm also storing the file position where the word occurs. But the
Problem is:
The IO#pos points to the end of the file all the while... Below is
the code I'm working on:

File.open(file_name) do |f|

f.readlines("\r\n\r\n").each do |para|

The reason is in the line above.

para.scan(/\b\w+\b/).each do |word|

word = word.downcase.stem
if (!stoplist.include? word) && (!word.empty?) #excludes empty
and frequent words

unless freq.has_key?(word)
freq[word] = [1,f.pos,file_name] # freq is a hash, that
stores an array containing index, position of word (THE PROBLEM)..
else
freq[word].to_a[0] += 1
freq[word].to_a<< f.pos << file_name
end

unless wfreq.has_key?(word)
wfreq[word] = [1,f.pos,file_name]
else
wfreq[word].to_a[0] += 1
wfreq[word].to_a<< f.pos << file_name
end

end
end
end

File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}

Also it would be great if someone told me the replacement for the
deprecated 'to_a' method used above

Why do you convert an Array into an Array?

Kind regards

robert

Brian Candler · Apr 29, 2009

Arun said:
unless freq.has_key?(word)
freq[word] = [1,f.pos,file_name] # freq is a hash, that
stores an array containing index, position of word (THE PROBLEM)..
else
freq[word].to_a[0] += 1
freq[word].to_a<< f.pos << file_name
end

BTW, you can replace all that by:

freq[word] ||= [0]
freq[word][0] += 1
freq[word] << f.pos << file_name

As for the pos, since you've already slurped in the data you'll need to
remember where you are within your buffer. Your outer loop could become
something like this:

para_pos = 0
f.readlines("\r\n\r\n").each do |para|
...
para_pos += para.size + 4
end

Unfortunately, I don't think string#scan will give you offsets into the
strings found.

In ruby 1.8 you can write this:

pos = 0
while md = /\b\w+\b/.match(para[pos..-1])
word = md[0]
puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
pos += md.end(0)
...
end

In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
you could optimise it to this:

pos = 0
while md = /\b\w+\b/.match(para, pos)
word = md[0]
puts "Match #{word} at #{para_pos+md.begin(0)}"
pos = md.end(0)
...
end

However in ruby 1.9 the offsets used will be in terms of number of
characters, not number of bytes. It would be up to you to convert this
back into byte offsets into the file, if that's what you're after.

Robert Klemme · Apr 29, 2009

2009/4/29 Brian Candler said:
Unfortunately, I don't think string#scan will give you offsets into the
strings found.

In ruby 1.8 you can write this:

=A0pos =3D 0
=A0while md =3D /\b\w+\b/.match(para[pos..-1])
=A0 =A0word =3D md[0]
=A0 =A0puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
=A0 =A0pos +=3D md.end(0)
=A0 =A0...
=A0end

In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
you could optimise it to this:

=A0pos =3D 0
=A0while md =3D /\b\w+\b/.match(para, pos)
=A0 =A0word =3D md[0]
=A0 =A0puts "Match #{word} at #{para_pos+md.begin(0)}"
=A0 =A0pos =3D md.end(0)
=A0 =A0...
=A0end

String#scan is likely faster than manually matching portions with
#match. In both versions of Ruby you can do this to get the
/character/ offset:

irb(main):001:0> s=3D%{foo bar baz}
=3D> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $`.length }
0
4
8
=3D> "foo bar baz"

However in ruby 1.9 the offsets used will be in terms of number of
characters, not number of bytes. It would be up to you to convert this
back into byte offsets into the file, if that's what you're after.

This is an important point to remember!

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Arun Kumar · Apr 29, 2009

Thank you so much for the kind responses! I'm pleased to be part of such a
kind community

2009/4/29 Brian Candler said:
2009/4/29 Brian Candler said:

Unfortunately, I don't think string#scan will give you offsets into the
strings found.

In ruby 1.8 you can write this:

pos =3D 0
while md =3D /\b\w+\b/.match(para[pos..-1])
word =3D md[0]
puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
pos +=3D md.end(0)
...
end

In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
you could optimise it to this:

pos =3D 0
while md =3D /\b\w+\b/.match(para, pos)
word =3D md[0]
puts "Match #{word} at #{para_pos+md.begin(0)}"
pos =3D md.end(0)
...
end

Click to expand...

String#scan is likely faster than manually matching portions with
#match. In both versions of Ruby you can do this to get the
/character/ offset:

irb(main):001:0> s=3D%{foo bar baz}
=3D> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $`.length }
0
4
8
=3D> "foo bar baz"

However in ruby 1.9 the offsets used will be in terms of number of
characters, not number of bytes. It would be up to you to convert this
back into byte offsets into the file, if that's what you're after.

Click to expand...

This is an important point to remember!

Kind regards

robert

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Arun Kumar · Apr 29, 2009

[Note: parts of this message were removed to make it a legal post.]

unless freq.has_key?(word)
freq[word] = [1,f.pos,file_name] # freq is a hash, that
stores an array containing index, position of word (THE PROBLEM)..
else
freq[word].to_a[0] += 1
freq[word].to_a<< f.pos << file_name
end

Click to expand...

BTW, you can replace all that by:

freq[word] ||= [0]
freq[word][0] += 1
freq[word] << f.pos << file_name

I've tried doing it but since 'freq' is a hash it gives the following error:

preprocessor.rb:32:in `calc_frequency_word_list': undefined method `[]='
for 0:Fixnum (NoMethodError)
from copy of preprocessor.rb:25:in `scan'
from copy of preprocessor.rb:25:in `calc_frequency_word_list'
from copy of preprocessor.rb:23:in `each'
from copy of preprocessor.rb:23:in `calc_frequency_word_list'
from copy of preprocessor.rb:61

Brian Candler · Apr 29, 2009

Arun said:
freq[word] ||= [0]
freq[word][0] += 1
freq[word] << f.pos << file_name

Click to expand...

I've tried doing it but since 'freq' is a hash it gives the following
error:

Show your actual code. The following code works just fine:

freq = {}
%w{foo bar baz bar}.each do |word|
freq[word] ||= [0]
freq[word][0] += 1
freq[word] << "pos" << "name"
end
puts freq.inspect

The error suggests that you have initialized freq[word] to 0, not to
[0].

Or perhaps you set freq = Hash.new(0), which is wrong in this case,
because the default element needs to be [0] not 0.

An alternative is to auto-initialize each hash element like this:

freq = Hash.new { |h,k| h[k] = [0] }
%w{foo bar baz bar}.each do |word|
freq[word][0] += 1
freq[word] << "pos" << "name"
end
puts freq.inspect

Arun Kumar · Apr 29, 2009

Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
because the default element needs to be [0] not 0.

An alternative is to auto-initialize each hash element like this:

freq =3D Hash.new { |h,k| h[k] =3D [0] }
%w{foo bar baz bar}.each do |word|
freq[word][0] +=3D 1
freq[word] << "pos" << "name"
end
puts freq.inspect

)

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Robert Klemme · Apr 29, 2009

2009/4/29 Arun Kumar said:
Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
because the default element needs to be [0] not 0.

An alternative is to auto-initialize each hash element like this:

freq =3D Hash.new { |h,k| h[k] =3D [0] }
%w{foo bar baz bar}.each do |word|
=A0freq[word][0] +=3D 1
=A0freq[word] << "pos" << "name"
end
puts freq.inspect

Click to expand...

This is a typical case where I would introduce a separate class or
even multiple classes because it makes life so much more readable.

WordPositon =3D Struct.new :file,

os

WordStats =3D Struct.new :word,

ositions do
def count; positions.size; end
end

freq =3D Hash.new {|h,word| h[word.freeze] =3D WordStat.new(word, [])}
...
freq[word].positions << WordPosition.new(file_name, pos)
...

Then you can do

freq.sort_by {|w,stat| stat.count}

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Brian Candler · Apr 30, 2009

Robert said:
String#scan is likely faster than manually matching portions with
#match. In both versions of Ruby you can do this to get the
/character/ offset:

irb(main):001:0> s=%{foo bar baz}
=> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $`.length }
0
4
8
=> "foo bar baz"

Well, my guess is that would be *less* efficient for large paragraphs,
since $` forces allocation of a new string containing all the text from
the start to the current point. But that reminds me, there is a global
variable containing a MatchData object: $~

So you can write:

irb(main):001:0> s=%{foo bar baz}
=> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
0
4
8
=> "foo bar baz"

Regards,

Brian.

Arun Kumar · Apr 30, 2009

Thank you so much for all the responses

Well, my guess is that would be *less* efficient for large paragraphs,
since $` forces allocation of a new string containing all the text from
the start to the current point. But that reminds me, there is a global
variable containing a MatchData object: $~

So you can write:

irb(main):001:0> s=3D%{foo bar baz}
=3D> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
0
4
8
=3D> "foo bar baz"

Regards,

Brian.

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Robert Klemme · Apr 30, 2009

2009/4/30 Brian Candler said:
Well, my guess is that would be *less* efficient for large paragraphs,
since $` forces allocation of a new string containing all the text from
the start to the current point.

Last time I checked the actual string buffer was shared so the
overhead is just a single instance. I do have to admit though that I
do not know when the object is allocated (i.e. at time of match or
when referencing $`).

But that reminds me, there is a global
variable containing a MatchData object: $~

So you can write:

irb(main):001:0> s=3D%{foo bar baz}
=3D> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
0
4
8
=3D> "foo bar baz"

Also a good variant! (Btw, MatchData might be even more heavyweight
than a sub string.)

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Brian Candler · Apr 30, 2009

Robert said:
I do have to admit though that I
do not know when the object is allocated (i.e. at time of match or
when referencing $`).

Experiment suggests the MatchData is created immediately on the match,
and the string is instantiated lazily from that. This makes sense; it
would be very inefficient to allocate strings for $`, $1, $2, $3, ... $'
when maybe none of them will be used. But the MatchData object has the
original string plus all the offsets.

def count(klass)
c = 0
ObjectSpace.each_object(klass) { c += 1 }
c
end

str = " foo bar baz "

c1 = [count(MatchData), count(String)]

str =~ /(\w+)/

c2 = [count(MatchData), count(String)]

x = $~

c3 = [count(MatchData), count(String)]

y = $`

c4 = [count(MatchData), count(String)]

puts [c1,c2,c3,c4].inspect
# [[0, 188], [1, 188], [1, 188], [1, 189]]

Robert Klemme · Apr 30, 2009

2009/4/30 Brian Candler said:
Experiment suggests the MatchData is created immediately on the match,
and the string is instantiated lazily from that. This makes sense; it
would be very inefficient to allocate strings for $`, $1, $2, $3, ... $'
when maybe none of them will be used. But the MatchData object has the
original string plus all the offsets.

Ah, good to know! Thanks for the experimenting!

"Tune in next week when you'll hear Dr. Brian say: what's this fuse for?"
;-)

Kind regards

robert

Parsing pdf files	7	Aug 22, 2009
geting error as unxpected symbol read: ". in array initialization	0	Mar 27, 2016
retriving escape unicode sequences from files ...	1	Aug 4, 2012
retriving escape unicode sequences from files ...	1	Aug 4, 2012
IO#pos not reporting correctly for files in append mode?	0	Apr 8, 2006
corrupt zip files	10	May 6, 2012
Why file containing 256 bytes is 257 bytes long?	12	Sep 14, 2005
VB script, driving me crazy; vb bug ?	2	Jan 30, 2008

Random Access using IO#pos in code blocks

Arun Kumar

Robert Klemme

Brian Candler

Robert Klemme

Arun Kumar

Arun Kumar

Brian Candler

Arun Kumar

Robert Klemme

Brian Candler

Arun Kumar

Robert Klemme

Brian Candler

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads