Best practices resource/guidance for strings

C

Cs Webgrl

Hello,

I am working with scraping quite a bit of data and I would like to make
sure that I'm following some best practices for string manipulation. I
would like to be sure to take into account any speed and garbage
collection issues.

Does anyone know of any posts, websites, books or other resources that
provide "do this, not that" types of guidance?

For example, my understanding is that globbing everything into one line
when manipulating a string is not the best use of resources.

not good
"string+var".gsub('+','').strip.capitalize


better
s = "string+var
s.gsub('+','')
s.strip!
s.capitalize
s => 'String Var'

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Thanks.
 
P

Peter Hickman

[Note: parts of this message were removed to make it a legal post.]

Personally doing things on one line is not a sin of itself. Only when it is
overdone! As to what counts as overdone depends on your reading ability.

Splitting things onto individual lines allows you to insert logging at
various points without fear of breaking the code which the one line approach
does not.

However the multiline approach can make an insignificant part of the code
take up lots of screen real estate which can make the larger code harder to
read.

For example x.downcase.gsub(/\s+/, ' ').strip.capitalize is a fairly easy to
read clean up on a string but if it goes multiline

x.downcase!
x.gsub!(/\s+/, ' ')
x.strip!
x.capitalize!

not only does it take up more of the screen but it has also altered x,
something that the single line version did not.

Of course if things get really silly you could just create a function and
stuff all the code in there.
 
B

Brian Candler

Cs said:
better
s = "string+var
s.gsub('+','')
s.strip!
s.capitalize
s => 'String Var'

(You need gsub! and capitalize! of course)
Are there resources that explain why one is better than the other that
also provides more best practices like this?

Methods like capitalize! work on the existing string buffer in memory.
The non-bang methods create a whole new string, which involves work
copying it, and then later garbage-collecting the original.

Most of the non-bang methods are implemented as a dup followed by
calling the bang method on the copy. They're written in C, but are
effectively like this:

class String
def capitalize
dup.capitalize!
end

def capitalize!
# scan the string and modify it in place
end
end

Of course, in most apps the original chained code you wrote will be just
fine, and it's easy to write and understand. If you will be processing
files which are hundreds of megabytes long then it may be worthwhile
rewriting to the second form.

Other thoughts:

* for large files, process them in chunks or lines rather than reading
them all in at once

* use block form when opening a file, to ensure it's closed as soon as
you've finished with it

File.open("/path/to/file","rb") do |f|
f.each_line do |line|
...
end
end
 
C

Cs Webgrl

Thanks so much for the help and guidance. Most of my data is parsed
from mechanize and broken into smaller chunks that will manipulated to
get the final format. From my understanding, I should be ok. I
definitely agree that the conciseness of fewer lines of code is easier
to read. Just wanted to make sure that I'm not compromising speed or
garbage collection for readability on these types of methods.
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Hello,

I am working with scraping quite a bit of data and I would like to make
sure that I'm following some best practices for string manipulation. I
would like to be sure to take into account any speed and garbage
collection issues.

Does anyone know of any posts, websites, books or other resources that
provide "do this, not that" types of guidance?

For example, my understanding is that globbing everything into one line
when manipulating a string is not the best use of resources.

not good
"string+var".gsub('+','').strip.capitalize


better
s = "string+var
s.gsub('+','')
s.strip!
s.capitalize
s => 'String Var'

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Thanks.
I don't know about a specific site, but if you do not need to keep the value
of string, then string << var is better than string + var, since it mutates
string, rather than creating a new object. I once read benchmarks about
this, but I can't remember where I read them, and I can't seem to recreate
them, so maybe I am wrong.

# plus returns a new String
string , var = 'abc' , 'def'
string + var # => "abcdef"
string # => "abc"

# << mutates the receiver
string << var # => "abcdef"
string # => "abcdef"



You can use s.delete('+') instead of s.gsub('+','') and it will be faster,
prettier, and more expressive.



I expect the reason you heard that it is better to do it on multiple lines
is that it then lets you use the bang methods, which, for whatever reason
will return nil if they don't mutate the object. In general, it is faster to
say s.capitalize! than s.capitalize because in bang version, we mutate s
itself, in the second, we create a new object that is modified. But we are
not interested in keeping the original value of s, so creating all these
objects adds up.

# capitalize returns the capital version regardless of the original string
# so you can use it in the middle of a method chain
'Abc'.capitalize # => "Abc"
'abc'.capitalize # => "Abc"

# don't use capitalize! in the middle of a method chain because it can
return nil
'Abc'.capitalize! # => nil
'abc'.capitalize! # => "Abc"

# capitalize creates a new string, so is less efficient if you don't care
about the original
# also does not modify the receiver, so you have to capture its result
s = 'abc'
s.capitalize # => "Abc"
s # => "abc"

# capitalize! mutates the original string, so is more efficient if you don't
care about the original
# does modify the receiver, so don't have to capture its result
# in fact, _don't_ capture its result, because as shown above, result could
be nil
s = 'abc'
s.capitalize! # => "Abc"
s # => "Abc"
 
C

Cs Webgrl

Awesome guidelines. Thank you so much for taking the time to write this
up and help me understand how everything works.

Much appreciated Josh!
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

You can use s.delete('+') instead of s.gsub('+','') and it will be faster,
prettier, and more expressive.

This is wrong, delete removes the intersection of characters, you do need
to use gsub. I guess the speed comparison is not relevant, but it is still
uglier and less expressive -- but more correct.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top