#sort_by and #sort_obj

E

Eric Hodel

I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj
 
T

Trans

I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

Perhaps a better name is #indice or #sort_index, or something like
that.

T.
 
D

David A. Black

Hi --

I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need to know
how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming up with
something better.

Maybe #sort_key.


David

--
Upcoming training from Ruby Power and Light, LLC:
* Intro to Ruby on Rails, Edison, NJ, October 23-26
* Advancing with Rails, Edison, NJ, November 6-9
Both taught by David A. Black.
See http://www.rubypal.com for more info!
 
D

David A. Black

Hi --

I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

I think Eric intends it to return a kind of sort key, not to actually
specify the sorting algorithm.
Perhaps a better name is #indice or #sort_index, or something like
that.

I'm afraid "indice" isn't a word :) The singular of indices is index.


David

--
Upcoming training from Ruby Power and Light, LLC:
* Intro to Ruby on Rails, Edison, NJ, October 23-26
* Advancing with Rails, Edison, NJ, November 6-9
Both taught by David A. Black.
See http://www.rubypal.com for more info!
 
S

Stefan Rusterholz

David said:
Hi --


Maybe #sort_key.


David

I'd support that. I wrote a nat_cmp for String and added
String#nat_cmp_key to use sort_by.
It would be nice if the object returned allowed #-@ to be called, to
enable reverse sorting by easily doing: enum.sort_by { |o| -o.sort_key }

Regards
Stefan
 
T

Trans

I think Eric intends it to return a kind of sort key, not to actually
specify the sorting algorithm.

I realize. But clearly a standard definition of <=> would be based on
it. So the "coupling" method of Sortable could be #sort_key instead of
I'm afraid "indice" isn't a word :) The singular of indices is index.

Ah, well. I'm an American; dropping 's' is good enough for me ;)
#sort_key is a perfectly good name.

T.
 
A

ara.t.howard

I don't know if the name #sort_obj is appropriate, but I'm not
coming up with something better.

i prefer #order, #order_by since the concept is extends to much more
that sorting, for instance iterating or returning elements from a
collection such as a priority queue. in any case the idea is a good
one.


cheers.

a @ http://codeforpeople.com/
 
R

Rudi Cilibrasi

i prefer #order, #order_by since the concept is extends to much more
that sorting, for instance iterating or returning elements from a
collection such as a priority queue. in any case the idea is a good
one.

its a great idea and i have been using it for years in sql by the
name "sortkey" although after reading ara's point i think i prefer the
term "orderkey" now.
another name in collation docs is "weight string". i think i like
weight_string the best because it is less verby and confrontational
than order_by and order. both of these are sort of "in your face"
sounding. cheers,
-r.
 
J

Joel VanderWerf

Eric said:
somethings.sort_by { |s| s.sort_obj }

Since this is more efficient than #sort, maybe the implementation of
#sort should first try to do this, and only if that fails (#sort_obj not
defined, or two sort_obj not comparable), revert to the normal <=>
algorithm.

This would make a difference when some library does a #sort on some
collection, and you can't tell it to #sort_by instead (and you can't
redefine #sort in the collection, either).

In the cases where you have a collection of objects that don't respond
to #sort_obj, #sort will quickly hit a failure and use the fall-back.
(The worst case would be if, say, all but the last object in the
collection had #sort_obj.)

It should also fall back if <=> fails on the sort objs. I'm not sure how
to detect that reliably though.

Anyway, here's a half-baked attempt:

class NoSortObjMethod < StandardError; end

class Object
def sort_obj
raise NoSortObjMethod
end
end

class Array
alias old_sort sort
def sort
if block_given?
old_sort {|*args| yield(*args)}
else
begin
rslt = sort_by {|s| s.sort_obj}
puts "DEBUG: used sort_by with sort_obj"
rslt
rescue NoSortObjMethod
puts "DEBUG: NoSortObjMethod, using old_sort"
old_sort
rescue TypeError => ex # ?
puts "DEBUG: Comparison failed (#{ex}), using old_sort"
old_sort
end
end
end
end

a = [1, 3, 2]
p a
p a.sort


class C
def initialize n
@n = n
end

def sort_obj
@n
end
end

b = [C.new(1), C.new(3), C.new(2)]
p b
p b.sort

__END__

Output:

[1, 3, 2]
DEBUG: NoSortObjMethod, using old_sort
[1, 2, 3]
[#<C:0xb7c3d248 @n=1>, #<C:0xb7c3d234 @n=3>, #<C:0xb7c3d220 @n=2>]
DEBUG: used sort_by with sort_obj
[#<C:0xb7c3d248 @n=1>, #<C:0xb7c3d220 @n=2>, #<C:0xb7c3d234 @n=3>]
 
R

Rick DeNatale

Since this is more efficient than #sort, maybe the implementation of
#sort should first try to do this, and only if that fails (#sort_obj not
defined, or two sort_obj not comparable), revert to the normal <=>
algorithm.

Yes, but sort_by is not always more efficient than sort:

$ qri sort_by
----------------------------------------------------- Enumerable#sort_by
enum.sort_by {| obj | block } => array
------------------------------------------------------------------------
Sorts enum using a set of keys generated by mapping the values in
enum through the given block.

%w{ apple pear fig }.sort_by {|word| word.length}
#=> ["fig", "pear", "apple"]

The current implementation of sort_by generates an array of tuples
containing the original collection element and the mapped value.
This makes sort_by fairly expensive when the keysets are simple

require 'benchmark'
include Benchmark

a = (1..100000).map {rand(100000)}

bm(10) do |b|
b.report("Sort") { a.sort }
b.report("Sort by") { a.sort_by {|a| a} }
end

produces:

user system total real
Sort 0.180000 0.000000 0.180000 ( 0.175469)
Sort by 1.980000 0.040000 2.020000 ( 2.013586)

Also adding the startup overhead to check for an implementation of
sort_obj is an expense.

There's also the issue that you really need to check if ALL of the
elements being sorted respond to sort_obj.

It seems better to me to either reserve this technique as a pattern,
or at most add a new sort_by_sort_obj method or the like, and leave
sort as it is.
 
J

Joel VanderWerf

Rick said:
Yes, but sort_by is not always more efficient than sort:

Quite right, #sort should stay as it is, and #sort_by should only be
called explicitly.
 
E

Eric Hodel

I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

My primary goal was reducing the number of method calls and objects
created, boosting speed. The downside to this is that you can't sort
heterogeneous collections using #sort_obj. When sorting using #<=>
this would be easier.
 
E

Eric Hodel

its a great idea and i have been using it for years in sql by the
name "sortkey" although after reading ara's point i think i prefer the
term "orderkey" now.
another name in collation docs is "weight string". i think i like
weight_string the best because it is less verby and confrontational
than order_by and order. both of these are sort of "in your face"
sounding. cheers,

I like order_key too.
 
R

Robert Klemme

2007/10/8 said:
I haven't seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can't easily use the more-efficient #sort_by because you'll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don't know if the name #sort_obj is appropriate, but I'm not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

My primary goal was reducing the number of method calls and objects
created, boosting speed.

I'm afraid the performance issue is not that easy: I can imagine where
the <=> approach is more efficient. Here's the reason: with #sort_by
you need to keep *all* keys in memory for the whole sorting process
which can cost a lot of memory. Creating those for pairwise
comparison (as with <=>) is more CPU intensive because more temp
arrays have to be created but the overall memory usage might be lower
because they are kept during a single comparison operation only (see
also Rick's benchmark). Allocating that much memory from the OS for
the #sort_by case might actually slow the sorting down dramatically.
The downside to this is that you can't sort
heterogeneous collections using #sort_obj. When sorting using #<=>
this would be easier.

Not necessarily because the <=> operators might not be able to work
with elements from the other types and thus sorting might fail as
miserably. :)

Kind regards

robert
 
E

Eric Hodel

Yes, but sort_by is not always more efficient than sort:

require 'benchmark'
include Benchmark

a = (1..100000).map {rand(100000)}

bm(10) do |b|
b.report("Sort") { a.sort }
b.report("Sort by") { a.sort_by {|a| a} }
end

produces:

user system total real
Sort 0.180000 0.000000 0.180000 ( 0.175469)
Sort by 1.980000 0.040000 2.020000 ( 2.013586)

Also adding the startup overhead to check for an implementation of
sort_obj is an expense.

There's also the issue that you really need to check if ALL of the
elements being sorted respond to sort_obj.

It seems better to me to either reserve this technique as a pattern,
or at most add a new sort_by_sort_obj method or the like, and leave
sort as it is.

I'm actually impressed #sort_by does that well. Array#sort cheats.
 
E

Eric Hodel

I'm afraid the performance issue is not that easy: I can imagine where
the <=> approach is more efficient. Here's the reason: with #sort_by
you need to keep *all* keys in memory for the whole sorting process
which can cost a lot of memory. Creating those for pairwise
comparison (as with <=>) is more CPU intensive because more temp
arrays have to be created but the overall memory usage might be lower
because they are kept during a single comparison operation only (see
also Rick's benchmark). Allocating that much memory from the OS for
the #sort_by case might actually slow the sorting down dramatically.

Profiling indicated a significant amount of time was spent in #<=>
which created more objects and called more methods than #sort_by.
Replacing #sort with #sort_by and #sort_obj reduced both. Creating
fewer objects and using less memory aren't the same thing.

(Also, Rick's benchmark shows Array#sort cheating more than any other
effect.)
Not necessarily because the <=> operators might not be able to work
with elements from the other types and thus sorting might fail as
miserably. :)

#<=> usually does the right thing, raising ArgumentError or returning
nil on incomparable objects, where if you use #sort_obj with
incomparable objects you may get bogus results.
 
R

Rick DeNatale

Profiling indicated a significant amount of time was spent in #<=>
which created more objects and called more methods than #sort_by.
Replacing #sort with #sort_by and #sort_obj reduced both. Creating
fewer objects and using less memory aren't the same thing.

(Also, Rick's benchmark shows Array#sort cheating more than any other
effect.)

First of all it's not my benchmark, it comes directly from the ri doc
for sort_by

Second, I don't understand the concept of Array#sort "cheating" this
isn't a contest.

Some things perform better with sort, others with sort_by. The
performance of sort is generally determined by the cost of doing the
<=> comparison, while the performace of sort_by is generally
determined by the cost in time and space of creating the side array.

As the English say, Horse for courses.
 
E

Eric Hodel

First of all it's not my benchmark, it comes directly from the ri doc
for sort_by

Second, I don't understand the concept of Array#sort "cheating" this
isn't a contest.

Array#sort doesn't call Fixnum#<=> when comparing fixnums, even if
you redefine Fixnum#<=>. If you want a benchmark that shows when
#sort is faster than #sort_by in several general cases you should use
like implementations, not the special cases where Array#sort has been
optimized.
Some things perform better with sort, others with sort_by. The
performance of sort is generally determined by the cost of doing the
<=> comparison, while the performace of sort_by is generally
determined by the cost in time and space of creating the side array.

For sorting Fixnum there is much lower cost since you never call a #<=>.

While I didn't mention it, I was profiling to speed up RubyGems. I
don't blindly assume, and neither should anyone else.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top