set_intersection

J

John

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

Thanks.
 
V

Victor Bazarov

John said:
Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

I believe that if both items in 'set_intersection' are *sorted* the
linear search time can be achieved. The Standard does not contain
any complexity requirements for 'set_intersection' that I could find.

V
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

I believe that if both items in 'set_intersection' are *sorted* the
linear search time can be achieved. The Standard does not contain
any complexity requirements for 'set_intersection' that I could find.

Section 25.3.5 requires that both sets are sorted. The complexity of
set_intersection is O(n+m), where n and m are the sizes of each set (if
you have the same version of the standard as me it's at the top of p 560).

To the OP: if you cross-post, please add a note about it in the message,
my first reply got stuck in c.l.c++.m's moderation system, it might get
through later.
 
V

Victor Bazarov

Erik said:
Section 25.3.5 requires that both sets are sorted. The complexity of
set_intersection is O(n+m), where n and m are the sizes of each set
(if you have the same version of the standard as me it's at the top
of p 560).

Apparently the "version" of the Standard I have is different, I did
find the complexity requirement, it slipped to the next page after the
description of 'set_intersection', so I just couldn't find it easily.

To all: the complexity of O(n+m) is deduced from the formula given by
the Standard in 25.3.5.3/4. BTW, numbering of chapters, clauses and
even paragraphs is usually kept up from the very first "version" of the
standard, so it's actually better to give the location of any text by
the chapter/clause/paragraph instead of page (since numbering of pages
is different in different "versions" and doesn't necessarily correspond
to that in the PDF file so it's unclear which "page" you mean).

V
 
P

Pete Becker

BTW, numbering of chapters, clauses and
even paragraphs is usually kept up from the very first "version" of the
standard, so it's actually better to give the location of any text by
the chapter/clause/paragraph instead of page (since numbering of pages
is different in different "versions" and doesn't necessarily correspond
to that in the PDF file so it's unclear which "page" you mean).

There's no effort to maintain chapter, clause, or paragraph numbers.
The way to cite portions of the standard is by the section's tag and
paragraph number, with enough context so that if the paragraph number
changes, readers can find the one of interest. So the complexity
requirements for set_intersection in the current working draft are in
[set.intersection]/4. (In the current standard, they're in
[lib.set.intersecion]/4; I've removed the lib. prefix from all the
section tags).
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

It's specified as O(n+m), where n is the size of the first set and m is
the size of the second. This complexity is achieved by requiring that
both sets are sorted, so it's probably using your algorithm.

--
Erik Wikström


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
M

Martin Bonner

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

25.3.5.3 [lib.set.intersection] / 4
Complexity: At most
2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons.

The standard almost /never/ specifies the algorithm just a maximum
complexity (which may imply an algorithm, but doesn't prevent a clever
implementer coming up with something better).
 
J

Jerry Coffin

[ ... ]
Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

The algorithm isn't specified. The complexity is ($25.3.5.3/4):
"Comlexity: At most 2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons."

--
Later,
Jerry.

The universe is a figment of its own imagination.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
C

Carl Barron

Martin Bonner said:
Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

25.3.5.3 [lib.set.intersection] / 4
Complexity: At most
2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons.

The standard almost /never/ specifies the algorithm just a maximum
complexity (which may imply an algorithm, but doesn't prevent a clever
implementer coming up with something better).
It does specify that equivalent entries are copied from the first
sequence. and if there are k equivalent members of both sets those k are
copied. so it must iterate over the first set, but since it can start
the iteration of the second sequence where it left off last time as
soon as either iterator reaches its end the process stops. That is
if you want the elements from the large set copied it goes first, if
you want the copies of the smaller set copied it goes first. It will
stop with the same # of compares. [not stated. but all I see is a
permutation of the actual compares if the sequences are interchanged]

In short it makes no difference as far as speed which sequence is
presented first. It might make a difference if the comparison does not
ensure equiv(A,B) means they are identical. Example
struct foo
{
int a;
int b;
};
bool operator < (const foo &x,const foo &y) {return x.a < y.a;}

std::vector<foo> f1,f2,i1,i2;
// fill f1 and f2 sorted on op <.
std::set_intersect(f1.begin(),f1.end(),f2.begin(),f2.begin(),
std::back_inserter(i1));
std::set_intersetion(f2.begin(),f2.end(),f1.begin(),f1.end,
std::back_inserter(i2));
does not ensure i1 == i2.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,291
Messages
2,571,455
Members
48,132
Latest member
KatlynC08

Latest Threads

Top