set_intersection

John · Jul 20, 2007

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

Thanks.

Victor Bazarov · Jul 20, 2007

John said:
Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

I believe that if both items in 'set_intersection' are *sorted* the
linear search time can be achieved. The Standard does not contain
any complexity requirements for 'set_intersection' that I could find.

V

=?ISO-8859-1?Q?Erik_Wikstr=F6m?= · Jul 20, 2007

I believe that if both items in 'set_intersection' are *sorted* the
linear search time can be achieved. The Standard does not contain
any complexity requirements for 'set_intersection' that I could find.

Section 25.3.5 requires that both sets are sorted. The complexity of
set_intersection is O(n+m), where n and m are the sizes of each set (if
you have the same version of the standard as me it's at the top of p 560).

To the OP: if you cross-post, please add a note about it in the message,
my first reply got stuck in c.l.c++.m's moderation system, it might get
through later.

Victor Bazarov · Jul 20, 2007

Erik said:
Section 25.3.5 requires that both sets are sorted. The complexity of
set_intersection is O(n+m), where n and m are the sizes of each set
(if you have the same version of the standard as me it's at the top
of p 560).

Apparently the "version" of the Standard I have is different, I did
find the complexity requirement, it slipped to the next page after the
description of 'set_intersection', so I just couldn't find it easily.

To all: the complexity of O(n+m) is deduced from the formula given by
the Standard in 25.3.5.3/4. BTW, numbering of chapters, clauses and
even paragraphs is usually kept up from the very first "version" of the
standard, so it's actually better to give the location of any text by
the chapter/clause/paragraph instead of page (since numbering of pages
is different in different "versions" and doesn't necessarily correspond
to that in the PDF file so it's unclear which "page" you mean).

V

Pete Becker · Jul 20, 2007

BTW, numbering of chapters, clauses and
even paragraphs is usually kept up from the very first "version" of the
standard, so it's actually better to give the location of any text by
the chapter/clause/paragraph instead of page (since numbering of pages
is different in different "versions" and doesn't necessarily correspond
to that in the PDF file so it's unclear which "page" you mean).

There's no effort to maintain chapter, clause, or paragraph numbers.
The way to cite portions of the standard is by the section's tag and
paragraph number, with enough context so that if the paragraph number
changes, readers can find the one of interest. So the complexity
requirements for set_intersection in the current working draft are in
[set.intersection]/4. (In the current standard, they're in
[lib.set.intersecion]/4; I've removed the lib. prefix from all the
section tags).

=?ISO-8859-1?Q?Erik_Wikstr=F6m?= · Jul 20, 2007

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

It's specified as O(n+m), where n is the size of the first set and m is
the size of the second. This complexity is achieved by requiring that
both sets are sorted, so it's probably using your algorithm.

--
Erik Wikström

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Martin Bonner · Jul 20, 2007

Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

25.3.5.3 [lib.set.intersection] / 4
Complexity: At most
2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons.

The standard almost /never/ specifies the algorithm just a maximum
complexity (which may imply an algorithm, but doesn't prevent a clever
implementer coming up with something better).

Jerry Coffin · Jul 21, 2007

[ ... ]

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

The algorithm isn't specified. The complexity is ($25.3.5.3/4):
"Comlexity: At most 2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons."

--
Later,
Jerry.

The universe is a figment of its own imagination.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Carl Barron · Jul 21, 2007

Martin Bonner said:
Suppose you have a large set of strings like for say 1 million
elements.

Say you have a pretty small set of strings like say 20 elements.

You want to find which of the 20 elements are also in the 1 million
element set.

To do this you can iterate through each of the 20 elements and look
them each up in the bigger set.

Now if you have another set say 900,000 elements this is not the best
way to do the
set intersection. Better way is to linearly iterator through both
sets in order keeping the matching
elements.

How does the algorithm set_intersection work ? Does it take into
account the above type of analysis ?

Does the standard tell the running time for set_intersection and what
the algorithm is or is this implementation dependent ?

Click to expand...

25.3.5.3 [lib.set.intersection] / 4
Complexity: At most
2 * ((last1 - first1) + (last2 - first2)) - 1
comparisons.

The standard almost /never/ specifies the algorithm just a maximum
complexity (which may imply an algorithm, but doesn't prevent a clever
implementer coming up with something better).

It does specify that equivalent entries are copied from the first
sequence. and if there are k equivalent members of both sets those k are
copied. so it must iterate over the first set, but since it can start
the iteration of the second sequence where it left off last time as
soon as either iterator reaches its end the process stops. That is
if you want the elements from the large set copied it goes first, if
you want the copies of the smaller set copied it goes first. It will
stop with the same # of compares. [not stated. but all I see is a
permutation of the actual compares if the sequences are interchanged]

In short it makes no difference as far as speed which sequence is
presented first. It might make a difference if the comparison does not
ensure equiv(A,B) means they are identical. Example
struct foo
{
int a;
int b;
};
bool operator < (const foo &x,const foo &y) {return x.a < y.a;}

std::vector<foo> f1,f2,i1,i2;
// fill f1 and f2 sorted on op <.
std::set_intersect(f1.begin(),f1.end(),f2.begin(),f2.begin(),
std::back_inserter(i1));
std::set_intersetion(f2.begin(),f2.end(),f1.begin(),f1.end,
std::back_inserter(i2));
does not ensure i1 == i2.

set_intersection of two interval sets	1	Jul 7, 2007
STL:: Questions on set_intersection	3	Oct 11, 2005
Calculate rang and derang of ordering of subsets	0	Feb 6, 2024
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Text box simply do not stand out against the wall paper.	3	Feb 7, 2025
Incrementing getelementbyid	1	Jan 6, 2022
Search Results with Pagination	1	Oct 25, 2024
C code String And Comparison	2	Dec 27, 2022

set_intersection

John

Victor Bazarov

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Victor Bazarov

Pete Becker

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Martin Bonner

Jerry Coffin

Carl Barron

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads