reduce() anomaly?

  • Thread starter Stephen C. Waterbury
  • Start date
B

Ben Finney

Your OS needs a better 'man uniq', since that description doesn't say
what the expected behaviour is with unsorted input. The GNU 'man uniq'
doesn't mention sorted input, but applies to any input.

Bah, I spoke too soon; you're looking only at the 'NAME" section. Read
the 'DESCRIPTION' section and I'll bet it's clear what happens on
unsorted input.
 
E

Erik Max Francis

Curt said:
curty@einstein:~$ less uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ uniq uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ sort uniq.txt | uniq
curty
flirty

Maybe my uniq is unique.

No, that's expected behavior, and consistent with what I said. uniq
doesn't really care whether its input is sorted, it just takes
consecutive sequences of identical lines (identically by the criteria
you specify on the command line) and collapses them into at most one.

In your sample, there were no consecutive lines that were identical, so
uniq did nothing. Change the order of them, and despite still being
non-sorted, you'll see that uniq is working:

max@oxygen:~/tmp% cat > uniq.txt
flirty
curty
curty
flirty
^D
max@oxygen:~/tmp% uniq uniq.txt
flirty
curty
flirty

The duplicate consecutive "curty" lines got collapsed into one.
NAME
uniq - remove duplicate lines from a sorted file
******

That's true that's in the one-line description of uniq on some systems,
such as GNU, since that's the most common usage. But if you look at the
description of what it actually does, you'll see its behavior doesn't
require sorted input:

DESCRIPTION
Discard all but one of successive identical lines from
INPUT (or standard input), writing to OUTPUT (or standard
output).

And on some systems, the summary doesn't mention sorting at all; Solaris
8, for instance:

NAME
uniq - report or filter out repeated lines in a file

and sort is only mentioned in the "SEE ALSO" section, nowhere in the
main descpription.

For an example where uniq would be useful despite the input deliberately
not being sorted, consider processing a log file with a lot of duplicate
entries, and you only want to see the first of each series of
consecutive duplicates. (This is actually not unheard of; syslog for
instance will do this automatically.)

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ Extremes meet.
-- John Hall Wheelock
 
C

Curt

Erik Max Francis said:
In your sample, there were no consecutive lines that were identical, so
uniq did nothing. Change the order of them, and despite still being
non-sorted, you'll see that uniq is working:

Well, changing the order of the lines in my sample to ensure the
contiguity of identical entries _is_ sorting. I don't know what else
one could call that procedure, but "non-sorted" appears to me to be
a rather provocative description of the modified sample which you were
constrained to alter in order that it meet a criterion whose existence
you deny.
 
B

Ben Finney

Well, changing the order of the lines in my sample to ensure the
contiguity of identical entries _is_ sorting.

Erik wasn't talking about the *process* of rearranging lines; he was
talking about the *state* of the input list.

This list is sorted (in ascending alphanumeric order):

curty
curty
flirty
flirty

This list is not sorted:

flirty
curty
curty
flirty

Both contain contiguous identical lines. The 'uniq' filter will have an
effect on oth these lists.


This list is sorted:

cooty
curty
flippy
flirty

This list is not sorted:

flippy
curty
cooty
flirty

Neither contain contiguous identical lines. The 'uniq' filter will not
have an effect on either of these lists.
"non-sorted" appears to me to be a rather provocative description of
the modified sample which you were constrained to alter in order that
it meet a criterion whose existence you deny.

The criterion has been stated several times already: uniq will act on
consecutive identical lines. As shown above, this property is
orthogonal to "sorted".

It's unfortunate (and probably worthy of a bug report) that man pages
refer to "sorted" in the 'NAME' section for 'uniq', because the sorted
or non-sorted state is unrelated to the behaviour of 'uniq'.
 
B

Ben Finney

man uniq on GNU:

DESCRIPTION
Discard all but one of successive identical lines from
INPUT (or standard input), writing to OUTPUT (or standard
output).

This says nothing about sorting.

Sadly, the 'NAME' section does:

NAME
uniq - remove duplicate lines from a sorted file

I suspect this is a source of confusion; certainly, it doesn't help.
 
E

Erik Max Francis

Curt said:
Well, changing the order of the lines in my sample to ensure the
contiguity of identical entries _is_ sorting.

The tweak I made to your sample file wasn't sorted. It just had two
identical adjacent lines. The modified sample again was:

max@oxygen:~/tmp% cat > uniq.txt
flirty
curty
curty
flirty
^D
max@oxygen:~/tmp% uniq uniq.txt
flirty
curty
flirty

You don't really think the sequence [flirty, curty, curty, flirty] is
sorted, do you?
I don't know what else
one could call that procedure, but "non-sorted" appears to me to be
a rather provocative description of the modified sample which you were
constrained to alter in order that it meet a criterion whose existence
you deny.

man uniq on GNU:

DESCRIPTION
Discard all but one of successive identical lines from
INPUT (or standard input), writing to OUTPUT (or standard
output).

This says nothing about sorting.

man uniq on Solaris 8:

DESCRIPTION
The uniq utility will read an input file comparing adjacent
lines, and write one copy of each input line on the output.
The second and succeeding copies of repeated adjacent input
lines will not be written.

Repeated lines in the input will not be detected if they are
not adjacent.

Neither of these detailed descriptions makes any reference to sorting
whatsoever; uniq acts completely locally and doesn't care whether its
input is sorted or not.

As a more extended example, consider processing by uniq with a
hypothetical log file:

max@oxygen:~/tmp% cat > uniq2.txt
startup
connect from A
message from A
message from A
message from A
message from A
disconnect from B
mark
mark
connect from B
message from B
disconnect from B
shutdown
^D
max@oxygen:~/tmp% uniq uniq2.txt
startup
connect from A
message from A
disconnect from B
mark
connect from B
message from B
disconnect from B
shutdown
max@oxygen:~/tmp% uniq -c uniq2.txt # to see the number of duplicates
1 startup
1 connect from A
4 message from A
1 disconnect from B
2 mark
1 connect from B
1 message from B
1 disconnect from B
1 shutdown

I hope you'd agree that this input is obviously not sorted in any way.
Yet uniq works precisely as described.

Yes, obviously sending uniq sorted input is a common way it is invoked.
But it is by no means required.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ We are victims of our circumstance.
-- Sade Adu
 
E

Erik Max Francis

Ben said:
Sadly, the 'NAME' section does:

Yes, I quoted that earlier and commented on it. It's not universal in
uniq man pages. The Solaris 8 man page, for instance, doesn't have this
bug.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ They make it a desert and call it peace.
-- Tacitus
 
C

Curt

Erik Max Francis said:
The tweak I made to your sample file wasn't sorted. It just had two
identical adjacent lines. The modified sample again was:
max@oxygen:~/tmp% cat > uniq.txt
flirty
curty
curty
flirty
^D
max@oxygen:~/tmp% uniq uniq.txt
flirty
curty
flirty
You don't really think the sequence [flirty, curty, curty, flirty] is
sorted, do you?

Well, you did do _something_ to the sample for which you fail to find
a more descriptive word than "tweak". I certainly do think that the
proper word for the modified sample is "sorted"; yes, you sorted the
file on the word "curty", by which I mean that you performed "an
operation that segregates items into groups according to a specified
criterion" (WordNet). You segregated the item "curty" into a group
and therefore sorted the file by what we will now refer to as the
"curty criterion". That you didn't apply the "flirty criterion" to
your sort (or simply apply an alphabetical criterion) does not
demonstrate anything other than your cleverly disguised reticence to
employ the "sort" command. ;-)
 
B

Ben Finney

Erik Max Francis said:
You don't really think the sequence [flirty, curty, curty, flirty] is
sorted, do you?

Well, you did do _something_ to the sample for which you fail to find
a more descriptive word than "tweak".

He contrived an example that demonstrated his point. You seem to be
fascinated with finding some definition of "sort" that can be bent to
this.
I certainly do think that the
proper word for the modified sample is "sorted"; yes, you sorted the
file on the word "curty", by which I mean that you performed "an
operation that segregates items into groups according to a specified
criterion" (WordNet).

This is ridiculous.

What makes you think he applied "the curty criterion", presuming there
can be some meaningful definition of that? Why could he not, perhaps,
have "sorted" it based on the arrangement of monitor toys he could see?
Or on the output of /dev/random ?

Are you saying that *any* list which has had *any* concievable criterion
applied to its generation, must therefore have been "sorted"?


None of this is the case.

The contrived example showed that 'uniq' *does* have an effect on a list
which has not been sorted (i.e. not sorted into the default order one
would assume when hearing the term "sorted", i.e. alphanumeric
ascending).

Writhing to attempt to force the term "sorted" to apply to an unsorted
list is rather psychotic.
 
H

Hannu Kankaanp??

Terry Reedy said:
The above was a minimal 'concept' proposal to test the aesthetics of
something structurally different from current 'lambda's. I think I
would make all identifiers params by default, since I believe this to
be more common, and 'tag' non-locals, perhaps with one of the
reserved, as yet unused symbols. Example: lambda x: x + y == `x + @y`
or `x+y@`. Since expressions cannot assign, no global declaration is
needed.

A pretty simplistic way to do something similar would be this alias:

@ <=> lambda X=None, Y=None, Z=None:

Thusly:

seq.sort(lambda x, y: cmp(y, x)) =>
seq.sort(@cmp(Y, X))

map(lambda x: x[::-1], seq) =>
map(@X[::-1], seq)

It could replace lambda for most cases. Half of the
function definition wouldn't be "useless trash" anymore,
and redundancy in describing the parameter names would
go away (lambdas don't usually need describing parameter
names).

But @ isn't exactly describing symbol (though "lambda" or "def"
don't scream out "Here's a function definition!" either)
and the 3 implicit arguments don't really fit with
Python's explicity philosophy.... And from what
I've gathered, Guido would rather remove lambda than
introduce syntax that encourages small anonymous functions
(in a weird way to boot). Oh well.
 
C

Curt

Ben Finney said:
Erik Max Francis said:
You don't really think the sequence [flirty, curty, curty, flirty] is
sorted, do you?
Well, you did do _something_ to the sample for which you fail to find
a more descriptive word than "tweak".
He contrived an example that demonstrated his point. You seem to be

No, he didn't contrive an example. Please don't invent things. He tooked
my perfectly good and reasonable example of a file containing redundant
entries and "tweaked" it in order to make the entries of type "curty"
contiguous.
fascinated with finding some definition of "sort" that can be bent to
this.

I did not bend the definition of sort--I got it out of the WordNet dictionary
and quoted one of its senses directly.
This is ridiculous.

No, it isn't.
What makes you think he applied "the curty criterion", presuming there
can be some meaningful definition of that? Why could he not, perhaps,

He grouped the "curty" entries in my sample, is what makes me think he
applied the "curty criterion", and this a perfectly meaningful definition
of the "curty criterion", to wit: "take the sample and change the order
of the lines so that the items of type "curty" are grouped". The fact
that he grouped the latter in the middle of the list changes nothing. That
was a conscious decision on his part; he could have put the curty items at
the top, or at the bottom, or at any position in the list of his choosing,
if the list was long enough.
have "sorted" it based on the arrangement of monitor toys he could see?
Or on the output of /dev/random ?
Are you saying that *any* list which has had *any* concievable criterion
applied to its generation, must therefore have been "sorted"?

I'm saying exactly what I said, i.e. that any arbitrary list upon which
one performs an operation which segregates items into groups according
to a specified criterion is sorted. If you have a list of cows and
cats and dogs, and perform an operation on said list which groups all
the cows together, you have sorted that list "by cows". This appears
to me to be a standard definition; I have yet to see an argument from
you that would dissuade me from believing this to be true, but I would
love to see it if it exists.
 
M

Michael Geary

Curt said:
I'm saying exactly what I said, i.e. that any arbitrary
list upon which one performs an operation which
segregates items into groups according to a specified
criterion is sorted. If you have a list of cows and cats
and dogs, and perform an operation on said list which
groups all the cows together, you have sorted that list
"by cows". This appears to me to be a standard
definition; I have yet to see an argument from you
that would dissuade me from believing this to be
true, but I would love to see it if it exists.

Curt, you have a point that in conventional English usage, "sorted" does not
necessarily mean "ordered". For example, if you hand me a bag of mixed fruit
and ask me to sort the fruit, I won't need to ask you what order to put them
in. I will merely separate out the apples, oranges, bananas, pears, and what
have you.

But you must realize that in the software world, when someone says "sorted"
they usually do mean "sorted and ordered". That's why people have been
having trouble understanding your point.

Even so, you are quite mistaken about the uniq command. It really has
nothing to do with the input being sorted at all. Consider this file:

up
down
up
up
down
down
down
up
up
up
down
down

This file is not sorted by any definition of the word. Agreed? Now run uniq
on it and you'll get this result:

up
down
up
down
up
down

As you can see, uniq simply removes consecutive duplicate lines, nothing
more and nothing less. It does this whether the file is ordered, sorted, or
unsorted.

-Mike
 
E

Erik Max Francis

Curt said:
Well, you did do _something_ to the sample for which you fail to find
a more descriptive word than "tweak". I certainly do think that the
proper word for the modified sample is "sorted"; yes, you sorted the
file on the word "curty", by which I mean that you performed "an
operation that segregates items into groups according to a specified
criterion" (WordNet).

It seems at this point you're conceding that the file is not globally
sorted, and so retracting your original claim. If uniq does something
meaningful and predictable, according to its documentation, on
non-sorted text, then obviously it does not require sorted input as you
once claimed. All despite your goalpost shifting.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ Golf is a good walk spoiled.
-- Mark Twain
 
E

Erik Max Francis

Curt said:
No, he didn't contrive an example. Please don't invent things.

I posted another example, totally unrelated to your flirty/curty
nonsense, that demonstrated that uniq could do something meaningful with
a totally unsorted file. Please read things.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ Golf is a good walk spoiled.
-- Mark Twain
 
C

Curt

Curt wrote:
It seems at this point you're conceding that the file is not globally
sorted, and so retracting your original claim. If uniq does something

My original demonstration removed _all_ the duplicates. To remove all
the duplicates from my sample with "uniq", it must be sorted in either
ascending or descending alphabetical order.

My sample file in its modified state is not "globally sorted", but it is
partially sorted (the one implies the other, I'm afraid), and "uniq" worked
solely on that part of the file which you "tweaked" in this way. Seems okay
to me!
meaningful and predictable, according to its documentation, on
non-sorted text, then obviously it does not require sorted input as you
once claimed. All despite your goalpost shifting.

I haven't shifted goalposts, whatever that means. You shifted the
emphasis of my example by removing only one of the duplicates.

I concede that "uniq" does something "meaningful" on unsorted input, if
that input contains identical, successive lines. I do not concede,
however, that you may never be required to sort its input in order to
arrive at that state, which is exactly what my demonstration "proved".

If I've just shifted goalposts, well, you can all shoot me at dawn, if
you feel it's a capital offense. Don't worry, we'll give you the real
bullet.
 
C

Curt

Curt wrote:
I posted another example, totally unrelated to your flirty/curty

Oh yes, you did. Was that the contrived example he was referring to?

I actually thought he was alluding to your contrived example which was
a variation on the theme of my contrived example. Why did I think that?

Let's go back to the context of the whole shebang which you've cut.

He quotes you, then me, then speaks out himself:

You:
"You don't really think the sequence [flirty, curty, curty, flirty]
is sorted, do you?"

Me:
"Well, you did do something to the sample for which you fail to find
a more descriptive word than "tweak"."

Him:
"He contrived an example that demonstrated his point."

Then I say the following thing, which you truncated:

"No, he didn't contrive an example. Please don't invent things. He
tooked my perfectly good and reasonable example of a file containing
redundant entries and "tweaked" it in order to make the entries of type
"curty" contiguous.

Well, the whole thing is not as clearly a case of bad reading as you say
it is.
 
M

Michael Geary

Curt said:
I haven't shifted goalposts, whatever that means. You shifted the
emphasis of my example by removing only one of the duplicates.

I concede that "uniq" does something "meaningful" on unsorted
input, if that input contains identical, successive lines. I do not
concede, however, that you may never be required to sort its
input in order to arrive at that state, which is exactly what my
demonstration "proved".

If I've just shifted goalposts, well, you can all shoot me at dawn,
if you feel it's a capital offense. Don't worry, we'll give you the
real bullet.

Curt...

It is unfortunate that both the program name and the one-line description of
uniq are misleading:

uniq - remove duplicate lines from a sorted file

Well, yes: If a file is sorted, then uniq removes all duplicate lines, and
the resulting file contains only unique lines (thus the name 'uniq').

But the fact remains that what uniq actually does has nothing to do with
sorting. It removes consecutive duplicate lines. That's all.

If I flip a coin 100 times and record the results, will there be consecutive
"heads" or "tails"? Most likely. Is that file sorted? Of course not--it's
completely random. Will uniq remove the consecutive duplicates? Yes.

It's really that simple.

-Mike
 
B

Bengt Richter

Yes, I quoted that earlier and commented on it. It's not universal in
uniq man pages. The Solaris 8 man page, for instance, doesn't have this
bug.
Just looked at the version that comes with msys:

----
[11:26] ~>uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Discard all but one of successive identical lines from INPUT (or
standard input), writing to OUTPUT (or standard output).

-c, --count prefix lines by the number of occurrences
-d, --repeated only print duplicate lines
-D, --all-repeated print all duplicate lines
-f, --skip-fields=N avoid comparing the first N fields
-i, --ignore-case ignore differences in case when comparing
-s, --skip-chars=N avoid comparing the first N characters
-u, --unique only print unique lines
-w, --check-chars=N compare no more than N characters in lines
-N same as -f N
+N same as -s N
--help display this help and exit
--version output version information and exit

A field is a run of whitespace, then non-whitespace characters.
Fields are skipped before chars.

Report bugs to <[email protected]>.
----

Where in the past have I seen an option for writing line repetions as a count message? I.e.,

hoo
hee
hee
haw
haw
haw
hoopla

becomes

hoo
hee
*** above line repeated 1 more time ***
haw
*** above line repeated 2 more times ***
hoopla

(Useful for collapsing error log content
such as a traceback from exceeding recursion depth ;-)

Whether to make a message for a single repeat or just allow two-in-row might be nice
to differentiate in e.g., a -g vs -G option.

Regards,
Bengt Richter
 
E

Erik Max Francis

Curt said:
Oh yes, you did. Was that the contrived example he was referring to?

I actually thought he was alluding to your contrived example which was
a variation on the theme of my contrived example. Why did I think
that?

Let's go back to the context of the whole shebang which you've cut.

This is exceptionally tiring. I don't really care what the other poster
said. You're concentrating on the one example I gave that you insist
involves sorting (which does not involve sorting the file, it involves
rearranging two lines which still leave the file in an unsorted state).

The fact is I posted an entirely different example -- which you have
completely ignored -- which demonstrated the functionality of uniq as
document and has absolutely, unequivocally, undeniably, nothing at all
to do with sorting.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ I'm trying to forget / But I can't act as if we never met
-- Chante Moore
 
C

Curt

Erik Max Francis said:
Curt wrote:
This is exceptionally tiring. I don't really care what the other poster
said. You're concentrating on the one example I gave that you insist

You don't care what the other poster said. I did, however, because I was
responding to his post. Get it? I made a reasonable assumption in the
context of his article; you insulted me gratuitously and dishonestly because
you cut that context which made it clear that my remark was in good faith.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,172
Messages
2,570,933
Members
47,473
Latest member
ChristelPe

Latest Threads

Top