a splitting headache

M

Mensanator

Even if the docs do describe the behavior adequately, he has a point
that the documents should emphasize the counterintutive split
personality of the method better.
s.split() and s.split(sep) do different things, and there is no string
sep that can make s.split(sep) behave like s.split(). �That's not
unheard of but it does go against our typical expectations. �It would
have been a better library design if s.split() and s.split(sep) were
different methods.
That they are the same method isn't the end of the world but the
documentation really ought to emphasize its dual nature.

I would also offer that the example

� '1,,2'.split(',') returns ['1', '', '2'])

could be improved by including a sep instance at the
beginning or end of the string, like

� '1,,2,'.split(',') returns ['1', '', '2', ''])

since that illustrates another difference between the
sep and non-sep cases.

But if '1,,2'.split(',') returned ['1', '', '2', ''],
then ','.join(['1', '', '2', '']) would return
'1,,2,' which is not what you started with, so I would
hardly call that an improvement.

The split function works fine, either version, I just
think it could be explained better.
 
M

Mensanator

18:09 -0700, Mensanator wrote:
All I wanted to do is split a binary number into two lists, a list of
blocks of consecutive ones and another list of blocks of consecutive
zeroes.
But no, you can't do that.
c = '0010000110'
c.split('0')
['', '', '1', '', '', '', '11', '']
Ok, the consecutive delimiters appear as empty strings for reasons
unknown (except for the first one). Except when they start or end the
string in which case the first one is included.
Maybe there's a reason for this inconsistent behaviour but you won't
find it in the documentation.
Wanna bet? I'm not sure whether you're claiming that the behavior is
not specified in the docs or the reason for it. The behavior certainly
is specified. I conjecture you think the behavior itself is not
specified,
The problem is that the docs give a single example
'1,,2'.split(',')
['1','','2']
ignoring the special case of leading/trailing delimiters. Yes, if you
think it through, ',1,,2,'.split(',') should return ['','1','','2','']
for exactly the reasons you give.
Trouble is, we often find ourselves doing ' 1 �2 �'.split() which
returns
['1','2'].
I'm not saying either behaviour is wrong, it's just not obvious that the
one behaviour doesn't follow from the other and the documentation could
be
a little clearer on this matter. It might make a bit more sense to
actually
mention the slpit(sep) behavior that split() doesn't do.
Have you _read_ the docs?
They're quite clear on the difference
between no sep (or sep=None) and sep=something:
I disagree that they are "quite clear". The first paragraph makes no
mention of leading or trailing delimiters and they show no example
of such usage. An example would at least force me to think about it
if it isn't specifically mentioned in the paragraph.
One could infer from the second paragraph that, as it doesn't return
empty stings from leading and trailing whitespace, slpit(sep) does
for leading/trailing delimiters. Of course, why would I even be
reading
this paragraph when I'm trying to understand split(sep)?

Now there you have an excellent point.

At the start of the documentation for every function and method
they should include the following:

Note: If you want to understand completely how this
function works you may need to read the entire documentation.

When I took Calculus, I wasn't required to read the
entire book before doing the chapter 1 homework.
Has teaching changed since I was ib school?
And of course they should precede that in every instance with

Note: Read the next sentence.

And don't forget to add:

We can't be bothered to show any examples of how
this actually works, work out all the special
cases for yourself.
The splitting of real strings is just as important, if not more so,
than the behaviour of splitting empty strings. Especially when the
behaviour is radically different.
'010000110'.split('0')
['', '1', '', '', '', '11', '']
is a perfect example. It shows the empty strings generated from the
leading and trailing delimiters, and also that you get 3 empty
strings
between the '1's, not 4. When creating documentation, it is always a
good idea to document such cases.
And you'll then want to compare this to the equivalent whitespace
case:
' 1 � �11 '.split()
['1', '11']
And it wouldn't hurt to point this out:

and note that it won't work with the whitespace version.
No, I have not submitted a request to change the documentation, I was
looking for some feedback here. And it seems that no one else
considers
the documentation wanting.
"If sep is given, consecutive delimiters are not grouped together and are
deemed to delimit empty strings (for example, '1,,2'.split(',') returns
['1', '', '2']). The sep argument may consist of multiple characters (for
example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an
empty string with a specified separator returns [''].
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start or
end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace with
a None separator returns []."
because your description of what's happening,
"consecutive delimiters appear as empty strings for reasons
unknown (except for the first one). Except when they start or end the
string in which case the first one is included"
is at best an awkward way to look at it. The delimiters are not
appearing as empty strings.
You're asking to split �'0010000110' on '0'. So you're asking for
strings a, b, c, etc such that
(*) '0010000110' = a + '0' + b + '0' + c + '0' + etc
The sequence of strings you're getting as output satisfies (*) exactly;
the first '' is what appears before the first delimiter, the second ''
is what's between the first and second delimiters, etc.

David C. Ullrich

"Understanding Godel isn't about following his formal proof.
That would make a mockery of everything Godel was up to."
(John Jones, "My talk about Godel to the post-grads."
in sci.logic.)- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -
 
M

Mensanator

18:09 -0700, Mensanator wrote:
All I wanted to do is split a binary number into two lists, a list of
blocks of consecutive ones and another list of blocks of consecutive
zeroes.
But no, you can't do that.
c = '0010000110'
c.split('0')
['', '', '1', '', '', '', '11', '']
Ok, the consecutive delimiters appear as empty strings for reasons
unknown (except for the first one). Except when they start or end the
string in which case the first one is included.
Maybe there's a reason for this inconsistent behaviour but you won't
find it in the documentation.
Wanna bet? I'm not sure whether you're claiming that the behavior is
not specified in the docs or the reason for it. The behavior certainly
is specified. I conjecture you think the behavior itself is not
specified,
The problem is that the docs give a single example
'1,,2'.split(',')
['1','','2']
ignoring the special case of leading/trailing delimiters. Yes, if you
think it through, ',1,,2,'.split(',') should return ['','1','','2','']
for exactly the reasons you give.
Trouble is, we often find ourselves doing ' 1  2  '.split() which
returns
['1','2'].
I'm not saying either behaviour is wrong, it's just not obvious that the
one behaviour doesn't follow from the other and the documentation could
be
a little clearer on this matter. It might make a bit more sense to
actually
mention the slpit(sep) behavior that split() doesn't do.
Have you _read_ the docs?
They're quite clear on the difference
between no sep (or sep=None) and sep=something:
I disagree that they are "quite clear". The first paragraph makes no
mention of leading or trailing delimiters and they show no example
of such usage. An example would at least force me to think about it
if it isn't specifically mentioned in the paragraph.
One could infer from the second paragraph that, as it doesn't return
empty stings from leading and trailing whitespace, slpit(sep) does
for leading/trailing delimiters. Of course, why would I even be
reading
this paragraph when I'm trying to understand split(sep)?

A skightly less sarcastic answer than what I just posted:

And a slightly less sarcastic reply.
I don't see why you _should_ need to read the second paragraph
to infer that leading delimiters will return empty strings when
you do split(sep). That's exactly what one would expect!

Yes, AFTER you read the docs. But prior to opening them, coupled
with a long history of using split(), there is no reason to expect
such behaviour at all.
As I pointed out the other day, if you're splitting ',,p' with
sep = ',' that means you're looking for strings _separated by_
commas. That means you're asking for [s1, s2, ...] where
s1 is the part of the string preceding the first comma,
s2 is the part of the string after the first comma but
before the second comma, etc. And that means s1 = ''
here.

It behaves much like the CSV module, which I'm very familiar
with from Excel. But when importing into Excel, you have the
option of treating consecutive delimiters as one, but unlike
split(), a single leading delimiter will NOT be thrown away.
I would wager that the body of Excel users is vastly greater
than the body of Python programmers. It doesn't hurt to
explicitly point out the obvious, because what's obvious may
differ from people's experience.
That's what "split on commas" _means_. It's also exactly
what you want in typical applications, for example
parsing comma-separated data. The fact that s.split()
does _not_ include an empty string at the start if s
begins with whitespace is that counterintuitive part;
that's why it's specified in the second paragraph
(whether you believe it or not, _that's_ what
confused _me_ once. At which point I read the docs...)
I suppose it makes sense given a typical use case of
s.split(), where s is text and we want to find a list of
the words in s.

Right, what I wanted was to extract the 'words' consisting
of blocks of contiguous 1-bits from a binary number and simply
discard the 0's. I was then going to do the same process only
delimiting on 1's to get blocks of 0's. What I was expecting
was split(sep) to work similar to split() as it is somewhat
unusal for the algorithm to change. I still think the
documentation could do a better job explaining this.
Really. I can't understand why you would _expect_
s.split(sep) to do anything other than

def split(s, sep):
  res = []
  acc = ''
  for c in s:
    if c in sep:
      res.append(acc)
      acc = ''
    else:
      acc = acc + c
  res.append(acc)
  return res

A very good example, it should be in the docs. Have a look at
the itertools module docs. There they do a wonderful job of
explaining with numerous cases of "itertools.x is equivalent
to ..."
Really. You're used to the idea that sum_{j=1}^0 c_j
should be 0, right? That's for exactly the same reason -
the obvious thing for sum_{j=a}^b c_j to return is
given by

def sum(c, lower, upper):
  res = 0
  j = lower
  while j <= upper:
    res = res + c[j]
    j = j + 1
  return res




The splitting of real strings is just as important, if not more so,
than the behaviour of splitting empty strings. Especially when the
behaviour is radically different.
'010000110'.split('0')
['', '1', '', '', '', '11', '']
is a perfect example. It shows the empty strings generated from the
leading and trailing delimiters, and also that you get 3 empty
strings
between the '1's, not 4. When creating documentation, it is always a
good idea to document such cases.
And you'll then want to compare this to the equivalent whitespace
case:
' 1    11 '.split()
['1', '11']
And it wouldn't hurt to point this out:

and note that it won't work with the whitespace version.
No, I have not submitted a request to change the documentation, I was
looking for some feedback here. And it seems that no one else
considers
the documentation wanting.
"If sep is given, consecutive delimiters are not grouped together and are
deemed to delimit empty strings (for example, '1,,2'.split(',') returns
['1', '', '2']). The sep argument may consist of multiple characters (for
example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an
empty string with a specified separator returns [''].
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start or
end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace with
a None separator returns []."
because your description of what's happening,
"consecutive delimiters appear as empty strings for reasons
unknown (except for the first one). Except when they start or end the
string in which case the first one is included"
is at best an awkward way to look at it. The delimiters are not
appearing as empty strings.
You're asking to split  '0010000110' on '0'. So you're asking for
strings a, b, c, etc such that
(*) '0010000110' = a + '0' + b + '0' + c + '0' + etc
The sequence of strings you're getting as output satisfies (*) exactly;
the first '' is what appears before the first delimiter, the second ''
is what's between the first and second delimiters, etc.

David C. Ullrich

"Understanding Godel isn't about following his formal proof.
That would make a mockery of everything Godel was up to."
(John Jones, "My talk about Godel to the post-grads."
in sci.logic.)
 
R

rurpy

18:09 -0700, Mensanator wrote: [...]
Have you _read_ the docs?
Yes.

They're quite clear on the difference
between no sep (or sep=None) and sep=something:

I disagree that they are "quite clear". The first paragraph makes no
mention of leading or trailing delimiters and they show no example
of such usage. An example would at least force me to think about it
if it isn't specifically mentioned in the paragraph.

One could infer from the second paragraph that, as it doesn't return
empty stings from leading and trailing whitespace, slpit(sep) does
for leading/trailing delimiters. Of course, why would I even be
reading
this paragraph when I'm trying to understand split(sep)?

A skightly less sarcastic answer than what I just posted:

I don't see why you _should_ need to read the second paragraph
to infer that leading delimiters will return empty strings when
you do split(sep). That's exactly what one would expect!

It is very dangerous for a documentation writer
to decide not to document something because, "that's
what everyone would expect". It relies on the
writer actually knowing what everyone would expect
which, given the wide range of backgrounds and
experience of the readers, is asking a lot.
Just look at the different expectation of how
Python should work expressed in posts to this
group by people coming from other languages.
 
R

rurpy

And they _state_ quite clearly that they do different things!
I don't see what your complaint could possibly be.


_That_ may be so. But claiming that there's a problem with the
docs here seems silly, since the docs say exactly what happens.

If saying exactly what happens was sufficient there
would be no need for docs, the code would suffice.
The purpose of docs is to *effectively* convey to
a human reader a complete and accurate description
of how the documented object works. How effectively
it does that is a quality-of-implementation issue.

Many of us feel that the Python docs quality has a
lot of room for improvement.
 
J

jhermann

All I wanted to do is split a binary number into two lists,
a list of blocks of consecutive ones and another list of
blocks of consecutive zeroes.

Back to the OP's problem, the obvious (if you know the std lib) and
easy solution is:
['00', '1', '0', '1', '0', '1111', '00', '1', '0', '1']

In production code, you compile the regex once, of course.
 
V

Vlastimil Brom

2009/10/26 jhermann said:
All I wanted to do is split a binary number into two lists,
a list of blocks of consecutive ones and another list of
blocks of consecutive zeroes.

Back to the OP's problem, the obvious (if you know the std lib) and
easy solution is:
['00', '1', '0', '1', '0', '1111', '00', '1', '0', '1']

In production code, you compile the regex once, of course.

Maybe one can even forget the split function here entirely
re.findall(r"0+|1+", '001010111100101') ['00', '1', '0', '1', '0', '1111', '00', '1', '0', '1']

vbr
 
M

Mensanator

2009/10/26 jhermann <[email protected]>:




Back to the OP's problem, the obvious (if you know the std lib) and
easy solution is:
c = '001010111100101'
filter(None, re.split("(1+)", c))
['00', '1', '0', '1', '0', '1111', '00', '1', '0', '1']
In production code, you compile the regex once, of course.

Maybe one can even forget the split function here entirely

['00', '1', '0', '1', '0', '1111', '00', '1', '0', '1']

Very good. Thanks to both of you. Now if I could only remember
why I wanted to do this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,184
Messages
2,570,978
Members
47,561
Latest member
gjsign

Latest Threads

Top