Frankenstring

T

Thomas Lotze

Hi,

I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
.... print c
.... if c == "2":
.... break
....
0
1
2.... print c
....
7
8
9
It's definitely no help that file-like objects are iterable; I do want
to get a character, not a complete line, at a time.

I can think of more than one clumsy way to implement the desired
behaviour in Python; I'd rather like to know whether there's an
implementation somewhere that does it fast. (Yes, it's me and speed
considerations again; this is for a tokenizer at the core of a library,
and I'd really like it to be fast.) I don't think there's anything like
it in the standard library, at least not anything that would be obvious
to me.

I don't care whether this is more of a string iterator with seeking and
telling, or a file-like object with a single-character iterator; as long
as it does both efficiently, I'm happy.

I'd even consider writing such a beast in C, albeit more as a learning
exercise than as a worthwhile measure to speed up some code.

Thanks for any hints.
 
M

Mike C. Fletcher

Thomas said:
Hi,

I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
Okay, first off, this is never going to be *fast* compared to something
coded in C and wrapped with Python. You are dealing with every single
character as a Python object, so let's forget fast for the moment and do
a straightforward implementation:

class Franken( str ):
frankenIndex = 0
def __iter__( self ):
while self.frankenIndex < len(self):
yield self[ self.frankenIndex ]
self.frankenIndex += 1
self.frankenIndex = 0
def seek( self, index ):
self.frankenIndex = index
def tell( self, index ):
return self.frankenIndex

if __name__ == "__main__":
f = Franken( 'abcdefg' )
for c in f:
print 'char', c
if c == 'c':
break
f.seek( 5 )
l1 = list( f )
l2 = list( f )
assert l1 == [ 'f','g' ]
assert l2 == list(str(f))
print 'first list', l1
print 'second list', l2

If you want to speed it up, you can optimise for various string sizes
(eg using a slice of the string and the built-in iterator when
appropriate), but in the end, it's not going to be anywhere near as fast
as a C-engine tokeniser, so I'd personally spend more time on elegance
than on speed...

Anywho, have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com
 
B

Bengt Richter

Hi,

I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:

... print c
... if c == "2":
... break
...
0
1
2
... print c
...
7
8
9

It's definitely no help that file-like objects are iterable; I do want
to get a character, not a complete line, at a time.

I can think of more than one clumsy way to implement the desired
behaviour in Python; I'd rather like to know whether there's an
implementation somewhere that does it fast. (Yes, it's me and speed
considerations again; this is for a tokenizer at the core of a library,
and I'd really like it to be fast.) I don't think there's anything like
it in the standard library, at least not anything that would be obvious
to me.

I don't care whether this is more of a string iterator with seeking and
telling, or a file-like object with a single-character iterator; as long
as it does both efficiently, I'm happy.

I'd even consider writing such a beast in C, albeit more as a learning
exercise than as a worthwhile measure to speed up some code.

Thanks for any hints.
I'd probably subclass file to buffer in good-sized chunks and override the
iteration to go by characters through the buffer, updating the buffer
when you get to its end, and overriding seek and tell to do the right thing
re the buffer and where you are in it for the character iteration via next.

Regards,
Bengt Richter
 
T

Thomas Lotze

jay said:
see StringIO or cStringIO in the standard library.

Just as with files, iterating over them returns whole lines, which is
unfortunately not what I want.
 
S

Scott David Daniels

Thomas said:
Hi,
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:



... print c
... if c == "2":
... break
...
OK, frankenstring can be approximated by nothing:
for c in "0123456789":
print c


Now if you want to do it for a file, you could do:

for c in thefile.read():
....

Or, if you did not want to do a full read:
def filechars(afile):
for line in afile:
for char in line:
yield char

--Scott David Daniels
(e-mail address removed)
 
T

Thomas Lotze

Scott said:
Now if you want to do it for a file, you could do:

for c in thefile.read():
....

The whole point of the exercise is that seeking on a file doesn't
influence iteration over its content. In the loop you suggest, I can
seek() on thefile to my heart's content and will always get its content
iterated over exactly from beginning to end. It had been read before any
of this started, after all. Similarly, thefile.tell() will always tell me
thefile's size or the place I last seek()'ed to instead of the position of
the next char I will get.
 
B

Bengt Richter

The whole point of the exercise is that seeking on a file doesn't
influence iteration over its content. In the loop you suggest, I can
seek() on thefile to my heart's content and will always get its content
iterated over exactly from beginning to end. It had been read before any
of this started, after all. Similarly, thefile.tell() will always tell me
thefile's size or the place I last seek()'ed to instead of the position of
the next char I will get.
What I suggested in my other post (untested beyond what you see, so you
may want to add to the test ):

----< lotzefile.py >--------------------------------------------------
class LotzeFile(file):
BUFSIZE = 4096
def __init__(self, path, mode='r'):
self.f = file(path, mode)
self.pos = self.bufbase = 0
self.buf = ''
def __iter__(self): return self
def next(self):
if not self.buf[self.pos:]:
self.bufbase += len(self.buf)
self.pos = 0
self.buf = self.f.read(self.BUFSIZE)
if not self.buf:
self.close()
raise StopIteration
byte = self.buf[self.pos]
self.pos += 1
return byte
def seek(self, pos, ref=0):
self.f.seek(pos, ref)
self.bufbase = self.f.tell()
self.pos = 0
self.buf = ''
def tell(self):
return self.bufbase + self.pos
def close(self):
self.f.close()

def test():
f = file('lotzedata.txt','w')
for s in (' %3d'%i for i in xrange(1000)): f.write(s)
f.close()

it = iter(LotzeFile('lotzedata.txt'))

hold4=[0,0,0,0]
for i, c in enumerate(it):
hold4[i%4] = c
if i%4==3:
print hold4
assert (i-3)/4 == int(''.join(hold4))
if i == 99: break
print it.tell()
it.seek(52)
for i in xrange(8): print it.next(),
print
it.seek(990*4)
for c in it: print c,

if __name__ == '__main__':
test()
----------------------------------------------------------------------

Result:

[20:53] C:\pywk\clp>py24 lotze.py
[' ', ' ', ' ', '0']
[' ', ' ', ' ', '1']
[' ', ' ', ' ', '2']
[' ', ' ', ' ', '3']
[' ', ' ', ' ', '4']
[' ', ' ', ' ', '5']
[' ', ' ', ' ', '6']
[' ', ' ', ' ', '7']
[' ', ' ', ' ', '8']
[' ', ' ', ' ', '9']
[' ', ' ', '1', '0']
[' ', ' ', '1', '1']
[' ', ' ', '1', '2']
[' ', ' ', '1', '3']
[' ', ' ', '1', '4']
[' ', ' ', '1', '5']
[' ', ' ', '1', '6']
[' ', ' ', '1', '7']
[' ', ' ', '1', '8']
[' ', ' ', '1', '9']
[' ', ' ', '2', '0']
[' ', ' ', '2', '1']
[' ', ' ', '2', '2']
[' ', ' ', '2', '3']
[' ', ' ', '2', '4']
100
1 3 1 4
9 9 0 9 9 1 9 9 2 9 9 3 9 9 4 9 9 5 9 9 6 9 9 7 9 9 8 9 9 9

I suspect you could get better performance if you made LotzeFile instances able to
return interators over buffer chunks and get characters from them, which would
be string iterators supplying the characters rather than the custom .next, but
the buffer chunks would have to be of some size to make that pay. Testing is
the only way to find out what the crossing point is, if you really have to.

Regards,
Bengt Richter
 
R

Roland Heiber

Thomas said:
It's definitely no help that file-like objects are iterable; I do want
to get a character, not a complete line, at a time.

Hi,

if i did understand what you mean, what about using mmap? Iterating over
characters in a file like this:

# -*- coding: iso-8859-1 -*-
import os
import mmap

f = open("file.txt", "r+")
size = os.path.getsize("file.txt")
m = mmap.mmap(f.fileno(), size)

for x in m:
print x

m.close()
f.close()
 
P

Peter Otten

Thomas said:
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
.... def next(self):
.... c = self.read(1)
.... if not c:
.... raise StopIteration
.... return c
........ print c
.... if c == "2":
.... break
....
0
1
2.... print c
....
7
8
9
A non-intrusive alternative:
.... def char(): return instream.read(1)
.... return iter(char, "")
........ print c
.... if c == "2": break
....
0
1
23

Performance is probably not so good, but if you really want to do it in C,
with cStringIO you might be /almost/ there.

Peter
 
T

Thomas Lotze

Roland said:
if i did understand what you mean, what about using mmap?

AIUI (and as a little experimenting seems to confirm), you can't
reposition an iterator over an mmap'ed file by seeking. True, you have
both iterating by characters and seeking/telling, but the two
functionalities don't play together.
 
T

Thomas Lotze

Bengt said:
----< lotzefile.py >--------------------------------------------------
Thanks.

[...]
byte = self.buf[self.pos]

This is the place where the thing is basically a str whose items are
accessed as sequence elements. It has some iterator behaviour and file
management which makes it nice to use, of course, and to most this will
be enough (and it is a lot indeed). But it loses the efficiency of

for c in "asdf": do_something(c)

Actually, relying on string[index] behind the scenes is one of the ways
of implementing frankenstring I labelled "clumsy" in the original
posting ;o)
I suspect you could get better performance if you made LotzeFile instances
able to return interators over buffer chunks and get characters from them,
which would be string iterators supplying the characters rather than the
custom .next, but the buffer chunks would have to be of some size to make
that pay. Testing is the only way to find out what the crossing point is,
if you really have to.

If I understand this correctly, you'd have to switch to using a new iterator
after seeking, which would make this impossible:

f = LotzeFile('something')
for c in iter(f):
do_something(c)
if some_condition:
f.seek(somewhere)
# the next iteration reads from the new position

And it would break telling since the class can't know how many
characters have been read from an iterator once it returned one after
seeking or switching to another buffer chunk.
 
T

Thomas Lotze

Peter said:
... def next(self):
... c = self.read(1)
... if not c:
... raise StopIteration
... return c

Repeated read(1) on a file-like object is one of the ways of doing it with
existing tools I labelled "clumsy" in the original posting ;o)

Thanks anyway.
 
R

Roland Heiber

Thomas said:
AIUI (and as a little experimenting seems to confirm), you can't
reposition an iterator over an mmap'ed file by seeking. True, you have
both iterating by characters and seeking/telling, but the two
functionalities don't play together.

A quick and dirty hack!? Maybe i'm missing what it is that you need ...

class MmapWithSeekAndTell(object):
def __init__(self, m, size):
self._m = m
self._pos = 0
self._size = size-1

def __iter__(self):
return self

def next(self):
if self._pos < self._size-1:
self._pos += 1
return self._m[self._pos]
raise StopIteration

def seek(self, offset, whence=0):
if whence == 0 and 0 <= offset < self._size:
self._pos = offset
elif whence == 1 and 0 <= (self._pos + offset) < self._size:
self._pos += offset
elif whence == 2 and 0<= (self._size - offset) < self._size:
self._pos = self._size - offset

def tell(self):
return self._pos


HtH, Roland
 
A

Andreas Lobinger

Aloha,

Thomas said:
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
... print c
... if c == "2":
... break
...
0
1
2
... print c
...
7
8
9
I can think of more than one clumsy way to implement the desired
behaviour in Python; I'd rather like to know whether there's an
implementation somewhere that does it fast. (Yes, it's me and speed
considerations again; this is for a tokenizer at the core of a library,
and I'd really like it to be fast.)

You can already think my answer, because i'm doing this
at the core of a similar library, but to give others
the chance to discuss.
>>> f = "0123456789"
>>> p = 0
>>> t2 = f.find('2')+1
>>> for c in f[p:t2]:
.... print c
....
0
1
2
.... print c
....
7
8
9

A string, and a pointer on that string. If you give up the
boundary condition to tell backwards, you can start to eat up
the string via f = f[p:]. There was a performance difference
with that, in fact it was faster ~4% on a python2.2.

I dont't expect any iterator solution to be faster than
that.

Wishing a happy day
LOBI
 
P

Peter Otten

Thomas said:
Repeated read(1) on a file-like object is one of the ways of doing it with
existing tools I labelled "clumsy" in the original posting ;o)

Not clumsy, just slow. I hope you'll let us know how much faster your final
approach turns out to be. By the way, I'll consider anything that doesn't
implement seek() and tell() cheating :)

Peter
 
T

Thomas Lotze

This is indeed faster than going through a string char by char. It doesn't
make for a nice character-based state machine, but of course it avoids
making Python objects for every character and uses the C implementation of
str for searching.

However, it's only fine if you are looking for single characters. As soon
as you're looking for classes of characters, you need the (slower) regex
machinery (as you well know, but for the sake of discussion...).
A string, and a pointer on that string. If you give up the boundary
condition to tell backwards, you can start to eat up the string via f =
f[p:]. There was a performance difference with that, in fact it was faster
~4% on a python2.2.

When I tried it just now, it was the other way around. Eating up the
string was slower, which makes sense to me since it involves creating new
string objects all the time.
I dont't expect any iterator solution to be faster than that.

It's not so much an issue of iterators, but handling Python objects
for every char. Iterators would actually be quite helpful for searching: I
wonder why there doesn't seem to be an str.iterfind or str.itersplit
thing. And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.
 
T

Thomas Lotze

Peter said:
Not clumsy, just slow.

As you wish ;o) I didn't mean clumsy as in "clumsy looking Python code"
anyway, rather as in "clumsy to use the Python machinery for operations
that are straight-forward and efficient in C, in which language str and
cStringIO are implemented already".
I hope you'll let us know how much faster your
final approach turns out to be.

I'm pretty convinced that implementing an algorithmically nice state
machine that goes through a string char by char won't get any faster than
using s[index] all the time unless I do a frankenstring in C. Failing
that, a more pragmatic approach is what Andreas suggests; see the other
subthread.
By the way, I'll consider anything that
doesn't implement seek() and tell() cheating :)

An implementation of frankenstring would have to have seek and tell,
that's the point of doing it. But for half-way simple state machines,
hiding the index handling in a Python class that slows things down it just
not worth it. Doing index += 1 here and there is fine if it happens only
half a dozen times. I know it's not beautiful, that's why I started this
thread ;o)
 
A

Andreas Lobinger

Aloha,

Thomas said:
A string, and a pointer on that string. If you give up the boundary
condition to tell backwards, you can start to eat up the string via f =
f[p:]. There was a performance difference with that, in fact it was faster
~4% on a python2.2.
When I tried it just now, it was the other way around. Eating up the
string was slower, which makes sense to me since it involves creating new
string objects all the time.

I expected the f[p:] also to be slower, the 4% i only measured on one
platform. Most propably the CG and memory management isn't the same.
It's not so much an issue of iterators, but handling Python objects
for every char. Iterators would actually be quite helpful for searching: I
wonder why there doesn't seem to be an str.iterfind or str.itersplit
thing. And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.

There is a finditer in the re. I'm currently rewriting a few pattern
matching things and find it quite valueable.
>>> import re
>>> pat = re.compile('[57]')
>>> f = "754356184756046104564"
>>> for a in pat.finditer(f):
.... print a.start(),f[a.start()]
....
0 7
1 5
4 5
9 7
10 5
18 5

Wishing a happy day
LOBI
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,261
Messages
2,571,308
Members
47,976
Latest member
AlanaKeech

Latest Threads

Top