Frankenstring

T

Thomas Lotze

Thomas said:
And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.

On second thought, that wouldn't gain much on a loop over finding each
sequence, but add more complexity than it is worth. What would be more
useful, especially thinking of a C implementation, is str.findanyof and
str.findnoneof. They take a string as an argument and find the first
occurrence of any char in that string or any char not in that string,
resp. Especially finding any char not among a given few needs a hoop to
jump through now, if I didn't miss anything.
 
U

ucntcme

Just as with files, iterating over them returns whole lines, which is
unfortunately not what I want.


Then why not subclass it and alter the iteration scheme to do a read(1)
or something?

from StringIO import StringIO


class FrankenString(StringIO):
lastidx = 0
atEnd = False
def __iter__(self):
while not self.atEnd:
char = self.read(1)
idx = self.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char
 
U

ucntcme

Here is a cStringIO based version:
class FrankenString:
def __init__(self,string=None):
self.str = StringIO(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char

On a string 1024*1024 long the StringIO version takes 10s to iterate
over but do nothing, whereas this one takes 3.1 on my system. Well to
create that string, create the instance and loop over. But since each
variant is doing the same thing I figure it's even. ;)
 
U

ucntcme

well that appears to have been munged up ...
that tell() belongs immediately after self.str. making it
self.str.tell()


class FrankenString:
def __init__(self,string=None):
self.str = StringIO(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char
 
T

Thomas Lotze

Peter said:
I hope you'll let us know how much faster your
final approach turns out to be

OK, here's a short report on the current state. Such code as there is can
be found at <http://svn.thomas-lotze.de/PyASDF/pyasdf/_frankenstring.c>,
with a Python mock-up in the same directory.

Thinking about it (Andreas, thank you for the reminder :eek:)), doing
character-by-character scanning in Python is stupid, both in terms of
speed and, given some more search capabilities than str currently has,
elegance.

So what I did until now (except working myself into writing extensions
in C) is give the evolving FrankenString some search methods that enable
searching for the first occurrence in the string of any character out of
a set of characters given as a string, or any character not in such a
set. This has nothing to do yet with iterators and seeking/telling.

Just letting C do the "while data[index] not in whitespace: index += 1"
part speeds up my PDF tokenizer by a factor between 3 and 4. I have
never compared that directly to using regular expressions, though... As
a bonus, even with this minor addition the Python code looks a little
cleaner already:

c = data[cursor]

while c in whitespace:
# Whitespace tokens.
cursor += 1

if c == '%':
# We're just inside a comment, read beyond EOL.
while data[cursor] not in "\r\n":
cursor += 1
cursor += 1

c = data[cursor]

becomes

cursor = data.skipany(whitespace, start)
c = data[cursor]

while c == '%':
# Whitespace tokens: comments till EOL and whitespace.
cursor = data.skipother("\r\n", cursor)
cursor = data.skipany(whitespace, cursor)
c = data[cursor]

(removing '%' from the whitespace string, in case you wonder).

The next thing to do is make FrankenString behave. Right now there's too
much copying of string content going on everytime a FrankenString is
initialized; I'd like it to share string content with other
FrankenStrings or strs much like cStringIO does. I hope it's just a
matter of learning from cStringIO. To justify the "franken" part of the
name some more, I consider mixing in yet another ingredient and making
the thing behave like a buffer in that a FrankenString should be
possible to make from only part of a string without copying data.

After that, the thing about seeking and telling iterators over
characters or search results comes in. I don't think it will make much
difference in performance now that the stupid character searching has
been done in C, but it'll hopefully make for more elegant Python code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,261
Messages
2,571,308
Members
47,976
Latest member
AlanaKeech

Latest Threads

Top