Frankenstring

Thomas Lotze · Jul 14, 2005

Thomas said:
And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.

On second thought, that wouldn't gain much on a loop over finding each
sequence, but add more complexity than it is worth. What would be more
useful, especially thinking of a C implementation, is str.findanyof and
str.findnoneof. They take a string as an argument and find the first
occurrence of any char in that string or any char not in that string,
resp. Especially finding any char not among a given few needs a hoop to
jump through now, if I didn't miss anything.

ucntcme · Jul 14, 2005

Just as with files, iterating over them returns whole lines, which is
unfortunately not what I want.

Then why not subclass it and alter the iteration scheme to do a read(1)
or something?

from StringIO import StringIO

class FrankenString(StringIO):
lastidx = 0
atEnd = False
def __iter__(self):
while not self.atEnd:
char = self.read(1)
idx = self.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char

ucntcme · Jul 14, 2005

Here is a cStringIO based version:
class FrankenString:
def __init__(self,string=None):
self.str = StringIO(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char

On a string 1024*1024 long the StringIO version takes 10s to iterate
over but do nothing, whereas this one takes 3.1 on my system. Well to
create that string, create the instance and loop over. But since each
variant is doing the same thing I figure it's even.

ucntcme · Jul 14, 2005

well that appears to have been munged up ...
that tell() belongs immediately after self.str. making it
self.str.tell()

class FrankenString:
def __init__(self,string=None):
self.str = StringIO(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char

Thomas Lotze · Jul 18, 2005

Peter said:
I hope you'll let us know how much faster your
final approach turns out to be

OK, here's a short report on the current state. Such code as there is can
be found at <http://svn.thomas-lotze.de/PyASDF/pyasdf/_frankenstring.c>,
with a Python mock-up in the same directory.

Thinking about it (Andreas, thank you for the reminder

)), doing
character-by-character scanning in Python is stupid, both in terms of
speed and, given some more search capabilities than str currently has,
elegance.

So what I did until now (except working myself into writing extensions
in C) is give the evolving FrankenString some search methods that enable
searching for the first occurrence in the string of any character out of
a set of characters given as a string, or any character not in such a
set. This has nothing to do yet with iterators and seeking/telling.

Just letting C do the "while data[index] not in whitespace: index += 1"
part speeds up my PDF tokenizer by a factor between 3 and 4. I have
never compared that directly to using regular expressions, though... As
a bonus, even with this minor addition the Python code looks a little
cleaner already:

c = data[cursor]

while c in whitespace:
# Whitespace tokens.
cursor += 1

if c == '%':
# We're just inside a comment, read beyond EOL.
while data[cursor] not in "\r\n":
cursor += 1
cursor += 1

c = data[cursor]

becomes

cursor = data.skipany(whitespace, start)
c = data[cursor]

while c == '%':
# Whitespace tokens: comments till EOL and whitespace.
cursor = data.skipother("\r\n", cursor)
cursor = data.skipany(whitespace, cursor)
c = data[cursor]

(removing '%' from the whitespace string, in case you wonder).

The next thing to do is make FrankenString behave. Right now there's too
much copying of string content going on everytime a FrankenString is
initialized; I'd like it to share string content with other
FrankenStrings or strs much like cStringIO does. I hope it's just a
matter of learning from cStringIO. To justify the "franken" part of the
name some more, I consider mixing in yet another ingredient and making
the thing behave like a buffer in that a FrankenString should be
possible to make from only part of a string without copying data.

After that, the thing about seeking and telling iterators over
characters or search results comes in. I don't think it will make much
difference in performance now that the stupid character searching has
been done in C, but it'll hopefully make for more elegant Python code.

TF-IDF	2	Aug 19, 2021
Create and Preview HTML & PDF with Custom Encryption and Micro Cloud Storage	0	Nov 12, 2024
Data saving in condition of changing reality	0	Apr 29, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	4	Jun 4, 2023
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Is React Native good for mobile game development?	1	Mar 20, 2024
Chatbot	0	Oct 8, 2024
Twitter Bot for Series recommendations help please	1	Oct 2, 2024

Frankenstring

Thomas Lotze

ucntcme

ucntcme

ucntcme

Thomas Lotze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads