Enumerating formatting strings

S

Steve Holden

I was messing about with formatting and realized that the right kind of
object could quite easily tell me exactly what accesses are made to the
mapping in a string % mapping operation. This is a fairly well-known
technique, modified to tell me what keys would need to be present in any
mapping used with the format.

class Everything:
def __init__(self, format="%s", discover=False):
self.names = {}
self.values = []
self.format=format
self.discover = discover
def __getitem__(self, key):
x = self.format % key
if self.discover:
self.names[key] = self.names.get(key, 0) + 1
return x
def nameList(self):
if self.names:
return ["%-20s %d" % i for i in self.names.items()]
else:
return self.values
def __getattr__(self, name):
print "Attribute", name, "requested"
return None
def __repr__(self):
return "<Everything object at 0x%x>" % id(self)

def nameCount(template):
et = Everything(discover=True)
p = template % et
nlst = et.nameList()
nlst.sort()
return nlst

for s in nameCount("%(name)s %(value)s %(name)s"):
print s

The result of this effort is:

name 2
value 1

I've been wondering whether it's possible to perform a similar analysis
on non-mapping-type format strings, so as to know how long a tuple to
provide, or whether I'd be forced to lexical analysis of the form string.

regards
Steve
 
B

Bengt Richter

I was messing about with formatting and realized that the right kind of
object could quite easily tell me exactly what accesses are made to the
mapping in a string % mapping operation. This is a fairly well-known
technique, modified to tell me what keys would need to be present in any
mapping used with the format.
I've been wondering whether it's possible to perform a similar analysis
on non-mapping-type format strings, so as to know how long a tuple to
provide, or whether I'd be forced to lexical analysis of the form string.
When I was playing with formatstring % mapping I thought it could
be useful if you could get the full format specifier info an do your own
complete formatting, even for invented format specifiers. This could be
done without breaking backwards compatibility if str.__mod__ looked for
a __format__ method on the other-wise-mapping-or-tuple-object. If found,
it would call the method, which would expect

def __format__(self,
ix, # index from 0 counting every %... format
name, # from %(name) or ''
width, # from %width.prec
prec, # ditto
fc, # the format character F in %(x)F
all # just a copy of whatever is between % and including F
): ...

This would obviously let you handle non-mapping as you want, and more.

The most popular use would probably be intercepting width in %(name)<width>s
and doing custom formatting (e.g. centering in available space) for the object
and returning the right size string.

Since ix is an integer and doesn't help find the right object without the normal
tuple, you could give your formatting object's __init__ method keyword arguments
to specify arguments for anonymous slots in the format string, conventionally
naming them a0, a1, a2 etc. Then later when you get an ix with no name, you could
write self.kw.get('%as'%ix) to get the value, as in use like
'%(name)s %s' % Formatter(a1=thevalue) # Formatter as base class knows how to do name lookup

Or is this just idearrhea?

Regards,
Bengt Richter
 
P

Peter Otten

Steve said:
I was messing about with formatting and realized that the right kind of
object could quite easily tell me exactly what accesses are made to the
mapping in a string % mapping operation. This is a fairly well-known
technique, modified to tell me what keys would need to be present in any
mapping used with the format.
....

I've been wondering whether it's possible to perform a similar analysis
on non-mapping-type format strings, so as to know how long a tuple to
provide, or whether I'd be forced to lexical analysis of the form string.

PyString_Format() in stringobject.c determines the tuple length, then starts
the formatting process and finally checks whether all items were used -- so
no, it's not possible to feed it a tweaked (auto-growing) tuple like you
did with the dictionary.

Here's a brute-force equivalent to nameCount(), inspired by a post by Hans
Nowak (http://mail.python.org/pipermail/python-list/2004-July/230392.html).

def countArgs(format):
args = (1,) * (format.count("%") - 2*format.count("%%"))
while True:
try:
format % args
except TypeError, e:
args += (1,)
else:
return len(args)

samples = [
("", 0),
("%%", 0),
("%s", 1),
("%%%s", 1),
("%%%*.*d", 3),
("%%%%%*s", 2),
("%s %*s %*d %*f", 7)]
for f, n in samples:
f % ((1,)*n)
assert countArgs(f) == n

Not tested beyond what you see.

Peter
 
G

Greg Ewing

Steve said:
I've been wondering whether it's possible to perform a similar analysis
on non-mapping-type format strings, so as to know how long a tuple to
provide,

I just tried an experiment, and it doesn't seem to be possible.

The problem seems to be that it expects the arguments to be
in the form of a tuple, and if you give it something else,
it wraps it up in a 1-element tuple and uses that instead.

This seems to happen even with a custom subclass of tuple,
so it must be doing an exact type check.

So it looks like you'll have to parse the format string.
 
P

Peter Otten

Greg said:
I just tried an experiment, and it doesn't seem to be possible.

The problem seems to be that it expects the arguments to be
in the form of a tuple, and if you give it something else,
it wraps it up in a 1-element tuple and uses that instead.

This seems to happen even with a custom subclass of tuple,
so it must be doing an exact type check.

No, it doesn't do an exact type check, but always calls the tuple method:
.... def __getitem__(self, index):
.... return 42
...."'a' 'b'"
So it looks like you'll have to parse the format string.

Indeed.

Peter
 
B

Bengt Richter

No, it doesn't do an exact type check, but always calls the tuple method:

... def __getitem__(self, index):
... return 42
...
"'a' 'b'"


Indeed.
Parse might be a big word for

(if it works in general ;-)

Or maybe clearer and faster:
3

Regards,
Bengt Richter
 
P

Peter Otten

Bengt said:
Parse might be a big word for


(if it works in general ;-)

Which it doesn't:
fmt.split('%%')))
....
Traceback (most recent call last):
Or maybe clearer and faster:

3

Mixed formats show some "interesting" behaviour:
Traceback (most recent call last):
.... def __getitem__(self, key):
.... return "D[%s]" % key
....Traceback (most recent call last):
'<__main__.D instance at 0x402aad8c> D[x] D[y]'

That is as far as I got. So under what circumstances is
'%s this %(x)s not %% but %s' a valid format string?

Peter
 
B

Bengt Richter

Which it doesn't:
D'oh. (My subconscious knew that one, and prompted the "if" ;-)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: not enough arguments for format string
But that one it totally spaced on ;-/
Or maybe clearer and faster:

Mixed formats show some "interesting" behaviour:
Traceback (most recent call last):
... def __getitem__(self, key):
... return "D[%s]" % key
...Traceback (most recent call last):
'<__main__.D instance at 0x402aad8c> D[x] D[y]'

That is as far as I got. So under what circumstances is
'%s this %(x)s not %% but %s' a valid format string?
Yeah, I got that far too, some time ago playing % mapping, and
I thought they just didn't allow for mixed formats. My thought then
was that they could pass integer positional keys to another method
(say __format__) on a mapping object that wants to handle mixed formats.
If you wanted the normal str or repr resprensentation of a mapping
object that had a __format__ method, you'd have to do it on the args
side with str(theobject), but you'd have a way. And normal mapping objects
would need no special handling for "%s' in a mixed format context.

Regards,
Bengt Richter
 
M

Michael Spencer

Bengt said:

My experiments suggest that you can have a maximum of one unnamed argument in a
mapping template - this unnamed value evaluates to the map itself
Based on the above experiments, never.

I have wrapped up my current understanding in the following class:
POSITIONAL Template: %s %*.*d %*s
Arguments: ('s', 'width', 'precision', 'd', 'width', 's')
MAPPING Template: %(arg1)s %% %(arg2).*f %()s %s
Arguments: {'': 's', 'arg1': 's', 'arg2': 'f', None: 's'}
class StringFormatInfo(object):
parse_format = re.compile(r'''
\% # placeholder
(?:\((?P<name>[\w]*)\))? # 0 or 1 named groups
(?P<conversion>[\#0\-\+]?) # 0 or 1 conversion flags
(?P<width>[\d]* | \*) # optional minimum conversion width
(?:.(?P<precision>[\d]+ | \*))? # optional precision
(?P<lengthmodifier>[hlL]?) # optional length modifier
(?P<type>[diouxXeEfFgGcrs]{1}) # conversion type - note %% omitted
''',
re.VERBOSE
)

"""Wraps a template string and provides information about the number and
kinds of arguments that must be supplied. Call with % to apply the
template to data"""

def __init__(self, template):
self.template = template
self.formats = formats = [m.groupdict() for m in
self.parse_format.finditer(template)]

for format in formats:
if format['name']:
self.format_type = "MAPPING"
self.format_names = dict((format['name'], format['type'])
for format in formats)
break
else:
self.format_type = "POSITIONAL"
format_names = []
for format in formats:
if format['width'] == '*':
format_names.append('width')
if format['precision'] == '*':
format_names.append('precision')
format_names.append(format['type'])
self.format_names = tuple(format_names)

def __mod__(self, values):
return self.template % values

def __repr__(self):
return "%s Template: %s\nArguments: %s" % \
(self.format_type, self.template, self.format_names)



Michael
 
A

Andrew Dalke

Michael said:
I have wrapped up my current understanding in the following class:

I see you assume that only \w+ can fit inside of a %()
in a format string. The actual Python code allows anything
up to the balanced closed parens.
.... def __getitem__(self, text):
.... print "Want", repr(text)
.... Want 'this(is)a.--test!'
'None'
I found this useful for a templating library I once wrote
that allowed operations through a simple pipeline, like

%(doc.text|reformat(68)|indent(4))s

Andrew
(e-mail address removed)
 
M

Michael Spencer

Andrew said:
I see you assume that only \w+ can fit inside of a %()
in a format string. The actual Python code allows anything
up to the balanced closed parens.
Gah! I guess that torpedoes the regexp approach, then.

Thanks for looking at this

Michael
 
G

Greg Ewing

Peter said:
No, it doesn't do an exact type check, but always calls the tuple method:

I guess you mean len(). On further investigation,
this seems to be right, except that it doesn't
invoke a __len__ defined in a custom subclass.
So there's something in there hard-coded to
expect a built-in tuple.

In any case, the original idea isn't possible.
 
S

Steve Holden

Michael said:
Gah! I guess that torpedoes the regexp approach, then.

Thanks for looking at this

Michael
While Andrew may have found the "fatal flaw" in your scheme, it's worth
pointing out that it works just fine for my original use case.

regards
Steve
 
B

Bengt Richter

Gah! I guess that torpedoes the regexp approach, then.

Thanks for looking at this
I brute-forced a str subclass that will call a mapping object's __getitem__ for both
kinds of format spec and '*' specs. Just to see what it would take. I didn't go the whole
way loking for a __format__ method on the mapping object, along the lines I suggested in
a previous post. Someone else's turn again ;-)
This has not been tested thoroughly...

The approach is to scan the original format string and put pieces into an out list
and then ''.join that for final ouput. The pieces are the non-format parts and
string from doing the formatting as formats are found. %(name) format args are
retrieved from the mapping object by name as usual, and saved as the arg for
rewritten plain format made from the tail after %(name), which is the same tail
as %tail, except that the value is already retrieved. Next '*' or decimal strings
are packed into the rewritten format, etc. The '*' values are retrieved by integer
values passed to mapobj and incremented each time. If the arg value was not
retrieved by name, that's another mapobj. Then the conversion is done with
the plain format. The tests have MixFmt(fmt, verbose=True) % MapObj(position_params, namedict)
and the verbose prints each rewritten format and arg and result as it appends them to out.


----< mixfmt.py >------------------------------------------------------------------------
# mixfmt.py -- a string subclass with __mod__ permitting mixed '%(name)s %s' formatting
import re
class MixFmtError(Exception): pass

class MixFmt(str):
def __new__(cls, s, **kw):
return str.__new__(cls, s)
def __init__(self, *a, **kw):
self._verbose = kw.get('verbose')

# Michael Spencer's regex, slightly modded, but only for reference, since XXX note
parse_format = re.compile(r'''
(
\% # placeholder
(?:\(\w*\))? # 0 or 1 "named" groups XXX "%( (any)(balanced) parens )s" is legal!
[\#0\-\+]? # 0 or 1 conversion flags
(?:\* | \d+)? # optional minimum conversion width
(?:\.\* | \.\d+)? # optional precision
[hlL]? # optional length modifier
[diouxXeEfFgGcrs] # conversion type - note %% omitted
)
''',
re.VERBOSE)

def __mod__(self, mapobj):
"""
The '%' MixFmt string operation allowing both %(whatever)fmt and %fmt
by calling mapobj[whatever] for named args, and mapobj sequentially
counting i for each '*' width or precision spec, and unnamed args.
It is up to the mapobj to handle this. See MapObj example used in tests.
"""
out = []
iarg = 0
pos, end = 0, len(self)
sentinel = object()
while pos<end:
pos, last = self.find('%', pos), pos
while pos>=0 and self[pos:pos+2] == '%%':
pos+=2
pos = self.find('%', pos)
if pos<0: out.append(self[last:].replace('%%','%')); break
# here we have start of fmt with % at pos
out.append(self[last:pos].replace('%%','%'))
last = pos
plain_arg = sentinel
pos = pos+1
if self[pos]=='(':
# scan for balanced matching ')'
brk = 1; pos+=1
while brk>0:
nextrp = self.find(')',pos)
if nextrp<0: raise MixFmtError, 'no match for "(" at %s'%(pos+1)
nextlp = self.find('(', pos)
if nextlp>=0:
if nextlp<nextrp:
brk+=1; pos = nextlp+1
else:
pos = nextrp+1
brk-=1
else:
brk-=1
pos = nextrp+1
plain_arg = mapobj[self[last+2:pos-1]]
# else: normal part starts here, at pos
plain_fmt = '%'
# [\#0\-\+]? # 0 or 1 conversion flags
if pos<end and self[pos] in '#0-+':
plain_fmt += self[pos]; pos+=1
# (?:\* | \d+)? # optional minimum conversion width
if pos<end and self[pos]=='*':
plain_fmt += str(mapobj[iarg]); pos+=1; iarg+=1
elif pos<end and self[pos].isdigit():
eod = pos+1
while eod<end and self[eod].isdigit(): eod+=1
plain_fmt += self[pos:eod]
pos = eod
#(?:\.\* | \.\d+)? # optional precision
if self[pos] == '.':
plain_fmt += '.'
pos +=1
if pos<end and self[pos]=='*':
plain_fmt += str(mapobj[iarg]); pos+=1; iarg+=1
elif pos<end and self[pos].isdigit():
eod = pos+1
while eod<end and self[eod].isdigit(): eod+=1
plain_fmt += self[pos:eod]
pos = eod
#[hlL]? # optional length modifier
if pos<end and self[pos] in 'hlL': plain_fmt += self[pos]; pos+=1
#[diouxXeEfFgGcrs] # conversion type - note %% omitted
if pos<end and self[pos] in 'diouxXeEfFgGcrs': plain_fmt += self[pos]; pos+=1
else: raise MixFmtError, 'Bad conversion type %r at %s' %(self[pos], pos)
if plain_arg is sentinel: # need arg
plain_arg = mapobj[iarg]; iarg+=1
result = plain_fmt % (plain_arg,)
if self._verbose:
print ' -> %r %% %r => %r' % (plain_fmt, (plain_arg,), result)
out.append(result)
return ''.join(out)

class MapObj(object):
"""
Example for test.
Handles both named and positional (integer) keys
for MixFmt(fmtstring) % MapObj(posargs, namedict)
"""
def __init__(self, *args, **kw):
self.args = args
self.kw = kw
def __getitem__(self, i):
if isinstance(i, int): return self.args
else:
try: return self.kw
except KeyError: return '<KeyError:%r>'%i

def test(fmt, *args, **namedict):
print '\n==== test with:\n %r\n %s\n %s' %(fmt, args, namedict)
print MixFmt(fmt, verbose=True) % MapObj(*args, **namedict)

def testseq():
test('(no %%)')
test('%s', *['first'])
test('%(sym)s',**dict(sym='second'))
test('%s %*.*d %*s', *['third -- expect " 012 ab" after colon:', 5, 3, 12, 4, 'ab'])
test('%(arg1)s %% %(arg2).*f %()s %s', *[3, 'last'], **{
'arg1':'fourth -- expect " % 2.220 NULL? last" after colon:', 'arg2':2.22, '':'NULL?'})
#'%s %*.*d %*s', *['expect " 345 ab"??:', 2, 1, 12345, 4, 'ab'])
test('fifth -- non-key name: %(this(is)a.--test!)s')

if __name__ == '__main__':
import sys
if not sys.argv[1:]:
raise SystemExit,'Usage: python24 mixfmt.py -test | fmt ([key =] (s | (-i|-f) num)+ )*'
fmt, rawargs = sys.argv[1], iter(sys.argv[2:])
if fmt == '-test': testseq(); raise SystemExit
args = []
namedict = {}; to_name_dict=False
for arg in rawargs:
if arg == '-i': arg = int(rawargs.next())
if arg == '-f': arg = float(rawargs.next())
if arg == '=': to_name_dict = True
elif to_name_dict: namedict[args.pop()] = arg; to_name_dict=False
else: args.append(arg)
test(fmt, *args, **namedict)
-----------------------------------------------------------------------------------------
Result of py24 mixfmt.py -test:

[10:06] C:\pywk\pymods>py24 mixfmt.py -test

==== test with:
'(no %%)'
()
{}
(no %)

==== test with:
'%s'
('first',)
{}
-> '%s' % ('first',) => 'first'
first

==== test with:
'%(sym)s'
()
{'sym': 'second'}
-> '%s' % ('second',) => 'second'
second

==== test with:
'%s %*.*d %*s'
('third -- expect " 012 ab" after colon:', 5, 3, 12, 4, 'ab')
{}
-> '%s' % ('third -- expect " 012 ab" after colon:',) => 'third -- expect " 012 ab"
after colon:'
-> '%5.3d' % (12,) => ' 012'
-> '%4s' % ('ab',) => ' ab'
third -- expect " 012 ab" after colon: 012 ab

==== test with:
'%(arg1)s %% %(arg2).*f %()s %s'
(3, 'last')
{'': 'NULL?', 'arg1': 'fourth -- expect " % 2.220 NULL? last" after colon:', 'arg2': 2.2200000
000000002}
-> '%s' % ('fourth -- expect " % 2.220 NULL? last" after colon:',) => 'fourth -- expect " %
2.220 NULL? last" after colon:'
-> '%.3f' % (2.2200000000000002,) => '2.220'
-> '%s' % ('NULL?',) => 'NULL?'
-> '%s' % ('last',) => 'last'
fourth -- expect " % 2.220 NULL? last" after colon: % 2.220 NULL? last

==== test with:
'fifth -- non-key name: %(this(is)a.--test!)s'
()
{}
-> '%s' % ("<KeyError:'this(is)a.--test!'>",) => "<KeyError:'this(is)a.--test!'>"
fifth -- non-key name: <KeyError:'this(is)a.--test!'>

You can also run it interactively with one format and some args, e.g.,

[10:25] C:\pywk\pymods>py24 mixfmt.py
Usage: python24 mixfmt.py -test | fmt ([key =] (s | (-i|-f) num)+ )*

[10:25] C:\pywk\pymods>py24 mixfmt.py "%*.*f %(hi)s" -i 6 -i 3 -f 3.5 hi = hello

==== test with:
'%*.*f %(hi)s'
(6, 3, 3.5)
{'hi': 'hello'}
-> '%6.3f' % (3.5,) => ' 3.500'
-> '%s' % ('hello',) => 'hello'
3.500 hello


Regards,
Bengt Richter
 
M

Michael Spencer

Steve said:
While Andrew may have found the "fatal flaw" in your scheme, it's worth
pointing out that it works just fine for my original use case.

regards
Steve

Thanks. Here's a version that overcomes the 'fatal' flaw.

class StringFormatInfo(object):

def __init__(self, template):
self.template = template
self.parse()

def tokenizer(self):
lexer = TinyLexer(self.template)
self.format_type = "POSITIONAL"
while lexer.search("\%"):
if lexer.match("\%"):
continue
format = {}
name = lexer.takeparens()
if name is not None:
self.format_type = "MAPPING"
format['name'] = name
format['conversion'] = lexer.match("[\#0\-\+]")
format['width'] = lexer.match("\d+|\*")
format['precision'] = lexer.match("\.") and \
lexer.match("\d+|\*") or None
format['lengthmodifier'] = lexer.match("[hlL]")
ftype = lexer.match('[diouxXeEfFgGcrs]')
if not ftype:
raise ValueError
else:
format['type'] = ftype
yield format

def parse(self):
self.formats = formats = list(self.tokenizer())
if self.format_type == "MAPPING":
self.format_names = dict((format['name'], format['type'])
for format in formats)
else:
format_names = []
for format in formats:
if format['width'] == '*':
format_names.append('width')
if format['precision'] == '*':
format_names.append('precision')
format_names.append(format['type'])
self.format_names = tuple(format_names)

def __mod__(self, values):
return self.template % values

def __repr__(self):
return "%s Template: %s\nArguments: %s" % \
(self.format_type, self.template, self.format_names)
__str__ = __repr__

SFI = StringFormatInfo

def tests():
print SFI('%(arg1)s %% %(arg2).*f %()s %s')
print SFI('%s %*.*d %*s')
print SFI('%(this(is)a.--test!)s')


import re

class TinyLexer(object):
def __init__(self, text):
self.text = text
self.ptr = 0
self.len = len(text)
self.re_cache = {}

def match(self, regexp, consume = True, anchor = True):
if isinstance(regexp, basestring):
cache = self.re_cache
if regexp not in cache:
cache[regexp] = re.compile(regexp)
regexp = cache[regexp]
matcher = anchor and regexp.match or regexp.search
match = matcher(self.text, self.ptr)
if not match:
return None
if consume:
self.ptr = match.end()
return match.group()

def search(self, regexp, consume = True):
return self.match(regexp, consume=True, anchor=False)

def takeparens(self):
start = self.ptr
if self.text[start] != '(':
return None
out = ''
level = 1
self.ptr += 1
while self.ptr < self.len:
nextchar = self.text[self.ptr]
level += (nextchar == '(') - (nextchar == ')')
self.ptr += 1
if level == 0:
return out
out += nextchar
raise ValueError, "Unmatched parentheses"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top