DTD Parsing

A

Asun Friere

Now that PyXML (and thus xmlproc) is defunct, does anyone know any
handy modules (apart from re :) for parsing DTDs?
 
F

Felipe Bastos Nunes

I'd like to know too. I work with java and jdom, but I'm doing
personal things in python, and plan to go full python in the next 2
years. Xml is my first option for configuration files and simple
storages.
 
A

Asun Friere

Am 10.11.2010 03:44, schrieb Felipe Bastos Nunes:


Don't repeat the mistakes of others and use XML as a configuration
language. XML isn't meant to be edited by humans.

Yes but configuration files are not necessarily meant to be edited by
humans either!

Having said that, I'm actually old school and prefer "setting=value"
human editable config files which are easily read into a dict via some
code something like this:

def read_config (file_obj) :
"""Reads a config file and returns values as a dictionary

Config file is a series of lines in the format:
#comment
name=value
name:value
name = value #comment
Neither name nor value may contain '#', '=', ':' nor any spaces.

"""
config = {}
nameval = re.compile('^\s*([^=:\s]+)\s*(?:=|:)\s*([^=:\s]*)
\s*(?:#.*)?\s*$').search
comment = re.compile('^\s*($|#)').search
for line in file_obj :
if comment(line) : continue
try :
name, value = nameval(line).groups()
except AttributeError :
sys.stderr.write('WARNING: suspect entry: %s\n' % line)
continue
config[name]=value
file_obj.close()
return config

Thanks Christian, I might check out 'configobj', but my needs are
rarely more complicated than the above will satisfy.

In any case Felipe, whether you intend to use XML for config or not
(or for any other reason), there are good tools for XML parsing in
python including with DTD validation. Try the modules 'libxml2',
'lxml', or even, if your needs are modest, the poorly named
'HTMLParser'.

What I'm looking for instead is something to parse a DTD, such as
xmlproc's DTDConsumer. It might even exist in the modules I've
mentioned, but I can't find it. In the event, I think I'll use a DTD-
xsd conversion script and then simply use HTMLParser. Unless someone
can point me in the way of a simple DTD parser, that is.
 
A

Asun Friere

Back to the initial question: I highly recommend LXML for any kind of
XML processing, validation, XPath etc.

Sorry Christian, didn't realise at first that that was a response to
MY intial question. But does lxml actually have something for parsing
DTDs, as opposed parsing XML and validating it against a DTD?
 
S

Stefan Behnel

Asun Friere, 10.11.2010 04:42:
Sorry Christian, didn't realise at first that that was a response to
MY intial question. But does lxml actually have something for parsing
DTDs, as opposed parsing XML and validating it against a DTD?

What's your interest in parsing a DTD if you're not up to validating XML?

Stefan
 
A

Asun Friere

What's your interest in parsing a DTD if you're not up to validating XML?

Spitting out boilerplate code.

Just at the moment I'm creating a stub XSLT sheet, which creates a
template per element (from a 3rd party DTD with 143 elements, yuk!)
containing nothing more than a apply-templates line listing all
possible child elements and a comment saying 'NOT IMPLEMENTED: %s' %
element_name. This saves not only typing, but helps me work through
and guards against any clumsy oversight on my part in writing a
translation sheet for an IMO overly large schema.

A few years back I used a similar technique to write some boiler plate
python code where xml was isomorphically represented on a class per
element basis (which will no doubt offend some people's sense of
generalisation, but is none the less an interesting way to work with
XML).

While I'm here and just for the record, (as I don't imagine anyone
would want to use the code I posted above), the line
"file_obj.close()" has no place in a function which is passed an open
file_object. My apologies.
 
S

Stefan Behnel

Asun Friere, 10.11.2010 06:41:
What's your interest in parsing a DTD if you're not up to validating XML?

Spitting out boilerplate code.
[...]
A few years back I used a similar technique to write some boiler plate
python code where xml was isomorphically represented on a class per
element basis (which will no doubt offend some people's sense of
generalisation, but is none the less an interesting way to work with
XML).

Give lxml.objectify a try. It doesn't use DTDs, but does what you want.

There are also some other similar tools like gnosis.objectify or Amara. I
never benchmarked them in comparison, but I'd be surprised if
lxml.objectify wasn't the fastest. I'd be interested in seeing the margin,
though, in case anyone wants to give it a try.

It's generally a good idea to state what you want to achieve, rather than
just describing the failure of an intermediate step of one possible path
towards your hidden goal. This list has a huge history of finding shortcuts
that the OPs didn't think of.

Stefan
 
R

r0g

Yes but configuration files are not necessarily meant to be edited by
humans either!

Having said that, I'm actually old school and prefer "setting=value"
human editable config files which are easily read into a dict via some
code something like this:
<snippetysnip>


Me too when possible, TBH if I only needed strings and there was no
pressing security issue I'd just do this...

config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

There is a place for XML settings though, they're nice and portable and
for some apps you probably don't want end users editing their
configurations by hand in a text editor anyway, you would prefer them to
use the nice consistency preserving config interface you have lovingly
built for them. You have built them a nice GUI config interface haven't
you ??? ;)

Roger
 
I

Ian Kelly

Me too when possible, TBH if I only needed strings and there was no
pressing security issue I'd just do this...

config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

That's five whole lines of code. Why go to all that trouble when you
can just do this:

import config

I kid, but only partially. Where this really shines is when you're
prototyping something and you need to configure complex object
hierarchies. No need to spend time writing parsers to generate the
hierarchies; you just construct the objects directly in the config.
When the project becomes mature enough that configuration security is a
concern, then you can replace the config with XML or whatever, and in
the meantime you can focus on more important things, like the actual
project.

Cheers,
Ian
 
R

r0g

Me too when possible, TBH if I only needed strings and there was no
pressing security issue I'd just do this...

config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

That's five whole lines of code. Why go to all that trouble when you can
just do this:

import config


Heh, mainly because I figure the config module will have a lot more
options than I have use for right now and therefore the docs will take
me longer to read than I will save by not just typing in the above ;)

Having said that, you've just prompted me to take a look... there goes
another 10 minutes of my life!

Roger
 
A

Asun Friere

Give lxml.objectify a try. It doesn't use DTDs, but does what you want.

Yes I should take the time to familiarise myself with the lxml API in
general. I mostly use libxml2 and libxslt nowadays. For simple stuff
(like this) I use a StateParser which is your common-or-garden variety
State Pattern built on HTMLParser. (For the record it took 3 trivial
state definitions and one hackish one :)

However, my issue was not with any particular in any particular python
technology for XML processing, but with eating a DTD. Once it's in
xsd, it's all downhill from there! So the answer to my question turned
out to be dtd2xsd.pl :)
It's generally a good idea to state what you want to achieve, rather than
just describing the failure of an intermediate step of one possible path
towards your hidden goal. This list has a huge history of finding shortcuts
that the OPs didn't think of.

It's very simple really. I would like to know whether there is some
generally used DTD parser around which could function as a replacement
for xmlproc's DTDParser/DTDConsumer, the existence of which might have
evaded my attention. I would still like to know.

Without wanting to appear ungrateful, I'm not after any shortcut to
any goal, hidden or otherwise, nor is the reason I want a DTD Parser
(I only told you because you asked so nicely) strictly pertinent to my
question. I simply meant to ask, precisely what I did ask.
 
A

Asun Friere

That's five whole lines of code.  Why go to all that trouble when you
can just do this:

import config

I kid, but only partially.  

For myself, generally because I only become aware of the module, or
the module is only written after I written some stuff myself.

I wrote a Date object before the standard one either existed or I knew
of it and I'll keep on using it till you pry it from my cold ... on
2nd thoughts don't pry, just let me use it.
 
F

Felipe Bastos Nunes

I'll look at the options. But anyway, only to give an example of the
configs I told, the ShoX project (at sourceforge.net) has xml as
config files. I'm not talking about common users to edit the xmls,
it's about the developer edit them :) I'm working in a python
wireless sensor network simulator, some builtin functions at python
will greatly increase the ease in doing this.

Does any, libxml2 or lxml, collect children like jdom does in java?
List<Element> children = myRoot.getChildren();

Or just doing a Handler to find the children?

2010/11/10 said:
Me too when possible, TBH if I only needed strings and there was no
pressing security issue I'd just do this...

config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

That's five whole lines of code. Why go to all that trouble when you
can just do this:

import config

I kid, but only partially. Where this really shines is when you're
prototyping something and you need to configure complex object
hierarchies. No need to spend time writing parsers to generate the
hierarchies; you just construct the objects directly in the config.
When the project becomes mature enough that configuration security is a
concern, then you can replace the config with XML or whatever, and in
the meantime you can focus on more important things, like the actual
project.

Cheers,
Ian
 
I

Ian

Heh, mainly because I figure the config module will have a lot more
options than I have use for right now and therefore the docs will take
me longer to read than I will save by not just typing in the above ;)

I think you misunderstand me. There is no config module and there are
no docs to read. It's just the configuration file itself written as a
Python script, containing arbitrary settings like:

SESSION_TIMEOUT = 900

or:

DOMAIN_OBJECTS = [
ObjectType1(
option1 = 'foo',
option2 = 'bar',
),
ObjectType2(
option1 = 'foo',
option2 = 'baz',
option3 = 42,
),
...
]

Cheers,
Ian
 
S

Stefan Behnel

Felipe Bastos Nunes, 10.11.2010 13:34:
Does any, libxml2 or lxml, collect children like jdom does in java?
List<Element> children = myRoot.getChildren();

Bah, that's *so* Java. ;)

ElementTree and lxml.etree do it like this:

children = list(myRoot)

lxml also supports XPath and lots of other helpful stuff.

Stefan
 
L

Lawrence D'Oliveiro

Christian said:
Don't repeat the mistakes of others and use XML as a configuration
language. XML isn't meant to be edited by humans.

My principle is: anything automatically generated by machine is not fit for
viewing or editing by humans. There’s nothing special about XML in this
regard.

I have successfully got a nontechnical client to put together XML control
files to drive some software I wrote for him. He used KXMLEditor (part of
KDE 3.x, seems to be defunct nowadays), and of course he always started with
an existing example and modified that. But it wasn’t too long before he was
regularly doing it without needing to ask me questions.
 
L

Lawrence D'Oliveiro

Christian said:
I'm sorry but every time I read XML and configuration in one sentence, I
see the horror of TomCat or Shibboleth XML configs popping up.

Tomcat I know is written in Java; let me guess—Shibboleth is too?
 
L

Lawrence D'Oliveiro

Ian Kelly said:
config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

That's five whole lines of code. Why go to all that trouble when you
can just do this:

import config

Not a good idea. Because if there any mistakes in the config, you would like
to print useful explanatory error messages to help the writer of the config
file figure out what they’ve done wrong, rather than relying on them to
understand Python exception tracebacks. Also your config validation rules
may not map easily to Python language rules.
 
S

Steve Holden

Ian Kelly said:
On 11/9/2010 11:14 PM, r0g wrote:

config = {}
for line in (open("config.txt", 'r')):
if len(line) > 0 and line[0] <> "#":
param, value = line.rstrip().split("=",1)
config[param] = value

That's five whole lines of code. Why go to all that trouble when you
can just do this:

import config
Not a good idea.
[...]

Sure, you wouldn't want users editing imported Python files, it would be
leaving your program hostage to all the things an ignorant or malevolent
user might do.

But practical for, for example, experimental work. I seem to remember
that the major server project in "Python Web Programming" has all
configurable modules importing a common Config module, though in my own
defense that *was* almost ten years ago now. (That wasn't intended to be
a production server though it ran sporadically on a computer in my
basement for several years and produced a traceback occasionally).

regards
Steve
 
R

r0g

I think you misunderstand me. There is no config module and there are
no docs to read. It's just the configuration file itself written as a
Python script, containing arbitrary settings like:


So you're not talking about this then?...

http://www.red-dove.com/config-doc/

I see. You're suggesting writing config files IN the language you're
already writing in?

Indeed that's what I do in many situations, it has the advantage of
working in any scripting language (I do the same in PHP fairly often, as
do several big projects like Drupal) and of course it spares you a bit
of code.

However, if your config file it's meant to be distributed / editable by
end users you don't necessarily want them to need a full understanding
of python syntax to do it.

Also, actually parsing config files (rather than just importing
namespaces) gives you an opportunity to deal with any syntax errors on a
case by case basis i.e. skip, fail, issue warning etc. Just importing
code gives you a kind of 100% consistency or death situation and while
that might be exactly what you want many times there may be situations
where you want a bit more fine grained control!

Roger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,918
Members
47,458
Latest member
Chris#

Latest Threads

Top