programming: SAX and get content between open and close tag?

R

Rui Maciel

Is it possible to, using the SAX approach, extract the XML content between
an opening and closing tag as if it was a continuous string of text?

For example, let's say we have the following document:

<first>
<second>
<alpha>foo</alpha>
<beta>bar</beta>
</second>
</first>

Is it possible to directly extract the content between <first> and </first>
as if it was a text string?


Thanks in advance
Rui Maciel
 
J

Juergen Kahrs

Rui said:
<first>
<second>
<alpha>foo</alpha>
<beta>bar</beta>
</second>
</first>

Is it possible to directly extract the content between <first> and </first>
as if it was a text string?

I think it's not possible. Do you expect the tags inside
<first> to appear as text also ? Or do you expect the
character data between the tags to appear only ? What
_exactly_ do you expect to be the result of your example ?
 
R

Rui Maciel

Juergen said:
I think it's not possible. Do you expect the tags inside
<first> to appear as text also ? Or do you expect the
character data between the tags to appear only ? What
exactly do you expect to be the result of your example ?

What I had in mind was to extract the literal text which is enclosed in the
<first> and </first> tags, where the child tags would appear also as if
they were text. To put it in other words, extract the XML subsection
enclosed by the <first> and </first> tags.

Is it possible?


Thanks and best regards
Rui Maciel
 
J

Joe Kesselman

Rui said:
Is it possible to directly extract the content between <first> and </first>
as if it was a text string?

Not using standard SAX. Run those events back through a SAX serializer
to regenerate the text from them.
 
R

rui.maciel

Joe said:
Not using standard SAX. Run those events back through a SAX serializer
to regenerate the text from them.

I see what you mean. But that seems to be a bit redundant, doesn't it?
I mean, run a XML text through a parser, decompose it and then generate
the exact same information from he parser's information... It looks
like too much trouble just to end up practically where we were before.
It would be a lot simpler if it was possible to extract the original
content which is enclosed by certain tags.


Rui Maciel
 
J

Joe Kesselman

It would be a lot simpler if it was possible to extract the original
content which is enclosed by certain tags.

The parser has to grovel through all the bytes anyway, to make sure it
has found the correct matching close-tag.

And this is a relatively uncommon case. Normally if folks are reading an
XML document at all, it's because they want its meaning, not its markup.
(For example, note that the meaning of the text is indeterminate without
knowing what namespace declarations it inherits from its surrounding
context.)

There are special cases where this could be useful... but SAX is
designed for the most general cases.
 
G

Greger

I see what you mean. But that seems to be a bit redundant, doesn't it?
I mean, run a XML text through a parser, decompose it and then generate
the exact same information from he parser's information... It looks
like too much trouble just to end up practically where we were before.
It would be a lot simpler if it was possible to extract the original
content which is enclosed by certain tags.


Rui Maciel
http://www.saxproject.org/quickstart.html
for java, what language do you use?
 
W

William Park

Rui Maciel said:
What I had in mind was to extract the literal text which is enclosed in the
<first> and </first> tags, where the child tags would appear also as if
they were text. To put it in other words, extract the XML subsection
enclosed by the <first> and </first> tags.

Is it possible?

If <first> tag is not nested, then treat the XML file as long string.
So, find the first <first>, then find the first </first>. Otherwise,
you have to do some bookkeeping.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
J

Joe Kesselman

William said:
If <first> tag is not nested, then treat the XML file as long string.
So, find the first <first>, then find the first </first>. Otherwise,
you have to do some bookkeeping.

In other words, text-based rather than XML-based processing, the
"desperate PERL hacker" solution. Doable. Ugly. Sometimes worth
considering, but often means you're asking the wrong questions or
optimizing the wrong things.
 
J

Joe Kesselman

Malcolm said:
You have to provide the my_print_xxx_as_text routines, and of course the
above is completely pseudo code, but I think you might get the idea.

That's the "reserialize SAX events into text form" solution, which Rui
was objecting to.
 
M

Malcolm Dew-Jones

Joe Kesselman ([email protected]) wrote:
: William Park wrote:
: > If <first> tag is not nested, then treat the XML file as long string.
: > So, find the first <first>, then find the first </first>. Otherwise,
: > you have to do some bookkeeping.

: In other words, text-based rather than XML-based processing, the
: "desperate PERL hacker" solution. Doable. Ugly. Sometimes worth
: considering, but often means you're asking the wrong questions or
: optimizing the wrong things.

No, I think he means that your sax event handler code does something like
the following

global variable first_depth=0;

sub start_element( the_element_as_an_object )
{
if (the_element_as_an_object->its_name = 'first')
{
first_depth ++;
}

if (first_depth > 0)
{
my_print_element_as_text( the_element_as_an_object );
}
}

sub end_element( the_element_end_as_an_object )
{
if (first_depth > 0)
{
my_print_element_end_as_text( the_element_end_as_an_object );
}

if (the_element_end_as_an_object->its_name = 'first')
{
first_depth --;
}

}

sub handle_everything_else( the_thing_as_an_object)
{
if (first_depth > 0)
{
my_print_thing_as_text( the_thing_as_an_object );
}
}


You have to provide the my_print_xxx_as_text routines, and of course the
above is completely pseudo code, but I think you might get the idea.
 
G

Greger

I'm using C++ at the moment with Qt's XML library.

That site seems rather nice. I'll read it to see if I can finally get a
hang of this XML parsing thing.


Thanks for your help
Rui Maciel
I have never used sax myself, using libxml2 tree in my project, but what
you'ld probably need to do is to "trigger" the function that processes the
contents of a tag when the tagtype you are looking for occurs.
Better:see the Qt documentation, I am sure there are simple ways to achieve
what you try to do.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,002
Messages
2,570,261
Members
46,858
Latest member
FlorrieTuf

Latest Threads

Top