Mixed Content XML pattern matching

phaeton123 · Aug 25, 2006

I was trying to use Xquery to try to do pattern matching over mixed
structured and unstructured content. For example consider the following
xml fragment:

.....
<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
.....

Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft. In
other words, something like "<type=state>.*Microsoft"
What would be the easiest way to accomplish this with Xquery or Xpaths
if in fact it is possible? If it is possible, can we incorporate into
these "mixed regular expressions" arbitrarily nested structures and
regular text? The problem I am having is conceptualizing how one can
naturally combine searches over structure and content jointly as
opposed to
doing it in 2 passes: one pass to search for the paths of the form
//state, extracting all of the matching bodies and then searching for
plaintext regular expression matches of "Microsoft".

Martin Honnen · Aug 25, 2006

phaeton123 wrote:

<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
....

Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft.

//article[body[state[. = 'Mass.' and (some $sibling in
following-sibling::node() satisfies contains($sibling, 'Microsoft'))]]]

should do to select those article elements which have a body child
element with a state child element whose content is 'Mass.' and which is
followed by some sibling containing 'Microsoft'.

Alain Frisch · Aug 29, 2006

FWIW, you can express such mixed structure/content query in CDuce
(http://www.cduce.org/) with a single pattern. E.g., for your example:

[ _* <state>_ _* 'Microsoft' _* ]

-- Alain

Display and filter xml content in drop-down menu	4	Jul 31, 2006
XML doubts please reply	2	Aug 14, 2009
XML Attributes vs Elements	0	May 14, 2007
TUG 2008 Call for Papers	0	Jan 2, 2008
newbie problem with xml/xsl example	2	Jul 27, 2006
[ANN] XMLmind XML Editor V2.10	0	Jun 6, 2005
Microsoft Signs on as Elite Sponsor of XML 2004; Program Expands	0	Nov 10, 2004
Re: Client found response content type of 'text/html; charset=utf-8', but expected 'text/xml'	0	Mar 8, 2005

Mixed Content XML pattern matching

phaeton123

Martin Honnen

Alain Frisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads