P
phaeton123
I was trying to use Xquery to try to do pattern matching over mixed
structured and unstructured content. For example consider the following
xml fragment:
.....
<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
.....
Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft. In
other words, something like "<type=state>.*Microsoft"
What would be the easiest way to accomplish this with Xquery or Xpaths
if in fact it is possible? If it is possible, can we incorporate into
these "mixed regular expressions" arbitrarily nested structures and
regular text? The problem I am having is conceptualizing how one can
naturally combine searches over structure and content jointly as
opposed to
doing it in 2 passes: one pass to search for the paths of the form
//state, extracting all of the matching bodies and then searching for
plaintext regular expression matches of "Microsoft".
structured and unstructured content. For example consider the following
xml fragment:
.....
<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
.....
Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft. In
other words, something like "<type=state>.*Microsoft"
What would be the easiest way to accomplish this with Xquery or Xpaths
if in fact it is possible? If it is possible, can we incorporate into
these "mixed regular expressions" arbitrarily nested structures and
regular text? The problem I am having is conceptualizing how one can
naturally combine searches over structure and content jointly as
opposed to
doing it in 2 passes: one pass to search for the paths of the form
//state, extracting all of the matching bodies and then searching for
plaintext regular expression matches of "Microsoft".