Getting kind of abstract text snippets from text nodes

  • Thread starter Andreas W. Wylach
  • Start date
A

Andreas W. Wylach

Hi everybody,

I am about implementing a little search engine that searches a phrase
over xml text nodes. I got
that all working fine but what I want as the results is not the
complete text of the textnode,
I would like to make an abstract like result list (such output that
you get with google searches.

For eg

.... I am the <b>substring</b> from a complete text node ...

where "substring" is the search term.

The problem is simple (I think): I want to extract all the text parts
of the complete text node,
where search searchterm is highlighted, surrounded by the text like
30
characters.

I found an intersting post "cut down text" which is almost that what
I
am looking for, but there the
text is just trimmed by x characters.

Is anybody here, that has an "elegant" way to solve that or some
hints
that get me to the solution? I am not able to use regex (would be
nice
though)
My parser is Sablotron so I am restricted to the functions that I
get.
(1.0).


Any help is greatly appreciated.


regards,
Andreas W Wylach
 
J

Joe Kesselman

Think about dividing the text into three parts: before your target, the
target itself, and after the target. Process each appropriately. If you
want to report multiple instances within the same block of text, look at
the standard examples of recursive text processing.
 
D

Dimitre Novatchev

Andreas W. Wylach said:
Hi everybody,

I am about implementing a little search engine that searches a phrase
over xml text nodes. I got
that all working fine but what I want as the results is not the
complete text of the textnode,
I would like to make an abstract like result list (such output that
you get with google searches.

For eg

... I am the <b>substring</b> from a complete text node ...

where "substring" is the search term.

The problem is simple (I think): I want to extract all the text parts
of the complete text node,
where search searchterm is highlighted, surrounded by the text like
30
characters.


FXSL gives you exactly that (look for testConcordance.xsl).

As first shown here a year and a half ago:


http://www.stylusstudio.com/xsllist/200511/post00560.html

this was used to create a concordance of the text of the New Testament for
any word longer than three characters with frequency count in the document
not exceeding a given frequency count parameter (1280, which practically
leaves out mainly pronouns).

The code itself is 95 lines and on a 3GHz, 2GB Pentium IV PC with Saxon 8.6
(at that time) needed less than 92 seconds to produce the complete (huge)
concordance. The source xml document: "ot Ending Spaces.xml" is almost 50
000 (fifty thousand) lines long.

This is just one illustration of the reality of what can be done with XSLT,
disspelling the myths of "XSLT cannot do this or that
efficiently/elegantly".

Hope this helped.


Cheers,
Dimitre Novatchev
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,268
Members
46,867
Latest member
Lonny Petersen

Latest Threads

Top