pdf index builder

G

Giovanni Azua

Hello!

I have the strong need to do the following. Given a set of PDF files
scattered across multiple directories, build a global index that includes
for every index term the file names and corresponding pages where such
index occurs. A really nice to have would be to "parse" formulas but I
guess these are stored as images ...

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

My use-case for this is a very critical open book exam but there are no
books instead a bunch of dense PDF papers and lectures (a lot) if I get
such index I might get an edge here :)

TIA,
Best regards,
Giovanni

-- Giovanni
 
A

Arved Sandstrom

Hello!

I have the strong need to do the following. Given a set of PDF files
scattered across multiple directories, build a global index that includes
for every index term the file names and corresponding pages where such
index occurs. A really nice to have would be to "parse" formulas but I
guess these are stored as images ...

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

My use-case for this is a very critical open book exam but there are no
books instead a bunch of dense PDF papers and lectures (a lot) if I get
such index I might get an edge here :)

TIA,
Best regards,
Giovanni

-- Giovanni

Presumably you don't want to get as high-powered (and costly and
complicated) as something like CBR (content based retrieval) in IBM
FileNet P8. :)

AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and
indexing. If you're in control of the entire Alfresco system you'd have
access to the indexing data in its raw form. But I don't see the point,
I'd myself simply run PDFBox and Lucene standalone, if all you want is a
global index. Granted, Alfresco is not a complicated install.

One note: PDFBox is noted by a number of commentators to be slow in the
Alfresco environment. For all I know it's slow, period. You might want
to consider pdftotext. There are some decent articles on using it vice
PDFBox with Alfresco.

AHS
 
R

Roedy Green

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

there is a ton of PDF utilities. Have a browse at
http://mindprod.com/jgloss/pdf.html

I would be quite surprised if what you want does not exist.
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top