pdf index builder

Giovanni Azua · Dec 5, 2011

Hello!

I have the strong need to do the following. Given a set of PDF files
scattered across multiple directories, build a global index that includes
for every index term the file names and corresponding pages where such
index occurs. A really nice to have would be to "parse" formulas but I
guess these are stored as images ...

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

My use-case for this is a very critical open book exam but there are no
books instead a bunch of dense PDF papers and lectures (a lot) if I get
such index I might get an edge here

TIA,
Best regards,
Giovanni

-- Giovanni

Arved Sandstrom · Dec 6, 2011

Hello!

I have the strong need to do the following. Given a set of PDF files
scattered across multiple directories, build a global index that includes
for every index term the file names and corresponding pages where such
index occurs. A really nice to have would be to "parse" formulas but I
guess these are stored as images ...

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

My use-case for this is a very critical open book exam but there are no
books instead a bunch of dense PDF papers and lectures (a lot) if I get
such index I might get an edge here

TIA,
Best regards,
Giovanni

-- Giovanni

Presumably you don't want to get as high-powered (and costly and
complicated) as something like CBR (content based retrieval) in IBM
FileNet P8.

AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and
indexing. If you're in control of the entire Alfresco system you'd have
access to the indexing data in its raw form. But I don't see the point,
I'd myself simply run PDFBox and Lucene standalone, if all you want is a
global index. Granted, Alfresco is not a complicated install.

One note: PDFBox is noted by a number of commentators to be slow in the
Alfresco environment. For all I know it's slow, period. You might want
to consider pdftotext. There are some decent articles on using it vice
PDFBox with Alfresco.

AHS

Roedy Green · Dec 9, 2011

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

there is a ton of PDF utilities. Have a browse at
http://mindprod.com/jgloss/pdf.html

I would be quite surprised if what you want does not exist.
--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

Index and search PDf files in ASP.net	2	Nov 30, 2004
Search script to index dynamic pages	19	Mar 28, 2011
Printing PDF files in Java	3	May 11, 2005
[ann] Cross Builder, v1.0	0	Dec 15, 2004
Old template class works in VC++ Not in C++ Builder 5	3	Nov 30, 2006
Unexplained delay Module::Build + ExtUtils::MakeMaker building pureperl modules	1	Aug 8, 2012
[ANN] Nitro + Og 0.21.0 Compiler, Og custom joins, Og dynamic injection, new builder	10	Jul 25, 2005
Tomcat+Struts+Cocoon: Good tutorials needed about Cocoon integration	0	Nov 12, 2004

pdf index builder

Giovanni Azua

Arved Sandstrom

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads