PDF to text covertor?

D

dare ruby

Dear all,

Could anyone explain how to do convert PDF to text format.

Thanks in advance

Regards,
Jose Martin
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Mon, 11 Aug 2008 18:41:51 +0900
Von: dare ruby <[email protected]>
An: (e-mail address removed)
Betreff: PDF to text covertor?
Dear all,

Could anyone explain how to do convert PDF to text format.

Thanks in advance

Regards,
Jose Martin

Dear Jose,

it depends on whether your PDF actually contains text or just images that a human can recognize as
text.
In the first case, you can try using tools like pdftotext (http://en.wikipedia.org/wiki/Pdftotext), on Linux and
Mac, at least. On Windows, there are also some pdf viewers where you can say , "Save as text" .

In the second case, you'll have to use an OCR (optical character recognition) software. There are some
good commercial ones available. I've liked ABBYY's Finereader (on Windows).

Best regards,

Axel
 
K

Kouhei Sutou

Hi,

In <[email protected]>
"PDF to text covertor?" on Mon, 11 Aug 2008 18:41:51 +0900,
dare ruby said:
Could anyone explain how to do convert PDF to text format.

It seems that Ruby/Poppler(*1), the Ruby bindings of
Poppler(*2), is what you're looking for.
http://ruby-gnome2.svn.sourceforge..../trunk/poppler/sample/pdf2text.rb?view=markup

(*1) http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby/Poppler
(*2) http://poppler.freedesktop.org/

pdftotext is a bundled application in Poppler.


Thanks,
 
D

dare ruby

I have some of the study materials as PDF documents. I need to parse the
PDF to any text format like microsoft word or text pad in windows OS. I
need to do parsing using a ruby program. Could any one suggesst on this?

Thanks in advance

Regards,
Jose Martin
 
M

Martin DeMello

I have some of the study materials as PDF documents. I need to parse the
PDF to any text format like microsoft word or text pad in windows OS. I
need to do parsing using a ruby program. Could any one suggesst on this?

Your best bet is a ruby script that calls out to xpdf to do the actual
pdf->text conversion, then parses the text. There's a windows port of
the xpdf command line utilities.

http://gnuwin32.sourceforge.net/packages/xpdf.htm
http://www.perlmonks.org/?node_id=298041
http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/
http://forjournalists.com/cookbook/index.php?title=XPDF

martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,053
Members
47,656
Latest member
rickwatson

Latest Threads

Top