Converting pdf to text

Chandramohan Neelakantan · Sep 10, 2003

Hello all,

Problem:

Need to extract text information from a pdf file , write the text
to a file for a hardware project .
The text is contained in a table and has the width and height
information of different layers for a chip
The widthe and height information would be used to create test layouts
for different layers using Cadence SKILL.

OS: Hp-UX

Other tools used: Cadence SKILL

I wanted to do this initial pdf parsing in Perl because:

- it comes with the OS
- No point in writing the pdf parsing tool (which wld be an independen
project then)
- someone must have experienced the parsing proble before

I hope Im clear so far

Searching:

I tried module search on search.cpan.org but as far I have seen, I
dint notice any that extracts the text information from a pdf file.

I also tried seaarching on google but there seems to be pdf2text for
Linux

Solutions:

- I would appreciate if someone could point me to a module/script
that converts pdf 2 text

- any other suggestions in tackling the problem welcome

Many thanks
CM

David Efflandt · Sep 10, 2003

Hello all,

Problem:

Need to extract text information from a pdf file , write the text
to a file for a hardware project .
The text is contained in a table and has the width and height
information of different layers for a chip
The widthe and height information would be used to create test layouts
for different layers using Cadence SKILL.

OS: Hp-UX

Other tools used: Cadence SKILL

I wanted to do this initial pdf parsing in Perl because:

- it comes with the OS
- No point in writing the pdf parsing tool (which wld be an independen
project then)
- someone must have experienced the parsing proble before

I hope Im clear so far

Searching:

I tried module search on search.cpan.org but as far I have seen, I
dint notice any that extracts the text information from a pdf file.

I also tried seaarching on google but there seems to be pdf2text for
Linux

My system calls it pdf2ascii, which is one of the utilities included with
ghostscript (PostScript and PDF language interpreter and previewer). You
might see if 'gs' is either on your system or if ghostscript could be
compiled for HP-UX. See if 'apropos pdf' (or ghostscript) turns up
anything.

Whether that would work depends whether the pdf was created from a text
based source. If the text is in an image (scanned, etc.) you would need
some sort of OCR software to interpret the graphical text.

Vlad Tepes · Sep 10, 2003

Chandramohan Neelakantan said:
Hello all,

Need to extract text information from a pdf file , write the text
to a file for a hardware project .

You could try using the command line utility pdftotext from the xpdf
distribution. I've got better experience with that tool than with using
pdf2ascii (comes with ghostscript).

Just my two cents,

Chandramohan Neelakantan · Sep 15, 2003

Many thanks for the tips.

-CM

Vlad Tepes said:
You could try using the command line utility pdftotext from the xpdf
distribution. I've got better experience with that tool than with using
pdf2ascii (comes with ghostscript).

Just my two cents,

PDF File Code	4	Apr 20, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Image shifts to the right when export the page to pdf	4	May 5, 2023
dynamic content with PDF::API2	1	Aug 8, 2012
converting strings to hex	10	Apr 3, 2014
Converting file to PDF	6	Dec 23, 2008
How to get text from PDF?	1	Dec 22, 2008

Converting pdf to text

Chandramohan Neelakantan

David Efflandt

Vlad Tepes

Chandramohan Neelakantan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads