Converting pdf to text

  • Thread starter Chandramohan Neelakantan
  • Start date
C

Chandramohan Neelakantan

Hello all,

Problem:

Need to extract text information from a pdf file , write the text
to a file for a hardware project .
The text is contained in a table and has the width and height
information of different layers for a chip
The widthe and height information would be used to create test layouts
for different layers using Cadence SKILL.


OS: Hp-UX

Other tools used: Cadence SKILL



I wanted to do this initial pdf parsing in Perl because:

- it comes with the OS
- No point in writing the pdf parsing tool (which wld be an independen
project then)
- someone must have experienced the parsing proble before

I hope Im clear so far


Searching:

I tried module search on search.cpan.org but as far I have seen, I
dint notice any that extracts the text information from a pdf file.


I also tried seaarching on google but there seems to be pdf2text for
Linux



Solutions:

- I would appreciate if someone could point me to a module/script
that converts pdf 2 text

- any other suggestions in tackling the problem welcome



Many thanks
CM
 
D

David Efflandt

Hello all,

Problem:

Need to extract text information from a pdf file , write the text
to a file for a hardware project .
The text is contained in a table and has the width and height
information of different layers for a chip
The widthe and height information would be used to create test layouts
for different layers using Cadence SKILL.


OS: Hp-UX

Other tools used: Cadence SKILL



I wanted to do this initial pdf parsing in Perl because:

- it comes with the OS
- No point in writing the pdf parsing tool (which wld be an independen
project then)
- someone must have experienced the parsing proble before

I hope Im clear so far


Searching:

I tried module search on search.cpan.org but as far I have seen, I
dint notice any that extracts the text information from a pdf file.


I also tried seaarching on google but there seems to be pdf2text for
Linux

My system calls it pdf2ascii, which is one of the utilities included with
ghostscript (PostScript and PDF language interpreter and previewer). You
might see if 'gs' is either on your system or if ghostscript could be
compiled for HP-UX. See if 'apropos pdf' (or ghostscript) turns up
anything.

Whether that would work depends whether the pdf was created from a text
based source. If the text is in an image (scanned, etc.) you would need
some sort of OCR software to interpret the graphical text.
 
V

Vlad Tepes

Chandramohan Neelakantan said:
Hello all,

Need to extract text information from a pdf file , write the text
to a file for a hardware project .

You could try using the command line utility pdftotext from the xpdf
distribution. I've got better experience with that tool than with using
pdf2ascii (comes with ghostscript).

Just my two cents,
 
C

Chandramohan Neelakantan

Many thanks for the tips.


-CM



Vlad Tepes said:
You could try using the command line utility pdftotext from the xpdf
distribution. I've got better experience with that tool than with using
pdf2ascii (comes with ghostscript).

Just my two cents,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,139
Messages
2,570,805
Members
47,351
Latest member
LolaD32479

Latest Threads

Top