P
Peter Jamieson
I am trying to extract data from the table in a large number of rtf files.
I tried RTF::Tokenizer and RTF:arser but could not make progress
so have decided to try regular expressions.
My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.
Any advice or suggestions appreciated!
###########################################
# Perl code to parse table in rtf files #
###########################################
#!/usr/bin/perl -w
use strict;
use warnings;
use Time::Local;
use Win32::ODBC;
# use RTF::Tokenizer; # unsuccessful
# use RTF:arser; # unsuccessful
use dbi;
use Getopt::Long;
my $ett = localtime();
print "\n Time : $ett \n";
my $file_ = 'BURN_RDX_01.rtf';
my @lines;
open(INFO, $file_) || die("Unable to open file!");
@lines = <INFO>;
close(INFO);
# get the useful line data
my $line;
my $useful_data;
foreach $line (@lines) {
if ($line =~ /\\pard\\intbl/) {
$useful_data = "$useful_data.$line \n";
}
}
print "useful_data are: $useful_data \n";
Inspection of the table headings reveals they may vary (sometimes no
telemetry data for a particular range or table has different
ranges) but typical headings are like this:
\pard\intbl {\b\f1\fs24\qc Propellant Burn Times \cell }\pard\intbl
{\f1\fs20\qc 22000m\par 20000m\cell
20000m\par 18000m\cell 18000m\par 16000m\cell 16000m\par 14000m\cell
14000m\par 12000m\cell 12000m\par
10000m\cell 10000m\par 8000m\cell 8000m\par 6000m\cell 6000m\par 4000m\cell
4000m\par 2000m\cell
2000m\par BURN CUT OFF\cell }\pard\intbl {\b\f1\qc 17812\cell }\pard\intbl
{\row }
There may be 6 to 30 data rows in the table, typical row looks like this:
\pard\intbl {\b\f1\fs20\qc 1\cell 40\cell Composition (RDX1)\cell \b0\fs16
\cell \b \cell \cell
1319\cell [90]\cell 1293\cell [90]\cell 1321\cell [90]\cell 1273\cell
[90]\cell 1245\cell [90]\cell
1173\cell [90]\cell 1117\cell [100]\cell 1102\cell [70]\cell 1119\cell
[10]\cell 1218\cell [10]\cell
17817 \cell }\pard\intbl {\row }
I tried RTF::Tokenizer and RTF:arser but could not make progress
so have decided to try regular expressions.
My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.
Any advice or suggestions appreciated!
###########################################
# Perl code to parse table in rtf files #
###########################################
#!/usr/bin/perl -w
use strict;
use warnings;
use Time::Local;
use Win32::ODBC;
# use RTF::Tokenizer; # unsuccessful
# use RTF:arser; # unsuccessful
use dbi;
use Getopt::Long;
my $ett = localtime();
print "\n Time : $ett \n";
my $file_ = 'BURN_RDX_01.rtf';
my @lines;
open(INFO, $file_) || die("Unable to open file!");
@lines = <INFO>;
close(INFO);
# get the useful line data
my $line;
my $useful_data;
foreach $line (@lines) {
if ($line =~ /\\pard\\intbl/) {
$useful_data = "$useful_data.$line \n";
}
}
print "useful_data are: $useful_data \n";
Inspection of the table headings reveals they may vary (sometimes no
telemetry data for a particular range or table has different
ranges) but typical headings are like this:
\pard\intbl {\b\f1\fs24\qc Propellant Burn Times \cell }\pard\intbl
{\f1\fs20\qc 22000m\par 20000m\cell
20000m\par 18000m\cell 18000m\par 16000m\cell 16000m\par 14000m\cell
14000m\par 12000m\cell 12000m\par
10000m\cell 10000m\par 8000m\cell 8000m\par 6000m\cell 6000m\par 4000m\cell
4000m\par 2000m\cell
2000m\par BURN CUT OFF\cell }\pard\intbl {\b\f1\qc 17812\cell }\pard\intbl
{\row }
There may be 6 to 30 data rows in the table, typical row looks like this:
\pard\intbl {\b\f1\fs20\qc 1\cell 40\cell Composition (RDX1)\cell \b0\fs16
\cell \b \cell \cell
1319\cell [90]\cell 1293\cell [90]\cell 1321\cell [90]\cell 1273\cell
[90]\cell 1245\cell [90]\cell
1173\cell [90]\cell 1117\cell [100]\cell 1102\cell [70]\cell 1119\cell
[10]\cell 1218\cell [10]\cell
17817 \cell }\pard\intbl {\row }