Need expert help matching a line

  • Thread starter Ramon F Herrera
  • Start date
R

Ramon F Herrera

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon
 
R

Ramon F Herrera

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon

Just to make the exercise a little harder -and fun- the assignment
syntax should be able to support continuation lines, where the RHS is
very long:

Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!

A continuation line always starts with blank space.

TIA,

-Ramon
 
L

Lucius Sanctimonious

perlre (the manpage for Perl regular expressions) is your friend.
Seriously.  It will answer all the questions you raised.


Thanks Don, seriously.

You are essentially telling me to RTFM. I have already RTFM.

The question remains open...

Thx,

-Ramon
 
C

Charlton Wilbur

LS> You are essentially telling me to RTFM. I have already RTFM.

Your question shows no evidence of this.

LS> The question remains open...

Post what you've already tried, and let us know what you're having
problems with.

Also, review the posting guidelines that are posted here frequently, or
online at http://www.rehabitation.com/clpmisc/clpmisc_guidelines.html --
they're a summary of what works best if you really want to get help,
instead of just wanting to stir up drama.

Charlton
 
C

ccc31807

CODE:
use strict;
use warnings;

my ($var, $val);
my %variables;
while (<DATA>)
{
chomp;
if (/=/) { ($var, $val) = split /=/; }
elsif (/^ +\w+/) { $val .= $_; }
else { next; }
$var =~ s/^\s+//;
$var =~ s/\s+$//;
$variables{$var} = $val;
}

foreach my $key (keys %variables) { print "$key => $variables{$key}
\n"; }
exit(0);

__DATA__
My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!
Favorite Curse = That umpire is a #&*%!

OUTPUT:
My Favorite Baseball Player => George Herman "Babe" Ruth
Describe your summer vacation => Well, we traveled to the beach and
to the mountains, and debated whether we should go to the Grand
Canyon and Niagara falls. The GPS you gave me turned out to be very
useful!
Favorite Curse => That umpire is a #&*%!
What did your do on Christmas => I rested, computed the % mortgage
and visited my brother + sister.
 
S

sln

CODE:
use strict;
use warnings;

my ($var, $val); = ('','');
my %variables;
while (<DATA>)
{
chomp;
if (/=/) { ($var, $val) = split /=/; }
elsif (/^ +\w+/) { $val .= $_; }
else { next; }
$var =~ s/^\s+//;
$var =~ s/\s+$//;
$variables{$var} = $val;
}

foreach my $key (keys %variables) { print "$key => $variables{$key}
\n"; }
exit(0);
Looks good. I like the way you did this.
Might need initial condition check
elsif (/^ +\w+/ and length($var)) { $val .= $_; }

-sln
 
S

sln

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon

-sln

use strict;
use warnings;

my $buf = '';

while (<DATA>)
{
if (/=/ or eof) {
if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
{
my ($var,$val) = ($1,$2);
$val =~ s/\n +/\n/g;
print "$var => $val\n\n";
}
$buf = '';
}
$buf .= $_;
}
__DATA__

My Favorite Baseball Player = George Herman = "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!
 
R

Ramon F Herrera

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.
I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.
Variable Name = Variable Value
IOW, each line contains 3 parts:
VariableName
Equal Sign
VariableValue
As opposed to the variable names used by many programming languages,
my variable names accept embedded space.
Here's some examples of the lines I am trying to match:
My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!
What I need is a way to specify valid characters.
VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)
Thanks for your kind assistance.

-sln

use strict;
use warnings;

my $buf  = '';

while (<DATA>)
{
        if (/=/ or eof) {
                if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
                {
                        my ($var,$val) = ($1,$2);
                        $val =~ s/\n +/\n/g;
                        print "$var => $val\n\n";
                }
                $buf = '';
        }
        $buf .= $_;    }

__DATA__

My Favorite Baseball Player = George Herman =  "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
 visited my brother + sister.
 asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
  and to the mountains, and debated whether we
  should go to the Grand Canyon and Niagara falls.
  The GPS you gave me turned out to be very useful!


Thank you, sln!

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax..html

I am sticking with the default (Perl) Regex syntax.

This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.

TIA,

-Ramon
 
R

Ramon F Herrera

use strict;
use warnings;
my $buf  = '';
while (<DATA>)
{
        if (/=/ or eof) {
                if ($buf =~ /\s*([\w ]+)\s*=\s*((?:..+(?:\n .+)*)|)/)
                {
                        my ($var,$val) = ($1,$2);
                        $val =~ s/\n +/\n/g;
                        print "$var => $val\n\n";
                }
                $buf = '';
        }
        $buf .= $_;    }

My Favorite Baseball Player = George Herman =  "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
 visited my brother + sister.
 asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
  and to the mountains, and debated whether we
  should go to the Grand Canyon and Niagara falls.
  The GPS you gave me turned out to be very useful!

Thank you, sln!

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/...

I am sticking with the default (Perl) Regex syntax.

This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value    = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.

TIA,

-Ramon

I have made some progress here:

const string variable = "(\\w+[\\w\\d\\. ]*)";
const char equal_sign = '=';
const string value = "(.+)";

I think the above will cover most real cases, but not sure what will
happen if the RHS contains an '=' sign?

-RFH
 
S

sln

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax.html

I am sticking with the default (Perl) Regex syntax.
This uses Perl 5.8 as a reference to describe the syntax. That is its library default?
http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.
To just add those restrictions just requires this:
const string variable = "([a-zA-Z][\\w. ]+)";

There is nothing in the regex that is not in Perl 5.8, if
thats what they will be using.

-sln
 
R

Ramon F Herrera

This uses Perl 5.8 as a reference to describe the syntax.
That is its library default?

Don't know, but frankly, my expressions will be so simple (the only
ones I am capable of writing :) that I doubt the syntax version will
make any difference.

All these questions are really to refresh my memory, since all this
stuff is coming back to me. My biggie Perl pattern matching project
was as follows. I used to manage a multi-thousand subscriber mailing
list at MIT. Those days e-mail traffic could really bog down a server
and network, and many of my subscribers graduated or went on a summer
vacation, etc., and forgot to unsubscribe. What I did was to pattern-
match every conceivable bounce received and extract the e-mail addres
of the subscriber. There were lots of mail servers then, BITNET, UUCP,
DECNET and many versions of sendmail. At least today e-mail addresses
are pretty much standard.

-Ramon
 
S

sln

I have made some progress here:

const string variable = "(\\w+[\\w\\d\\. ]*)";
const char equal_sign = '=';
const string value = "(.+)";

I think the above will cover most real cases, but not sure what will
happen if the RHS contains an '=' sign?

-RFH

"(\\w+[\\w\\d\\. ]*)";
^ don't you want alpha first character?
"(\\w+[\\w\\d\\. ]*)";
^ this is redundant

Otherwise, it looks ok. Since Boost is using Perl 5.8, you may
be able to do some validation and trimming all in the regex components.

// VAR Capture: alpha start char, other chars alphanumeric, space and '.',
// Trim (do not capture) trailing white spaces before 'equal_sign'
const string variable = "([a-zA-Z](?:(?!\s*=)[\\w. ])*)";
// Breakdown:
// ( # start capture group
// [a-zA-Z] # first char, alpha
// (?: # pseudo group
// (?!\s*=) # IF NOT whitespace(*) followed by equal sign
// [\w. ] # AND this char is in this class
// # THEN consume character
// # ELSE fail (or trim) on this character
// )* # end group, do none or many times
// ) # finish capture, done once


// Separator: whitespace, equal, whitespace (non-capture, considered trim)
const char equal_sign = "\\s*=\\s*";

// VAL Capture: any character up until a newline.
// Trim (do not capture) trailing white spaces before either
// equal sign (invalid separator), newline or end of string.
const string value = "((?:(?!\s*(?:=|\n|$)).)+)";
// Breakdown:
// ( # start capture group
// (?: # pseudo group
// (?! # IF NOT
// \s* # whitespace(*) followed by
// (?:=|\n|$) # equal or newline or end of string
// )
// . # AND this char is not newline
// # THEN consume character
// # ELSE fail (or trim) on this character
// )+ # end group, do once or many times
// ) # finish capture, done once

Combined it looks something like this -
/([a-zA-Z](?:(?!\s*=)[\w. ])*)\s*=\s*((?:(?!\s*(?:=|\n|$)).)+)/

I am guilty of too much info. It looks worse than it really is.
Thanks for that Boost info.

-sln
 
J

Jürgen Exner

Ramon F Herrera said:
This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

See 'perldoc -f split':

($variable_name, $variable_value) = split (/=/, $line);

As long as there isn't an equal sign in either name or value this will
work just fine.

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top