Need expert help matching a line

Ramon F Herrera · Sep 8, 2009

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon

Ramon F Herrera · Sep 8, 2009

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon

Just to make the exercise a little harder -and fun- the assignment
syntax should be able to support continuation lines, where the RHS is
very long:

Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!

A continuation line always starts with blank space.

TIA,

-Ramon

Lucius Sanctimonious · Sep 8, 2009

perlre (the manpage for Perl regular expressions) is your friend.
Seriously. It will answer all the questions you raised.

Thanks Don, seriously.

You are essentially telling me to RTFM. I have already RTFM.

The question remains open...

Thx,

-Ramon

Charlton Wilbur · Sep 8, 2009

LS> You are essentially telling me to RTFM. I have already RTFM.

Your question shows no evidence of this.

LS> The question remains open...

Post what you've already tried, and let us know what you're having
problems with.

Also, review the posting guidelines that are posted here frequently, or
online at http://www.rehabitation.com/clpmisc/clpmisc_guidelines.html --
they're a summary of what works best if you really want to get help,
instead of just wanting to stir up drama.

Charlton

ccc31807 · Sep 8, 2009

CODE:
use strict;
use warnings;

my ($var, $val);
my %variables;
while (<DATA>)
{
chomp;
if (/=/) { ($var, $val) = split /=/; }
elsif (/^ +\w+/) { $val .= $_; }
else { next; }
$var =~ s/^\s+//;
$var =~ s/\s+$//;
$variables{$var} = $val;
}

foreach my $key (keys %variables) { print "$key => $variables{$key}
\n"; }
exit(0);

__DATA__
My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!
Favorite Curse = That umpire is a #&*%!

OUTPUT:
My Favorite Baseball Player => George Herman "Babe" Ruth
Describe your summer vacation => Well, we traveled to the beach and
to the mountains, and debated whether we should go to the Grand
Canyon and Niagara falls. The GPS you gave me turned out to be very
useful!
Favorite Curse => That umpire is a #&*%!
What did your do on Christmas => I rested, computed the % mortgage
and visited my brother + sister.

sln · Sep 8, 2009

CODE:
use strict;
use warnings;

my ($var, $val); = ('','');
my %variables;
while (<DATA>)
{
chomp;
if (/=/) { ($var, $val) = split /=/; }
elsif (/^ +\w+/) { $val .= $_; }
else { next; }
$var =~ s/^\s+//;
$var =~ s/\s+$//;
$variables{$var} = $val;
}

foreach my $key (keys %variables) { print "$key => $variables{$key}
\n"; }
exit(0);

Looks good. I like the way you did this.
Might need initial condition check
elsif (/^ +\w+/ and length($var)) { $val .= $_; }

-sln

sln · Sep 8, 2009

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Here's some examples of the lines I am trying to match:

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

What I need is a way to specify valid characters.

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Thanks for your kind assistance.

-Ramon

-sln

use strict;
use warnings;

my $buf = '';

while (<DATA>)
{
if (/=/ or eof) {
if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
{
my ($var,$val) = ($1,$2);
$val =~ s/\n +/\n/g;
print "$var => $val\n\n";
}
$buf = '';
}
$buf .= $_;
}
__DATA__

My Favorite Baseball Player = George Herman = "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!

Ramon F Herrera · Sep 8, 2009

This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

Click to expand...

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Click to expand...

Variable Name = Variable Value

Click to expand...

IOW, each line contains 3 parts:

Click to expand...

VariableName
Equal Sign
VariableValue

Click to expand...

As opposed to the variable names used by many programming languages,
my variable names accept embedded space.

Click to expand...

Here's some examples of the lines I am trying to match:

Click to expand...

My Favorite Baseball Player = George Herman "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
Favorite Curse = That umpire is a #&*%!

Click to expand...

What I need is a way to specify valid characters.

Click to expand...

VariableName: Alphanumeric (and perhaps underscore), blank space.
VariableValue: Pretty much anything is valid on the RHS except an '='
sign (I guess)

Click to expand...

Thanks for your kind assistance.

Click to expand...

-Ramon

Click to expand...

-sln

use strict;
use warnings;

my $buf = '';

while (<DATA>)
{
if (/=/ or eof) {
if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
{
my ($var,$val) = ($1,$2);
$val =~ s/\n +/\n/g;
print "$var => $val\n\n";
}
$buf = '';
}
$buf .= $_; }

__DATA__

My Favorite Baseball Player = George Herman = "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!

Thank you, sln!

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax..html

I am sticking with the default (Perl) Regex syntax.

This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.

TIA,

-Ramon

Ramon F Herrera · Sep 8, 2009

use strict;
use warnings;

Click to expand...

my $buf = '';

Click to expand...

while (<DATA>)
{
if (/=/ or eof) {
if ($buf =~ /\s*([\w ]+)\s*=\s*((?:..+(?:\n .+)*)|)/)
{
my ($var,$val) = ($1,$2);
$val =~ s/\n +/\n/g;
print "$var => $val\n\n";
}
$buf = '';
}
$buf .= $_; }

__DATA__

Click to expand...

My Favorite Baseball Player = George Herman = "Babe" Ruth
What did your do on Christmas = I rested, computed the % mortgage and
visited my brother + sister.
asdfasdf=
Favorite Curse = That umpire is a #&*%!
errnngsf
sngdnsdg
Describe your summer vacation = Well, we traveled to the beach
and to the mountains, and debated whether we
should go to the Grand Canyon and Niagara falls.
The GPS you gave me turned out to be very useful!

Click to expand...

Thank you, sln!

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/...

I am sticking with the default (Perl) Regex syntax.

This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.

TIA,

-Ramon

I have made some progress here:

const string variable = "(\\w+[\\w\\d\\. ]*)";
const char equal_sign = '=';
const string value = "(.+)";

I think the above will cover most real cases, but not sure what will
happen if the RHS contains an '=' sign?

-RFH

sln · Sep 8, 2009

I have to clarify that my program is not written in Perl (language
that I haven't used in ages) but in C++. The reason I posted my
question in this NG will be understood by reading this:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax.html

I am sticking with the default (Perl) Regex syntax.

This uses Perl 5.8 as a reference to describe the syntax. That is its library default?
http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

This is the relevant code that I have so far. As you can see it is
rather simplistic. I am not implementing continuation lines yet.

const string variable = "([\\w ]+)";
const char equal_sign = '=';
const string value = "([\\w ]+)";

const string assignment = variable + equal_sign + value;

The question that I have is this: how do I restrict the LHS to begin
with an alphabetic characters? IOW: The LHS may contain blanks but
they cannot be the first character of the line. I will also be
accepting digits, periods and underscores on the LHS but again, the
variable name cannot begin with any of them.

To just add those restrictions just requires this:
const string variable = "([a-zA-Z][\\w. ]+)";

There is nothing in the regex that is not in Perl 5.8, if
thats what they will be using.

-sln

Ramon F Herrera · Sep 9, 2009

This uses Perl 5.8 as a reference to describe the syntax.
That is its library default?

Don't know, but frankly, my expressions will be so simple (the only
ones I am capable of writing

that I doubt the syntax version will
make any difference.

All these questions are really to refresh my memory, since all this
stuff is coming back to me. My biggie Perl pattern matching project
was as follows. I used to manage a multi-thousand subscriber mailing
list at MIT. Those days e-mail traffic could really bog down a server
and network, and many of my subscribers graduated or went on a summer
vacation, etc., and forgot to unsubscribe. What I did was to pattern-
match every conceivable bounce received and extract the e-mail addres
of the subscriber. There were lots of mail servers then, BITNET, UUCP,
DECNET and many versions of sendmail. At least today e-mail addresses
are pretty much standard.

-Ramon

sln · Sep 9, 2009

I have made some progress here:

const string variable = "(\\w+[\\w\\d\\. ]*)";
const char equal_sign = '=';
const string value = "(.+)";

I think the above will cover most real cases, but not sure what will
happen if the RHS contains an '=' sign?

-RFH

"(\\w+[\\w\\d\\. ]*)";
^ don't you want alpha first character?
"(\\w+[\\w\\d\\. ]*)";
^ this is redundant

Otherwise, it looks ok. Since Boost is using Perl 5.8, you may
be able to do some validation and trimming all in the regex components.

// VAR Capture: alpha start char, other chars alphanumeric, space and '.',
// Trim (do not capture) trailing white spaces before 'equal_sign'
const string variable = "([a-zA-Z](?

?!\s*=)[\\w. ])*)";
// Breakdown:
// ( # start capture group
// [a-zA-Z] # first char, alpha
// (?: # pseudo group
// (?!\s*=) # IF NOT whitespace(*) followed by equal sign
// [\w. ] # AND this char is in this class
// # THEN consume character
// # ELSE fail (or trim) on this character
// )* # end group, do none or many times
// ) # finish capture, done once

// Separator: whitespace, equal, whitespace (non-capture, considered trim)
const char equal_sign = "\\s*=\\s*";

// VAL Capture: any character up until a newline.
// Trim (do not capture) trailing white spaces before either
// equal sign (invalid separator), newline or end of string.
const string value = "((?

?!\s*(?:=|\n|$)).)+)";
// Breakdown:
// ( # start capture group
// (?: # pseudo group
// (?! # IF NOT
// \s* # whitespace(*) followed by
// (?:=|\n|$) # equal or newline or end of string
// )
// . # AND this char is not newline
// # THEN consume character
// # ELSE fail (or trim) on this character
// )+ # end group, do once or many times
// ) # finish capture, done once

Combined it looks something like this -
/([a-zA-Z](?

?!\s*=)[\w. ])*)\s*=\s*((?

?!\s*(?:=|\n|$)).)+)/

I am guilty of too much info. It looks worse than it really is.
Thanks for that Boost info.

-sln

Jürgen Exner · Sep 16, 2009

Ramon F Herrera said:
This is really a parsing question, but I figure that nobody knows more
about regex and pattern matching than Perl programmers.

I have many files which contain multiple lines of variable-value pair
assignments. I need to break down each lines into its 3 constituent
components.

Variable Name = Variable Value

IOW, each line contains 3 parts:

VariableName
Equal Sign
VariableValue

See 'perldoc -f split':

($variable_name, $variable_value) = split (/=/, $line);

As long as there isn't an equal sign in either name or value this will
work just fine.

jue

Looking for a class that provides tokens	3	Sep 7, 2009
A curses-game I need help with.	7	Oct 8, 2006
How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.vhdl FAQ part 4 of 4: glossary	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Need expert help matching a line

Ramon F Herrera

Ramon F Herrera

Lucius Sanctimonious

Charlton Wilbur

ccc31807

sln

sln

Ramon F Herrera

Ramon F Herrera

sln

Ramon F Herrera

sln

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads