Parsing a chemical formal

L

Luotao Fu

Hi All,
My first post on this Groups, so sorry for any possible stupidity :)
I'm wrting since days a perl programm. The programm contains a small
routine, wich shall parse a chemical formal and return the name and
portion of single atoms
in the material as a array(or a hash) Well my code looks like that:

my @literals=split /([A-Z])/, $molecule;

for (my $i=0; $i<=$#literals; $i++){
my @atom;
print "Literal: ", $literals[$i], "\n";
push(@atom, $literals[$i]);
if ($literals[$i+1] !~ /[A-Z]/){
push(@atom,$literals[$i+1]);
$i++;
}
push(@atoms,join("",@atom)};
}

The $molecule contains the formal (i.E. H2O, FeCl3 or CaCl), Every Beginning
letter of a element ist written in upper case. As you can see, I split
first the $molecule with Letters in upper case, which means FeCl3
turns into {F,e,C,l3}, than I scan the splitted list, which is stored
in the array @Literal, for capital
letters, every capital letter will be pushed in a temporary Array. If
the following item in array is not written in upper case, which means, that
the Name of the atom contains more than one letter, it'll be also pushed in
the same temporary Array, which will be later joined and puted in the
output array. The final result of the Formal H20 should be {H2,O},
FeCl3 {Fe,Cl3} and so on....

This works so far, but I'm far not satified with this solution. There
must be better ways to solve it. which more intelligent RegExp and so
on. But I'm not quite familiar to RegExps in Perl, so that I can't think
out any better solution.

Anyone Idea, how I can write this routine more elegantly?

Thanx A lot
Cheers
Luotao Fu
 
M

Mark Clements

Luotao said:
Hi All,
My first post on this Groups, so sorry for any possible stupidity :)
I'm wrting since days a perl programm. The programm contains a small
routine, wich shall parse a chemical formal and return the name and
portion of single atoms
This works so far, but I'm far not satified with this solution. There
must be better ways to solve it. which more intelligent RegExp and so
on. But I'm not quite familiar to RegExps in Perl, so that I can't think
out any better solution.

You need to check out

man perlre

But you may want to check out Chemistry::FormulaPattern
(search.cpan.org) if this is a real-life problem rather than just a
programming exercise.

Try this (is pretty rough and am sure I have missed edge cases) to get
you started:

bob 881 $ cat testformula.pl
#!/usr/local/bin/perl

use strict;
use warnings;

use Data::Dumper;

my $formula = shift;

my @elements = ();

while($formula =~ s/([A-Z][a-z]?[0-9]*)//){
push @elements, $1;
};
print Dumper \@elements;

bob 882 $ ./testformula.pl H2SO4
$VAR1 = [
'H2',
'S',
'O4'
];
 
G

GreenLeaf

Abigail said:
I wouldn't use split, just parse what you want to keep. What you want is
very simple: exactly one capital letter, followed by zero or more lower
case letters, followed by zero or more numbers. Written as a regex, this
is:

to OP:

If this is an exercise, considering the real world scenario, you might
want to consider the rule that an element name is always exactly one
capital letter followed by _exactly zero or one simple letter_, with the
exception of elements that start with Uu. I'm assuming here that yours
is a program for learning, since you admitted to write it 'since days'
:). Considering these facts will make your re more robust.

You might also want to consider the radicals (such as hydroxyl -OH)
because they are sure to lead to incorrect results if you just ignore
parenthesis: for instance Fe(OH)3. You can do this by first capturing
parenthesis and numbers that follow, then running the same simple rules
that you used to capture no-parenthesis case for the token within each
set of parenthesis. Something along the line of

my @atoms = /((?:\(.+\)|Uu.|[A-Z][a-z]?)\d*)/g;

would work here.

Since Abigail's post clearly gave you almost everything you need to
know, it would be quite straightforward to implement these simple
changes. Good luck! :)

Hope this helps,
sat
 
L

Luotao Fu

Hi,

@Abigail:
fancy idea! Now the famous Question to myself: If this is simple, why
haven't I gotten it myself? ;-) works like a charm, thanx a lot.
to OP:

If this is an exercise, considering the real world scenario, you might
want to consider the rule that an element name is always exactly one
capital letter followed by _exactly zero or one simple letter_, with the
exception of elements that start with Uu. I'm assuming here that yours
is a program for learning, since you admitted to write it 'since days'
:). Considering these facts will make your re more robust.

;-) Actually it's not an exercise, the perlscript should format Database
Files for my C Programm, which handles with CT Scanners. On the other side,
I'm indeed learning Perl though writing this. I'd also had written it in C,
but I chose perl to refresh my Memory on RegExp.
You might also want to consider the radicals (such as hydroxyl -OH)
because they are sure to lead to incorrect results if you just ignore
parenthesis: for instance Fe(OH)3. You can do this by first capturing
parenthesis and numbers that follow, then running the same simple rules
that you used to capture no-parenthesis case for the token within each
set of parenthesis. Something along the line of

my @atoms = /((?:\(.+\)|Uu.|[A-Z][a-z]?)\d*)/g;

would work here.

Thanx for the advise, I didn't think about this one. However it might
not be a serious problem for me. We have limited the Input on only Stuffs
containing the first 100 Elements on the periodic Table. Which is more
important, I define the formatrules of the Inputfiles. I'll notice
in the Readme, that such formats are forbidden :).
Since Abigail's post clearly gave you almost everything you need to
know, it would be quite straightforward to implement these simple
changes. Good luck! :)

Hope this helps,

Thanx a lot

Cheers
Luotao Fu
 
T

Ted Zlatanov

I'm wrting since days a perl programm. The programm contains a small
routine, wich shall parse a chemical formal and return the name and
portion of single atoms
in the material as a array(or a hash) ....
The $molecule contains the formal (i.E. H2O, FeCl3 or CaCl), Every
Beginning letter of a element ist written in upper case. As you can
see, I split first the $molecule with Letters in upper case, which
means FeCl3 turns into {F,e,C,l3}, than I scan the splitted list,
which is stored in the array @Literal, for capital letters, every
capital letter will be pushed in a temporary Array. If the following
item in array is not written in upper case, which means, that the
Name of the atom contains more than one letter, it'll be also pushed
in the same temporary Array, which will be later joined and puted in
the output array. The final result of the Formal H20 should be
{H2,O}, FeCl3 {Fe,Cl3} and so on....

I think you are not doing this correctly.

You are not parsing random letters, you are parsing chemical
elements' names in sequence. So don't just say "split on a letter."
Build a dictionary of element names (it's a finite list, although you
can anticipate new elements may need to be added at the end).
Something like this:

my %elements = { H => { number => 1, extra => data => you => need },
He => { number => 2, ...},
...
};

Then, build your regular expression to match elementa from your
%element hash.

This may be a LOT easier with the Parse::RecDescent module, which I
think is the right tool for this task. It can parse formulas like the
ones you describe, as long as you write a suitable grammar (you can
write a rule that will match elements from the %elements hash). It
will generate a suitable parse tree for you, which will be a lot more
functional that your {H2,O} format. For more information and help,
mail the recdescent list at (e-mail address removed) after you've read the
Parse::RecDescent documentation :)

Ted
 
J

John Bokma

Ted said:
I think you are not doing this correctly.

You are not parsing random letters, you are parsing chemical
elements' names in sequence. So don't just say "split on a letter."
Build a dictionary of element names (it's a finite list, although you
can anticipate new elements may need to be added at the end).
Something like this:

my %elements = { H => { number => 1, extra => data => you => need },
He => { number => 2, ...},
...
};

Then, build your regular expression to match elementa from your
%element hash.

If you can assume that only valid formulaes are given to the program,
[A-Z][a-z]?\d* sounds sufficient to me.

If you really want to check validity you can capture the [A-Z][a-z]?
part and look it up in a hash. Moreover, if some letters are not
possible (for example x), you could remove them from the character class
(and making the program harder to read, I guess).
will generate a suitable parse tree for you, which will be a lot more
functional that your {H2,O} format.

One lesson I learned the hard way: never make your program more
funcional than the requirements. I.e. if you need cat, don't write
OpenOffice :-D.
 
G

GreenLeaf

Luotao said:
Thanx for the advise, I didn't think about this one. However it might
not be a serious problem for me. We have limited the Input on only Stuffs
containing the first 100 Elements on the periodic Table. Which is more
important, I define the formatrules of the Inputfiles.

Last time I checked, all Fe, O and H were below 100. :eek:)

However, since this is a real program as you said, it _may be_ better to
handle the parenthesis, because if you do not, somebody else will have
to format Fe(OH)3 to FeO3H3 - or you will be limiting the usefulness of
your program. Be nice and do them a favor, since it does not need _too
much_ of additional work at your side. A couple more lines to make it
able to handle stuff like Fe2(SO4)3 - as you see in the sub
processToken() ;)

I agree with John's idea though: _no need to bother_ if you will _never_
get such formulae in the first place, and KISS.


use strict;
use warnings;

while (<DATA>){
my @atoms = /((?:\(.+\)|[A-Z][a-z]?)\d*)/g;
my %total; # total count of each element
foreach (@atoms) {
my %stuff = processToken($_);
while (my ($element, $count) = each %stuff){
$total{$element} += $count;
}
}
# here, you have all elements with their respective counts.
while (my ($element, $count) = each %total){
print "$element$count\n";
}
}

sub processToken {
my $token = shift;
if ($token =~ /\(/){ # we have groups
my ($elempart, $numpart) = $token =~ m/\((\w+)\)(\d*)/;
my %grpcounts = processToken($elempart);
$grpcounts{$_} *= ($numpart ? $numpart : 1)
foreach (keys %grpcounts);
return %grpcounts;
} else {
my @atoms = split /(?=[A-Z][a-z]?[0-9]*)/ => $token;
my %atomcounts;
foreach (@atoms){
my ($element, $count) = /([A-Za-z]+)(\d*)/;
$atomcounts{$element} += $count ? $count : 1;
}
return %atomcounts;
}
}

__DATA__
H2O
FeCl3
NaOH
Fe(OH)3
Fe2(SO4)3
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,268
Members
46,867
Latest member
Lonny Petersen

Latest Threads

Top