R
Roedy Green
I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:
<p class=MsoNormal align=center style='text-align:center'>Bunkie</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>9 %</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>5%</p>
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center
style='text-align:center'>Hessmer</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>8%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8
Here is one of my many failed attempts:
(?m)center'>([a-zA-Z ]+)</p>$.+$.+$.+?center'>([0-9\.]+)([ %]+)</p>
This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.
Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.
A bit of background on the problem. I am updating sales tax tables
for every county and city in the USA for the American Sales Tax
calculator. See http://mindprod.com/applet/americantax.html. Lousiana
makes gathering this data particularly difficult. They don't even have
a PDF document to describe the rules, much less something sane like
CSV format. They told me they hope to get organised some time this
October. They have privatised sales tax collecting. Each
parish(county) is handled by a different business or private
individual. Each parish has a web page with the rules, usually in
table form, but every one is different. I have downloaded all the web
pages and I am trying to develop regex for each parish to extract
the raw data, one regex group for city and one for tax rate.
I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:
<p class=MsoNormal align=center style='text-align:center'>Bunkie</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>9 %</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>5%</p>
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center
style='text-align:center'>Hessmer</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>8%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8
Here is one of my many failed attempts:
(?m)center'>([a-zA-Z ]+)</p>$.+$.+$.+?center'>([0-9\.]+)([ %]+)</p>
This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.
Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.
A bit of background on the problem. I am updating sales tax tables
for every county and city in the USA for the American Sales Tax
calculator. See http://mindprod.com/applet/americantax.html. Lousiana
makes gathering this data particularly difficult. They don't even have
a PDF document to describe the rules, much less something sane like
CSV format. They told me they hope to get organised some time this
October. They have privatised sales tax collecting. Each
parish(county) is handled by a different business or private
individual. Each parish has a web page with the rules, usually in
table form, but every one is different. I have downloaded all the web
pages and I am trying to develop regex for each parish to extract
the raw data, one regex group for city and one for tax rate.
I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.