Newbe Unicode question

S

Scottie

How do I make a Unicode Perl script that uses:
perl zapotec.pl zapotecUnicode.txt > asdf.txt
where "zapotecUnicode.txt" is UTF-8 file?

In the zapotec.pl I have:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin2";
at the very top.

Any help would be appreciated.

Scott
 
B

Ben Morrow

How do I make a Unicode Perl script that uses:
perl zapotec.pl zapotecUnicode.txt > asdf.txt
where "zapotecUnicode.txt" is UTF-8 file?

In the zapotec.pl I have:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin2";

Why? Is your source in latin2?
at the very top.

Any help would be appreciated.

Err... what does yor script do, and in what ways is in not working?

Ben
 
S

Scottie

Ben,
Why? Is your source in latin2?

I'm sorry. The 3rd line is:
use encoding "latin1";
Err... what does yor script do, and in what ways is in not working?

I started with GAWK and used a2p to change it to Perl. I think I know
that the @Fld line isn't allowing it to be Unicode. I have hunted
through the Perl docs concerning my problem and I haven't come up with
an answer. What do you think?

# Perl - a2p - Combines many changes to the Zapotec-Spanish
dictionary.
# Scott Starker

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin1";

# ${^WIDE_SYSTEM_CALLS} = 1;
$[ = 1; # set array base to 1
$, = " "; # set output field separator
$\ = "\n"; # set output record separator

$AlreadyGN = 0;
$notes = 0;
$gnsgnFirstLine = 0;
$anyline = 0;
$position = 0;
$lxline = '';
$mldef = '';
$seline = '';
$line = '';
$beg = '';
$end = '';

# This program takes out the "lx"'s that are alone on the line ("\k").
while (<>) {
chomp; # strip record separator
@Fld = split("\x{0020}", $_, 9999); # " "
print "\x{002a}";
# if ($Fld[1] eq " \\ l x") {
# if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
print "\x{002a}\x{002a}";
$s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
# Make "tone" un-bolded
$Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
$Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
$position = index($Fld[$#Fld], "\x{005d}");
$lxline = $_;
..
..
..

Scott
 
B

Ben Morrow

Ben,


I'm sorry. The 3rd line is:
use encoding "latin1";


I started with GAWK and used a2p to change it to Perl. I think I know
that the @Fld line isn't allowing it to be Unicode. I have hunted
through the Perl docs concerning my problem and I haven't come up with
an answer. What do you think?

# Perl - a2p - Combines many changes to the Zapotec-Spanish
dictionary.
# Scott Starker

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin1";

This is unnecessary because al latin1 is the default anyway and b. your
source is all ascii.
# ${^WIDE_SYSTEM_CALLS} = 1;
$[ = 1; # set array base to 1

Aaarg... run away... $[ is highly deprecated and double-plus-ungood.
Yes, I know it's not your code :).
$, = " "; # set output field separator
$\ = "\n"; # set output record separator

$AlreadyGN = 0;
$notes = 0;
$gnsgnFirstLine = 0;
$anyline = 0;
$position = 0;
$lxline = '';
$mldef = '';
$seline = '';
$line = '';
$beg = '';
$end = '';

# This program takes out the "lx"'s that are alone on the line ("\k").
while (<>) {
chomp; # strip record separator
@Fld = split("\x{0020}", $_, 9999); # " "
print "\x{002a}";
# if ($Fld[1] eq " \\ l x") {
# if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
print "\x{002a}\x{002a}";
$s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
# Make "tone" un-bolded
$Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
$Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
$position = index($Fld[$#Fld], "\x{005d}");
$lxline = $_;

Right, let's attempt to translate that into Perl... (untested)

#!/usr/bin/perl

use strict;
use warnings;

$, = " ";
$\ = "\n";

binmode STDIN, ':encoding(utf8)';
binmode STDOUT, ':encoding(utf8)';
# this is better as you get fallback if the input is invalid

my $ced = "\xb8";

while (<>) {
chomp;
my ($a, $b, $c) = split " ";
if ($a eq '\\\lx') { # this comes out as two \
print '**';
s/-/^~/g;
$b = "|b$b";
s/\[/|r[/g;
s/]/]|b/g;
s/]\|b$ced ]/]$ced |b/g;

....etc. (Bog, that code's making my eyes hurt!) You can carry on, and
finish it (what you posted wasn't complete, right?).

Now, I can't really see what this is supposed to do, so what do you want
it to do, and what is it in fact doing?

Ben
 
S

Scottie

Ben,
... (what you posted wasn't complete, right?).

It wasn't nearly all of it!
Now, I can't really see what this is supposed to do, so what do you want
it to do, and what is it in fact doing?

Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
Zapotec word (spoken in Mexico) and it's Spanish words as it's
definitions. It's almost a database type-of-thing. The program is
called Shoebox. There are different lines for each record. They all
start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
There might at least one subentry (\se) along with it's definition(s)
(\sgn). There's more than these fields. (The Perl line "print "**";
was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
$_, 9999); that takes an array like this. Can you help me out? I need
to know how to get the line into @Fld.

Scott
 
B

Ben Morrow

Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
Zapotec word (spoken in Mexico) and it's Spanish words as it's
definitions. It's almost a database type-of-thing. The program is
called Shoebox. There are different lines for each record. They all
start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
There might at least one subentry (\se) along with it's definition(s)
(\sgn). There's more than these fields. (The Perl line "print "**";
was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
$_, 9999); that takes an array like this. Can you help me out? I need
to know how to get the line into @Fld.

Well, that's easy:

my @F = split ' ';

if the records on each line are space-separated. Alternatively,

my @F = split /\\/;

may work better, as it will split the line on the backslashes. There are
two 'unfortunately's here: firstly, you'll get an initial empty field,
before the first backslash; secondly, the actual backslashes themselves
will be removed, so you'll have to remember to put them back in.

It's probably easiest if you then iterate over the fields, and do
whatever you need to based on the field type:

#!/usr/bin/perl -lanF\\
# see perldoc perlrun for the above: it automagically iterates over all
# lines and splits them into @F

BEGIN {
$\ = '';
binmode STDIN, ':encoding(utf8)';
binmode STDOUT, ':encoding(utf8)';
}

for (@F) {
/^lx/ and next;

/^gn/ and do {
s/\xB8/|b/; # or whatever it is you want to do
next;
};
}
continue {
# this makes sure each entry gets printed, with its backslash,
# when you're done with it.

print '\\' . (join $,, @_) . $\;
}

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top