Correct use of Unicode in RegExp

M

mike blamires

I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike
 
M

mike blamires

I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

Apologies, incorrect newsgroup first time round. Please see above.
cheers
Mike
 
D

Daniel N. Andersen

mike blamires said:
I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

First of all, I've never worked with unicode characters.

I see you've tried to do something with \xFF and \x{FF} without
success. Have you tried \\xFF and \\x\{FF\} instead (notice the '\'
before all characters that aren't alphapetic or numeric)?

Good luck,
DNA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,814
Latest member
SpicetreeDigital

Latest Threads

Top