Correct use of Unicode in RegExp

mike blamires · Apr 22, 2004

I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

mike blamires · Apr 23, 2004

I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

Apologies, incorrect newsgroup first time round. Please see above.
cheers
Mike

Daniel N. Andersen · Apr 23, 2004

mike blamires said:
I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

First of all, I've never worked with unicode characters.

I see you've tried to do something with \xFF and \x{FF} without
success. Have you tried \\xFF and \\x\{FF\} instead (notice the '\'
before all characters that aren't alphapetic or numeric)?

Good luck,
DNA

Correct use of Unicode in RegExp	0	Apr 23, 2004
Use of undefined constant error	2	Jun 30, 2022
How to use Densenet121 in monai	0	Feb 16, 2024
Unicode	2	Mar 15, 2013
Correct handling of case in unicode and regexps	1	Feb 23, 2013
Data saving in condition of changing reality	0	Apr 29, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
unicode as valid naming symbols	70	Mar 25, 2014

Correct use of Unicode in RegExp

mike blamires

mike blamires

Daniel N. Andersen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads