Authenticating an UTF-8, I18N field in struts using regular expressions

L

los

Hi,

I've been trying to find a regular expression that will authenticate an
UTF-8 field in struts.

So far I have;

<constant-name>user_name</constant-name>
<constant-value>^([a-zA-Z0-9_\x81-\xFF])*$</constant-value>

and this works for some UTF-8 strings such as; é, ã, ó, etc... But
it doesn't work for other symbols such as; Æ, ß, G, g, etc...

I'm wondering if someone has come across this before and knows what I
need to do in order to authenticate these characters as well.

Thanks,

Los
 
R

Roedy Green

I've been trying to find a regular expression that will authenticate an
UTF-8 field in struts.

UTF-8 is a byte encoding. Regexes work only on 16-bit strings.

You want to validate that a group of UTF-8 bytes is a legitimate
encoding?

I think the tool you want is a finite state automaton, not a regex.

But first, is there any such thing as an invalid UTF-8 encoding
according to the standards people?

What are the rules for what you consider a valid encoding?

See http://mindprod.com/finitestateautomaton.html
 
O

Oliver Wong

Roedy Green said:
But first, is there any such thing as an invalid UTF-8 encoding
according to the standards people?

Yes. The following bitstreams ('?' means can be either 1 or 0) are
invalid:

10?????? (The most significant bit of a single character should always be 0)
110????? (Should be followed by a second byte)
110????? 0??????? (The second byte should have 10 as its two most
significant bits)
110????? 11?????? (Ditto)

And there are many other examples. The UTF-8 standard is described in
RFC3629: http://www.ietf.org/rfc/rfc3629.txt


Incidentally, Java doesn't actually use the official UTF-8 standard.
Rather, they use what they call "Modified UTF-8" (see
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 )

The main difference between standard UTF-8 and Java's UTF-8 is how the
null character is encoded (00000000 versus 11000000 10000000 respectively),
and how characters outside the Basic Multilingual Plane (BMP) are encoded.
Basically, Unicode characters range from 0x0 to 0x10FFFF (in hexadecimal),
which is more characters than can be represented by only 16 bits (the size
of Java's 'char' primitive type). As such, Sun essentially had to "hack" in
support for the higher numbered characters, encoding them as surrogate
pairs.

This isn't an issue for most developpers, because characters outside the
BMP are rarely used. They contain characters for "dead" languages (e.g.
Shavian, Ufaritic, Deseret, Kharosthi, etc.), and for musical notation
(including ancient Greek musical notation). But in case you were wondering
what all that "char getChar()" versus "int getCodePoint" nonsense is all
about, now you know.

If the OP is actually analyzing bitstreams on the bit level and needs to
validate the stream as UTF-8 or otherwise, then these differences may become
important. Specifically, is the OP intending to use standard UTF-8, or
Java's UTF-8?

- Oliver
 
O

Oliver Wong

Oliver Wong said:
This isn't an issue for most developpers, because characters outside
the BMP are rarely used. They contain characters for "dead" languages
(e.g. Shavian, Ufaritic, Deseret, Kharosthi, etc.), and for musical
notation (including ancient Greek musical notation).

In retrospect, I shouldn't have generalized like this. If you're
developing software for archeologists or other people in similar fields,
they may be very interested in whether or not your application supports
these "dead languages".

- Oliver
 
L

los

In the software I've written I've setup struts to use an utf-8 filter
which does the correct thing. It reads the values from the form
fields, and at the java code level, the value contains the special
characters. It's all explained in this page
http://www.javaworld.com/javaworld/jw-05-2004/jw-0524-i18n.html.

My only issue is in one form where I want to allow the values inserted
to be special characters from other languages, but not symbols such as
(, <, +, }, etc... Creating the regular expression that handles these
values is becoming quite hard to find. I found some information in
this page;

http://www-128.ibm.com/developerworks/library/j-i18n.html

and tried changing my regex to: ^([a-zA-Z0-9_\u0081-\u1FFF])*$ and
^([a-zA-Z0-9_\x0081-\x1FFF])*$ in hopes that if I extended the range of
unicode chars it would work. But it still doesn't.

I believe that these characters are utf-8 "valid" because they're read
by the utf-8 filter used in struts, and at the java level the values
read from the forms contain the correct values. I'm just not sure if
there are limitations in regular expressions that limit the characters
that can be parsed.

-Los
 
O

Oliver Wong

It turns out I've completely misinterpreted what you were trying to do.
Oh well, here goes a second attempt...

los said:
In the software I've written I've setup struts to use an utf-8 filter
which does the correct thing. It reads the values from the form
fields, and at the java code level, the value contains the special
characters. It's all explained in this page
http://www.javaworld.com/javaworld/jw-05-2004/jw-0524-i18n.html.

My only issue is in one form where I want to allow the values inserted
to be special characters from other languages, but not symbols such as
(, <, +, }, etc... Creating the regular expression that handles these
values is becoming quite hard to find. I found some information in
this page;

So basically, you have a web application in which you're accepting a
username, but you don't want the username to contain certain characters. So
you want to do this checking via regular expressions.
http://www-128.ibm.com/developerworks/library/j-i18n.html

and tried changing my regex to: ^([a-zA-Z0-9_\u0081-\u1FFF])*$ and
^([a-zA-Z0-9_\x0081-\x1FFF])*$ in hopes that if I extended the range of
unicode chars it would work. But it still doesn't.

Did you try: "^([\p{javaLowerCase}\p{javaUpperCase])+$"? Note that I
changed your * to + because I assume you want the username to consist of at
least 1 character. If you want the username to be at least 4 characters
long, for example, you might want:
"^([\p{javaLowerCase}\p{javaUpperCase]){4,}$"

- Oliver
 
L

los

Oliver said:
It turns out I've completely misinterpreted what you were trying to do.
Oh well, here goes a second attempt...
:) It happens to me all the time.
Did you try: "^([\p{javaLowerCase}\p{javaUpperCase])+$"? Note that I
changed your * to + because I assume you want the username to consist of at
least 1 character. If you want the username to be at least 4 characters
long, for example, you might want:
"^([\p{javaLowerCase}\p{javaUpperCase]){4,}$"

Even after fixing a small typo you had in closing the "}" for
javaUpperCase, I tried what you suggested but the program didn't like
it :(

I think it might have something to do with the fact that struts reads
this regex from an xml file, and dumps it as javascript into the
webpage. I think it's javascript code that validates the string in the
form, at which case the javaUpperCase/LowerCase might not be recognized
by javascript?!

My feeling is that it needs to be something similar to what I had
before.

-Los
 
O

Oliver Wong

los said:
Did you try: "^([\p{javaLowerCase}\p{javaUpperCase])+$"? Note that I
changed your * to + because I assume you want the username to consist of
at
least 1 character. If you want the username to be at least 4 characters
long, for example, you might want:
"^([\p{javaLowerCase}\p{javaUpperCase]){4,}$"

Even after fixing a small typo you had in closing the "}" for
javaUpperCase, I tried what you suggested but the program didn't like
it :(

I think it might have something to do with the fact that struts reads
this regex from an xml file, and dumps it as javascript into the
webpage. I think it's javascript code that validates the string in the
form, at which case the javaUpperCase/LowerCase might not be recognized
by javascript?!

My feeling is that it needs to be something similar to what I had
before.

Okay, I was assuming you were using Java's RegExp facilities. If you're
using JavaScript, then you might want to post the question on a JavaScript
newsgroup (e.g. comp.lang.javascript).

JavaScript and Java are completely different things.

- Oliver
 
R

Roedy Green

As such, Sun essentially had to "hack" in
support for the higher numbered characters, encoding them as surrogate
pairs.

real unicode uses some 3 or 4 byte format that more or less directly
encodes the character,where Java uses two 16 bit encoded characters of
a magic range, where each char piggybacks some of the bits of the real
character??
 
R

Roedy Green

In retrospect, I shouldn't have generalized like this. If you're
developing software for archeologists or other people in similar fields,
they may be very interested in whether or not your application supports
these "dead languages".

there is some fun stuff up there -- not likely to be supported widely
though.

1D400 is the Mathematical alphabets, needed to typeset any university
math textbook.

There is a fair bit of Chinese, hardly a dead language. (I have
sometimes considered using Chinese for icons. The only catch is, I
might pick a word with meaning quite different from what I think it
should mean iconically.)

See http://mindprod.com/jgloss/unicode.html

Check out Ugaritic -- cuneiform.
 
R

Roedy Green

Here is a very fast way to solve your problem.

You use a java.util.BitSet. You initialise it so that characters you
like have a 1 bit and characters you don't have a 0. It requires a
64K bitset, i.e. 8k bytes.

Now you check a string like this:

boolean acceptable = true;
loop:
for ( int i=0; i<l.length(); i++)
{
if (!b.get( s.charAt( i ) ) )
{
acceptable = false;
break loop;
}
}
 
L

los

That's an interesting approach Roedy. But I'd prefer something cleaner
like the regular expression that is already in place. I've placed a
similar post on the javascript group to see if someone has an
explanation to this. If I don't find what I'm looking for, I think
I'll have to rely on java code (like the one you posted) to solve this
problem.

Thanks,

-Los
 
R

Roedy Green

But I'd prefer something cleaner

Conceptually, I consider the BitSet cleaner than a regex. There is
always some doubt precisely what any given regex will do. The BitSet
approach allows an arbitrarily complex rule with no decrease in speed
of the actual verification, unlike a regex. The BitSet will be
considerably faster than a regex.

The problem with a regex is it may solve your problem, but if you need
something slightly different, you can be sent back to square one if
there is not some canned bit of magic in the regex parser to do what
you want.

A regex is a bigger hammer than you need. It is for detecting patterns
of characters, not simply classifying characters.
 
C

Chris Uppal

Oliver said:
Incidentally, Java doesn't actually use the official UTF-8 standard.
Rather, they use what they call "Modified UTF-8" (see
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8
)

This might be misleading. It is certainly true (and it is entirely
reprehensible) that some of Java's API use the name UTF8 where they don't
actually use UTF8 (ObjectOutputStream being probably the worst offender).
However the Charset and/or Charset{En/De}coder for UTF8 does read/write real
UTF8.

-- chris
 
O

Oliver Wong

Roedy Green said:
there is some fun stuff up there -- not likely to be supported widely
though.

1D400 is the Mathematical alphabets, needed to typeset any university
math textbook.

"Needed" is a bit of a strong word here, particular for (dead-tree)
textbook environments. Most of the characters in that region are the latin
or greek characters in varied font (e.g. "a" written in cursive, or in "a"
italics) and could be "simulated" by using the latin character "a" (Unicode
0x0061 in Hex) with font styles applied via a word processor.

It might make more sense to use these characters in an e-book, where you
when doing a "Search" for a particular character, the mathmatical 'a written
in italics' character has completely different semantics from an 'a written
in cursive', so that the search returns results from one set of occurences,
and not the other.
There is a fair bit of Chinese, hardly a dead language. (I have
sometimes considered using Chinese for icons. The only catch is, I
might pick a word with meaning quite different from what I think it
should mean iconically.)

Most of the Asian characters actually in common use by the Chinese and
Japanese (don't know about other Asian languages) are in the region 0x4E00
to 0x9FBF, which is in the BMP (Basic Multilingual Plane). The stuff in
region 0x20000 to 0x2A6D6 are "ancient" characters which are pretty much
never used, except perhaps in family names for very old families, or by
anthropologists interested in ancient Asian culture.

If I had to design a unicode compliant font, and I was "on a budget", in
addition to the BMP (which I think *ALL* fonts should support at a minimum),
I'd consider throwing in the musical notation characters in 0x1D100 to
0x1D1DD. All the other regions are too esoteric (read: I'm too lazy) to
bother implementing.

- Oliver
 
O

Oliver Wong

Roedy Green said:
real unicode uses some 3 or 4 byte format that more or less directly
encodes the character,where Java uses two 16 bit encoded characters of
a magic range, where each char piggybacks some of the bits of the real
character??

There's a lot of confusion about this issue that stems from the fact
that Unicode is only a mapping from numbers to characters. It does not, for
example, explain how to map bytes or bitstreams into characters. So in
addition to Unicode, you need an encoding, such as UTF-8 or UTF-16. UTF-8
takes a bitstream, and maps those into a sequence of numbers. Then you use
Unicode to take that sequence of numbers and map them onto a sequence of
characters.

I think part of the misunderstanding happens because in ASCII, both the
mapping from bits to numbers and numbers to character are glossed over into
a direct mapping from bits into characters.

The biggest number which has a defined mapping onto a character in
Unicode is 0x10FFFF (in hex). One encoding system might be to always use 3
bytes for every character, so that the largest number you could represent
with this encoding is 0xFFFFFF, which is enough to represent all the defined
unicode characters and then some. But of course, this encoding system would
not be "backwards compatible" with ASCII.

UTF-8 is a variable-length encoding which is backwards compatible with
ASCII. This is one of the reasons why it's a very popular encoding; all
valid ASCII documents are also valid UTF-8 documents. For characters whose
Unicode numbers are in the range 0x000000 to 0x00007F, only 1 byte is
require to encode them. However, as a tradeoff, for some characters (those
whose Unicode numbers are in the range 0x010000 to 0x10FFFF) 4 bytes are
required to encode them.

UTF-16 is variable-length as well, and NOT backwards compatible with
ASCII. I'm not all that familiar with this encoding, so I may be mistaken
here, but apparently it uses 2 bytes to encode all the characters whose
Unicode number fall between 0x000000 and 0x00FFFF, and 4 bytes for the rest
(I don't see how this is possible, and couldn't find a reference
implementation).

Java uses UTF-16 internally. A Java "char" is 16 bites, so it handles
the first case (requiring 2 bytes) just fine. The problem is when you try to
represent a character that requires 4 bytes under UTF-16. Apparently what
happened was the Unicode Stantard "changed", where previously The BMP (Basic
Multilingual Plane) was all of Unicode, and but then Unicode added a few
more characters.

From the JavaDoc for java.lang.Character:

<quote>
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which defined
characters as fixed-width 16-bit entities. The Unicode standard has since
been changed to allow for characters whose representation requires more than
16 bits. The range of legal code points is now U+0000 to U+10FFFF
</quote>

So Sun, based on the original Unicode specification, had assumed that 16
bits was enough, and then got screwed over.

For what it's worth, "real" UTF-8 can support arbitrarily many
characters (i.e. no matter how many characters the next version of Unicode
defines, UTF-8 will have a sequence of bytes to represent it), because it
first encodes the length of the byte representation in unary, and then
encodes the actual bytes in binary.

- Oliver
 
R

Roedy Green

"Needed" is a bit of a strong word here, particular for (dead-tree)
textbook environments. Most of the characters in that region are the latin
or greek characters in varied font (e.g. "a" written in cursive, or in "a"
italics) and could be "simulated" by using the latin character "a" (Unicode
0x0061 in Hex) with font styles applied via a word processor.

things may have changed since I studied math, but back then
mathematicians needed a new alphabet for every new "class".

Both in physics and math they almost never used a simple roman
alphabet variable name.

They both liked single-letter variable and constant names.
 
R

Roedy Green

For characters whose
Unicode numbers are in the range 0x000000 to 0x00007F, only 1 byte is
require to encode them.

how are chars in the range 0x80 to 0xff encoded?
 
O

Oliver Wong

Roedy Green said:
how are chars in the range 0x80 to 0xff encoded?

This is off the top of my head so the details may be a bit off, but
here's the general idea for UTF-8. First encode the number of bytes needed
to represent the character in unary. For the character 0xFF (or 11111111 in
binary), you'd need 2 bytes under UTF-8, so we encode 2 in unary:

110????? ????????

Here I've used ? to mean we haven't yet determined what the values of those
bits are.

Every supplementary byte starts with a header of '10', so our bitstream
now looks like:

110????? 10??????

We now have 11 bits free to actually encode our character. 0xFF, in a
bitstream of size 11 looks like: 00011111111, so we fill in the ?s with this
data:

11000011 10111111

The question is, given 0xFF, how could we have known ahead of time that
we would have needed 6 bytes to encode this in UTF-8? Well, you sort of have
to do the above steps in reverse order. Convert 0xFF into a bitstream
(11111111). Notice that for every supplementary byte, 2 bits are header and
6 bits are "actual" info, so we need 2 bytes (for 12 bits of "actual info"),
but then you can use some of the bits from the first byte if the unary
length encoding doesn't take up a full byte, so really what I describe is
just a heuristic.

I think in practice, they precompute this stuff and just hardcode a
table into the source code (i.e. 0x00 to 0x7F -> 1 byte, 0x80 to 0x7FF -> 2
bytes, etc.)

- Oliver
 
R

Roedy Green

This is off the top of my head so the details may be a bit off, but
here's the general idea for UTF-8. First encode the number of bytes needed
to represent the character in unary. For the character 0xFF (or 11111111 in
binary), you'd need 2 bytes under UTF-8, so we encode 2 in unary:

110????? ????????

I think you are using the term "unary" in an unusual way. Could you
please define it.

I understand the unary number system as the simplest numeral system to
represent natural numbers: in order to represent a number N, an
arbitrarily chosen symbol is repeated N times. For example, using the
symbol "|" (a tally mark), the number 6 is represented as "||||||".
The standard method of counting on one's fingers is effectively a
unary system. Unary is most useful in counting or tallying ongoing
results, such as the score in a game of sports, since no intermediate
results need to be erased or discarded.

This is the number system used in professor Melzak's Q machines, a
simplification of Turing machines (basically a set of pits containing
stones).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,852
Latest member
CarlaDowle

Latest Threads

Top