Conrad said:
Yes, I was assuming simple English sentences, where \b will usually work
(and it doesn't matter when toUpperCase is applied digits or the
underscore).
It matters because it would be needlessly inefficient.
In this case, my earlier example could even be simplified to:
v1 = v1.replace(/\b\w/g, function (c) {
return c.toUpperCase();
});
Correct, \b would match the empty string before the \w then.
Your character class approach (in your other post) would work if the
character set is known and rather small. Latin1, for example, could use
[a-zà áâãäåæçèéêëìÃîïðñòóôõöøßùúûüý]. But if we're assuming an random
international setting, this is going to be a lot harder.
Harder, granted.
Creating a character class that would work on the complete Unicode set
would be almost impossible, and also error prone.
I do not think it any of the above would apply, though. ISTM you are
unaware of the fact that, while the Unicode Standard (4.0) already defines a
finite character set of which ECMAScript implementations only support the
Basic Multilingual Plane (U+0000 to U+FFFF), the number of characters that
can be subject to case switching is even more limited, and that character
ranges can be used in regular expressions, whereas their boundaries can also
be written as Unicode escape sequences.
All it takes is a bit of research on the defined Unicode character ranges
and the scripts (as in writing) they provide support for. Take some Latin
character ranges for example:
/[a-z\u00c0-\u00f6\u00f8-\u00ff\u0100-\u017f\u0180-\u01bf\u01c4-\u024f]/i
(This can be optimized, of course, but it helps [you] to get the picture.)
See also: said:
It would be simpler to define custom "word boundary" characters, and just
let JavaScript uppercase everything following them:
Would it? ISTM the punctuation of languages is a lot more complicated than
their letters; take Spanish, for example. But then ISTM capitalizing titles
is not something that is common in other languages than English, and some
even consider it deprecated there already. However, for uniformity one
might be inclined to apply this formatting to non-English (song) titles as
well; I have seen that before.
var wBound = '\\s,.;:?!\'"';
var rex = new RegExp('(^|[' + wBound + '])([^' + wBound + '])', 'g');
v1 = v1.replace(rex, function (s, g1, g2) {
return g1 + g2.toUpperCase();
});
That does not make much sense, though, since with the exception of white
space, and single and double quote, none of those (punctuation) characters
is likely to occur directly before something that can be considered a word
character. In fact, it is customary to have (white) space between those
characters and the word character to be uppercased, so there would never be
a match then.
wBound would still have to be adjusted as required to include, for
example, different types of quotes, or the Japanese/Chinese full stop
character 。).
I am afraid it would have to be rewritten entirely anyway.
PointedEars