Regexp: Case-insensitive matching | N factorial

G

gentsquash

In a setting where I can specify only a JS regular
expression, but not the JS code that will use it, I seek
a regexp component that matches a string of letters,
ignoring case. E.g, for "cat" I'd like the effect of

([Cc][Aa][Tt])

but without having to have many occurrences of [Xx].


Secondly, what is an efficient regexp that matches a
string exactly when ALL words in a certain list occur in
the string. I'd like the effect of

(cat.*nip|nip.*cat)

except that there are N words rather than just the two
words "cat" and "nip". (I can assume that no word in the
list is a prefix of any other.) Naturally, I'm looking for
a regexp-solution that does not involve listing all
N factorial
many orderings.

--Jonathan LF King, Mathematics dept, Univ. of Florida
 
R

RobG

In a setting where I can specify only a JS regular
expression, but not the JS code that will use it, I seek
a regexp component that matches a string of letters,
ignoring case. E.g, for "cat" I'd like the effect of

([Cc][Aa][Tt])

but without having to have many occurrences of [Xx].

var reA = /cat/i;

Will match the string 'cat' anywhere it appears regardless of case.
If you want to match the word cat exactly, then:

var reA = /\bcat\b/i;

Sample use:

if (re.test(string)) {
// the pattern was found
}

Secondly, what is an efficient regexp that matches a
string exactly when ALL words in a certain list occur in
the string. I'd like the effect of

(cat.*nip|nip.*cat)

I'm not sure what you mean by "matches a string exactly", do you mean
the word?

If you meant you want a single RegExp to match a set of patterns in
any order (i.e. in the above example either cat then nip or nip then
cat), I don't think that can be done.

Javascript regular expressions have an alternative operator '|' (kind
of an OR operator), but no equivalent for AND. Lookahead doesn't help
either, as it still requires an order to the patterns.

It can easily be done in a loop using RegExp as a constructor, but I
don't think that's what you want, e.g.

function matchWords(s, wordArray) {
var len = wordArray.length;
var result = true;
while (i-- && result) {
var re = new RegExp('\\b' + wordArray + '\\b', 'i');
result = re.test(string);
}
return result;
}

alert( matchWords('The cat ate some cat nip', ['nip','cat']) );


Note that when using RegExp to construct a reqular expression, the
backslash '\' character denoting a special character must be quoted
and so becomes '\\'. Also, the regular expression's idea of a word
boundary might be different to what you expect.

except that there are N words rather than just the two
words "cat" and "nip". (I can assume that no word in the
list is a prefix of any other.) Naturally, I'm looking for
a regexp-solution that does not involve listing all
N factorial
many orderings.

I don't think you can do that with a single regular expression.
 
L

Lasse Reichstein Nielsen

RobG said:
If you meant you want a single RegExp to match a set of patterns in
any order (i.e. in the above example either cat then nip or nip then
cat), I don't think that can be done.
Javascript regular expressions have an alternative operator '|' (kind
of an OR operator), but no equivalent for AND. Lookahead doesn't help
either, as it still requires an order to the patterns.

How about:

(?=.*\bcat\b)(?=.*\bnip\b)(?=.*\bfoo\b)(?=.*\bbar\b)(?=.*\bbaz\b)

I.e., several lookaheads.
It won't be pretty, and it definitly won't perform very well, but
it should be correct.

/L
 
R

RobG

How about:

 (?=.*\bcat\b)(?=.*\bnip\b)(?=.*\bfoo\b)(?=.*\bbar\b)(?=.*\bbaz\b)

I.e., several lookaheads.
It won't be pretty, and it definitly won't perform very well, but
it should be correct.

Cool, I thought that order would still matter. For the OP, the string
needs to be a single line, no line feeds etc. Some play code:


<script type="text/javascript">

function getRE(wordArray) {
var re = [];
for (var i=0, len=wordArray.length; i<len; i++) {
re.push('(?=.*\\b' + wordArray + '\\b)');
}
return new RegExp(re.join(''), 'i');
}

</script>

<textarea id="ta">The cat sat on the mat and
drank the milk</textarea>
<input id="inp0" type="text" value="milk cat sat">
<input type="button" value="Test" onclick="

// Make sure s is a single line of text
var s = document.getElementById('ta').value.replace(/\s/g,' ');
var words = document.getElementById('inp0').value.split(' ');
var re = getRE(words);
alert(
'String: ' + s +
'\n\nExpression: ' + re +
'\n\nTest: ' + re.test(s)
);

">


PS. Putting many statements inside the value of an onclick attribute
is not good form, but OK for play code. :)
 
T

Thomas 'PointedEars' Lahn

RobG said:
If you want to match the word cat exactly, then:

var reA = /\bcat\b/i;

That depends on how you define a word. If you define a word as a sequence
of word characters as specified in the ECMAScript Language Specification,
Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
right.

However, for example "Menü" is a word in German, and

var reA = /\bmen\b/i;

will (only) match the "Men" in "Menü" there. Because `ü' is not considered
a word character per the Specification, and so the empty word ε between "n"
and "ü" constitutes a word boundary matched by /\b/ (as e.g.

"Menü".match(/\bmen\b/i)

shows).

So for matching Unicode words in strings, you have to use

var reA = /(^|\s)cat(\s|$)/i;

instead; that is, a character sequence (here: without whitespace in-between)
bounded by whitespace, or one or two input boundaries.


PointedEars
 
R

RobG

RobG said:
If you want to match the word cat exactly, then:
var reA = /\bcat\b/i;

That depends on how you define a word. If you define a word as a sequence
of word characters as specified in the ECMAScript Language Specification,
Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
right.

However, for example "Men¨¹" is a word in German, and

var reA = /\bmen\b/i;

will (only) match the "Men" in "Men¨¹" there. Because `¨¹' is not considered
a word character per the Specification,

Hence I included the sentence "Also, the regular expression's idea of
a word
boundary might be different to what you expect."

and so the empty word ¦Å between "n"
and "¨¹" constitutes a word boundary matched by /\b/ (as e.g.

"Men¨¹".match(/\bmen\b/i)

shows).

So for matching Unicode words in strings, you have to use

var reA = /(^|\s)cat(\s|$)/i;

That expression is commonly used for matching values in the HTML class
attribute where the separator is specified as being whitespace. It is
not sufficient for matching words in general where they may be
followed by punctuation marks such as commas, semi-colons, colons,
dashes, periods and so on.
 
T

Thomas 'PointedEars' Lahn

RobG said:
Thomas said:
RobG said:
If you want to match the word cat exactly, then:
var reA = /\bcat\b/i;
That depends on how you define a word. If you define a word as a sequence
of word characters as specified in the ECMAScript Language Specification,
Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
right.

However, for example "Menü" is a word in German, and

var reA = /\bmen\b/i;

will (only) match the "Men" in "Menü" there. Because `ü' is not considered
a word character per the Specification,

Hence I included the sentence "Also, the regular expression's idea of
a word boundary might be different to what you expect."

It was easy to overlook and provides no explanation as to what should be
expected instead.
That expression is commonly used for matching values in the HTML class
attribute where the separator is specified as being whitespace. It is
not sufficient for matching words in general where they may be
followed by punctuation marks such as commas, semi-colons, colons,
dashes, periods and so on.

Good point. However, a character class can take care of that. For example,
in Unicode text that uses only ASCII and Latin-1 punctuation:

var reA = /(^|[\s,;:.-])cat([\s,;:.-]|$)/i;

But whether a punctuation mark really delimits a word is a matter of
language, interpretation, and personal taste. For example, the HYPHEN-MINUS
character ("-") may have been used as hyphen in compounds.

An alternative would be to use the \w escape sequence to build your own
character class:

var reA = /(^|[^\wäöü])cat([^\wäöü]|$)/i;


PointedEars
 
D

Dr J R Stockton

In comp.lang.javascript message <6aa0c1c4-b785-4da1-9107-b681df097261@c5
8g2000hsc.googlegroups.com>, Wed, 25 Jun 2008 15:31:37,
(e-mail address removed) posted:
In a setting where I can specify only a JS regular
expression, but not the JS code that will use it, I seek
a regexp component that matches a string of letters,
ignoring case. E.g, for "cat" I'd like the effect of

([Cc][Aa][Tt])

but without having to have many occurrences of [Xx].

If all else fails, read the manual. There are links in <URL:http://www.
merlyn.demon.co.uk/js-valid.htm>.


Note that the average intellectual level of those who post with @gmail
addresses is so low that readers may kill-file it /in toto/.

Secondly, what is an efficient regexp that matches a
string exactly when ALL words in a certain list occur in
the string. I'd like the effect of

(cat.*nip|nip.*cat)

except that there are N words rather than just the two
words "cat" and "nip". (I can assume that no word in the
list is a prefix of any other.) Naturally, I'm looking for
a regexp-solution that does not involve listing all
N factorial
many orderings.

I doubt whether one exists to do a direct match, at least if it is to be
compatible with any user agent that knows RegExps.

But one could use S2 = S1.replace(/cat|nip/gi, "") and see whether the
difference of the lengths matches the total of the strings, provided
that no string can occur more than once and matchable strings cannot
overlap.
--Jonathan LF King, Mathematics dept, Univ. of Florida
DSS.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top