RegExp as Finite State Machine

T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
[1] `/(a|b|c)/` and `/(a|b\|c)/` produce `/(a|c)/` instead of a proper `a`
I don't see how this can be accomplished with using .source.split
(/.../) since we don't have negative lookbehind (?<!) in ECMAScript
implementations with which you could exclude `\|' as a delimiter; so
it probably needs to be solved with RegExp-based string parsing.

Not sure about RegExp-based parsing, since escaped sequences could be of
arbitrary length (I don't think it's possible to detect whether a
character is preceded by `2n+1` amount of `\` - and so is escaped).

A character `x' can only be preceded by one `\' -- `\x' -- because `\\\x'
means that there is a literal `\' before the `\x' in the expression. So it
suffices to exclude cases where special characters like `|' are preceded by
one backslash.
A simple parser, on the other hand, seems to solve the problem nicely
(although, I'm sure, can't compare in speed with `split`-based approach)

function split(string, separator) {
var arr = string.split(''),
result = [],
IS_ESC = false,
ESC_CHAR = '\\',
char,
lastIdx = 0;
for (var i=0, len=arr.length; i<len; i++) {
char = arr;
if (char == ESC_CHAR) {
IS_ESC = !IS_ESC;
continue;
}
if (char === separator && !IS_ESC) {
result.push(string.substring(lastIdx, i));
lastIdx = i+1;
}
else if (i == arr.length-1) {
result.push(string.substring(lastIdx, i+1))
}
}
return result;
}


With RegExp-based parsing, it would be

function split(s, separator)
{
var
rx = new RegExp("[^\\\\]\\" + separator, "g"),
m,
a = [],
i = 0;

while ((m = rx.exec(s)))
{
a.push(s.substring(i, rx.lastIndex - 1));
i = rx.lastIndex;
}

a.push(s.substring(i, s.length));

return a;
}

That's just a quick hack, though. It doesn't work unchanged with arbitrary
separators.


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
I'm not sure I understand.

Wouldn't you want to account for escape character *itself* being escaped
(forcing following separator to be interpreted as a meta character again
- following RegExp semantics - and so on, arbitrary amount of times)?

Yes, I would. However, what you seem to miss is the following:
I.e. -

"|" - intepreted as meta character
"\\|" - interpreted as literal "|" (since it is escaped with one "\")

Full ACK since we are talking strings to be passed to RegExp() in the end.
"\\\\|" - interpreted as literal "\" (since it is escaped with one "\")
followed by (this time - unescaped) "|" meta character

So what we have here is just the first case preceded by a literal backslash;
as a RegExp: `/\\|/'. There is no apparent need to handle this case as a
special one. Of course, the passed delimiter always has to be escaped
before being reasonably usable as a search pattern in RegExp-based parsing.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,123
Messages
2,570,740
Members
47,296
Latest member
EarnestSme

Latest Threads

Top