J
Jacob Fugal
Okay. The thing making this difficult is handling things that span across
tags. Running a gsub that matches something entirely within a single
tag won't produce problems, nor will reversing it, nor will anything else
you do to it that I can think of. Matching a pattern across tags I'm
pretty sure can be done, but it'll probably be a pain to do, and I'm star= ting
to wonder if there's any point to it. Substitution across tags is probabl= y
doable if you can solve the pattern matching problem, but how do you
decide sensibly what ends up in what tag?
A solution, similar to that employed by ncurses and many other UI
systems, is to use the concept of an extended character. Each
character in the string is flagged with applicable attributes.
Translating marked up ASCII to a list of extended characters is easy
enough: maintain a bitmask of attributes and turn them on/off as you
encounter tags; apply the current bitmask to each character
encountered.
Translating back from an extended character string to ascii markup can
be accomplished with an algorithm like the following (I'm using an
array instead of bitmask for readability):
def encode( extended_chars, start_flags=3D[], clean=3D0 )
current_flags =3D start_flags
encoded_ascii =3D ''
extended_chars.each do |char|
(current_flags - char.flags).each do |flag|
encoded_ascii << flag.close_tag
end
(char.flags - current_flags).each do |flag|
encoded_ascii << flag.open_tag
end
current_flags =3D char.flags
encoded_ascii << char.ascii_char
end
if clean
current_flags.each do |flag|
encoded_ascii << flag.close_tag
end
current_flags =3D []
end
return (encoded_ascii, current_flags.clone)
end
Ideally, the list of encoded characters would be encapsulated in an
object that acts like a string (implementing gsub, reverse, etc.) The
operations would rearrange/remove individual extended characters from
the object without changing any of the flags associated with any one
character.
As an example application, your string would decode as follows:
something =3D decode("A <C red>red</C> and <C blue>blue</C> baseball bat.")
# =3D> A, ' ', r|red, e|red, d|red, ' ', a, n, d, ' ', b|blue, l|blue,
u|blue, e|blue, ' ', b, a, s, ...
The regex /red and blue/ would match this substring
# r|red, e|red, d|red, ' ', a, n, d, ' ', b|blue, l|blue, u|blue, e|blue
That substring is replaced with the substring (since it wasn't encoded):
# o, s, t, r, i, c, h
And the result is:
# A, ' ', o, s, t, r, i, c, h, ' ', b, a, s, ...
Obviously, no part is red or blue. Assume we'd actually marked up "ostrich"=
as
# "<C red>os</C>tri<C green>ch</C>"
# =3D> o|red, s|red, t, r, i, c|green, h|green
And matched against the shorted substring "d and bl" then the result would =
be:
# A, ' ', r|red, e|red, o|red, s|red, t, r, i, c|green, h|green,
u|blue, e|blue, ' ', b, a, s, ...
# =3D> "A <C red>reos</C>tri<C green>ch</C><C blue>ue</C> baseball bat."
Jacob Fugal
DISCLAIMER: None of the above is intended to be complete, bug-free or
efficient. An actual implementation would need all of those. This is
just meant to be an example algorithm that would make the discussed
operations possible.