William Ahern wrote:
BTW: I have an iovec based buffer tokenizer that utilizes a source,
source_len, delimiter array, and optional character class flags (the
standard ctypes) if you're interested. Approx 3M iovec/sec tokenization
on a P (celeron) without modifying any of the source buffer. Basically,
it's like this:
I have something similar (almost exact, actually), which i call splitv()
(from the example split() function included in the strtok man page on many
Unices). But, I never put in character class support; it just takes an
optional (1 << CHAR_BIT)-character map for selecting delimiters.
I'd still be interested to see yours. I had a flag in mine to return empty
fields, but it's broken, meaning something in my loop should be smarter.
I've never gone back to it because so far I've never wanted to have the
empty fields.
Then, if I wanted to push my luck, I'd deprecate all the wide-character
interfaces, and include a small suite of functions for manipulating UTF-8
encoded Unicode strings. I might start by adding a struct uvec, something
like: struct uvec { uchar *uv_base, [a bunch of opaque members,
noticeably and intentionally missing uv_len] }.
Absolutely do NOT remove the length.
Well, the notion is that given the history of C strings, the terminology
is overloaded. Is the "length" the "sizeof" of the object, where the units
are C's notion of bytes, or are we talking about the number of graphemes,
etc. I would remove something like uv_len and replace it with
two or more members or functions, each with more precise, less
ambiguous names. The problem with Unicode string processing is that it
violates many of the supposed intrinsic properties of strings that have
variously benefited and plagued programmers for years. Take the tokenizer
above, for instance. In written Thai there generally isn't any word or
symbol delimiters at all. Just one long string of syllables. To
tokenize you actually need a language dictionary, and you parse from right
to left trying to form words; when the next syllable couldn't
possible result in a legitimate word, you break. So with Thai it's an all
or nothing proposition; you either manipulate the text using a
sophisticated and capable interface, or you treat it as an opaque chunk of
memory. All this "wide-char" non-sense is completely and utterly useless.
We must permanently put to rest this practice in text processing
of conflating "characters" and "bytes" (indeed, at many levels remove the
notion of characters altogether; in one sense the usefulness rests solely
with parsing so-called human readable network protocols which employ
ASCII, but then that's not really "text").
Thus, this is also why I'd deprecate the wide-character support. It
fails to provide any sort of usefulness at the memory management level,
nor does it comprehensively solve the internationalization issue (heck,
technically if doesn't even allow you the privilege of conflating
characters and bytes if you so choosed, which will always be useful
for historical and technical reasons, as mentioned above). Also, the
whole notion of environment locale's is broken. But I digress. In any
event, I think the answer to both of those is to standardize on Unicode
strings, and a UTF-8 encoding specifically, so when you need to you can
fallback to more traditional string and memory management, while also
opening the door to sophisticated and comprehensive internationalized text
processing. This, I think, proceeds from and enhances the best qualities
of C. And if the idea of including such a large string interface with C was
too repugnant, I'd still implement everything else: rip out
wide-characters, fixed char signedness headaches, etc. They stand on
their own merits.