regexp for matching a string with mandatory underscores

P

Peter J. Holzer

RTFS. tr/// is pretty highly optimised; in particular, the 'count
characters' case has its own implementation that does no copying, except
when counting non-SvUTF8 characters in a SvUTF8 string. In that case
obviously every character of the string being counted has to be
individually converted to UTF-32, so there's no allocation but there is
effectively copying.

This sort of inefficiency is unavoidable when using UTF-8 as an internal
representation, which is why certain people are trying so hard to make
perl's internal representation opaque. Everyone now knows that using
UTF-8 was a mistake, but it can't be fixed until people get used to
keeping their fingers out.

In Pike (like Perl a vaguely C-like interpreted language) strings always
consist of elements of equal length: All characters in a string
are either 1 byte or 2 bytes or 4 bytes in length. That may waste some
space if you have a string with lots of ascii characters and one 💩 in
it, but it makes most string operations simpler.

Theoretically, Perl could switch to such a model without breaking
programs (except XS code). Practically ...

hp
 
I

Ilya Zakharevich

RTFS. tr/// is pretty highly optimised; in particular, the 'count
characters' case has its own implementation that does no copying, except
when counting non-SvUTF8 characters in a SvUTF8 string. In that case
obviously every character of the string being counted has to be
individually converted to UTF-32, so there's no allocation but there is
effectively copying.

What makes this "obvious"? I see absolutely no need for this...
Unless you mean "copying one char at a time", not copying the whole
string. And such things MUST be documented (since in presence of
tie()ing they are not implementation details).
This sort of inefficiency is unavoidable when using UTF-8 as an internal
representation, which is why certain people are trying so hard to make
perl's internal representation opaque. Everyone now knows that using
UTF-8 was a mistake, but it can't be fixed until people get used to
keeping their fingers out.

Why do you think it is inefficiency? Todays machines are even more
tied by memory than machines 10 years ago... (In proportion to amount
of data one may [so does] store on the disk.)
In the tied (or more generally magic) case, perl calls FETCH to update
the string stored in the scalar, does the tr/// on that string, then
calls STORE to update the magic. A tie implementation that's being
careful about copying will have no additional problems because of tr///.

Now I'm absolutely confused... Are you still discussing tr/foo//
here? Do you say it WOULD call STORE?

And "being careful about copying" brings no imagery here. What
EXACTLY do you mean by that?

IMO, an operation which has semantic of reading should NOT call STORE
on tied data...

Ilya
 
R

Rainer Weikusat

[...]
Using UTF-8 certainly makes the code a lot hairier, and I suspect
that costs more than the memory. You end up converting
character-at-a-time to UTF-32 practically every time you do anything
with that string, rather than being able to use fast
interfaces like wmemchr(3).

Sometimes, life is just mean. Couldn't the people who invented UTF-8
in 1993 for use on their incredibly fast machines have foreseen how
much slower hardware was going to become in the next 19 years?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top