S
sln
-----------------------------------------------------
Hey, first off, I really appretiate your responses, especially the Unicode.
I am newly interrested in this and am learning, but understand a good portion.
Below, I'm just going to briefly clarify some of my previous statements.
Thanks!
-sln
-----------------------------------------------------
I guess this phrasing skipped a few things. Streams is not really an
stand alone definition of anything, but an acronym for doing operations
on file descriptor's in the kernel, via api (POSIX). Certainly there
is not Regular Expression engine, or anything like that in kernel's.
You certainly don't need regexps to parse XML, and you certainly don't need
regexps to do string comparisons on XML. 'Stream processing' however, has a
more abstract meaning. Basically it means processing locally disposable data,
while traversing a buffer of a kernel file descriptor and not waiting for the
end of file/low-level i/o, device, pipe, or whatever the descriptor referrs to.
You certainly can't do that in the kernel. The key is that a small user buffer
is populated as the 'stream' passes through it. The buffer is either fixed size
or expands and contracts slightly as necessary to process events as they are
parsed, in computer time not necessarily real time.
The machinations of 'buffering' as it seams to indicate some delineation in
your mind, has nothing to do with 'stream' parsing or processing, only the notion
of incremental processing.
Some foolish people obfuscate XML parsing and regular expressions in some high
abstractness of language, which totally misses the point.
Regular expressions used for parsing XML is no different that simple string comparison
of token punctuation. It is for that reason I made my statement.
Many examples of push/pull stream oriented processors.
Some references of stream-oriented processing of XML (SAX or near sax compliant):
http://en.wikipedia.org/wiki/Expat_(XML)
http://en.wikipedia.org/wiki/Streaming_Transformations_for_XML
[paragraph moved]
No, not really, it refers to a file descriptor.
As I said, there is 'no formal definition' of a stream. By all acounts
a 'stream' is an abstract concept akin to a tree watching water flow by,
a near static observer of fluidic motion.
Again, what is a stream? In this use, its an abstraction consisting of
buffering and processing layers in fluidic motion, in a continous manner.
A 'string' has nothing to do with anything.
But regex has nothing to do with stream's per say, there is only a limited
fixed api (soon to be expanded) that deals with file descriptors
(or Microsofts FILE *). So, you can skip this process.
Remember, 'stream' is an abstract concept, and so is a 'record'.
For the record, stream parsing/processing is grabbing from 1 to user defined
amount of characters/data, using api that works on the file descriptor kernel data,
to match a pattern on which to process. This requires user space buffering.
The concept of 'stream' processing is the antithesis of processing a complete data set.
Stream-parsing XML can be as simple as reading 1 character at a time, buffering until
a key character is found that may represent a character used in the closure of a statement,
processing that possibility, then clearing the buffer, or continue buffering. It can also
depend on the state of parsing variability of the xml processor. The result is the same,
cars are taken off the track and processed. Most xml 'state' processors will stop upon
the (near) first point of error in syntax (MSXML does this). Regular expressions offer
a distinct advantage in this regard, will/can continue processing to report other errors,
advance the stream, but does not enjoy the speed as say Expat does. Stream processing
XML has unique advantages to tree's (although tree's are now windowed) and enables
multi-level filters.
Was not a beef with Unicode, not at all, but it got me very interrested in it.
I didn't want to use pack/unpack templates that had no variability.
I needed to do pattern searches on 32-bit integers, plain and simple.
Had nothing to do with Unicode at all. For instance, if I found a numeric
256 (32-bit integer) in a stream of 32-bit integers, I wanted to grab
the 5th following 32-bit integer in the stream no matter what its value was.
This is the simple explanation, the real one involved complex variabilty.
So I looked at Unicode and Perl's utf-8 as the internal default,
as character representation's of 32-bit integers, to be used in regular expressions.
I didn't start with 'encodings'. In other words, encoding had nothing to do with
what I wanted to do. I understand there are encodings that translate to the code points,
in the particulare Unicode you want 8/16/32, endian and byte order mark.
The octets are the 1-6 bytes (8-bit) result of the encoding.
The code points run in ranges of 0-(2**32 - 1), but they run in ranges (utf-32 hase no code points).
Between those ranges and you run into Unicode internal control, reserved attributes (BOM,endianess etc..).
I guess I don't care about encoding if I could internalize (Perls utf-8) the full range
of 32-bit integers to characters to be used in regular expressions, then extracted back to 32-bit
integers to be used elsewhere.
......
I thought I had posted some code when I responded to this one ^^^^. Guess I didn't.
I will post a clipped follow-up code sample.
I hope you understand what my meaning is now, 'capitulate' is just a word.
Thank you!
-sln
Hey, first off, I really appretiate your responses, especially the Unicode.
I am newly interrested in this and am learning, but understand a good portion.
Below, I'm just going to briefly clarify some of my previous statements.
Thanks!
-sln
-----------------------------------------------------
I guess this phrasing skipped a few things. Streams is not really an
stand alone definition of anything, but an acronym for doing operations
on file descriptor's in the kernel, via api (POSIX). Certainly there
is not Regular Expression engine, or anything like that in kernel's.
You don't need regexps at all to parse XML (or any other language).
And you certainly don't need to do them on streams, since you can always
read the next block or line from the stream and append it to your
buffer.
You certainly don't need regexps to parse XML, and you certainly don't need
regexps to do string comparisons on XML. 'Stream processing' however, has a
more abstract meaning. Basically it means processing locally disposable data,
while traversing a buffer of a kernel file descriptor and not waiting for the
end of file/low-level i/o, device, pipe, or whatever the descriptor referrs to.
You certainly can't do that in the kernel. The key is that a small user buffer
is populated as the 'stream' passes through it. The buffer is either fixed size
or expands and contracts slightly as necessary to process events as they are
parsed, in computer time not necessarily real time.
The machinations of 'buffering' as it seams to indicate some delineation in
your mind, has nothing to do with 'stream' parsing or processing, only the notion
of incremental processing.
Some foolish people obfuscate XML parsing and regular expressions in some high
abstractness of language, which totally misses the point.
Regular expressions used for parsing XML is no different that simple string comparison
of token punctuation. It is for that reason I made my statement.
Many examples of push/pull stream oriented processors.
Some references of stream-oriented processing of XML (SAX or near sax compliant):
http://en.wikipedia.org/wiki/Expat_(XML)
http://en.wikipedia.org/wiki/Streaming_Transformations_for_XML
[paragraph moved]
On the other hand, I think you don't know what a stream is:
my ($fh, '<', 'test.xml');
Now $fh refers a stream.
No, not really, it refers to a file descriptor.
Please show me how you can apply a regexp to
this stream. Solutions which don't count:
As I said, there is 'no formal definition' of a stream. By all acounts
a 'stream' is an abstract concept akin to a tree watching water flow by,
a near static observer of fluidic motion.
* reading chunks from the stream into a scalar variable and then
applying the regexp to this variable (because then you apply it to a
string (as I wrote), not a stream.
Again, what is a stream? In this use, its an abstraction consisting of
buffering and processing layers in fluidic motion, in a continous manner.
A 'string' has nothing to do with anything.
* writing your own regexp engine (since Perl is a general purpose
programming language, you can of course write that but we were
talking about Perl' builtin regexp).
But regex has nothing to do with stream's per say, there is only a limited
fixed api (soon to be expanded) that deals with file descriptors
(or Microsofts FILE *). So, you can skip this process.
pack and unpack are Perl functions. They can only be applied to strings,
not streams. If you don't mean these functions but something else, be
more specific. And I have no idea what a "regex stream" might be. A
stream composed of regexps? A stream with special support for regexps?
A stream split into records with a regexp?
Remember, 'stream' is an abstract concept, and so is a 'record'.
For the record, stream parsing/processing is grabbing from 1 to user defined
amount of characters/data, using api that works on the file descriptor kernel data,
to match a pattern on which to process. This requires user space buffering.
The concept of 'stream' processing is the antithesis of processing a complete data set.
Stream-parsing XML can be as simple as reading 1 character at a time, buffering until
a key character is found that may represent a character used in the closure of a statement,
processing that possibility, then clearing the buffer, or continue buffering. It can also
depend on the state of parsing variability of the xml processor. The result is the same,
cars are taken off the track and processed. Most xml 'state' processors will stop upon
the (near) first point of error in syntax (MSXML does this). Regular expressions offer
a distinct advantage in this regard, will/can continue processing to report other errors,
advance the stream, but does not enjoy the speed as say Expat does. Stream processing
XML has unique advantages to tree's (although tree's are now windowed) and enables
multi-level filters.
Ah, back to my original argument, Unicode!I ah think your missing what Unicode is.
I know quite well what Unicode is - I found characterset issues
fascinating ever since I turned on an Apple ][ in 1984 and it identified
itself as "Apple ÜÄ". I've read Rob Pike's paper in the early 90s and
the full unicode standard (version 2.0) in the late 90s. And I've
discussed character encoding matters (including Unicode) a lot on
various newsgroups and mailinglists over the years and fixed a few
encoding related problems in various pieces of software.
Was not a beef with Unicode, not at all, but it got me very interrested in it.
I didn't want to use pack/unpack templates that had no variability.
I needed to do pattern searches on 32-bit integers, plain and simple.
Had nothing to do with Unicode at all. For instance, if I found a numeric
256 (32-bit integer) in a stream of 32-bit integers, I wanted to grab
the 5th following 32-bit integer in the stream no matter what its value was.
This is the simple explanation, the real one involved complex variabilty.
So I looked at Unicode and Perl's utf-8 as the internal default,
as character representation's of 32-bit integers, to be used in regular expressions.
I didn't start with 'encodings'. In other words, encoding had nothing to do with
what I wanted to do. I understand there are encodings that translate to the code points,
in the particulare Unicode you want 8/16/32, endian and byte order mark.
The octets are the 1-6 bytes (8-bit) result of the encoding.
The code points run in ranges of 0-(2**32 - 1), but they run in ranges (utf-32 hase no code points).
Between those ranges and you run into Unicode internal control, reserved attributes (BOM,endianess etc..).
I guess I don't care about encoding if I could internalize (Perls utf-8) the full range
of 32-bit integers to characters to be used in regular expressions, then extracted back to 32-bit
integers to be used elsewhere.
......
I thought I had posted some code when I responded to this one ^^^^. Guess I didn't.
I will post a clipped follow-up code sample.
Code is always nice because it is unambiguous (unlike the English
language). However, keep in mind that this is a discussion group, not a
code repository. Any code example longer than 50 lines or so is unlikely
to be read.
I've read that several times (and critisized it here, too).
If you think this is a fight where one of us has to win and the other to
capitulate, I'll stop now.
hp
I hope you understand what my meaning is now, 'capitulate' is just a word.
Thank you!
-sln