Copying data between kernel and user space is indeed a cause of
slowness. In at least the traditional OS models that still happens
more than it should. I don't know how well modern OSes have got at
avoiding that copying but the nature of many system calls means there
will still be some.
There has been effort put into reducing the number of copies needed, but
in many cases it is simply impossible to not copy at all.
Changing address spaces doesn't prevent their being some parts of
memory mapped in common so an OS could use common mappings in the
transfer and thus avoid double copying via a bounce buffer.
Since the data could be (nearly) anywhere in user space, unless all of
user space is available to kernel code, you're probably going to need
bounce buffers. Even if you don't happen to need them in all cases,
it's probably faster to use them anyway than spend the extra cycles
figuring out if you need them or not.
Changing page tables usually requires flushing the old TLB which is a
cache but only a cache on page table entries.
I thought if you flushed the page table entries, you had to flush all
the affected data cache lines as well. Some CPUs might simply flush
everything rather than spend the silicon needed to figure out which
lines _don't_ need to be flushed.
Yes, it can hurt performance and was an issue also on x86_32.
However, to reduce the impact some pages can be marked as Global.
Their entries are not flushed from the TLB with the rest. These are
usually pages containing kernel data and code which are to appear in
all address spaces.
.... but the proposal above was to have no memory that appeared in both,
presumably aside from the necessary bounce buffers.
As an aside, and I'm not sure which CPUs implement this, instead of
marking pages as global some processors allow TLB entries to be
tagged with the address space id. Then no flushing is required when
address spaces are changed. The CPU just ignores any entries which
don't have the current address space id (and which are not global)
and refills itself as needed. It's a better scheme because it doesn't
force flushing of entries unnecessarily.
I'm not aware of any x86(-64) chips that do that, but perhaps some have
recently added that.
The x86_32 application address space is strictly 4GB in size.
Addresses may go through segment mapping where you can have 4GB for
each of six segments but they all end up as 32 bits (which then go
through paging). So 4GB+4GB would have to multiplex on to the same
4GB address space (which would then multiplex on to memory). I don't
know what RedHat did but they could have reserved a tiny bit of the
applications' 4GB to help the transition and switched memory mappings
whenever going from user to kernel space.
That's exactly what they did; the (very small) data area mapped into
both user and kernel address spaces is the bounce buffer referred to
above. Kernel code could not "see" user space, just a bounce buffer.
The performance hit was enormous.
Such a scheme wouldn't have been used for all systems because it
would have been slower than normal not just because of the transition
but also because where the kernel needed access to user space it
would have to carry out a mapping of its own.
Exactly my point.
OSes often reserve some addresses so that they run in the same
address space as the apps. That makes communication much easier and a
little bit faster but IMHO they often reserve far more than they need
to, being built in the day when no apps would require 4GB.
The normal split is 2GB+2GB. You can configure Windows to do 3GB+1GB,
but many drivers crash in such a configuration and the OS can easily run
out of kernel space to hold page tables. That's why it's not the default.
As an aside, the old x86_16 could genuinely split all an app's
addressable memory into 64k data and 64k code because it had a 20-bit
address space.
You mean real mode.
The same was also possible in 286 protected mode (16-bit segments in a
24-bit address space), but not 386 protected mode (32-bit segments in a
32-bit address space).
S