[ORLinux] MMU/TLB/Huge pages braindump

Discussion:

Jonas Bonn

2013-07-30 12:35:01 UTC

Hi,

This is a bit of a braindump but hopefully it's reasonably coherent.

A discussion on IRC leads me to believe that we can get "huge pages" on
OpenRISC given what we have today. This is without using the ATB mechanism.

The arch spec allows PTE's to be located either in a level-1 page
directory or in a level-2 page table. A bit "L" in the page directory
entry (level 1) indicates whether the entry points to a page containing
a page table or whether it points to a "huge page". A huge page has a
24 bit offset is thus 16MB in size.

When a page is "huge", the TLB needs to know about it. That's what the
PL1 bit is for. I'd like to see this bit renamed HUGE in order to
indicate that it's just matching the high 8-bits of the page frame when
looking for a translation.

An example user of the "huge page" mechanism would be the Linux kernel
which maps itself into contiguous physical memory from 0 to
end_of_kernel. If we carefully manage the fact that it's not using 16MB
of physical memory, we could use the "huge page" mechanism to prevent a
lot of TLB misses when accessing kernel space code and data.

(Of course, 16MB might actually be too large for reasonable huge
pages... 2MB or 1MB might be better, see end of this mail)

For this to work, the PL1 bit would need to be implemented... the fact
that it's not today is a bug in all our implementations as it's not an
optional feature.

Some changes along these lines that may be needed in the arch spec are:

8.4.1 DMMUCR
PTBP should be bits 31-13, not 31-10... page frames are always 8kB in
size and need to be page aligned

8.4.2 DMMUPR
Drop this register altogether (see 8.8 below). 4 bits in each set gives
16 combinations, but many of these really don't make sense so this
flexibility really isn't needed.

8.4.3 IMMUCR
PTBP should be bits 31-13, not 31-10... page frames are always 8kB in
size and pretty much need to be page aligned

8.4.4 IMMUPR
Drop this register altogether (see 8.8 below)

Note that this register is overdimensioned... it has 7 sets with 2 bits
each.

8.4.6
Change name of PL1 to HUGE with description:
0: normal page, 8kB
1: huge page, 16MB (or 2MB, see below)

Change LRU from "last recently used" to "least recently used" (cosmetic)

8.4.9 - 8.4.11
Drop ATB's altogether. We can get 16MB pages without them and the 32GB
pages aren't realistic anyway.

8.8 PTE

Change PPN size to 19 bits (bits 31-13).

PPI: Why only 7 sets of protection bits? Why not 8? Because value 0
is overloaded to mean the entry is invalid, but this prevents the field
from being used as a sane bitmask. Change the PPI field to 3 individual
bits indicating Writable, User access, and Executability and drop the
Protection Registers altogether.

As per Stefan's earlier mail, make PTE something like this:

| 31 ... 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |OS-specific|Present| L | X | W | U | D | A |WOM|WBC|CI |CC |

...and we need a VALID bit in there somewhere.

----------------------

So how do we get 2MB huge pages... here's my suggestion.

Top-level page directory
---------------------------
0x0000... | 8-bit index entry, L=0 |
---------------------------
0x0020... | Empty |
to ~ ... ~
0x00e0... | Empty |
---------------------------
0x0100... | Next 8-bit index entry, L=1 |
---------------------------
0x0120... | 2 MB page entry (L=1) |
to ~ ... ~
0x01e0... | 2 MB page entry (L=1) |
---------------------------
0x0200... | Next 8-bit index entry, L=0 |
| ~~~ |
~ ~

The top-level page directory is an 8kB page, and it's 8-bit indexing
makes it sparsely populated. If we find that the L bit (huge page) is
set on an 8-bit indexed entry, then we could do a second indexing on the
remaining three bits (11 bit index total) to find the entry to the 2MB
huge page in the "free space".

This could get us 2MB huge pages and we could then keep the ATB stuff
around for the less useful 16MB huge pages.

This all plays reasonably nicely with the arch spec we've got today.
What would need clarifying is that these huge pages are 2MB and not
16MB, but this is all so vague in the spec as it stands and otherwise
unimplemented in practice that it ought to be doable.

Looking forward to comments!

/Jonas

Stefan Kristiansson

2013-07-31 07:48:58 UTC

Permalink

Post by Jonas Bonn
The arch spec allows PTE's to be located either in a level-1 page
directory or in a level-2 page table. A bit "L" in the page
directory entry (level 1) indicates whether the entry points to a
page containing a page table or whether it points to a "huge page".
A huge page has a 24 bit offset is thus 16MB in size.
When a page is "huge", the TLB needs to know about it. That's what
the PL1 bit is for. I'd like to see this bit renamed HUGE in order
to indicate that it's just matching the high 8-bits of the page
frame when looking for a translation.
An example user of the "huge page" mechanism would be the Linux
kernel which maps itself into contiguous physical memory from 0 to
end_of_kernel. If we carefully manage the fact that it's not using
16MB of physical memory, we could use the "huge page" mechanism to
prevent a lot of TLB misses when accessing kernel space code and
data.
(Of course, 16MB might actually be too large for reasonable huge
pages... 2MB or 1MB might be better, see end of this mail)
For this to work, the PL1 bit would need to be implemented... the
fact that it's not today is a bug in all our implementations as it's
not an optional feature.

Yes, this seems to be a valid explanation of what the arch spec describes
(although the arch specs way of saying this is a lot more unclear).

Post by Jonas Bonn
8.4.1 DMMUCR
PTBP should be bits 31-13, not 31-10... page frames are always 8kB
in size and need to be page aligned
8.4.2 DMMUPR
Drop this register altogether (see 8.8 below). 4 bits in each set
gives 16 combinations, but many of these really don't make sense so
this flexibility really isn't needed.
8.4.3 IMMUCR
PTBP should be bits 31-13, not 31-10... page frames are always 8kB
in size and pretty much need to be page aligned
8.4.4 IMMUPR
Drop this register altogether (see 8.8 below)
Note that this register is overdimensioned... it has 7 sets with 2
bits each.
8.4.6
0: normal page, 8kB
1: huge page, 16MB (or 2MB, see below)
Change LRU from "last recently used" to "least recently used" (cosmetic)
8.4.9 - 8.4.11
Drop ATB's altogether. We can get 16MB pages without them and the
32GB pages aren't realistic anyway.
8.8 PTE
Change PPN size to 19 bits (bits 31-13).
PPI: Why only 7 sets of protection bits? Why not 8? Because value
0 is overloaded to mean the entry is invalid, but this prevents the
field from being used as a sane bitmask. Change the PPI field to 3
individual bits indicating Writable, User access, and Executability
and drop the Protection Registers altogether.
| 31 ... 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |OS-specific|Present| L | X | W | U | D | A |WOM|WBC|CI |CC |
...and we need a VALID bit in there somewhere.

I agree on close to all of the above, except I think we should keep the xMMUPR
registers, but change the indexing from 0-7 (so 8 sets of protection bits).
Not so much because I think that the flexibility is needed, but because
using a "lookup table" to translate the X/W/U bits into
SRE/SXE/SWE/URE/UXE/UWE actually makes sense.

e.g. for the DTLB case:

X | W | U
---------
0 | 0 | 0 = SRE0
0 | 0 | 1 = SRE1 | URE1
0 | 1 | 0 = SRE2 | SWE2
0 | 1 | 1 = SRE3 | SWE3 | URE3 | UWE3
1 | 0 | 0 = SRE4
1 | 0 | 1 = SRE5 | URE5
1 | 1 | 0 = SRE6 | SWE7
1 | 1 | 1 = SRE7 | SWE7 | URE7 | UWE7

Software would need a pair of shift and mask operations to pick entries
out of the "lookup table" and hardware can easily do bitfield table lookups
directly from the register.

While we're at it, the bit order in DMMUPR should probably be changed
to match the one in DTLBTR, asi is now they don't match.
i.e. DMMUPR = UWE|URE|SWE|SRE and DTLBTR = SWE|SRE|UWE|URE

Regarding the PTE, isn't PRESENT == VALID?

Post by Jonas Bonn
----------------------
So how do we get 2MB huge pages... here's my suggestion.
Top-level page directory
---------------------------
0x0000... | 8-bit index entry, L=0 |
---------------------------
0x0020... | Empty |
to ~ ... ~
0x00e0... | Empty |
---------------------------
0x0100... | Next 8-bit index entry, L=1 |
---------------------------
0x0120... | 2 MB page entry (L=1) |
to ~ ... ~
0x01e0... | 2 MB page entry (L=1) |
---------------------------
0x0200... | Next 8-bit index entry, L=0 |
| ~~~ |
~ ~
The top-level page directory is an 8kB page, and it's 8-bit indexing
makes it sparsely populated. If we find that the L bit (huge page)
is set on an 8-bit indexed entry, then we could do a second indexing
on the remaining three bits (11 bit index total) to find the entry
to the 2MB huge page in the "free space".
This could get us 2MB huge pages and we could then keep the ATB
stuff around for the less useful 16MB huge pages.
This all plays reasonably nicely with the arch spec we've got today.
What would need clarifying is that these huge pages are 2MB and not
16MB, but this is all so vague in the spec as it stands and
otherwise unimplemented in practice that it ought to be doable.

This is of course bending the meaning of the L bit, perhaps that should
be renamed then?
Because now you always have a two-level structure, but with the difference
that you are pointing back into the page directory in the second level.
Would it *have* to point into the page directory though?
Or could we use the entry fetched from the 8-bit indexed to get the
table pointer (and this could happen to point back into the pgd on
Linux as a memory saving optimization)?

This all is of course "breaking" (fixing) the arch spec a bit, but as you
said, there are no (known*) implementations using this and there will never be
any implementations using it if it isn't useful.

* In the unlikely event that there would be any unknown implementations
actually using this stuff, this conversion is kept in the public,
so they are free to join in and raise their voices ;)

Stefan

Stefan Kristiansson

2013-07-31 13:11:57 UTC

Permalink

On 31 July 2013 09:48, Stefan Kristiansson

Post by Stefan Kristiansson
I agree on close to all of the above, except I think we should keep the xMMUPR
registers, but change the indexing from 0-7 (so 8 sets of protection bits).
Not so much because I think that the flexibility is needed, but because
using a "lookup table" to translate the X/W/U bits into
SRE/SXE/SWE/URE/UXE/UWE actually makes sense.
X | W | U
---------
0 | 0 | 0 = SRE0
0 | 0 | 1 = SRE1 | URE1
0 | 1 | 0 = SRE2 | SWE2
0 | 1 | 1 = SRE3 | SWE3 | URE3 | UWE3
1 | 0 | 0 = SRE4
1 | 0 | 1 = SRE5 | URE5
1 | 1 | 0 = SRE6 | SWE7
1 | 1 | 1 = SRE7 | SWE7 | URE7 | UWE7
Software would need a pair of shift and mask operations to pick entries
out of the "lookup table" and hardware can easily do bitfield table lookups
directly from the register.

So we need two 32-bit registers to implement a look-up table for 3
bits with a well-defined meaning (X/W/U)?

Strictly speaking, one 32-bit and one 16-bit register.
I'm not going to put up a big fight about keeping the xMMUPR
registers, but they _are_ in the arch spec as of now and even though
X/W/U are well defined and sufficient for the Linux case,
is that true for all other cases?

It's of course still possible to do the lookup table approach without
the actual registers by using a static map to X/W/U,
both in software and hardware.
In the software approach you most likely would do it without the
registers anyway.

Post by Stefan Kristiansson
While we're at it, the bit order in DMMUPR should probably be changed
to match the one in DTLBTR, asi is now they don't match.
i.e. DMMUPR = UWE|URE|SWE|SRE and DTLBTR = SWE|SRE|UWE|URE

If we keep the PR around, then I agree.

Post by Stefan Kristiansson
Regarding the PTE, isn't PRESENT == VALID?

Yes, it would seem to be.

Post by Stefan Kristiansson

No, the 8-bit indexed PPN can't point to another page-table in the L=1
case because the case where the next 3 bits are 0 (11-bit index ==
8-bit index) needs to point to a valid page.
My suggestion may be too complex, though; getting 1MB pages into the
second-level table may make more sense.
/Jonas

I kept the full quote of your message in my response, since I believe
you by mistake pressed reply instead of reply all.

Stefan

Jonas Bonn

2013-08-01 07:37:36 UTC

Permalink

Post by Stefan Kristiansson

On 31 July 2013 09:48, Stefan Kristiansson

So we need two 32-bit registers to implement a look-up table for 3
bits with a well-defined meaning (X/W/U)?

Strictly speaking, one 32-bit and one 16-bit register.

Heh... 48 bits then. Still, 48 > 3.

Post by Stefan Kristiansson
I'm not going to put up a big fight about keeping the xMMUPR
registers, but they _are_ in the arch spec as of now and even though
X/W/U are well defined and sufficient for the Linux case,
is that true for all other cases?

I say "yes, it's sufficient". I hereby put the challenge to the list to
come up with a _reasonable_ combination that's not possible with the
above 3 bits, along with an explanation of the use case that requires it.

I agree, they are in the arch spec but they are also (presumably)
implemented nowhere. Dropping them is a net win in terms of
architectural complexity. We may need to bump the version register and
refuse to run on "old versions w/ hardware TLB reload", but such
implementations don't exist anyway so it's a simple check that will
never trigger. (TLB HW reload can be checked in one of the CFGR regs, I
noticed).

Post by Stefan Kristiansson
It's of course still possible to do the lookup table approach without
the actual registers by using a static map to X/W/U,
both in software and hardware.

That must be a simpler HW implementation... right? For SW it's
definitely better.

Post by Stefan Kristiansson
In the software approach you most likely would do it without the
registers anyway.

If we keep the PR around, then I agree.

Post by Stefan Kristiansson
Regarding the PTE, isn't PRESENT == VALID?

Yes, it would seem to be.

Post by Stefan Kristiansson

I kept the full quote of your message in my response, since I believe
you by mistake pressed reply instead of reply all.

Good catch! Thank you!

/Jonas

Stefan Kristiansson

2013-08-01 09:23:59 UTC

Permalink

Post by Jonas Bonn

A reasonable combination would be superuser read & write, but
user read only.
As for use cases... /me shrugs... 3-bits seems to be enough for x86, so
why wouldn't it for us?

Post by Jonas Bonn

Post by Stefan Kristiansson
It's of course still possible to do the lookup table approach without
the actual registers by using a static map to X/W/U,
both in software and hardware.

That must be a simpler HW implementation... right? For SW it's definitely
better.

Probably slightly simpler (in terms of written code and generated logic), yes.

But we digress into this not so important discussion about keeping or not
keeping the xMMUPRs, when there are more interesting things to decide on.
I personally will not miss them much, so let's say we wipe them,
unless someone else have an objection to it when all the more gory things
have been handled.

Stefan

Stefan Kristiansson

2013-09-25 05:13:20 UTC

Permalink

Post by Jonas Bonn
For this to work, the PL1 bit would need to be implemented... the fact
that it's not today is a bug in all our implementations as it's not an
optional feature.

Just to show that everything discussed in this thread haven't been
forgotten, the PL1 bit is nowadays properly implemented in mor1kx.
And the TLB reload follow what was agreed upon in this thread.
I also have a set of kernel patches that implements the changes to the PTE
discussed here.
I have been using them since this discussion, so they have had a fair
amount of testing.
I'll try to find a spare moment to clean them up and post those as well in
a near future.

Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openrisc.net/pipermail/linux/attachments/20130925/b020bb10/attachment.html>