Discussion:
[ORLinux] Hardware assisted tlb reload in mor1kx
Stefan Kristiansson
2013-07-28 04:02:16 UTC
Permalink
Good news everybody!

We've got some hardware tlb reload going on in the hottest OpenRISC 1000
implementation there is, no more wasting instructions on tlb miss exceptions
when running Linux.

As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.

Another rough estimate of how much time is spent in the tlb miss vectors
was done by running 'gcc hello_world.c -o hello_world' in the jor1k
emulator (http://s-macke.github.io/jor1k/) and by using the stats from that
we saw that (momentarily) roughly up to 25% of the time was spent in the
dtlb miss exception handler.
This could of course also be improved by increasing the number of sets and ways
used in the mmus, but that's another topic that might be addressed in the
future.

As always, you can find it in the github repos at:
https://github.com/openrisc/mor1kx

But before we bring out the champagne and start celebrating, some notes about
the implementation that needs some discussion.

First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.

Let me illustrate the differences.
or1k arch spec pte layout:
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |

Linux pte layout:
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|

The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?

Second, naturally there are a couple of changes needed to Linux for this to
work.
The changes are minor but needs commenting before proper patches are sent out.
The full diff is available in the end of this mail, but I'll first comment the
changes to each file.

arch/openrisc/include/asm/spr_defs.h:
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).

arch/openrisc/kernel/head.S:
The implementation in mor1kx works so, that if the xMMUCR register is 0,
it will generate tlb miss exceptions, so we have to make sure that it
is zero when the MMUs are enabled, so the boot tlb miss handlers are used
until paging is set up.

arch/openrisc/mm/init.c:
arch/openrisc/mm/tlb.c:
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.

arch/openrisc/mm/fault.c:
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.

1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
2) Make mor1kx generate a tlb miss exception instead of a pagefault when the
pte table pointer is zero or the PRESENT bit is not set.

Some thoughts and comments on those issues, please!

Stefan

--- >8 ---
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index 5dbc668..1d20915 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -226,19 +226,15 @@
* Bit definitions for the Data MMU Control Register
*
*/
-#define SPR_DMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_DMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_DMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_DMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_DMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_DMMUCR_DTF 0x00000001 /* DTLB Flush */

/*
* Bit definitions for the Instruction MMU Control Register
*
*/
-#define SPR_IMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_IMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_IMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_IMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_IMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_IMMUCR_ITF 0x00000001 /* ITLB Flush */

/*
* Bit definitions for the Data TLB Match Register
diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S
index 1d3c9c2..59a3263 100644
--- a/arch/openrisc/kernel/head.S
+++ b/arch/openrisc/kernel/head.S
@@ -541,6 +541,15 @@ flush_tlb:

enable_mmu:
/*
+ * Make sure the page table base pointer is cleared
+ * ( = hardware tlb fill disabled)
+ */
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_DMMUCR
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_IMMUCR
+
+ /*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4c07a20 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
*/

if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;

diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index e7fdc50..d8b8068 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -191,6 +191,14 @@ void __init paging_init(void)
mtspr(SPR_ICBIR, 0x900);
mtspr(SPR_ICBIR, 0xa00);

+ /*
+ * Update the pagetable base pointer, to enable hardware tlb refill if
+ * supported by the hardware
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
+
/* New TLB miss handlers and kernel page tables are in now place.
* Make sure that page flags get updated for all pages in TLB by
* flushing the TLB and forcing all TLB entries to be recreated
diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c
index 683bd4d..96e6df3 100644
--- a/arch/openrisc/mm/tlb.c
+++ b/arch/openrisc/mm/tlb.c
@@ -151,6 +151,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
current_pgd = next->pgd;

+ /*
+ * Update the pagetable base pointer with the new pgd.
+ * This only have effect on implementations with hardware tlb refill
+ * support.
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
/* We don't have context support implemented, so flush all
* entries belonging to previous map
*/
--- >8 ---
Sebastian Macke
2013-07-28 17:16:16 UTC
Permalink
Great, this will definitely speed up things.

I would suggest to enable this hardware tlb refill by bit 17 in the
supervision register SR and not by a zero or nonzero DMMUCR or IMMUCR
register. Then it would be more consistent with the specification to
control such a feature.
Post by Stefan Kristiansson
Good news everybody!
We've got some hardware tlb reload going on in the hottest OpenRISC 1000
implementation there is, no more wasting instructions on tlb miss exceptions
when running Linux.
As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.
Another rough estimate of how much time is spent in the tlb miss vectors
was done by running 'gcc hello_world.c -o hello_world' in the jor1k
emulator (http://s-macke.github.io/jor1k/) and by using the stats from that
we saw that (momentarily) roughly up to 25% of the time was spent in the
dtlb miss exception handler.
This could of course also be improved by increasing the number of sets and ways
used in the mmus, but that's another topic that might be addressed in the
future.
https://github.com/openrisc/mor1kx
But before we bring out the champagne and start celebrating, some notes about
the implementation that needs some discussion.
First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.
Let me illustrate the differences.
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
Second, naturally there are a couple of changes needed to Linux for this to
work.
The changes are minor but needs commenting before proper patches are sent out.
The full diff is available in the end of this mail, but I'll first comment the
changes to each file.
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).
The implementation in mor1kx works so, that if the xMMUCR register is 0,
it will generate tlb miss exceptions, so we have to make sure that it
is zero when the MMUs are enabled, so the boot tlb miss handlers are used
until paging is set up.
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.
1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
2) Make mor1kx generate a tlb miss exception instead of a pagefault when the
pte table pointer is zero or the PRESENT bit is not set.
Some thoughts and comments on those issues, please!
Stefan
--- >8 ---
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index 5dbc668..1d20915 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -226,19 +226,15 @@
* Bit definitions for the Data MMU Control Register
*
*/
-#define SPR_DMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_DMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_DMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_DMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_DMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_DMMUCR_DTF 0x00000001 /* DTLB Flush */
/*
* Bit definitions for the Instruction MMU Control Register
*
*/
-#define SPR_IMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_IMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_IMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_IMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_IMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_IMMUCR_ITF 0x00000001 /* ITLB Flush */
/*
* Bit definitions for the Data TLB Match Register
diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S
index 1d3c9c2..59a3263 100644
--- a/arch/openrisc/kernel/head.S
+++ b/arch/openrisc/kernel/head.S
/*
+ * Make sure the page table base pointer is cleared
+ * ( = hardware tlb fill disabled)
+ */
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_DMMUCR
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_IMMUCR
+
+ /*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4c07a20 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
*/
if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;
diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index e7fdc50..d8b8068 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -191,6 +191,14 @@ void __init paging_init(void)
mtspr(SPR_ICBIR, 0x900);
mtspr(SPR_ICBIR, 0xa00);
+ /*
+ * Update the pagetable base pointer, to enable hardware tlb refill if
+ * supported by the hardware
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
+
/* New TLB miss handlers and kernel page tables are in now place.
* Make sure that page flags get updated for all pages in TLB by
* flushing the TLB and forcing all TLB entries to be recreated
diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c
index 683bd4d..96e6df3 100644
--- a/arch/openrisc/mm/tlb.c
+++ b/arch/openrisc/mm/tlb.c
@@ -151,6 +151,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
current_pgd = next->pgd;
+ /*
+ * Update the pagetable base pointer with the new pgd.
+ * This only have effect on implementations with hardware tlb refill
+ * support.
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
/* We don't have context support implemented, so flush all
* entries belonging to previous map
*/
--- >8 ---
_______________________________________________
Linux mailing list
Linux at lists.openrisc.net
http://lists.openrisc.net/listinfo/linux
Stefan Kristiansson
2013-07-29 13:13:55 UTC
Permalink
Post by Sebastian Macke
I would suggest to enable this hardware tlb refill by bit 17 in the
supervision register SR and not by a zero or nonzero DMMUCR or IMMUCR
register. Then it would be more consistent with the specification to
control such a feature.
I have no strong feelings against or for having a flag in SR, xMMUCR or
treating xMMUCR[31:10] == 0 as tlb miss fallback.
Perhaps the last is the least intrusive to the arch spec, but I can't see
any of them breaking anything.
What do others think about this?

Stefan

Henrik Nordström
2013-07-28 20:31:47 UTC
Permalink
Post by Stefan Kristiansson
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
Is there any pros/cons for the different approaches?

What approach do other architectures use?

Regards
Henrik
Stefan Kristiansson
2013-07-29 13:04:47 UTC
Permalink
Post by Henrik Nordström
Post by Stefan Kristiansson
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
Is there any pros/cons for the different approaches?
You tell me ;)
Perhaps the whole original idea behind the PP INDEX field was
to save bits in the PTE and make it flexible, but IMO it just
makes things more complicated.
Post by Henrik Nordström
What approach do other architectures use?
The closest thing I could find to the PP INDEX field in any other
architecture is the ACC bit field in sparc v8 [1], but if I understand things
correctly, that is mapped to a static table.

And if I read the ARM code right, I think they have seperate definitions
for the PTEs on the Linux and hardware side, residing in different memory
areas. I don't think we should do it like that though.

Stefan

[1] http://www.sparc.org/standards/V8.pdf, page 248
Jonas Bonn
2013-07-29 08:19:37 UTC
Permalink
Hi Stefan,

On 28 July 2013 06:02, Stefan Kristiansson
Post by Stefan Kristiansson
Good news everybody!
We've got some hardware tlb reload going on in the hottest OpenRISC 1000
implementation there is, no more wasting instructions on tlb miss exceptions
when running Linux.
Grand!
Post by Stefan Kristiansson
As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.
Walking the page table is pretty much the same operation whether it be
in software or hardware... the savings are the context switch
associated with the exception handler.
Post by Stefan Kristiansson
Another rough estimate of how much time is spent in the tlb miss vectors
was done by running 'gcc hello_world.c -o hello_world' in the jor1k
emulator (http://s-macke.github.io/jor1k/) and by using the stats from that
we saw that (momentarily) roughly up to 25% of the time was spent in the
dtlb miss exception handler.
This could of course also be improved by increasing the number of sets and ways
used in the mmus, but that's another topic that might be addressed in the
future.
Process start-up is always going to be dominated by a flood of
TLB-misses in order to populate the initially clean TLB. Similarly
after a context switch where the TLB has been flushed. There are ways
to mitigate this, with trade-offs: address space indexes to allow the
TLB to be "shared" between contexts (and thus not flushed);
speculatively preloading the TLB but at the cost of possibly flushing
out other entries that might be needed shortly; et cetera.
Post by Stefan Kristiansson
https://github.com/openrisc/mor1kx
But before we bring out the champagne and start celebrating, some notes about
the implementation that needs some discussion.
First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.
Let me illustrate the differences.
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |
We have 8K pages... why so many bits for the PPN? Bits 31, 30, and 29
are never used?
Post by Stefan Kristiansson
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
What are the protection combinations that are actually used:

SR|SW|SX /* really? */
SR|SW
SR|SX
SR

/* We may want to drop the SX's here */
SR|SW|SX|UR|UW|UX
SR|SW|SX|UR|UX
SR|SW|SX|UR|UW
SR|SW|SX|UR

Is that all? If yes, then SR is always set, and UR is always set for
user pages.

So we have:

USERPAGE?
WRITABLE?
EXECUTABLE?

...that's 3 bits, which maps nicely into the 3 bits available for PP
INDEX. So the hardware and software implementations aren't
contradictory there.

L ("link") isn't intresenting for the software implementation, so
reusing that for SHARED is fine there, but the HW implementation wants
L... where does SHARED go in that case and, furthermore, what is it
even used for? Somebody please check what that SHARED flag is doing.

Finally, we play games with the CC bit since we don't have an SMP
Implementation of OpenRISC. It's used to indicate that a page is
swapped out. And the WBC bit is aliased to distinguish page cache via
the PAGE_FILE flag. For the HW implementation we can't do this, so
where do we put these? Bits 31 and 30 and have the HW mapper mask
them out?
Post by Stefan Kristiansson
Second, naturally there are a couple of changes needed to Linux for this to
work.
The changes are minor but needs commenting before proper patches are sent out.
The full diff is available in the end of this mail, but I'll first comment the
changes to each file.
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).
I think that file has even more errors that that. Didn't somebody fix
this file up in or1ksim but not sync the kernel version?
Post by Stefan Kristiansson
The implementation in mor1kx works so, that if the xMMUCR register is 0,
it will generate tlb miss exceptions, so we have to make sure that it
is zero when the MMUs are enabled, so the boot tlb miss handlers are used
until paging is set up.
I think that's a sound solution... requires a minor documentation
change to the arch spec. We might be able to do even better though
and set up the PTE's early so that the boot handlers aren't needed at
all.
Post by Stefan Kristiansson
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.
1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
I think this is the right approach.
Post by Stefan Kristiansson
2) Make mor1kx generate a tlb miss exception instead of a pagefault when the
pte table pointer is zero or the PRESENT bit is not set.
Some thoughts and comments on those issues, please!
Stefan
--- >8 ---
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index 5dbc668..1d20915 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -226,19 +226,15 @@
* Bit definitions for the Data MMU Control Register
*
*/
-#define SPR_DMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_DMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_DMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_DMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_DMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_DMMUCR_DTF 0x00000001 /* DTLB Flush */
/*
* Bit definitions for the Instruction MMU Control Register
*
*/
-#define SPR_IMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_IMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_IMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_IMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_IMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_IMMUCR_ITF 0x00000001 /* ITLB Flush */
/*
* Bit definitions for the Data TLB Match Register
diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S
index 1d3c9c2..59a3263 100644
--- a/arch/openrisc/kernel/head.S
+++ b/arch/openrisc/kernel/head.S
/*
+ * Make sure the page table base pointer is cleared
+ * ( = hardware tlb fill disabled)
+ */
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_DMMUCR
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_IMMUCR
+
+ /*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4c07a20 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
*/
if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;
This won't work as things stand today...
Post by Stefan Kristiansson
diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index e7fdc50..d8b8068 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -191,6 +191,14 @@ void __init paging_init(void)
mtspr(SPR_ICBIR, 0x900);
mtspr(SPR_ICBIR, 0xa00);
+ /*
+ * Update the pagetable base pointer, to enable hardware tlb refill if
+ * supported by the hardware
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
+
/* New TLB miss handlers and kernel page tables are in now place.
* Make sure that page flags get updated for all pages in TLB by
* flushing the TLB and forcing all TLB entries to be recreated
diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c
index 683bd4d..96e6df3 100644
--- a/arch/openrisc/mm/tlb.c
+++ b/arch/openrisc/mm/tlb.c
@@ -151,6 +151,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
current_pgd = next->pgd;
+ /*
+ * Update the pagetable base pointer with the new pgd.
+ * This only have effect on implementations with hardware tlb refill
+ * support.
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
/* We don't have context support implemented, so flush all
* entries belonging to previous map
*/
--- >8 ---
_______________________________________________
Linux mailing list
Linux at lists.openrisc.net
http://lists.openrisc.net/listinfo/linux
/Jonas
Stefan Kristiansson
2013-07-29 11:44:12 UTC
Permalink
Post by Jonas Bonn
Post by Stefan Kristiansson
As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.
Walking the page table is pretty much the same operation whether it be
in software or hardware... the savings are the context switch
associated with the exception handler.
+ the overhead of doing shift and mask operations.
Post by Jonas Bonn
Post by Stefan Kristiansson
First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.
Let me illustrate the differences.
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |
We have 8K pages... why so many bits for the PPN? Bits 31, 30, and 29
are never used?
My guess is, another overlook in the arch spec. The page size was probably
not decided when they defined the PTE.
Bits 12, 11, and 10 are not used.
Post by Jonas Bonn
Post by Stefan Kristiansson
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|
So, I was sloppy here, the PPN is actually bits 31 - 13 and bit 12 is not used.
Post by Jonas Bonn
Post by Stefan Kristiansson
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
SR|SW|SX /* really? */
SR|SW
SR|SX
SR
/* We may want to drop the SX's here */
SR|SW|SX|UR|UW|UX
SR|SW|SX|UR|UX
SR|SW|SX|UR|UW
SR|SW|SX|UR
Is that all? If yes, then SR is always set, and UR is always set for
user pages.
I'm not sure, but it doesn't sound too far fetched.
Post by Jonas Bonn
USERPAGE?
WRITABLE?
EXECUTABLE?
...that's 3 bits, which maps nicely into the 3 bits available for PP
INDEX. So the hardware and software implementations aren't
contradictory there.
Yes, that sounds good.
And we could setup a static mapping in the PPI registers for that,
and leave it up to implementations to actually implement the PPI
registers or use the same static mapping.
Post by Jonas Bonn
L ("link") isn't intresenting for the software implementation, so
reusing that for SHARED is fine there, but the HW implementation wants
L... where does SHARED go in that case and, furthermore, what is it
even used for? Somebody please check what that SHARED flag is doing.
We can do without the L, if we just assume a two-level page table,
but it sounds better to actually do it properly and set L ("last" not "link")
on the actual PTE.

I'm not sure about the SHARED, looking around, many architectures seems to set
it to a combination of PRESENT, USER, RW and ACCESSED.
I'll continue investigating this.
Post by Jonas Bonn
Finally, we play games with the CC bit since we don't have an SMP
Implementation of OpenRISC. It's used to indicate that a page is
swapped out. And the WBC bit is aliased to distinguish page cache via
the PAGE_FILE flag. For the HW implementation we can't do this, so
where do we put these? Bits 31 and 30 and have the HW mapper mask
them out?
So, under the assumption all of the above turns out ok (and the SHARED
flag has to have it's own bit), our Linux PTE could look something like this.

| 31 ... 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |FILE|SHARED|PRESENT| L | X | W | U | D | A |WOM|WBC|CI |CC |
Post by Jonas Bonn
Post by Stefan Kristiansson
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).
I think that file has even more errors that that. Didn't somebody fix
this file up in or1ksim but not sync the kernel version?
I think they are pretty much up to sync, but not sure.
There might be more errors in it, but at least this was in all versions I
looked at.
Post by Jonas Bonn
Post by Stefan Kristiansson
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.
1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
I think this is the right approach.
I think so too, so let's do that. I'll take a look at what needs to be done.
Post by Jonas Bonn
Post by Stefan Kristiansson
if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;
This won't work as things stand today...
Certainly not, but it served as a good example of the problem I wanted to show.

Stefan
Loading...