Stefan Kristiansson
2013-07-28 04:02:16 UTC
Good news everybody!
We've got some hardware tlb reload going on in the hottest OpenRISC 1000
implementation there is, no more wasting instructions on tlb miss exceptions
when running Linux.
As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.
Another rough estimate of how much time is spent in the tlb miss vectors
was done by running 'gcc hello_world.c -o hello_world' in the jor1k
emulator (http://s-macke.github.io/jor1k/) and by using the stats from that
we saw that (momentarily) roughly up to 25% of the time was spent in the
dtlb miss exception handler.
This could of course also be improved by increasing the number of sets and ways
used in the mmus, but that's another topic that might be addressed in the
future.
As always, you can find it in the github repos at:
https://github.com/openrisc/mor1kx
But before we bring out the champagne and start celebrating, some notes about
the implementation that needs some discussion.
First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.
Let me illustrate the differences.
or1k arch spec pte layout:
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |
Linux pte layout:
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
Second, naturally there are a couple of changes needed to Linux for this to
work.
The changes are minor but needs commenting before proper patches are sent out.
The full diff is available in the end of this mail, but I'll first comment the
changes to each file.
arch/openrisc/include/asm/spr_defs.h:
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).
arch/openrisc/kernel/head.S:
The implementation in mor1kx works so, that if the xMMUCR register is 0,
it will generate tlb miss exceptions, so we have to make sure that it
is zero when the MMUs are enabled, so the boot tlb miss handlers are used
until paging is set up.
arch/openrisc/mm/init.c:
arch/openrisc/mm/tlb.c:
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.
arch/openrisc/mm/fault.c:
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.
1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
2) Make mor1kx generate a tlb miss exception instead of a pagefault when the
pte table pointer is zero or the PRESENT bit is not set.
Some thoughts and comments on those issues, please!
Stefan
--- >8 ---
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index 5dbc668..1d20915 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -226,19 +226,15 @@
* Bit definitions for the Data MMU Control Register
*
*/
-#define SPR_DMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_DMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_DMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_DMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_DMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_DMMUCR_DTF 0x00000001 /* DTLB Flush */
/*
* Bit definitions for the Instruction MMU Control Register
*
*/
-#define SPR_IMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_IMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_IMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_IMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_IMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_IMMUCR_ITF 0x00000001 /* ITLB Flush */
/*
* Bit definitions for the Data TLB Match Register
diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S
index 1d3c9c2..59a3263 100644
--- a/arch/openrisc/kernel/head.S
+++ b/arch/openrisc/kernel/head.S
@@ -541,6 +541,15 @@ flush_tlb:
enable_mmu:
/*
+ * Make sure the page table base pointer is cleared
+ * ( = hardware tlb fill disabled)
+ */
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_DMMUCR
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_IMMUCR
+
+ /*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4c07a20 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
*/
if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;
diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index e7fdc50..d8b8068 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -191,6 +191,14 @@ void __init paging_init(void)
mtspr(SPR_ICBIR, 0x900);
mtspr(SPR_ICBIR, 0xa00);
+ /*
+ * Update the pagetable base pointer, to enable hardware tlb refill if
+ * supported by the hardware
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
+
/* New TLB miss handlers and kernel page tables are in now place.
* Make sure that page flags get updated for all pages in TLB by
* flushing the TLB and forcing all TLB entries to be recreated
diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c
index 683bd4d..96e6df3 100644
--- a/arch/openrisc/mm/tlb.c
+++ b/arch/openrisc/mm/tlb.c
@@ -151,6 +151,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
current_pgd = next->pgd;
+ /*
+ * Update the pagetable base pointer with the new pgd.
+ * This only have effect on implementations with hardware tlb refill
+ * support.
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
/* We don't have context support implemented, so flush all
* entries belonging to previous map
*/
--- >8 ---
We've got some hardware tlb reload going on in the hottest OpenRISC 1000
implementation there is, no more wasting instructions on tlb miss exceptions
when running Linux.
As a rough estimate (by looking at simulation waveforms and comparing
the time spent in the tlb miss exception handler and the time spent when
doing a hw reload), the hardware tlb reload should be about 7.5 times faster
than the software reload.
The hardware reload isn't completely optimized, so you could still shave off
a couple of cycles there.
Perhaps that is true for the tlb miss handler in Linux too, so the rough
estimate is probably a good enough indicator at what kind of speedup we
can estimate from this.
Another rough estimate of how much time is spent in the tlb miss vectors
was done by running 'gcc hello_world.c -o hello_world' in the jor1k
emulator (http://s-macke.github.io/jor1k/) and by using the stats from that
we saw that (momentarily) roughly up to 25% of the time was spent in the
dtlb miss exception handler.
This could of course also be improved by increasing the number of sets and ways
used in the mmus, but that's another topic that might be addressed in the
future.
As always, you can find it in the github repos at:
https://github.com/openrisc/mor1kx
But before we bring out the champagne and start celebrating, some notes about
the implementation that needs some discussion.
First, it doesn't exactly follow the arch specifications definition of the
pagetable entries (pte), instead it uses the pte layout that our Linux port
defines.
Let me illustrate the differences.
or1k arch spec pte layout:
| 31 ... 10 | 9 | 8 ... 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN | L |PP INDEX | D | A |WOM|WBC|CI |CC |
Linux pte layout:
| 31 ... 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| PPN |SHARED|EXEC|SWE|SRE|UWE|URE| D | A |WOM|WBC|CI |PRESENT|
The biggest difference is that the arch spec defines a seperate register
(xMMUPR) which holds a table of protection bits, and the PP INDEX field
of the pte is used to pick out the "right" protection flags from that.
In our Linux port on the other hand, it has been chosen to not follow
this and embed the protection bits straight into the pte (which of
course is perfectly fine as it was designed for software tlb reload).
So, the question is, should we change Linux to be compliant with the
arch specs definition of the ptes and start using a PP index field or
change the arch spec to allow usages of the Linux definition?
Second, naturally there are a couple of changes needed to Linux for this to
work.
The changes are minor but needs commenting before proper patches are sent out.
The full diff is available in the end of this mail, but I'll first comment the
changes to each file.
arch/openrisc/include/asm/spr_defs.h:
The defines for the bitfields of xMMUCR are wrong in all of our spr_defs.h,
I tried to dig into where those defines come from, but both the arch spec
and spr_defs.h have been different since the beginning of time (or as long
back as the commit histories date back, some time in year 2000).
arch/openrisc/kernel/head.S:
The implementation in mor1kx works so, that if the xMMUCR register is 0,
it will generate tlb miss exceptions, so we have to make sure that it
is zero when the MMUs are enabled, so the boot tlb miss handlers are used
until paging is set up.
arch/openrisc/mm/init.c:
arch/openrisc/mm/tlb.c:
The correct value of the pagetable base pointer is updated to the xMMUCR
registers right after paging is initially set up and on each switch_mm.
arch/openrisc/mm/fault.c:
do_pagefault is called a bit differently when it is called from the pagefault
exception vectors and when it is called from the tlb miss exception vectors.
I've put in a hack there to make that difference disappear, but this has
to be addressed properly and as I see it there are two ways.
1) Do the necessary checks in do_pagefault to see if it should handle a
protection fault, or a missing page fault.
2) Make mor1kx generate a tlb miss exception instead of a pagefault when the
pte table pointer is zero or the PRESENT bit is not set.
Some thoughts and comments on those issues, please!
Stefan
--- >8 ---
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index 5dbc668..1d20915 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -226,19 +226,15 @@
* Bit definitions for the Data MMU Control Register
*
*/
-#define SPR_DMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_DMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_DMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_DMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_DMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_DMMUCR_DTF 0x00000001 /* DTLB Flush */
/*
* Bit definitions for the Instruction MMU Control Register
*
*/
-#define SPR_IMMUCR_P2S 0x0000003e /* Level 2 Page Size */
-#define SPR_IMMUCR_P1S 0x000007c0 /* Level 1 Page Size */
-#define SPR_IMMUCR_VADDR_WIDTH 0x0000f800 /* Virtual ADDR Width */
-#define SPR_IMMUCR_PADDR_WIDTH 0x000f0000 /* Physical ADDR Width */
+#define SPR_IMMUCR_PTBP 0xfffffc00 /* Page Table Base Pointer */
+#define SPR_IMMUCR_ITF 0x00000001 /* ITLB Flush */
/*
* Bit definitions for the Data TLB Match Register
diff --git a/arch/openrisc/kernel/head.S b/arch/openrisc/kernel/head.S
index 1d3c9c2..59a3263 100644
--- a/arch/openrisc/kernel/head.S
+++ b/arch/openrisc/kernel/head.S
@@ -541,6 +541,15 @@ flush_tlb:
enable_mmu:
/*
+ * Make sure the page table base pointer is cleared
+ * ( = hardware tlb fill disabled)
+ */
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_DMMUCR
+ l.movhi r30,0
+ l.mtspr r0,r30,SPR_IMMUCR
+
+ /*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4c07a20 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
*/
if (address >= VMALLOC_START &&
- (vector != 0x300 && vector != 0x400) &&
+ /*(vector != 0x300 && vector != 0x400) &&*/
!user_mode(regs))
goto vmalloc_fault;
diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index e7fdc50..d8b8068 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -191,6 +191,14 @@ void __init paging_init(void)
mtspr(SPR_ICBIR, 0x900);
mtspr(SPR_ICBIR, 0xa00);
+ /*
+ * Update the pagetable base pointer, to enable hardware tlb refill if
+ * supported by the hardware
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
+
/* New TLB miss handlers and kernel page tables are in now place.
* Make sure that page flags get updated for all pages in TLB by
* flushing the TLB and forcing all TLB entries to be recreated
diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c
index 683bd4d..96e6df3 100644
--- a/arch/openrisc/mm/tlb.c
+++ b/arch/openrisc/mm/tlb.c
@@ -151,6 +151,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
current_pgd = next->pgd;
+ /*
+ * Update the pagetable base pointer with the new pgd.
+ * This only have effect on implementations with hardware tlb refill
+ * support.
+ */
+ mtspr(SPR_IMMUCR, __pa(current_pgd) & SPR_IMMUCR_PTBP);
+ mtspr(SPR_DMMUCR, __pa(current_pgd) & SPR_DMMUCR_PTBP);
+
/* We don't have context support implemented, so flush all
* entries belonging to previous map
*/
--- >8 ---