Proposal for Alternative RISC‑V 64‑bit Translation Ssv64

Version 0.1-draft-20250315

Table of Contents
Introduction
- Translation Issues
- Ssv64 Pros and Cons
  - Ssv64 Pros
  - Ssv64 Cons
Documentation Conventions
Minimal Ssv64 Proposal
Extended Ssv64 Proposal
- Basic Method
- Historical Precedent
Design Considerations
Translation Cache Differences
Use After Free Detection
Sandboxing
Garbage Collection
Future Work
Paging
RISC‑V Glossary

Introduction

This document is a sort of whitepaper on virtual memory for large memory systems and a sort of rough draft specification/proposal for discussion.

In the author’s opinion, the existing RISC‑V virtual to physical translation mechanism are unlikely to perform well on systems with very large memories. Much of this is due to paging, and so this document begins with the problems of RISC‑V virtual to physical translation and paging issues, especially translation cache (TLB) miss rates and penalties. These issues motivate the proposed solution, which is primarily to replace fixed radix page tables with variable radix page tables underneath a region abstraction.

This document proposes a simplified version of the same mechanism to address second-level translation that the hypervisor specifies to translate guest operating system physical address (gPAs) to system physical addresses (sPAs). This portion is preliminary. The author expects that this is particularly helpful, as most guest operating systems are given rather small guest physical address spaces, and this proposal allows a small single-level guest page table to be sufficient.

The solution to the translation performance problem involves providing a bit of structure to the address space that can also be useful in smaller systems. In particular, the structured address space can be used to provide more efficient Garbage Collection and sandboxing.

Translation Issues

RISC‑V currently supports virtual address spaces up to 2⁵⁷ bytes with physical addresses up to 2⁵⁶ bytes. Eventually it will be necessary to support larger virtual and physical address spaces, for example for High Performance Computing (HPC). Currently RISC‑V uses fixed radix page tables with a radix of 512. In particular, 64‑bit RISC‑V has Sv39, Sv48, and Sv57 translation models for its supervisors using 3, 4, and 5‑level page tables with 512 PTEs per level for address spaces of −2³⁸..2³⁸−1, of −2⁴⁷..2⁴⁷−1, and −2⁵⁶..2⁵⁶−1 respectively. An obvious extension to Sv64 using a 6‑level page table for an address space of −2⁶³..2⁶³−1 is likely someday. This is illustrated in the figures below.

Sv39 Virtual Address with 4 KiB page (3 levels)
63	39	38	30	29	21	20	12	11	0
extend		VPN0		VPN1		VPN2		byte
25		9		9		9		12

Sv48 Virtual Address with 4 KiB page (4 levels)
63	48	47	39	38	30	29	21	20	12	11	0
extend		VPN0		VPN1		VPN2		VPN3		byte
16		9		9		9		9		12

Sv57 Virtual Address with 4 KiB page (5 levels)
63	57	56	48	47	39	38	30	29	21	20	12	11	0
extend		VPN0		VPN1		VPN2		VPN3		VPN4		byte
7		9		9		9		9		9		12

Straight-forward Sv64 Virtual Address with 4 KiB page (6 levels)
63	57	56	48	47	39	38	30	29	21	20	12	11	0
VPN0		VPN1		VPN2		VPN3		VPN4		VPN5		byte
7		9		9		9		9		9		12

Neither Sv57 or a hypothetical Sv64 as illustrated above is the best choice for the large address space applications (e.g., HPC). This is especially true when second-level address translation is added, as the number of accesses is then quadratic^*, i.e., (𝑡+1)²−1, in the number of table lookups (see table below), with Sv57 requiring up to 35 memory access on a TLB miss. Typically, specialized caches for each level of the hierarchy are provided^† to address this issue, and these caches tend to have high hit rates on most benchmarks, and so the average number of accesses is much smaller. However, some benchmarks do have performance issues even with such optimizations, and providing these caches for both first and second level translation is a cost in area.

* More generally, if guest page tables and nested page tables are 𝑚 and 𝑛 levels, the two levels of page walk take up to 𝑚𝑛+𝑚+𝑛 memory references (typically to the L2 cache). Adding a third level with 𝑝 memory accesses takes up to 𝑚𝑛𝑝+𝑚𝑛+𝑚𝑝+𝑛𝑝+𝑚+𝑛+𝑝 memory accesses.

† For example, for 48‑bit virtual address translation on Intel processors, there are caches on virtual address bits 47:39, 47:30, and 47:21 and the longest match starts the table walk at the physical address stored with the match.

A major issue with large (e.g., HPC) virtual address spaces is the issue of page size, the organization of page tables, translation cache miss rates and miss penalties. This document therefore includes a long discussion of paging issues at the end for interested readers. Most readers will simply want to consider the minimal and extended proposals that follow in subsequent sections.

Potential memory accesses by mode and levels of virtualization
mode	1	2	3
Sv39	3	15	63
Sv48	4	24	124
Sv57	5	35	215
Sv64	6	48	342

Intel, RISC‑V, and Linux all use different names for the virtual address fields. For readers familiar with these other sources, the following may help translate terminology. The two diagrams below repeat the above using the Linux names, and the table provides other names. The extend field is usually the sign-extension of bit 47 for 4‑level paging, and bit 56 for 5‑level paging.

Linux 4‑Level Paging
63	48	47	39	38	30	29	21	20	12	11	0
extend		PGD		PUD		PMD		PTE		byte
16		9		9		9		9		12

Linux 5‑Level Paging
63	57	56	48	47	39	38	30	29	21	20	12	11	0
extend		PML5		PGD		PUD		PMD		PTE		byte
7		9		9		9		9		9		12

Bits	Linux		Intel	RISC‑V	ARM
Bits	Abbr	Long	Intel	RISC‑V	ARM
56:48	PML5	Page Map Level 5	Page Map Level 5 (PML5)	VPN[4]	L0
47:39	PGD	Page Global Directory	Page Map Level 4 (PML4)	VPN[3]	L1
38:30	PUD	Page Upper Directory	Page Directory Pointer Table (PDPT)	VPN[2]	L2
29:21	PMD	Page Middle Directory	Page Directory Table (PDT)	VPN[1]	L3
20:12	PTE	Page Table Entry Directory		VPN[0]	L4

Ssv64 Pros and Cons

A brief list of Ssv64 pros and cons may help set the stage for the reader.

Ssv64 Pros

Ssv64 supports both a 64‑bit virtual and 64‑bit physical address space. Sv57 supports only 57 virtual address bits and 56 physical address bits. A straight-forward Sv64 extension could support a 64‑bit virtual and a 63‑bit physical address, but would probably perform poorly.
The supervisor specifies table sizes of its own choosing at each page table level, with 0–4 levels supported (with 1 level preferred, and up to 4 levels provided for compatibility). When its internal memory allocation algorithms allow larger tables to be used, the number of levels in page tables is reduced, which reduces the TLB miss penalty. Supervisors that allocate physical memory only in units of 4 KiB can continue to specify 512‑entry tables at each level, at the cost of reduced performance. Ssv64 has the up to a five-level TLB miss penalty when Ssv64’s other features are unused, and this is the same as Sv57 in this situation (not six levels as a hypothetical Sv64 would require), but supervisors are expected to use the features to do better.
Ssv64 can replace Sv39, Sv48, and Sv57 with a single mechanism, but likely would be implemented in parallel with these. Page tables are somewhat compatible with these existing standards, making it feasible to share certain logic. For example, the page table walk for Ssv64 could be used for Sv57 by disabling the table size encoding and substituting a fixed 512-entry table size, and shifting the PPN by two bits.

The address space is divided up into 65536 regions, and each region has its own region size and translation. Regions correspond roughly to the Linux Virtual Memory Area (VMA) kernel abstraction. Separate regions are expected to be used for thread stacks, heaps, and for memory mapped file (e.g., shared libraries). Other regions might be used for shared databases mapped into the address space. Small applications might need only ten or so regions, large ones hundreds, and huge ones thousands. Therefore 65536 regions is far more than necessary for most processes, so supporting smaller numbers efficiently is an explicit goal.

Virtual Address with Region
63	48	47	0
region		interpretation depends on region descriptor
16		48

One of the primary differences between regions and VMAs is that regions are identified by bits 63..48 of the virtual address, whereas VMAs are potentially use variable numbers of address bits from a smaller virtual address. This is illustrated below.

Sv57 User-mode address with VMA
63	56	55		12	11	0
0		VMA	VPN		byte
8		V	44−V		12

Region size can be 2¹² to 2⁶¹ bytes. Translation can be direct-mapped or paged. Direct mapping (zero levels of page table after one level the region lookup) is supported and is particularly useful for I/O regions. Direct-mapping supports power of two regions from 2¹² to 2⁶¹ bytes.
For a given region size, the single level page table size that minimizes memory usage is proportional to the square root of the region size divided by the PTE size. This proposal allows software to choose this size, or deviate from it if other considerations are more important.
Paged translation is made more efficient by using region size. The page table only maps the number of virtual address bits specified by the region size, and so small regions require few levels (often only one level). In particular, the root page table read begins by indexing into the table with the virtual address bits from regionsize−1..regionsize−tablesize, and then moves downward from there. This keeps the TLB miss penalty low even when 4 KiB tables are used at each level. For example, most shared libraries and other code regions, many stacks, would probably need only the root page table, for a significant saving on TLB miss penalty.

Root page table of region starts with region size
63	48	47			0
region		fill	tableindex0	offset
16		48−RS	PTS	RS−PTS

where RS is the region size log2, and PTS is the root page table size log2 specified by the supervisor. Fill is either all 0s or all 1s for upward and downward region growth.

Page tables can be shared by different address spaces even when permissions differ by specifying permissions per region. Region permissions AND with page permissions. An operating system could set all PTE permissions to RWX, and then provide different region tables for different processes, making all page tables sharable.
Because regions have permissions, they serve as an alternative to PMP-like mechanisms. The PMP mechanism is limited in the number of regions that it can efficiently represent (64 PMP are allocated, but perhaps only 16 can be efficiently tested). There are up to 65536 regions in Ssv64, which is a much greater quantity (but at lower precision). Not all processors have to support 65536 regions, however.
In a preliminary proposal, the G bit in PTEs is generalized to up to eight levels of ASID matching.
An option is provided to generalize the supervisor/user nested access model to allow user mode to support sandboxed code execution.
There is room to add attributes to regions of the address spaces for new features (e.g., generation numbers for garbage collection or HWASAN). In contrast, a Sv64 PTE would have no room left for new features.
Provides a clean model for full 64 bit addresses, but compatible hardware can implement a smaller number of address bits for reduced TLB width.
Supports (but does not require) multiple page sizes to reduce TLB miss rates.

Ssv64 Cons

Some extra bits in TLBs are required even when the same number of virtual and physical address bits are implemented.
Address space regions use a 16 B descriptor, wider than 8 B PTEs. The first 8 B of region descriptors are similar to PTEs, which may simplify implementation. The second 8 bytes provides most of the new features.
Some additional table alignment is required (similar to Sv39x4, Sv48x4, etc.) for the region descriptor table. Since the Linux kernel has buddy memory allocation, this is only an issue for other kernels.

Documentation Conventions

Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). At some point this should be converted to SAIL syntax.

Subscripts are used for bit-field extraction: sdtp[0]₆₃ is bit 63 of sdtp[0], and vaddress_11..0 is the least significant 12 bits of vaddress.
Superscripts on single-bit quantities are used for bit replication: 0³ is a bit string of three zero bits and 1¹⁶ represents a bit string of sixteen ones, but confusingly this document also uses 2^N to mean the Nth power of two (2⁸ = 256)—my excuse is that 2 is not a single bit value.
a∥b represents bit-string catenation.

Minimal Ssv64 Proposal

This section outlines a minimal 64‑bit virtual address space translation mechanism that solves the problems of Sv57 and a hypothetical Sv64 by using a variable radix page table format and a region descriptor to initiate it. Later sections will present an extended proposal.

This proposal addresses the performance problem of existing Sv39, Sv48, and Sv57 virtual memory, and the obvious extension to Sv64, by dividing the 64‑bit address space into 65536 regions based on the top 16 bits of the address. These upper bits index a descriptor table, which controls the interpretation of the lower 48 bits. Each region is of variable size of up to 2⁶¹ bytes, where regions >2⁴⁸ require the supervisor to specify multiple consistent descriptors. After this level of the translation, either a direct mapping is used, or RISC‑V-like page tables are used. Direct-mapping is especially useful for I/O regions. Because region descriptors include a size, direct-mapping can be used for regions as small as 4 KiB, or as large as 2⁶¹ bytes. (For second level address translation, analogous direct mapping is particularly useful for giving a power of two region of main memory to a virtual machine.)

Page Table Entry Formats

This section proposes new non-leaf and leaf Page Table Entries (PTE) formats that implement a variable radix page table. The next section will describe the how regions start the process, but since the reader is already familiar with PTEs, and region descriptors are similar to PTEs, with additional features, it makes sense to start with PTEs.

Page Table Entries (PTEs) are similar to Sv57, Sv48, and Sv39 PTE formats except for the following changes:

A full 64‑bit physical address is stored in the most significant bits.
The Sv39 N bit (bit 63) is eliminated by making it implicitly 1 and requiring NAPOT-encoding support for all PTEs (both leaf and non-leaf). Non-leaf PTEs use this same the size encoding for the size of next table level. TLBs are not required to implement the encoded page size, and may fill the TLB with the next lower supported size.
The non-leaf PTE indicator is changed from the XWR being 0 to being either 2 or 6. This allows bit 3 (the X bit) in this case to start the 2^PTS field, which allows for a smaller next-level page table. The value 0 is Reserved, or used in the PBMT option to represent invalid PTEs.

Non-Leaf Page Table Entry (PTE)
63		3	2	1	0
paddr_63..4+PTS	2^PTS		2		V
60−PTS	1+PTS		2		1

Field

Width

Bits

Description

Valid
0 ⇒ Page Fault, bits 63..1 for software use
1 ⇒ Valid, next-level page table specified by paddr_63..4+PTS

2:1

2 ⇒ Non-leaf PTE (this case)
see Leaf PTE below for other values

2^PTS

1+PTS

3+PTS:3

NAPOT next level table size encoding
Table size of next level is 2^1+PTS entries (2^4+PTS bytes):

PTS	PTEs	Bytes
0	2	16	B
1	4	32	B
2	8	64	B
3	16	128	B
⋮	⋮	⋮
8	512	4	KiB
9	1024	8	KiB
10	2048	16	KiB
⋮	⋮	⋮
14	32768	256	KiB
⋮	⋮	⋮
34	2³⁵	256	GiB
35	2³⁶	512	GiB
≥36	reserved

paddr_63..4+PTS

60−PTS

63:4+PTS

Physical address of next level page table

Note: If it reduces implementation cost, it seems reasonable to change PTS ≥ 32 to be reserved.

Leaf Page Table Entry (PTE)
63		11	10	8	7	6	5	4	3	1	0
paddr_63..12+S	2^S		RSW		D	A	G	U	XWR		V
52−S	1+S		3		1	1	1	1	3		1

Field	Width	Bits	Description
V	1	0	Valid 0 ⇒ Page Fault, bits 63..1 for software use 1 ⇒ Valid, bits 63..1 as described below
XWR	3	3:1	Execute, Write, Read permissions 0 ⇒ Reserved 2,6 ⇒ used to represent non-leaf PTEs as above These further restrict (logical-and with) the permissions in the Region Descriptor.
U	1	4	Page accessible to user-mode
G	1	5	Global mapping
A	1	6	Accessed
D	1	7	Dirty
RSW	3	10:8	Reserved for software
2^S	1+S	11+S:11	NAPOT page size encoding
paddr_63..12+S	52−S	63:12+S	Physical Page Number

The Svpbmt feature may not be required in Ssv64 because the Region Descriptor Entries (described below) have a larger PMA Override (PMAO) field. If PBMT is required at the individual PTE level, then a larger change is required to fit the PBMT field into PTEs. This can be accomplished by relying on the Svukte extension to eliminate the need for the U bit and by defining currently reserved value 0 of the XWR field as a substitute for the V bit as shown below (i.e., XWR = 0 ⇒ Invalid PTE).

Leaf Page Table Entry (PTE)
63		11	10	8	7	6	5	4	3	2	0
paddr_63..12+S	2^S		RSW		D	A	G	PBMT		XWR
52−S	1+S		3		1	1	1	2		3

Non-Leaf Page Table Entry (PTE)
63		3	2	0
paddr_63..4+PTS	2^PTS		2
60−PTS	1+PTS		3

A Simple Region Descriptor Entry Format

Later this proposal will introduce an alternative, preferred descriptor format with additional access control information. However, it makes sense to first show how a simple Region Descriptor Entry (RDE) might look to provide context for the Page Table Entries (PTE) formats above. RDEs can be seen as PTEs with an extra doubleword for additional region description.

A Simple Region Descriptor Entry (RDE) format would be a simplified PTE. First, below is the direct-mapping case (similar to a leaf PTE), followed by the the case for a pointer to the region’s root page table (similar to a non-leaf PTE). Finally, the second doubleword of the region descriptor (shared between direct-mapped and paged regions) is shown. The second doubleword has only 19 bits out of 64 currently in use, but the extended proposal, as well as future RISC‑ extensions, would use the Reserved bits to advantage.

Simple Region Descriptor Entries Doubleword 0 for Direct-Mapping
63		11	10	8	7	6	5	4	3	1	0
paddr_63..12+S	2^S		RSW		D	A	G	U	MAP		V
52−S	1+S		3		1	1	1	1	3		1

Simple Region Descriptor Entries Doubleword 0 for Paging
63		3	2	1	0
paddr_63..4+PTS	2^PTS		2		V
60−PTS	1+PTS		2		1

Region Descriptor Entry Doubleword 1
63	56	55	48	47	12	11	10	9	4	3	1	0
0		PMAO		0		FC		rsize		XWR		1
8		8		36		2		6		3		1

Fields of Simple Region Descriptor Entries

Field

Width

Bits

Description

Valid
0 ⇒ Access Fault, bits 63..1 for software use
1 ⇒ Valid

MAP

3:1

Must be identical to RDE doubleword 0 XWR for direct-mapped regions

XWR

3:1

Direct-map permissions

0	Reserved
1	Read-only
2	See below
3	Read-write
4	Execute-only
5	Read-execute
6	See below
7	Read-write-execute

2^S

1+S

13+S:11

NAPOT region size encoding for direct-mapping
(same as leaf PTE encoding above)
Must match rsize

paddr_63..12+S

52−S

63:14+S

Physical address of direct-mapped region

2^PTS

1+PTS

3+PTS:3

Root page table size encoding
Table size of root page table is 2^1+PTS entries

paddr_63..4+PTS

60−PTS

63:4+PTS

Physical address of root page table

rsize

9:4

log2 of region size in bytes

11:10

fill check

PMAO

55:48

PMA override

If PTEs are changed to support PBMT, then RDEs would be as well:

Simple Region Descriptor Entries Doubleword 0 for Direct-Mapping
63		11	10	8	7	6	5	4	3	2	0
paddr_63..12+S	2^S		RSW		D	A	G	PBMT		MAP
52−S	1+S		3		1	1	1	2		3

Simple Region Descriptor Entries Doubleword 0 for Paging
63		2	1	0
paddr_63..4+PTS	2^PTS		2
60−PTS	1+PTS		2

Region Descriptor Entry Doubleword 1
63	56	55	48	47	12	11	10	9	4	3	2	0
0		PMAO		0		FC		rsize		0	XWR
8		8		36		2		6		1	3

The Region Descriptor Table (RDT) is located by the processor by using the rdtp0 and rdtp1 CSRs, which have the following format. The rdtp0 CSR is used for virtual addresses with bit 63 clear (typically application addresses), and rdtp1 is for virtual addresses with bit 63 set (typically supervisor addresses). Supported RDT sizes are 8 KiB (512 regions) 16 KiB (1024 regions), …, 256 KiB (16384 regions), and 512 KiB (32768 regions). It is likely that most application address spaces would use the 512 region form of rdtp0.

Region Descriptor Table Pointer
63		12	11	0
paddr_63..13+RTS	2^RTS		ASID
51−RTS	1+RTS		12

Fields Region Descriptor Table Pointer
Field	Width	Bits	Description
ASID	12	11:0	Address Space ID
2^RTS	1+RTS	15:12	NAPOT encoding of RDT size (RTS 0..6 required) other values Reserved
paddr_63..13+RTS	51−RTS	63:13+RTS	Physical address of RDT

Second Level Translation for Hypervisor

The hypervisor should be able to use a compatible PTE mechanism to the first-level, but does not need some of the features of the minimal or extended proposals. What is proposed here is therefore a simplified version of the first-level translation. A minimal second level translation would consist of a gsize field of the hgas CSR (analogous to rsize in Region Descriptor Entries) and a single table for that 2^gsize guest physical address space. The range of gsize is 64..19. It is unlikely that gsize = 64 needs to be supported, and so this field of hgas should be WARL to allow implementations to choose a smaller size. There is no direct-mapping option in the hgapt CSR, as typically the guest would be given I/O and main memory regions, which requires a page table.

Hypervisor Guest Address Space Size
63	7	6	0
0		gsize
57		7

Hypervisor Guest Address Page Table
63	12	11	3	2	1	0
paddr_63..4+PTS		2^PTS		2		V
60−PTS		1+PTS		2		1

The second-level non-leaf PTE format is identical to the first-level non-leaf PTE format. The second-level leaf PTE format is identical to the first-level leaf PTE format except that the U and G bits are redefined as PBMT bits (details TBD).

Extended Ssv64 Proposal

This section outlines an 64‑bit virtual address space translation mechanism that is fuller featured than the one described above. It takes advantage of the region descriptor mechanism to add further access control, which can ultimately simplify the complex set of RISC‑V features that are currently proposed.

Basic Method

Given the region descriptors, it is possible to support new features that will not fit into the limited bits available in Page Table Entries (PTEs). Two features that can take advantage of region descriptors are support for garbage collection and to generalize the two-levels of nesting in RISC‑V (supervisor and user) to four to eight levels with nesting of Read, Write, and Execute permission. Some of these levels may be used in user mode for things like sandboxing untrusted code or implementing concurrent garbage collection. In addition it may be useful to have PMA overrides more general than RISC‑V’s Svpbmt, which is limited by the number of PTE bits available.

Historical Precedent

The model for the above features is actually taken from a 1960s processor architecture called Multics where regions were called segments and the permission nesting was called rings. I have avoided the words segments and rings in the exposition above to avoid preconceived notions the reader might have from some early microprocessors trying to extend their address spaces from tiny to small, which is quite different from the Multics approach. Multics segmentation is about better managing the existing address space, and that is what Ssv64 seeks to accomplish as well. There seems to be an impression among many in the computer architecture world that Multics virtual memory and protection were complex, when in fact they are actually simple, easy to implement, and general. Computer architecture from the 1980s to the present is often an oversimplification of Multics. For example, segmentation in Multics served to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) creates pressure to keep the number of bits devoted to access control minimal when security considerations might suggest a more robust set. Even though processor hardware has eliminated the segment concept, operating systems (e.g., Linux) have reinvented it. Linux calls a contiguous set of virtual pages with similar attributes (e.g., protection) a Virtual Memory Area (VMA).

As another example, many contemporary processor architectures (e.g., RISC‑V) have two rings (User and Supervisor), with a single bit in PTEs (the U bit in RISC‑V) serving as a ring bracket. (In addition there is a third ring-like privilege level called Machine Mode, but this privilege is mostly operates without virtual memory and so PTE bits are not needed for it.) Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. It was true that rings were not well utilized on Multics, but we now have more uses for multiple rings, such as hypervisors, Just-In-Time (JIT) compilation, and sandboxing.

For those familiar with Multics, the primary thing to know is that segment sizes are powers of two. Also, Ssv64 has the option to support either 4 or 7 rings and inverts ring numbers so that ring 0 is the least privileged (ring 0 was the most privileged in Multics). Also, inverting ring numbers means that applications are unaware of how many rings exist above and allows implementations to choose either 4 or 7 rings without affecting applications. The only reason for an implementation to use the smaller ring count is to save 3 bits in translation caches as the ring mechanism is very low-cost except for the number of bits in the TLB.

Example ring number usage with 4 rings
ring	use
0	JIT sandbox (e.g., browser downloaded code, e.g., code being debugged)
1	non-JIT sandbox
2	user (e.g., browser, debugger)
3	supervisor

Example ring number usage with 7 rings
ring	use
0	JIT sandbox (e.g., browser downloaded code, e.g., code being debugged)
1	non-JIT sandbox
2	user (e.g., browser, debugger)
3	supervisor device drivers
4	supervisor
5	hypervisor device drivers
6	hypervisor
7	Ultravisor or Trusted Execution Environment

Design Considerations

Page size flexibility and translation cache miss penalty are major design considerations, but these were addressed earlier. Here we look at a couple of other design considerations.

Number of Segments

The maximum number of segments (65536) is chosen to be compatible with RISC-V Sv48 with up to four levels of page table when 4 KiB pages and tables are used. This number of segments is likely far more than required (which is likely to be as few as 2048). Since the tables (or hardware via WARL) can reduce the number, the large number of segments isn’t a real issue. However, if Sv48 compatibility is not needed, one might make other choices, e.g., 8 KiB pages, 2048 segments, and levels of 2²³, 2³³, 2⁴³, and 2⁵³ bytes for fixed table size operating systems.

Rings and Gates

At this point, rings and gates are not a required component of this proposal, but rather a feature that is enabled by this proposal. Rings and gates are not yet fully developed here. If this goes further, whether to include rings and gates would need to be considered. One advantage of these is that transitions between privilege levels can be accomplished without an exception, and exceptions have performance costs that a simple JALR does not. If rings are used for sandboxing, then gates eliminate the need for exception handlers for each ring, which would be helpful if sandboxing is desired. Rings are independent of modes (e.g., user-mode might encompass several rings). One issue is that 64‑bit pointers lack the bits to add a ring number for pointer parameters passed from lower privilege code to gates (Multics pointers had a ring number). For simple pointer arguments this is not a difficult issue to address. Linux handles this case by simply testing the sign of the pointers passed to system calls. Adding instructions to generalize this test might be sufficient. Testing of pointer arguments becomes more involved when interfaces involve pointers to pointers in complex data structures when pointers are passed to a higher privilege level and then passed in turn to a yet higher privilege level. Most interfaces between privilege levels have avoided this and instructions for testing access may suffice.

To explore the addition of rings and gates, let’s start with a ring CSR that holds the current ring in read-only bits 2:0 and the previous ring in read-write bits 6:4. A gate transition copies bits 2:0 to 6:4 and sets 2:0 to the R1 of the target segment. Gate code is aligned on 128 B boundaries in the first 512 KiB of the segment. It will begin by switching to a new stack, which will require a new scratch register to save the caller’s sp and some location from which to read the new sp, either a fixed address in memory or new per-ring CSRs. That done, the gate will execute a RINGR rd, rs1 to each pointer passed to the gate for reading and RINGW rd, rs1 on each pointer argument passed to the gate for writing. RINGR sets rd to ring[6:4] < SDE1.R2 and RINGW sets rd to ring[6:4] < SDE1.R1 where R1 and R2 come from the translation cache for rs1. Any non-zero values indicate the caller does not have access to read or write one of the parameters and signals an access fault. Once the arguments are verified, the gate calls the actual function written in a high-level language (or C++). When it returns, the gate restores the caller’s sp and the ring CSR, and returns with a new JALR that restores ring 2:0 from 6:4 (but not higher than ring 2:0). Exceptions to supervisor and machine mode would have to set the ring CSR appropriately (e.g., saving the old value in bits 10:8). This section outlines what ring and gate support might look like, but I am sure that there are plenty of details that need to be filled in. Plus Linux would require a fair bit of support (e.g., to save/restore the ring CSR, put the ring number into sigcontext, and so on), and disallow EBREAK from sandboxed code, and probably much more.

Possible Ring Register
31	16	15	14	12	11	10	8	7	6	4	3	2	0
0		0	mepc		0	sepc		0	caller		0	current
16		1	3		1	3		1	3		1	3

Virtual Addresses

Ssv64 virtual addresses are interpreted as follows where PS is the page size implied by segment size (the ssize field of the SDE), and the PTS fields of the page tables created by the operating system. In particular, PS is ssize−PTS0 for a PTE in a single-level page table, ssize−PTS0−PTS1 for a PTE in a two-level page table, and ssize−PTS0−PTS1…−PTSn-1 for a PTE in an n‑level page table (1≤n≤4). This page size can be increased for a subset of pages within the last level by using the Svnapot-feature. The resulting PS must be in the range 47..12. Translation caches may reduce the calculated PS to the next lower supported value. Page sizes ≥48 bits are not supported because segment direct-mapping would be used instead.

Ssv64 Virtual Address
63	61	60	48	47			0
sg		segment		fill	VPN	byte
3		13		48−ssize	ssize−PS	PS

where ssize is the segment size for the segment, PS is the page size given by the segment mapping, and fill is all 0s for upward growing segments and all 1s for downward growing segments.

Field	Width	Bits	Description
sg	2	63:61	Segment group
segment	14	60:48	Segment in group
fill	max(0,48−ssize)	47:48−ssize	Must be downward^48−ssize if ssize < 48
VPN	48−PS	47:PS	Page in segment
byte	PS	PS−1:0	Byte in page

For example, for a ssize of 39 and the 4 KiB compatibility page size, the interpretation of a virtual address is as follows:

Ssv64 Virtual Address (2³⁹‑byte segment, 4 KiB page size)
63	61	60	48	47	39	38	12	11	0
sg		segment		fill		PPN		byte
3		13		9		27		12

Segment Descriptor Table Pointers and Segment Groups

Translation begins with the eight sdtp registers (sdtp[0] to sdtp[7]), which serve a similar function to satp in RISC‑V’s Sv48. Each register provides a pointer to a Segment Descriptor Table for one eighth of the address space, the size of that table encoded in NAPOT-fashion, and an Address Space Identifier (ASID).

Segment Descriptor Table Pointers
63		12	11	0
paddr_63..13+SGS	2^SGS		ASID
51−SGS	1+SGS		12

Field	Width	Bits	Description
ASID	12	11:0	Address Spaced Identifier for the Segment Group
paddr_63..13+SGS	51-SGS	63:13+SGS	Physical address of SDT for Segment Group

The size of the segment group is 512×2^SGS, where SGS is given by the number of zero bits starting at bit 12. Some implementations might reduce the number of bits in their TLBs by hardwiring bit 12 to 1, thus allowing only 512 segments per segment group. SGS must be ≤4 or a page fault occurs. Segment groups may be disabled entirely using the sgen register. The Segment Descriptor Table for the group is specified in a table of 16 B entries at the specified physical address, which is aligned to the size of the group.

Specifically, the segment group bounds check is vaddr_60..48 < 2^9+SGS. If the bounds check fails, a page fault exception is taken. If the bound check succeeds, the 16 B Segment Descriptor Entry is read from (sdtp[vaddr_63..61]_63..13+SGS ∥ 0^SGS+13) | (vaddr_60..48 ∥ 0⁴).

The stdp registers can be used to match the address space usage common in other architectures. Consider an architecture with just two levels, user and supervisor, where user addresses are ≥0 and supervisor addresses are <0. Supervisor addresses can use pages with a Global bit set to ignore ASID matching for supervisor common, or leave Global clear for per-process supervisor data. All user addresses have Global clear. To match such usage, ASID=0 is used instead of Global=1. Segment group 0 is used for user addresses with the temporarily assigned ASID, and Segment group 6 is used with the same ASID for supervisor per-process data. Segment group 7 is used for supervisor common addresses with ASID 0. Such a system might set sdtp[7] at initialization, change sdtp[0] and sdtp[6] on process switch, and leave the other five groups disabled.

A more sophisticated supervisor might attempt to get Instruction TLB sharing between user processes by mapping shared libraries using segment group 1 and ASID 0, while leaving segment group 0 for per-process data. Segment group 1 would be identical in every user process address space so sdtp[1] would not changed on process switch.

Segment Group Enable

This section is very preliminary at this point.

The sgen CSR controls which modes or rings (which is TBD) can write the various sdtp registers, with three bits per sdtp register. Reads or writes to sdtp[i] register or its shadows trap if the current mode or ring number is less than sgen_i×8+2..i×8. It is possible to provide read access separate from write access, but the need for this is unclear.

It is TDB whether to implement eight fields of 4 bits (allowing expansion to sixteen Segment Groups in the future) or eight fields of 8 bits (allowing new Segment Group Enable functionality per group in the future). The following illustrates the sgen CSR with 8‑bit fields and control by ring number. The alternatives are left to the imagination of the reader.

If sgen enables/disables on ring numbers, setting the ring number to 7 disables the corresponding sdtp register altogether.

Each Segment Group field is WARL, and some may be hardwired, including to be disabled. If some sgen fields are hardwired to 7, then the corresponding sdtp registers need not exist, nor the TLB bits required for that.

Segment Group Enable Register
63	56	55	48	47	40	39	32	31	24	23	16	15	8	7	0
sg7		sg6		sg5		sg4		sg3		sg2		sg1		sg0
8		8		8		8		8		8		8		8

Segment Group Enable Fields
7	6	3	2	0
L	F		ring
1	4		3

Field	Width	Bits	Description
ring	3	2:0	Ring for which sdtp is enabled
F	4	6:3	Reserved for future use
L	1	7	Lock

It may be appropriate to implement a lock bit in bit 7 of each field, similar to pmpcfg.

Segment Descriptors

The segment descriptor can be thought of the translation root page table, but with a 16 B descriptor instead of an 8 B PTE. The first 8 B of the descriptor is made very similar to the PTE format, with the extra permissions, attributes, etc. in the second 8 B of the descriptor. The PTE format in turn is made mostly compatible with Sv39, Sv48, and Sv57 PTE formats by using two XWR reserved values (2 and 6).

Segment Descriptor Entries

Segment Descriptor Table (SDT) entries consist of two doublewords. The first doubleword has a format similar to Page Table Entries (PTEs), and the second doubleword is used for segment size and permissions. The XWR field of the Segment Descriptor Entry (SDE) is used to distinguish direct-mapped segments (primarily intended for mapping I/O regions of the address space) and paged memory translation. In both cases the ssize field is used to bounds check the reference when ssize < 48 (no check is performed if ssize ≥ 48). In particular, bits 47..ssize are tested. The test depends on the Fill Check (FC) field of the SDE. If FC is 0, then bits 47..ssize must all be zero. If FC is 1, then bits 47..ssize must all be set. If FC is 2, then bits 47..ssize are ignored.

When ssize > 48 it is required that the supervisor shall make all 2^ssize−48 SDEs for the segment identical. (An alternative would be to require that paddr_{ssize−1..48} match the corresponding bits of the segment number for direct-mapped translation?)

Segment Descriptor Entries Doubleword 0 for Direct-Mapping
63	12	11	3	2	1	0
paddr_63..ssize		2^ssize−4		1		V
64−ssize		ssize−3		2		1

Field	Width	Bits	Description
V	1	0	Valid 0 ⇒ Page Fault, bits 127..1 for software use 1 ⇒ Valid, bits 127..1 interpreted as follows
WR	2	2:1	0 Reserved 1 ⇒ Direct-mapped (this case) 2 ⇒ Paged, pointer to root page table for the segment (see below) 3 Reserved
2^ssize−4	ssize−3	ssize−1:3	Same encoding as page table size, but must match segment size ssize
paddr_63..ssize	64−ssize	63:ssize	Physical address of direct-mapping

The direct-mapping is defined as:

	ssize ← SDE1_5..0
	downward ← SDE1_7..6 = 1
	if ssize < 12 \| ssize > 61 \| SDE1_ssize−1..3 ≠ 1∥0^ssize−4 then
		Exception(PageFault)
	else if ssize < 48 & vaddr_47..ssize ≠ downward^48−ssize then
		Exception(PageFault)
	else
		paddr ← SDE1_63..ssize ∥ vaddr_ssize−1..0
	endif

Segment Descriptor Entries Doubleword 0 for Paging
63	12	11	3	2	1	0
paddr_63..4+PTS		2^PTS		2		V
60−PTS		1+PTS		2		1

Field

Width

Bits

Description

Valid
0 ⇒ Page Fault, bits 127..1 for software use
1 ⇒ Valid, bits 127..1 interpreted as follows

2:1

0 Reserved
1 ⇒ Direct-mapped (see above)
2 ⇒ Paged, pointer to root page table for the segment (this case)
3 Reserved

2^PTS

1+PTS

3+PTS:3

Root page table size encoding
Table size of root page table is 2^1+PTS entries (2^4+PTS bytes):

PTS	PTEs	Bytes
0	2	16	B
1	4	32	B
2	8	64	B
4	16	128	B
⋮	⋮	⋮
8	512	4	KiB
9	1024	8	KiB
10	2048	16	KiB
⋮	⋮	⋮
14	32768	256	KiB
⋮	⋮	⋮
34	2³⁵	256	GiB
35	2³⁶	512	GiB
≥36	reserved

paddr_63..4+PTS

60−PTS

63:4+PTS

Physical address of direct-mapping or root page table

Note: If it reduces implementation cost, it seems reasonable to change PTS ≥ 32 to be reserved.

Segment Descriptor Entry Doubleword 1
63	56	55	48	47	46	45	44	43	32	31	30	27	26	24	23	22	20	19	18	16	15	11	10	8	7	6	5	0
0		PMAO		G1		G0		Gates		B	0		R3		0	R2		0	R1		0		XWR		FC		ssize
8		8		2		2		12		1	4		3		1	3		1	3		5		3		2		6

Field

Width

Bits

Description

ssize

5:0

log2 of segment size in bytes (12..61);
0..11 Reserved
62..63 Reserved

7:6

Fill check (must be 0 if ssize ≥48)
0 ⇒ address bits 47..ssize must be 0 (upward growing segment)
1 ⇒ address bits 47..ssize must be 1 (downward growing segment)
2 ⇒ address bits 47..ssize ignored (e.g., may be used for HWASAN)
3 Reserved (e.g., could be HWACHK or sign-extend check)

XWR

10:8

Execute Write Read permission

0	Reserved
1	Read-only
2	Reserved
3	Read-write
4	Execute-only
5	Read-execute
6	Reserved
7	Read-write-execute

18:16

Ring bracket 1 (as described below)

22:20

Ring bracket 2 (as described below)

26:24

Ring bracket 3 (as described below)

PTE backward compatibility:
0 ⇒ bits 4..5 interpreted as GC dirty as described below
1 ⇒ bit 4 is a Sv39-compatible U bit, and bit 5 is a Sv39-compatible G bit

Gates

43:32

Gate count
Gate transition only allowed if target_7..0 = 0 & target_ssize−1..8 < Gates.

45:44

Garbage collection generation

47:46

Garbage collection dirty

PMAO

55:48

PMA override, addition, hints, etc. (e.g., PBMT)

Ring Brackets

Each segment has three 3‑bit ring numbers—R1, R2, and R3—stored in the segment descriptor table and used for bracketing accesses by ring of execution in addition to the Read, Write, Execute permissions from the segment descriptor table, and in addition specifies gate access permission. To reiterate, Ssv64 inverts the ring number to privilege mapping chosen by Multics: ring 6 is the most privileged and ring 0 the least privileged. Typically R3≤R2≤R1. Writes are permitted when the current ring of execution is in [R1:6], reads in [R2:6], execution in [R2:R1], and calls to gates in [R3:R2−1]^*. Because all eight rings are not required, Ssv64 reserves the value 7 for other uses.
* The ring number of the caller and the ring brackets of the target segment are used to calculate the new ring number of execution, as per the Multics documentation modified for the inverted ring order:

If the caller’s ring is within the execute bracket [R2:R1], execution will continue in the same ring as the caller.
If the caller’s ring is within the gate bracket [R3:R2−1] and the target address satisfies the segment’s gate test, the process will switch to ring R2 before executing the target, increasing privilege.
If the caller’s ring is above the execute bracket (> R1), the process will switch to ring R1 before executing the target, decreasing privilege.
If the caller’s ring is above the gate bracket (< R3) or the target does not satisfy the gate test, an access fault exception occurs.

The gate test criteria cited above requires that the target address bit 128 B aligned (bits 6..0 are zero) and that bits 47..7 are less than the segment’s gate count field in the segment descriptor entry.

Example Ring Brackets
What	R1,R2,R3	seg RWX	R b	W b	X b	G b	Ring 0	Ring 1	Ring 2	Ring 3	Rings 4 to 6
User code	`2,2,2`	`R-X`	`[2,6]`	`-`	`[2,2]`	`-`	`----`	`----`	`R-X-`	`R---`	`R---`
User execute only	`2,2,2`	`--X`	`-`	`-`	`[2,2]`	`-`	`----`	`----`	`--X-`	`----`	`----`
User stack or heap	`2,2,2`	`RW-`	`[2,6]`	`[2,6]`	`-`	`-`	`----`	`----`	`RW--`	`RW--`	`RW--`
User read-only file	`2,2,2`	`R--`	`[2,6]`	`-`	`-`	`-`	`----`	`----`	`R---`	`R---`	`R---`
Compiler library	`6,0,0`	`R-X`	`[0,6]`	`-`	`[0,6]`	`-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`
Supervisor driver code	`4,3,3`	`R-X`	`[3,6]`	`-`	`[3,4]`	`-`	`----`	`----`	`----`	`R-X-`	`R-X-`
Supervisor driver data	`3,3,3`	`RW-`	`[3,6]`	`[3,6]`	`-`	`-`	`----`	`----`	`----`	`RW--`	`RW--`
Supervisor code	`4,3,4`	`R-X`	`[3,6]`	`-`	`[3,4]`	`-`	`----`	`----`	`----`	`R-X-`	`R-X-`
Supervisor heap or stack	`4,4,4`	`RW-`	`[4,6]`	`[4,6]`	`-`	`-`	`----`	`----`	`----`	`----`	`RW--`
Supervisor gates for user	`4,4,2`	`R-X`	`[4,6]`	`-`	`[4,4]`	`[2,3]`	`----`	`----`	`---G`	`---G`	`R-X-`
Sandboxed JIT code	`1,0,0`	`RWX`	`[0,6]`	`[1,6]`	`[0,1]`	`-`	`R-X-`	`RWX-`	`RW--`	`RW--`	`RW--`
Sandboxed JIT stack or heap	`0,0,0`	`RW-`	`[0,6]`	`[0,6]`	`-`	`-`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`
Sandboxed non-JIT code	`1,1,1`	`R-X`	`[1,6]`	`-`	`[1,1]`	`-`	`----`	`R-X-`	`R---`	`R---`	`R---`
User gates for sandboxes	`2,2,0`	`R-X`	`[2,6]`	`-`	`[2,2]`	`[0,1]`	`---G`	`---G`	`R-X-`	`R---`	`R---`

Translation Cache Differences

This section compares translation caching between various choices. There are two sides of the translation cache (TLB): the match side and the translation side. For comparison purposes, this comparison uses a ASID length of 12 bits on the match side and a physical address size of 47 (Sv39 only) or 56 bits on the translation side. The following table lists what needs be stored in the translation cache. Ssv64 GC and Gates features are assumed not present. Three Ssv64 configurations are: Ssv64min with 4 rings, 4 segment groups, and 512 segments per group; Ssv64max with 7 rings, 8 segment groups, and 8192 segments per group, and finally Ssv64max+ adds in all the possible features Ssv64 might allow including GC and Gates. The Ssv64 configuration chosen in rows below is sized to be as minimal in size as the Sv entries paired with. Svnapot is ignored for this comparison. The PS column represents the number of bits required for page sizes for Sv configurations, and desired for Ssv configurations. Set associative translation caches would of course not stores the virtual address bits used to select the set. This table also does not include the bits used for replacement (e.g., pseudo-LRU bits for set-associative caches). Other bits not included might be associated with hypervisor VMID, M-machine mode unmapped entries (if any), parity or ECC, page size reduction, etc. Note that Ssv64 has the potential (but no definition as of yet) to extend PBMT from 2 bits up to 8.

One other caveat concerns the common micro-architecture choice to split translation caches into separate structures for instruction and data. Instruction and data translation caches require different subsets of the bits enumerated below. For example the RW, D, and GC fields would be data only and the R3, X, Gates fields would be instruction only.

Mode	Match side					Translation side
Mode	PS	G	ASID	vaddr	total	U	AD	XWR	PBMT	paddr	Rn	other	total
Sv39 only	2	1	12	38:12	42	1	2	3	2	46:12	0	0	43
Sv48 only	2	1	12	47:12	51	1	2	3	2	55:12	0	0	52
Sv57 only	3	1	12	56:12	61	1	2	3	2	55:12	0	0	52
Sv64 only	3	1	12	63:12	68	1	2	3	2	55:12	0	0	52
Sv39 + Ssv64min	3	1	12	63,61,56:48,38:12	54	1	2	3	2	46:12	6	0	49
Sv48 + Ssv64min	3	1	12	63,61,56:12	63	1	2	3	2	55:12	6	0	58
Sv57 + Ssv64min	3	1	12	63,61,56:12	63	1	2	3	2	55:12	6	0	58
Sv64 + Ssv64max	3	1	12	63:12	68	1	2	3	2	55:12	9	0	61
Ssv64min only	2	0	12	63:61,56:12	62	0	2	3	2	46:12	6	0	48
Ssv64max only	2	0	12	63:12	66	0	2	3	2	55:12	9	0	60
Ssv64max+ only	2	0	12	63:12	66	0	2	0	8	55:12	9	16	79

Use After Free Detection

AddressSanitizer is a method for detecting programming errors. It has been useful enough that processor additions to reduce the its performance penalty have been made to make it more useable. One is Hardware-assisted AddressSanitizer (HWASAN), which uses unused bits in pointers to add a tag that is compared to the tag of the memory pointed to. The hardware assist refers to ISA features that allow pointer bits to be ignored (e.g., ARM’s Top Byte Ignore (TBI) option). This raises the question of whether this is possible with Ssv64.

Given that the point of a full Ssv64 implementation is to provide a 64‑bit address space, the only bits that can be ignored are the fill bits between the segment number and the segment size (bits 47..ssize). However, these bits would still be used for matching in translation caches, and can only be cheaply ignored on translation cache misses, and thus having different tags for the same VPN will result in many translation cache entries for the same page. This makes ignoring these bits for HWASAN not useful on Ssv64-only processors unless additional virtual address matching is required in translation caches, potentially introducing new critical paths. Processors that support HWASAN and also implement Sv48 or Sv57 will necessarily have the this extra ability to ignore 9 or 18 bits during the TLB matching process. This creates the opportunity to use bits 47..39 for HWASAN with Ssv64 when ssize ≤ 39.

Ssv64 is still useful when less than a 64‑bit address space is implemented by a processor due to its other features. As an example, a a fairly small Ssv64-only implementation might support only 50 bits as illustrated below: only ssize ≤ 39 with only 4 segment groups (with sgen[2] to sgen[5] hardwired to 7) and only 512 segment numbers per segment group, for a total of 2048 segments. In this case, only bits 63, 61, 56..48, and 38..12 participate in translation cache matching and it possible for HWASAN to use 14 bits (bits 62, 60..57, and 47..39) without addition to the translation cache critical path. This bits are marked as C as potential color bits in the figure. This is more bits than ARM’s TBI feature. However, without introducing critical paths in translation, the number of HWASAN bits decreases as the maximum address space implemented in a processor increases, making HWASAN either non-portable or requiring new critical paths.

Possible Low-end Ssv64 Address (2³⁹‑byte segment)
63	62	61	60	57	56	48	47	39	38	0
H	C	G	C		segment		C		offset
1	1	1	4		9		9		39

While HWASAN has proved useful, it is worth considering whether it is possible to go further, and provide a feature that avoids the additional checking code associated with HWASAN. Such a HWCHK feature might entail three registers giving the tag comparison table to use for three different segments (e.g., heap, stack, and globals), and implementing a hardware cache for tags by address, with misses in this case being filled from the specified tables. This is fairly expensive, as the tag cache involved would probably need to be at least 4 KiB. This feature would be sufficiently performative that it might be used in production code, rather than only in debugging code.

Sandboxing

At times it can be useful to be able to execute untrusted code in an environment where that code has no direct access to the rest of the system, but where it can communicate with the system efficiently. Hierarchical protection domains (aka protection rings) provide an efficient way to provide such an environment. Imagine a web browser that wants to be able to download code from an untrusted source, perhaps use Just-In-Time Compilation to generate native code, and then execute to provide some service as part of displaying the web page. The downloaded code should not be able to access any files or the state of the user browser. For this scenario on Ssv64, where ring 0 is the least privileged and ring 6 the most privileged (the opposite of the usual convention), the web browser might execute in ring 2, generate machine code to a segment that is writeable from ring 2, but only Read and Execute to ring 0, and then transfer to that ring 0 code. All rings share the same address space and TLB entries for a given process, but the ring brackets stored in the TLB change access to data based on the current ring of execution. Ring 0 would have access only to its code, stack, and heap segments, and nothing else. It would not be able to make system calls or access files, except indirectly by making requests to ring 2. The only access ring 0 would have outside of its three segments might be to call a limited set of gates in ring 2, causing a ring transition. Interrupts and such would be delivered to the browser in ring 2, allowing it to regain control in the even that the ring 0 code does not terminate. The browser and the rest of the system is completely protected by the code executing in ring 0. Because ring 0 is a subset of the address space of ring 2, ring 2 has complete access to all the data in ring 0, but ring 0 has access only to the segments granted to it by ring 2. Ring 2 has the option to grow or not grow the code, heap, and stack segments of ring 0 as appropriate.

Garbage Collection

One advantage of a more structured address space with segment descriptors is the room to support features that can take advantage of bits that won’t fit in Page Table Entries (PTEs). Languages such as Java, Julia, and Lisp rely on garbage collection (GC), which eliminates many programming errors that introduce bugs and vulnerabilities, and is therefore both a programming convenience and security feature. However, GC needs to be realtime and low-overhead, which can be achieved by including features for pointer tracking by generation and barriers to allow concurrent GC to be performed by one processor while another continues to run the application. This section outlines how Ssv64 can add support for efficient, realtime GC. One potentially new micro-architectural structure and a few new instructions are required.

For generational GC, new allocations are done in an area of memory that is analyzed frequently without scanning older allocations. Over time as this data ages, it may be moved to an area for older generation data. To work correctly, the pointers in the older areas that point to recent ones need to be known and used as roots for the areas containing more recent allocation. The processor hardware helps this process by taking an exception when a pointer to a newer generation is stored to an older area; the trap handler can note this pointer and then continue. The translation cache access for the store will provide both the generation dirty level for the target page and the generation number of the target segment. New load and store instructions are added that the compiler generates only for pointers to dynamically allocated memory. For the pointer store instruction the small Segment Attribute Cache provides the generation number of the pointer being stored. If the page does not yet contain more recent data than the pointer being stored, an exception occurs, similar to the exception used to transition the PTE dirty bit from 0→1. Ssv64 has support for 4 generations. Whether all such stores trap or only the first may depend on the GC algorithm; the trap handler can turn off traps to this page after the first trap by lowering the PTE GC field to the generation of the pointer being stored, or it can leave it unchanged to be informed of every such store.

For concurrent GC, the scan and reclamation runs in another thread, potentially on another processor while the application continues its work. This creates the potential for the application to change certain data that the GC is operating on. To prevent this, application loads or stores to that data are noticed and handled differently (the details depend on the GC algorithm). Rather than introduce a new special-purpose check for this as on some other architectures, Ssv64 has the option to use segment rings, which is more general-purpose. Consider an example where the GC algorithm employs 4 generations, with two segments for each generation. When it is time to migrate the live data from from the older segment of a generation to a newer segment, its ring is raised to the ring of the GC thread so that references by the application ring trap. When GC has completed its movement of live data, the ring number is lowered and eventually this segment can be used for new allocation, and eventually the roles of the segments can be reversed. When the amount of live data after GC becomes too large, the either segment size can be increased or data can be migrated to a segment with older generational data, thereby decreasing the frequency of GC for the current generation.

The Segment Attribute Cache suggested above can be rather small. If used for both segment bounds checking and GC, as an example, it might be only 1024 entries, 4‑way set associative, with the tag being the ASID and segment number, and the data being the segment size (6 bits) and the GC generation (2 bits). If segment bounds checking is not required and it is only used for GC, then this cache might be even smaller, perhaps just 64 entries, as not many segments would typically be used for GC-managed allocation. The load and store instructions proposed above might be named LP and SP or LG and SG depending on whether pointer or GC is being emphasized.

Future Work

Need to work out more details of gates into more privileged rings.
HWASAN and HWACHK were mentioned as possibilities. It would be good to develop a proposal.
Need to say more about the GC features.
Should SDE1.XWR be moved to bits 3:1 to be more similar to PTEs? To accommodate, move B bit to bit 0, ssize to 13:5, FC to 15:14.
Need to suggest what bounds checking could be done using the segment size. For example, it may be useful to prevent references using a base register to move outside the segment addressed by that base register.
Add potentially useful instructions? For example, SEGBASE rd, rs1 and SEGSIZE rd, rs1 might return the segment base address and segment size respectively. This would allow segment relative pointers (useful for mapping files into the address space with pointers independent of where in the address space the file is mapped) to be implemented in software. Of course, pointers could also be relative to their own location and work as well, but SEGBASE and SEGSIZE could also be useful for bounds checking in software. ADDA rd, rs1, rs2 could be useful for adding an integer value in rs2 to a pointer in rs1 and either wrapping or trapping if the addition would move outside of the segment bounds.

Paging

The Page Size Issue and History

A critical processor design decision is the choice of a page size or page sizes. If minimizing memory overhead is the criteria, it is well known that the optimal page size for an area of virtual memory is proportional to the square root of that memory size. Back in the 1960s, 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back to minimize the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today with addresses twice as wide. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.

In addition, today regions of memory vary wildly in size in computer systems, with many processes having fairly small code regions, a small stack region, and a heap that may be small, large, or huge, and sometimes the size is dependent upon input parameters. Even in processors that support multiple page sizes, size is often set for the entire system. When page size is variable at runtime, there may be only one value for the entire process virtual address space, which makes the value for setting be sub-optimal for code, stack, or heap, depending on which is chosen for optimization. Further, memory overhead is not the only criteria of importance. Larger page sizes minimize translation cache misses and therefore improve performance at the cost of memory wastage. Larger page sizes may also reduce the translation cache miss penalty when multi-level page tables are used (as is common today), by potentially reducing the number of levels to be read on a miss.

A major advantage of dividing the address space into regions is that it becomes possible to choose different paging structures on a per-region basis. Each shared library and the main program are individual mapped files containing code, and each could have a page size and levels appropriate to its size. The stack and heap regions can likewise have different page sizes from the code mapped files and each other. Choosing a page size based on the square root of the region size not only minimizes memory wastage, it can keep the page table a single level (just the root), which minimizes the translation cache miss penalty.

There is a cost to implementing multiple page sizes in the operating system. A simple operating system may support only a single page size. This proposal supports such an operating system, but provides functionality for more sophisticated operating systems. In such systems, typically free lists are maintained for each page size, and when a smaller page free list is empty, a large page is split up. The reverse process, of coalescing pages, is more involved, as it may be necessary to migrate one or more small pages to put back together what was split apart. This however has been implemented in operating systems and made to work well.

There is also a cost to implementing multiple page sizes in translation caches (typically called TLBs though that is a terrible name). The most efficient hardware for translation caches would prefer a single page size, or failing that, a fairly small number of page sizes. Page size flexibility can affect critical processor timing paths (particularly in L1 translation caches). Despite this, the trend has been toward supporting a small number of page sizes. The RISC‑V vector architecture helps to address this issue, as vector loads and stores are not as latency sensitive as scalar loads and stores, and therefore can go directly to a L2 translation cache, which is both larger, and as a result of being larger slower, and therefore better able to absorb the cost of multiple page size matching. Much of the need for larger sizes occurs in applications with huge memory needs, and these applications are often able to exploit the vector architecture.

Choosing Page Sizes

It may help to consider what historical architectures have for page size options. According to Wikipedia other 64‑bit architectures have supported the following page sizes:

Page Sizes in Other 64‑bit Architectures
Architecture	4 KiB	8 KiB	16 KiB	64 KiB	2 MiB	1 GiB	Other
MIPS	✔		✔	✔			256 KiB, 1 MiB, 4 MiB, 16 MiB
x86-64	✔				✔	✔
ARM	✔		✔	✔	✔	✔	32 MiB, 512 MiB
RISC‑V	✔				✔	✔	512 GiB, 256 TiB
Power	✔			✔			16 MiB, 16 GiB
UltraSPARC		✔		✔			512 KiB, 4 MiB, 32 MiB, 256 MiB, 2 GiB, 16 GiB
IA-64	✔	✔		✔			256 KiB, 1 MiB, 4 MiB, 16 MiB, 256 MiB
Ssv64	✔		✔	?			256 KiB, 16 MiB?

The only very common page size is 4 KiB, with 64 KiB, 2 MiB, and 1 GiB being somewhat common second page sizes. I suspect that 4 KiB has been carried forward from the 1960s for compatibility reasons as there probably exists some application software where page size assumptions exist. It would be interesting to know how often UltraSPARC encountered porting problems with its 8 KiB minimum page size. Today 8 KiB or 16 KiB pages make more technical sense for a minimum page size, but application assumptions may suggest keeping the old 4 KiB minimum, and introducing at least one larger page size to reduce translation cache miss rates. Processors targeted at HPC will likely need at least a third page size (more on HPC page size below).

RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads (they are all derived from the 4 KiB page being used at each level of the table walk). These early outs do reduce translation cache miss penalties, but they do complicate TLB matching, as mentioned earlier. To RISC‑V’s credit, it introduced a new PTE format (under the Svnapot extension) that communicates to processors that can take advantage of it that groups of PTEs are consistent and can be implemented with a larger unit in the translation cache. Ssv64 incorporates this as a required feature (which saves a bit).

Even a huge memory system (e.g., HPC) will have many small regions (e.g., files mapped for libraries and the main program, stack and heap for medium-sized processes such as editors, command line interpreters, etc.), and a smaller page size, such as 8 KiB or 16 KiB may be appropriate for these regions. However, 4 KiB is probably not so sub-optimal to warrant incompatibility by not supporting this size. Therefore the question is what is the most appropriate other page size, or page sizes, besides 4 KiB (which supports up to 2 MiB with one level, and up to 1 GiB with two levels). If only one other page size were possible for all implementations, 256 KiB might be a good choice, since this supports region sizes up to 2³³ bytes with one level, and region sizes of 2³⁴ to 2⁴⁸ bytes with two levels. But not all implementations need to support physical memory appropriate to a ≥2⁴⁸‑byte working set. It is more appropriate to target an intermediate page size >4 KiB but <256 KiB, and then add the 256 KiB page size for processors targeted at huge processes.

As mentioned earlier, the page size that optimizes memory wastage for a single-level page table is proportional to the square root of the region size, and a single-level page table also minimizes the TLB miss penalty, with a 2-level page table being second best for TLB miss penalty. Ssv64’s goal is to allow the operating system to choose page sizes per region that keep the page tables to 1 or 2 levels. It is therefore interesting to consider what region sizes are supported with this criteria with various page sizes. This is illustrated in the following table, assuming an 8 B PTE:

Region size reached in 1 to 3 levels
by page size with table size equal to page size
Page Size	1-Level	2-Level	3-Level	Level bits
4 KiB	2 MiB	1 GiB	512 GiB	21	30	39
16 KiB	32 MiB	64 GiB	128 TiB	25	36	47
64 KiB	512 MiB	4 TiB	32 PiB	29	42	55
256 KiB	8 GiB	256 TiB	8 EiB	33	48	63
2 MiB	512 GiB	128 PiB		39	57	75
16 MiB	32 TiB			45	66	87

To recapitulate, it makes sense to choose a second page size in addition to the 4 KiB compatibility size to extend the range of 1 and 2‑level page tables for simple operating systems, and then allow implementations targeted at huge physical memories to employ even larger page sizes and page table sizes. In particular, Ssv64 proposes a 4 KiB page size intended for backward compatibility, but based on the above, the suggested page size is 16 KiB. Sophisticated operating systems that can do arbitrary power of two allocation will use single-level page tables and a page size per region based on the square root of the region size. Operating systems with intermediate levels of sophistication may primarily operate with a pool of 16 KiB pages, with a mechanism to split these into 4 KiB pages and coalesce these back for applications that require the smaller page size. Intermediate operating systems targeted at huge memory configurations will add a 256 KiB pool with splitting to and coalescing from the 16 KiB pool. The least sophisticated operating systems will continue to use the 4 KiB compatibility page size.

Ssv64 proposes three improvements on paging found in recent architectures. First, it allows region size specifications to reduce page table walk latency. Just because the maximum region size is 2⁶¹ bytes doesn't mean that every region requires six levels of 4 KiB tables. Second, it allows the operating system specify the sizes of tables used at each level of the page table walk, rather than tying this to the page size used in translation caches. Decoupling the non-leaf table sizes from the leaf page sizes provides a mechanism that sophisticated operating systems may use for better performance, and on such systems this reduces some of the pressure for larger page sizes. Large leaf page sizes are still however useful for reducing TLB miss rates, and as the third improvement, Ssv64 incorporates Svnapot and allows the operating system to indicate where larger pages can be exploited by translation caches to reduce miss rates, but without requiring that all implementations do so.

Region descriptors and non-leaf page tables give the table size to be used at the next level, which allows the operating system to employ larger or smaller tables to optimize tradeoffs appropriate to the implementation and the application. The table size of the leaf page table implies the page size of the PTEs therein. When the leaf page table is reached, the Svnapot feature allows portions to use larger page sizes. Some implementations may support additional page sizes beyond these basic two recommendations in their translation cache matching hardware, such as 64 KiB and 256 KiB, whereas others may simply synthesize smaller pages for the L1 translation caches when page tables specify larger pages. Implementations targeting huge memory systems and applications (e.g., HPC) may add even larger pages to target further reduced TLB miss rates. The paging architecture allows this flexibility with Page Table Size (PTS) encoding in region descriptors and non-leaf PTEs, and for leaf PTEs with Svnapot encoding that allows enabled translation caches to take advantage of multiple consistent page table entries.

As an example illustrating the above, given a region of 2²⁶ bytes, a sophisticated operating system might choose a single-level (just the root) page table of 4096 entries, each specifying pages of 2¹⁴ bytes. There would be one region lookup followed by the root page table. On a Sv64 system, an operating system with large memory process would be forced to use a 5 or 6-level page table for this region.

Page Sizes for High Performance Computing (HPC)

High Performance Computing often performs operations on large two-dimensional matrices. For example a matrix multiply N×N matrices (e.g., A = A + B × C) requires O(N³) floating-point multiply add operations on O(N² data). These matrix calculations on paged memory can be challenging for translation caches. Page size determines how well translation caches can handle matrix operations. Matrix algorithms typically operate on smaller sub-blocks of the matrices to maximize data reuse (O(N³) operations on O(N²) data means O(N) data reuse is possible) and to fit into either the more constraining of the L1 TLB and L2 data cache (with other larger blocking done to fit into the L2 TLB and L3, and smaller blocking to fit into the register file). Matrices are often large enough that each row is in a different page for small page sizes. For an algorithm with 8 B or 16 B per element, each row is in a different page at the following column dimension:

Columns equal to page size
Page size	Columns		×1024 rows per page
Page size	8 B	16 B	8 B	16 B
4 KiB	512	256	0.5	0.25
8 KiB	16	512	1	0.5
16 KiB	2048	1024	2	1
64 KiB	8192	4096	8	4
256 KiB	32768	16384	32	16

For large computations (e.g., ≥1024 columns of 16 B elements), every a row increment is going to require a new TLB entry for page sizes ≤16 KiB. Even a 16 KiB page with 16 B per element results in a TLB entry per row. For a L1 TLB of 32 entries and three matrices (e.g., matrix multiply A = A + B × C), the blocking needs to limited to only 8 rows of each matrix (e.g., 8×8 blocking), which is on the low-side for the best performance. In contrast, the 64 KiB page size fits 4 rows in a single page, and so allows 32×32 blocking for three matrices using 24 entries.

If the vector unit is able to use the L2 TLB rather than the L1 TLB for its translation, which is plausible, then these larger page sizes are not quite as critical. A L2 TLB is likely to be 128 or 256 entries, and so able to hold 32 or 64 rows of ×1024 matrices of 16 B elements.

HPC experts might want to suggest an appropriate analysis for three dimensional data.

A possible goal for page size might be to balance the TLB and L2 cache sizes for matrix blocking. For example, a L2 cache size of 512 KiB can fit up to 100×100 blocks of three matrices of 16 B elements (total 480 KiB) given sufficient associativity. To fit 100 rows of 3 matrices in the L2 TLB requires ≥300 entries when pages are ≤16 KiB, but only ≥75 entries when pages ≥64 KiB. A given implementation should make similar tradeoffs based on the target applications and candidate TLB and cache sizes, and page size is another parameter that factors into the tradeoffs here. What is clear is that the architecture should allow implementations to efficiently support multiple page sizes if the translation cache timing allows it.

Page Size and L1 TLB Timing

Because multiple page sizes do affect timing critical paths in the translation caches, and the timing path of L1 translation caches are particularly critical for microprocessor clock rate, it is worth pointing out that implementations are able to reduce the page size stored in translation caches to equal the matching hardware. An implementation could for example synthesize 16 KiB pages for the L1 translation cache even when the operating system specifies a 64 KiB page. This will however increase the miss rate. Conversely, some hardware may support an even larger set of page sizes. Ssv64 adopts the NAPOT encoding from RISC‑V’s PMPs and PTEs (with the Svnapot extension) to allow the TLB to use larger matching for groups of consistent PTEs without requiring it. Thus it up to implementations whether to adopt larger page matching to lower the TLB miss rate at the cost of a potential TLB critical path. The cost of this feature is one bit in the PTE (taken from the bits reserved for software).

Sv64 Paging

It may be helpful to consider how paging might work in a straight-forward six-level Sv64 (basically Sv57 with a first additional level of 128 entries). This would not perform well due the six level translation cache miss penalty. Very likely a system with applications requiring this huge address space would use a final 2 MiB page, reducing it to five levels. These two options are illustrated in the two figures below.

Straight-forward Sv64 Virtual Address with 4 KiB page (6 levels)
63	57	56	48	47	39	38	30	29	21	20	12	11	0
VPN0		VPN1		VPN2		VPN3		VPN4		VPN5		byte
7		9		9		9		9		9		12

Straight-forward Sv64 Virtual Address with 2 MiB final page (5 levels)
63	57	56	48	47	39	38	30	29	21	20	0
VPN0		VPN1		VPN2		VPN3		VPN4		byte
7		9		9		9		9		21

Changing the page size to 8 KiB allows the reduction from six/five levels to five/four as illustrated below:

Straight-forward Sv64 Virtual Address with 8 KiB page (5 levels)
63	53	52	43	42	33	32	23	22	13	12	0
VPN0		VPN1		VPN2		VPN3		VPN4		byte
11		10		10		10		10		13

Straight-forward Sv64 Virtual Address with 8 KiB / 8 MiB page (4 levels)
63	53	52	43	42	33	32	23	22	0
VPN0		VPN1		VPN2		VPN3		byte
11		10		10		10		23

We can get to three levels by using a 256 KiB page size in a straight-forward Sv64 as illustrated below:

Straight-forward Sv64 Virtual Address with 256 KiB page (3 levels)
63	48	47	33	32	18	17	0
VPN1		VPN2		VPN3		byte
16		15		15		18

While the 256 KiB page works well for huge memory applications, it is not appropriate for all processes that would run on these processors, or even for some portions of the address space of huge memory applications. What would be appropriate is being able to specify the page size to be used for different regions of the 64‑bit address space.

Page Size Conclusions

This page size discussion attempts to justify a 4 KiB compatibility page size, a 16 KiB preferred page size, with support for some large-memory (e.g., HPC) targeted processors adding support for 256 KiB pages. Processors might support still other page sizes, but L1 translation cache timing considerations suggesting minimizing the number of choices. There are advantages to using different page sizes in various regions of a process address space, and it is advantageous to support decoupling of the non-leaf table sizes from the page size for sophisticated operating system. It is also advantageous to reduce the number of levels of page table to reduce translation cache miss penalties, and this is possible if different regions of the address space have their own size.

Should it become possible to eliminate the 4 KiB compatibility page size in favor of a 16 KiB minimum page size, it may be appropriate to use the extra two bits the increase the physical address width to 66 bits.

RISC‑V Glossary

A few RISC‑V terms and acronyms may not be familiar to some readers. To quote from RISC‑V International’s About RISC‑V: RISC-V is an open standard Instruction Set Architecture (ISA) enabling a new era of processor innovation through open collaboration.
The official RISC‑V ISA specifications may be downloaded from RISC‑V specifications while working versions may be found at the GitHub RISC‑V ISA Manual repository.
The two primary specifications are:

The RISC-V Instruction Set Manual Volume I: Unprivileged ISA
The RISC-V Instruction Set Manual Volume II: Privileged Architecture

NAPOT: An acronym for Naturally Aligned Power-of-2
(Chapter “Svnapot” Standard Extension for NAPOT Translation Contiguity)
NAPOT refers to things of size 2^N bytes being aligned to that size (i.e., the address is a multiple of 2^N, viz. the least significant N bits of the address are zero). In RISC‑V NAPOT sizes are represented with repeated 1s (e.g., in PMP entries) or repeated 0s (e.g., in PTEs) in the lower address bits, then the opposite bit, and then from that point actual address bits, which allows the address and size to encoded in 1 bit more than the address alone.
PBMT: An acronym for Page-Based Memory Types defined in the Svpbmt extension in the RISC‑V Privileged ISA. This adds two bits to Page Table Entries (PTEs) that override PMAs for the page:
0 ⇒ No-override
1 ⇒ Non-cacheable, idempotent, weakly-ordered (RVWMO), main memory
2 ⇒ Non-cacheable, non-idempotent, strongly-ordered (I/O ordering), I/O
3 Reserved
PMA: An acronym for Physical Memory Attributes
(Chapter Machine-Level ISA section Physical Memory Attributes), which are attributes used by processors for regions of the system’s physical address space.
PMP: An acronym for Physical Memory Protection
(Chapter Machine-Level ISA, section Physical Memory Protection CSRs), which is a mechanism for providing Read, Write, and Execute permission for ranges of physical memory addresses independent of the translation mechanism, and which is logically anded with those permissions.
WARL: An acronym for Write Any Values, Reads Legal Values, which is a specification of a register field that allows processors to implement a subset of the functionality described in a CSR (e.g., by hardwiring certain bits to fixed values, or by allowing some values to be written but not others). If an unsupported value is written, the implementation substitutes some legal value.
WPRI: An acronym for Writes Preserve values, Reads Ignore Values. The RISC‑V privileged specification states, Software should ignore the values read from these fields, and should preserve the values held in these fields when writing values to other fields of the same register. For forward compatibility, implementations that do not furnish these fields must make them read-only zero.

		<webmaster at securerisc.org>
2025-03-15

63	56	55	48	47	40	39	32	31	24	23	16	15	8	7	0
sg7		sg6		sg5		sg4		sg3		sg2		sg1		sg0
8		8		8		8		8		8		8		8

63	56	55	48	47	40	39	32	31	24	23	16	15	8	7	0
sg7		sg6		sg5		sg4		sg3		sg2		sg1		sg0
8		8		8		8		8		8		8		8

Proposal for Alternative RISC‑V 64‑bit Translation Ssv64

Table of Contents

Introduction

Translation Issues

Ssv64 Pros and Cons

Ssv64 Pros

Ssv64 Cons

Documentation Conventions

Minimal Ssv64 Proposal

Page Table Entry Formats

A Simple Region Descriptor Entry Format

Second Level Translation for Hypervisor

Extended Ssv64 Proposal

Basic Method

Historical Precedent

Design Considerations

Number of Segments

Rings and Gates

Virtual Addresses

Segment Descriptor Table Pointers and Segment Groups

Segment Group Enable

Segment Descriptors

Segment Descriptor Entries

Ring Brackets

Translation Cache Differences

Use After Free Detection

Sandboxing

Garbage Collection

Future Work

Paging

The Page Size Issue and History

Choosing Page Sizes

Page Sizes for High Performance Computing (HPC)

Page Size and L1 TLB Timing

Sv64 Paging

Page Size Conclusions

RISC‑V Glossary

63	56	55	48	47	40	39	32	31	24	23	16	15	8	7	0
sg7		sg6		sg5		sg4		sg3		sg2		sg1		sg0
8		8		8		8		8		8		8		8