Version 0.1-draft-20230321
This document is a sort of whitepaper on virtual memory for large memory systems and a sort of rough draft specification/proposal for discussion.
In the author’s opinion, the existing RISC‑V virtual to physical translation mechanism are unlikely to perform well on systems with very large memories. Much of this is due to paging, and so this document begins with the problems of RISC‑V virtual to physical translation and paging issues, especially translation cache (TLB) miss rates and penalties. These issues motivate the proposed solution.
This document proposes a simplified version of the same mechanism to address second-level translation that the hypervisor specifies to translate guest operating system physical address to system physical addresses. This portion is preliminary. The author expects that this is particularly helpful, as most guest operating systems are given rather small guest physical address spaces, and this proposal allows a small single-level guest page table to be sufficient.
The solution to the translation performance problem involves providing a bit of structure to the address space that can also be useful in smaller systems. In particular, the structured address space can be used to provide more efficient Garbage Collection and sandboxing.
RISC‑V currently supports virtual address spaces up to 257 bytes with physical addresses up to 256 bytes. Eventually it will be necessary to support larger virtual and physical address spaces, for example for High Performance Computing (HPC). Currently 64‑bit RISC‑V has Sv39, Sv48, and Sv57 translation models for its supervisors using 3, 4, and 5‑level page tables with 512 PTEs per level for address spaces of −238..238−1, of −247..247−1, and −256..256−1 respectively. An obvious extension to Sv64 using a 6‑level page table for an address space of −263..263−1 is likely someday. This is illustrated in the figures below.
63 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||
extend | VPN0 | VPN1 | VPN2 | byte | ||||||||||
25 | 9 | 9 | 9 | 12 |
63 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | ||||||
extend | VPN0 | VPN1 | VPN2 | VPN3 | byte | ||||||||||||
16 | 9 | 9 | 9 | 9 | 12 |
63 | 57 | 56 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||||
extend | VPN0 | VPN1 | VPN2 | VPN3 | VPN4 | byte | ||||||||||||||
7 | 9 | 9 | 9 | 9 | 9 | 12 |
63 | 57 | 56 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||||
VPN0 | VPN1 | VPN2 | VPN3 | VPN4 | VPN5 | byte | ||||||||||||||
7 | 9 | 9 | 9 | 9 | 9 | 12 |
Neither Sv57 or a hypothetical Sv64 as illustrated above is the best choice for the large address space applications (e.g. HPC). A major issue with these virtual address spaces is the issue of page size, the organization of page tables, translation cache miss rates and miss penalties. This document therefore begins with a long discussion of paging issues. It then goes on to outline a proposal for an alternative 64‑bit address space translation mechanism called Ssv64 for RISC‑V that improves efficiency for large address spaces and provides additional features that will benefit RISC‑V in the future.
A very brief pros and cons of the Ssv64 proposal may help set the stage for the reader.
63 | 48 | 47 | 0 | ||
region | interpretation depends on region descriptor | ||||
16 | 48 |
63 | 48 | 47 | 0 | ||||||||
region | fill | tableindex0 | offset | ||||||||
16 | 48−RS | PTS | RS−PTS |
A critical processor design decision is the choice of a page size or page sizes. If minimizing memory overhead is the criteria, it is well known that the optimal page size for an area of virtual memory is proportional to the square root of that memory size. Back in the 1960s, 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back to minimize the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today with addresses twice as wide. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.
In addition, today regions of memory vary wildly in size in computer systems, with many processes having fairly small code regions, a small stack region, and a heap that may be small, large, or huge, and sometimes the size is dependent upon input parameters. Even in processors that support multiple page sizes, size is often set for the entire system. When page size is variable at runtime, there may be only one value for the entire process virtual address space, which makes the value for setting be sub-optimal for code, stack, or heap, depending on which is chosen for optimization. Further, memory overhead is not the only criteria of importance. Larger page sizes minimize translation cache misses and therefore improve performance at the cost of memory wastage. Larger page sizes may also reduce the translation cache miss penalty when multi-level page tables are used (as is common today), by potentially reducing the number of levels to be read on a miss.
A major advantage of dividing the address space into regions is that it becomes possible to choose different paging structures on a per-region basis. Each shared library and the main program are individual mapped files containing code, and each could have a page size and levels appropriate to its size. The stack and heap regions can likewise have different page sizes from the code mapped files and each other. Choosing a page size based on the square root of the region size not only minimizes memory wastage, it can keep the page table a single level (just the root), which minimizes the translation cache miss penalty.
There is a cost to implementing multiple page sizes in the operating system. A simple operating system may support only a single page size. This proposal supports such an operating system, but provides functionality for more sophisticated operating systems. In such systems, typically free lists are maintained for each page size, and when a smaller page free list is empty, a large page is split up. The reverse process, of coalescing pages, is more involved, as it may be necessary to migrate one or more small pages to put back together what was split apart. This however has been implemented in operating systems and made to work well.
There is also a cost to implementing multiple page sizes in translation caches (typically called TLBs though that is a terrible name). The most efficient hardware for translation caches would prefer a single page size, or failing that, a fairly small number of page sizes. Page size flexibility can affect critical processor timing paths (particularly in L1 translation caches). Despite this, the trend has been toward supporting a small number of page sizes. The RISC‑V vector architecture helps to address this issue, as vector loads and stores are not as latency sensitive as scalar loads and stores, and therefore can go directly to a L2 translation cache, which is both larger, and as a result of being larger slower, and therefore better able to absorb the cost of multiple page size matching. Much of the need for larger sizes occurs in applications with huge memory needs, and these applications are often able to exploit the vector architecture.
It may help to consider what historical architectures have for page size options. According to Wikipedia other 64‑bit architectures have supported the following page sizes:
Architecture | 4 KiB | 8 KiB | 16 KiB | 64 KiB | 2 MiB | 1 GiB | Other |
---|---|---|---|---|---|---|---|
MIPS | ✔ | ✔ | ✔ | 256 KiB, 1 MiB, 4 MiB, 16 MiB | |||
x86-64 | ✔ | ✔ | ✔ | ||||
ARM | ✔ | ✔ | ✔ | ✔ | ✔ | 32 MiB, 512 MiB | |
RISC‑V | ✔ | ✔ | ✔ | 512 GiB, 256 TiB | |||
Power | ✔ | ✔ | 16 MiB, 16 GiB | ||||
UltraSPARC | ✔ | ✔ | 512 KiB, 4 MiB, 32 MiB, 256 MiB, 2 GiB, 16 GiB | ||||
IA-64 | ✔ | ✔ | ✔ | 256 KiB, 1 MiB, 4 MiB, 16 MiB, 256 MiB | |||
Ssv64 | ✔ | ✔ | ? | 256 KiB, 16 MiB? |
The only very common page size is 4 KiB, with 64 KiB, 2 MiB, and 1 GiB being somewhat common second page sizes. I suspect that 4 KiB has been carried forward from the 1960s for compatibility reasons as there probably exists some application software where page size assumptions exist. It would be interesting to know how often UltraSPARC encountered porting problems with its 8 KiB minimum page size. Today 8 KiB or 16 KiB pages make more technical sense for a minimum page size, but application assumptions may suggest keeping the old 4 KiB minimum, and introducing at least one larger page size to reduce translation cache miss rates. Processors targeted at HPC will likely need at least a third page size (more on HPC page size below).
RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads (they are all derived from the 4 KiB page being used at each level of the table walk). These early outs do reduce translation cache miss penalties, but they do complicate TLB matching, as mentioned earlier. To RISC‑V’s credit, it introduced a new PTE format (under the Svnapot extension) that communicates to processors that can take advantage of it that groups of PTEs are consistent and can be implemented with a larger unit in the translation cache. Ssv64 incorporates this as a required feature (which saves a bit).
Even a huge memory system (e.g. HPC) will have many small regions (e.g. files mapped for libraries and the main program, stack and heap for medium-sized processes such as editors, command line interpreters, etc.), and a smaller page size, such as 8 KiB or 16 KiB may be appropriate for these regions. However, 4 KiB is probably not so sub-optimal to warrant incompatibility by not supporting this size. Therefore the question is what is the most appropriate other page size, or page sizes, besides 4 KiB (which supports up to 2 MiB with one level, and up to 1 GiB with two levels). If only one other page size were possible for all implementations, 256 KiB might be a good choice, since this supports region sizes up to 233 bytes with one level, and region sizes of 234 to 248 bytes with two levels. But not all implementations need to support physical memory appropriate to a ≥248‑byte working set. It is more appropriate to target an intermediate page size >4 KiB but <256 KiB, and then add the 256 KiB page size for processors targeted at huge processes.
As mentioned earlier, the page size that optimizes memory wastage for a single-level page table is proportional to the square root of the region size, and a single-level page table also minimizes the TLB miss penalty, with a 2-level page table being second best for TLB miss penalty. Ssv64’s goal is to allow the operating system to choose page sizes per region that keep the page tables to 1 or 2 levels. It is therefore interesting to consider what region sizes are supported with this criteria with various page sizes. This is illustrated in the following table, assuming an 8 B PTE:
Page Size | 1-Level | 2-Level | 3-Level | Level bits | ||
---|---|---|---|---|---|---|
4 KiB | 2 MiB | 1 GiB | 512 GiB | 21 | 30 | 39 |
16 KiB | 32 MiB | 64 GiB | 128 TiB | 25 | 36 | 47 |
64 KiB | 512 MiB | 4 TiB | 32 PiB | 29 | 42 | 55 |
256 KiB | 8 GiB | 256 TiB | 8 EiB | 33 | 48 | 63 |
2 MiB | 512 GiB | 128 PiB | 39 | 57 | 75 | |
16 MiB | 32 TiB | 45 | 66 | 87 |
To recapitulate, it makes sense to choose a second page size in addition to the 4 KiB compatibility size to extend the range of 1 and 2‑level page tables for simple operating systems, and then allow implementations targeted at huge physical memories to employ even larger page sizes and page table sizes. In particular, Ssv64 proposes a 4 KiB page size intended for backward compatibility, but based on the above, the suggested page size is 16 KiB. Sophisticated operating systems that can do arbitrary power of two allocation will use single-level page tables and a page size per region based on the square root of the region size. Operating systems with intermediate levels of sophistication may primarily operate with a pool of 16 KiB pages, with a mechanism to split these into 4 KiB pages and coalesce these back for applications that require the smaller page size. Intermediate operating systems targeted at huge memory configurations will add a 256 KiB pool with splitting to and coalescing from the 16 KiB pool. The least sophisticated operating systems will continue to use the 4 KiB compatibility page size.
Ssv64 proposes three improvements on paging found in recent architectures. First, it allows region size specifications to reduce page table walk latency. Just because the maximum region size is 261 bytes doesn't mean that every region requires six levels of 4 KiB tables. Second, it allows the the operating system specify the sizes of tables used at each level of the page table walk, rather than tying this to the page size used in translation caches. Decoupling the non-leaf table sizes from the leaf page sizes provides a mechanism that sophisticated operating systems may use for better performance, and on such systems this reduces some of the pressure for larger page sizes. Large leaf page sizes are still however useful for reducing TLB miss rates, and as the third improvement, Ssv64 incorporates Svnapot and allows the operating system to indicate where larger pages can be exploited by translation caches to reduce miss rates, but without requiring that all implementations do so.
Region descriptors and non-leaf page tables give the table size to be used at the next level, which allows the operating system to employ larger or smaller tables to optimize tradeoffs appropriate to the implementation and the application. The table size of the leaf page table implies the page size of the PTEs therein. When the leaf page table is reached, the Svnapot feature allows portions to use larger page sizes. Some implementations may support additional page sizes beyond these basic two recommendations in their translation cache matching hardware, such as 64 KiB and 256 KiB, whereas others may simply synthesize smaller pages for the L1 translation caches when page tables specify larger pages. Implementations targeting huge memory systems and applications (e.g. HPC) may add even larger pages to target further reduced TLB miss rates. The paging architecture allows this flexibility with Page Table Size (PTS) encoding in region descriptors and non-leaf PTEs, and for leaf PTEs with Svnapot encoding that allows enabled translation caches to take advantage of multiple consistent page table entries.
As an example illustrating the above, given a region of 226 bytes, a sophisticated operating system might choose a single-level (just the root) page table of 4096 entries, each specifying pages of 214 bytes. There would be one region lookup followed by the root page table. On a Sv64 system, an operating system with large memory process would be forced to use a 5 or 6-level page table for this region.
High Performance Computing often performs operations on large two-dimensional matrices. For example a matrix multiply N×N matrices (e.g. A = A + B × C) requires O(N3) floating-point multiply add operations on O(N2 data). These matrix calculations on paged memory can be challenging for translation caches. Page size determines how well translation caches can handle matrix operations. Matrix algorithms typically operate on smaller sub-blocks of the matrices to maximize data reuse (O(N3) operations on O(N2) data means O(N) data reuse is possible) and to fit into either the more constraining of the L1 TLB and L2 data cache (with other larger blocking done to fit into the L2 TLB and L3, and smaller blocking to fit into the register file). Matrices are often large enough that each row is in a different page for small page sizes. For an algorithm with 8 B or 16 B per element, each row is in a different page at the following column dimension:
Page size |
Columns | ×1024 rows per page | ||
---|---|---|---|---|
8 B | 16 B | 8 B | 16 B | |
4 KiB | 512 | 256 | 0.5 | 0.25 |
8 KiB | 16 | 512 | 1 | 0.5 |
16 KiB | 2048 | 1024 | 2 | 1 |
64 KiB | 8192 | 4096 | 8 | 4 |
256 KiB | 32768 | 16384 | 32 | 16 |
For large computations (e.g. ≥1024 columns of 16 B elements), every a row increment is going to require a new TLB entry for page sizes ≤16 KiB. Even a 16 KiB page with 16 B per element results in a TLB entry per row. For a L1 TLB of 32 entries and three matrices (e.g. matrix multiply A = A + B × C), the blocking needs to limited to only 8 rows of each matrix (e.g. 8×8 blocking), which is on the low-side for the best performance. In contrast, the 64 KiB page size fits 4 rows in a single page, and so allows 32×32 blocking for three matrices using 24 entries.
If the vector unit is able to use the L2 TLB rather than the L1 TLB for its translation, which is plausible, then these larger page sizes are not quite as critical. A L2 TLB is likely to be 128 or 256 entries, and so able to hold 32 or 64 rows of ×1024 matrices of 16 B elements.
HPC experts might want to suggest an appropriate analysis for three dimensional data.
A possible goal for page size might be to balance the TLB and L2 cache sizes for matrix blocking. For example, a L2 cache size of 512 KiB can fit up to 100×100 blocks of three matrices of 16 B elements (total 480 KiB) given sufficient associativity. To fit 100 rows of 3 matrices in the L2 TLB requires ≥300 entries when pages are ≤16 KiB, but only ≥75 entries when pages ≥64 KiB. A given implementation should make similar tradeoffs based on the target applications and candidate TLB and cache sizes, and page size is another parameter that factors into the tradeoffs here. What is clear is that the architecture should allow implementations to efficiently support multiple page sizes if the translation cache timing allows it.
Because multiple page sizes do affect timing critical paths in the translation caches, and the timing path of L1 translation caches are particularly critical for microprocessor clock rate, it is worth pointing out that implementations are able to reduce the page size stored in translation caches to equal the matching hardware. An implementation could for example synthesize 16 KiB pages for the L1 translation cache even when the operating system specifies a 64 KiB page. This will however increase the miss rate. Conversely, some hardware may support an even larger set of page sizes. Ssv64 adopts the NAPOT encoding from RISC‑V’s PMPs and PTEs (with the Svnapot extension) to allow the TLB to use larger matching for groups of consistent PTEs without requiring it. Thus it up to implementations whether to adopt larger page matching to lower the TLB miss rate at the cost of a potential TLB critical path. The cost of this feature is one bit in the PTE (taken from the bits reserved for software).
It may be helpful to consider how paging might work in a straight-forward six-level Sv64 (basically Sv57 with a first additional level of 128 entries). This would not perform well due the six level translation cache miss penalty. Very likely a system with applications requiring this huge address space would use a final 2 MiB page, reducing it to five levels. These two options are illustrated in the two figures below.
63 | 57 | 56 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||||
VPN0 | VPN1 | VPN2 | VPN3 | VPN4 | VPN5 | byte | ||||||||||||||
7 | 9 | 9 | 9 | 9 | 9 | 12 |
63 | 57 | 56 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 0 | ||||||
VPN0 | VPN1 | VPN2 | VPN3 | VPN4 | byte | ||||||||||||
7 | 9 | 9 | 9 | 9 | 21 |
Changing the page size to 8 KiB allows the reduction from six/five levels to five/four as illustrated below:
63 | 53 | 52 | 43 | 42 | 33 | 32 | 23 | 22 | 13 | 12 | 0 | ||||||
VPN0 | VPN1 | VPN2 | VPN3 | VPN4 | byte | ||||||||||||
11 | 10 | 10 | 10 | 10 | 13 |
63 | 53 | 52 | 43 | 42 | 33 | 32 | 23 | 22 | 0 | |||||
VPN0 | VPN1 | VPN2 | VPN3 | byte | ||||||||||
11 | 10 | 10 | 10 | 23 |
We can get to three levels by using a 256 KiB page size in a straight-forward Sv64 as illustrated below:
63 | 48 | 47 | 33 | 32 | 18 | 17 | 0 | ||||
VPN1 | VPN2 | VPN3 | byte | ||||||||
16 | 15 | 15 | 18 |
While the 256 KiB page works well for huge memory applications, it is not appropriate for all processes that would run on these processors, or even for some portions of the address space of huge memory applications. What would be appropriate is being able to specify the page size to be used for different regions of the 64‑bit address space.
This page size discussion attempts to justify a 4 KiB compatibility page size, a 16 KiB preferred page size, with support for some large-memory (e.g. HPC) targeted processors adding support for 256 KiB pages. Processors might support still other page sizes, but L1 translation cache timing considerations suggesting minimizing the number of choices. There are advantages to using different page sizes in various regions of a process address space, and it is advantageous to support decoupling of the non-leaf table sizes from the page size for sophisticated operating system. It is also advantageous to reduce the number of levels of page table to reduce translation cache miss penalties, and this is possible if different regions of the address space have their own size.
Should it become possible to eliminate the 4 KiB compatibility page size in favor of a 16 KiB minimum page size, it may be appropriate to use the extra two bits the increase the physical address width to 66 bits.
Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). At some point this should be converted to SAIL syntax.
This section outlines a different 64‑bit virtual address space translation mechanism that solves the problems of Sv57 and a hypothetical Sv64. Later sections will go into more depth.
The above is achieved by dividing the 64‑bit address space into 65536 regions based on the top 16 bits of the address. These upper bits index a descriptor table, which controls the interpretation of the lower 48 bits. Each region is of variable size of up to 261 bytes, where regions >248 require the supervisor to specify multiple consistent descriptors. After this level of the translation, either a direct mapping is used, or RISC‑V-like page tables are used. Direct-mapping is especially useful for I/O regions. Because region descriptors include a size, direct-mapping can be used for regions as small as 4 KiB, or as large as 261 bytes.
Given the region descriptors, it is possible to support new features that will not fit into the limited bits available in Page Table Entries (PTEs). Two features that can take advantage of region descriptors are support for garbage collection and to generalize the two-levels of nesting in RISC‑V (supervisor and user) to four to eight levels with nesting of Read, Write, and Execute permission. Some of these levels may be used in user mode for things like sandboxing untrusted code or implementing concurrent garbage collection. In addition it may be useful to have PMA overrides more general than RISC‑V’s Svpbmt, which is limited by the number of PTE bits available.
The model for the above features is actually taken from a 1960s processor architecture called Multics where regions were called segments and the permission nesting was called rings. I have avoided the words segments and rings in the exposition above to avoid preconceived notions the reader might have from some early microprocessors trying to extend their address spaces from tiny to small, which is quite different from the Multics approach. Multics segmentation is about better managing the existing address space, and that is what Ssv64 seeks to accomplish as well. There seems to be an impression among many in the computer architecture world that Multics virtual memory and protection were complex, when in fact they are actually simple, easy to implement, and general. Computer architecture from the 1980s to present has often an oversimplification of Multics. For example, segmentation in Multics served to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) puts pressure to keep the number of bits devoted to access control minimal, when security considerations might suggest a more robust set. As another example, many contemporary processor architectures (e.g. RISC‑V) have two rings (User and Supervisor), with a single bit in PTEs (the U bit in RISC‑V) serving as a ring bracket. Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. It is true that rings were not well utilized on Multics, but we now have more uses for multiple rings, such as sandboxing and concurrent garbage collection.
For those familiar with Multics, the primary thing to know is that segment sizes are powers of two. Also, Ssv64 has the option to support either 4 or 7 rings and inverts ring numbers so that ring 0 is the least privileged (ring 0 was the most privileged in Multics). Also, inverting ring numbers means that applications are unaware of how many rings exist above and allows implementations to choose either 4 or 7 rings without affecting applications. The only reason for an implementation to use the smaller ring count is to save 3 bits in translation caches as the ring mechanism is very low-cost except for the number of bits in the TLB.
ring | use |
---|---|
0 | JIT sandbox (e.g. browser downloaded code, e.g. code being debugged) |
1 | non-JIT sandbox |
2 | user (e.g. browser, debugger) |
3 | supervisor |
ring | use |
---|---|
0 | JIT sandbox (e.g. browser downloaded code, e.g. code being debugged) |
1 | non-JIT sandbox |
2 | user (e.g. browser, debugger) |
3 | supervisor device drivers |
4 | supervisor |
5 | hypervisor device drivers |
6 | hypervisor |
7 | reserved for other purposes |
Page size flexibility and translation cache miss penalty are major design considerations, but these were addressed earlier. Here we look at a couple of other design considerations.
The maximum number of segments (65536) is chosen to be compatible with RISC-V Sv48 with up to four levels of page table when 4 KiB pages and tables are used. This number of segments is likely far more than required (which is likely to be as few as 2048). Since the tables (or hardware via WARL) can reduce the number, the large number of segments isn’t a real issue. However, if Sv48 compatibility is not needed, one might make other choices, e.g. 8 KiB pages, 2048 segments, and levels of 223, 233, 243, and 253 bytes for fixed table size operating systems.
At this point, rings and gates are not a required component of this proposal, but rather a feature that is enabled by this proposal. Rings and gates are not yet fully developed here. If this goes further, whether to include rings and gates would need to be considered. One advantage of these is that transitions between privilege levels can be accomplished without an exception, and exceptions have performance costs that a simple JALR does not. If rings are used for sandboxing, then gates eliminate the need for exception handlers for each ring, which would be helpful if sandboxing is desired. Rings are independent of modes (e.g. user-mode might encompass several rings). One issue is that 64‑bit pointers lack the bits to add a ring number for pointer parameters passed from lower privilege code to gates (Multics pointers had a ring number). For simple pointer arguments this is not a difficult issue to address. Linux handles this case by simply testing the sign of the pointers passed to system calls. Adding instructions to generalize this test might be sufficient. Testing of pointer arguments becomes more involved when interfaces involve pointers to pointers in complex data structures when pointers are passed to a higher privilege level and then passed in turn to a yet higher privilege level. Most interfaces between privilege levels have avoided this and instructions for testing access may suffice.
To explore the addition of rings and gates, let’s start with a ring CSR that holds the current ring in read-only bits 2:0 and the previous ring in read-write bits 6:4. A gate transition copies bits 2:0 to 6:4 and sets 2:0 to the R1 of the target segment. Gate code is aligned on 128 B boundaries in the first 512 KiB of the segment. It will begin by switching to a new stack, which will require a new scratch register to save the caller’s sp and some location from which to read the new sp, either a fixed address in memory or new per-ring CSRs. That done, the gate will execute a RINGR rd, rs1 to each pointer passed to the gate for reading and RINGW rd, rs1 on each pointer argument passed to the gate for writing. RINGR sets rd to ring[6:4] < SDE1.R2 and RINGW sets rd to ring[6:4] < SDE1.R1 where R1 and R2 come from the translation cache for rs1. Any non-zero values indicate the caller does not have access to read or write one of the parameters and signals an access fault. Once the arguments are verified, the gate calls the actual function written in a high-level language (or C++). When it returns, the gate restores the caller’s sp and the ring CSR, and returns with a new JALR that restores ring 2:0 from 6:4 (but not higher than ring 2:0). Exceptions to supervisor and machine mode would have to set the ring CSR appropriately (e.g. saving the old value in bits 10:8). This section outlines what ring and gate support might look like, but I am sure that there are plenty of details that need to be filled in. Plus Linux would require a fair bit of support (e.g. to save/restore the ring CSR, put the ring number into sigcontext, and so on), and diallow EBREAK from sandboxed code, and probably much more.
31 | 16 | 15 | 14 | 12 | 11 | 10 | 8 | 7 | 6 | 4 | 3 | 2 | 0 | |||||
0 | 0 | mepc | 0 | sepc | 0 | caller | 0 | current | ||||||||||
16 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 3 |
Ssv64 virtual addresses are interpreted as follows where PS is the page size implied by segment size (the ssize field of the SDE), and the PTS fields of the page tables created by the operating system. In particular, PS is ssize−PTS0 for a PTE in a single-level page table, ssize−PTS0−PTS1 for a PTE in a two-level page table, and ssize−PTS0−PTS1…−PTSn-1 for a PTE in an n‑level page table (1≤n≤4). This page size can be increased for a subset of pages within the last level by using the Svnapot-feature. The resulting PS must be in the range 47..12. Translation caches may reduce the calculated PS to the next lower supported value. Page sizes ≥48 bits are not supported because segment direct-mapping would be used instead.
63 | 61 | 60 | 48 | 47 | 0 | |||||||||
sg | segment | fill | VPN | byte | ||||||||||
3 | 13 | 48−ssize | ssize−PS | PS |
where ssize is the segment size for the segment, PS is the page size given by the segment mapping, and fill is all 0s for upward growing segments and all 1s for downward growing segments.
Field | Width | Bits | Description |
---|---|---|---|
sg | 2 | 63:61 | Segment group |
segment | 14 | 60:48 | Segment in group |
fill | max(0,48−ssize) | 47:48−ssize | Must be downward48−ssize if ssize < 48 |
VPN | 48−PS | 47:PS | Page in segment |
byte | PS | PS−1:0 | Byte in page |
For example, for a ssize of 39 and the 4 KiB compatibility page size, the interpretation of a virtual address is as follows:
63 | 61 | 60 | 48 | 47 | 39 | 38 | 12 | 11 | 0 | |||||
sg | segment | fill | PPN | byte | ||||||||||
3 | 13 | 9 | 27 | 12 |
Translation begins with the eight sdtp registers (sdtp[0] to sdtp[7]), which serve a similar function to satp in RISC‑V’s Sv48. Each register provides a pointer to a Segment Descriptor Table for one eighth of the address space, the size of that table encoded in NAPOT-fashion, and an Address Space Identifier (ASID).
63 | 12 | 11 | 0 | |||||
paddr63..13+SGS | 2SGS | ASID | ||||||
51−SGS | 1+SGS | 12 |
Field | Width | Bits | Description |
---|---|---|---|
ASID | 12 | 11:0 | Address Spaced Identifier for the Segment Group |
paddr63..13+SGS | 51-SGS | 63:13+SGS | Physical address of SDT for Segment Group |
The size of the segment group is 512×2SGS, where SGS is given by the number of zero bits starting at bit 12. Some implementations might reduce the number of bits in their TLBs by hardwiring bit 12 to 1, thus allowing only 512 segments per segment group. SGS must be ≤4 or a page fault occurs. Segment groups may be disabled entirely using the sgen register. The Segment Descriptor Table for the group is specified in a table of 16 B entries at the specified physical address, which is aligned to the size of the group.
Specifically, the segment group bounds check is vaddr60..48 < 29+SGS. If the bounds check fails, a page fault exception is taken. If the bound check succeeds, the 16 B Segment Descriptor Entry is read from (sdtp[vaddr63..61]63..13+SGS ∥ 0SGS+13) | (vaddr60..48 ∥ 04).
The stdp registers can be used to match the address space usage common in other architectures. Consider an architecture with just two levels, user and supervisor, where user addresses are ≥0 and supervisor addresses are <0. Supervisor addresses can use pages with a Global bit set to ignore ASID matching for supervisor common, or leave Global clear for per-process supervisor data. All user addresses have Global clear. To match such usage, ASID=0 is used instead of Global=1. Segment group 0 is used for user addresses with the temporarily assigned ASID, and Segment group 6 is used with the same ASID for supervisor per-process data. Segment group 7 is used for supervisor common addresses with ASID 0. Such a system might set sdtp[7] at initialization, change sdtp[0] and sdtp[6] on process switch, and leave the other five groups disabled.
A more sophisticated supervisor might attempt to get Instruction TLB sharing between user processes by mapping shared libraries using segment group 1 and ASID 0, while leaving segment group 0 for per-process data. Segment group 1 would be identical in every user process address space so sdtp[1] would not changed on process switch.
This section is very preliminary at this point.
The sgen CSR controls which modes or rings (which is TBD) can write the various sdtp registers, with three bits per sdtp register. Reads or writes to sdtp[i] register or its shadows trap if the current mode or ring number is less than sgeni×8+2..i×8. It is possible to provide read access separate from write access, but the need for this is unclear.
It is TDB whether to implement eight fields of 4 bits (allowing expansion to sixteen Segment Groups in the future) or eight fields of 8 bits (allowing new Segment Group Enable functionality per group in the future). The following illustrates the sgen CSR with 8‑bit fields and control by ring number. The alternatives are left to the imagination of the reader.
If sgen enables/disables on ring numbers, setting the ring number to 7 disables the corresponding sdtp register altogether.
Each Segment Group field is WARL, and some may be hardwired, including to be disabled. If some sgen fields are hardwired to 7, then the corresponding sdtp registers need not exist, nor the TLB bits required for that.
63 | 56 | 55 | 48 | 47 | 40 | 39 | 32 | 31 | 24 | 23 | 16 | 15 | 8 | 7 | 0 | ||||||||
sg7 | sg6 | sg5 | sg4 | sg3 | sg2 | sg1 | sg0 | ||||||||||||||||
8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
7 | 6 | 3 | 2 | 0 | ||
L | F | ring | ||||
1 | 4 | 3 |
Field | Width | Bits | Description |
---|---|---|---|
ring | 3 | 2:0 | Ring for which sdtp is enabled |
F | 4 | 6:3 | Reserved for future use |
L | 1 | 7 | Lock |
It may be appropriate to implement a lock bit in bit 7 of each field, similar to pmpcfg.
The segment descriptor can be thought of the translation root page table, but with a 16 B descriptor instead of an 8 B PTE. The first 8 B of the descriptor is made very similar to the PTE format, with the extra permissions, attributes, etc. in the second 8 B of the descriptor. The PTE format in turn is made mostly compatible with Sv39, Sv48, and Sv57 PTE formats by using two XWR reserved values (2 and 6).
Segment Descriptor Table (SDT) entries consist of two doublewords. The first doubleword has a format similar to Page Table Entries (PTEs), and the second doubleword is used for segment size and permissions. The XWR field of the Segment Descriptor Entry (SDE) is used to distinguish direct-mapped segments (primarily intended for mapping I/O regions of the address space) and paged memory translation. In both cases the ssize field is used to bounds check the reference when ssize < 48 (no check is performed if ssize ≥ 48). In particular, bits 47..ssize are tested. The test depends on the Fill Check (FC) field of the SDE. If FC is 0, then bits 47..ssize must all be zero. If FC is 1, then bits 47..ssize must all be set. If FC is 2, then bits 47..ssize are ignored.
When ssize > 48 it is required that the supervisor shall make all 2ssize−48 SDEs for the segment identical. (An alternative would be to require that paddrssize−1..48 match the corresponding bits of the segment number for direct-mapped translation?)
63 | 3 | 2 | 1 | 0 | ||||
paddr63..ssize | 2ssize−4 | 1 | V | |||||
64−ssize | ssize−3 | 2 | 1 |
Field | Width | Bits | Description |
---|---|---|---|
V | 1 | 0 |
Valid 0 ⇒ Page Fault, bits 127..1 for software use 1 ⇒ Valid, bits 127..1 interpreted as follows |
WR | 2 | 2:1 |
0 Reserved 1 ⇒ Direct-mapped (this case) 2 ⇒ Paged, pointer to root page table for the segment (see below) 3 Reserved |
2ssize−4 | ssize−3 | ssize−1:3 | Same encoding as page table size, but must match segment size ssize |
paddr63..ssize | 64−ssize | 63:ssize | Physical address of direct-mapping |
The direct-mapping is defined as:
ssize ← SDE15..0 | ||
downward ← SDE17..6 = 1 | ||
if ssize < 12 | ssize > 61 | SDE1ssize−1..3 ≠ 1∥0ssize−4 then | ||
Exception(PageFault) | ||
else if ssize < 48 & vaddr47..ssize ≠ downward48−ssize then | ||
Exception(PageFault) | ||
else | ||
paddr ← SDE163..ssize ∥ vaddrssize−1..0 | ||
endif |
63 | 3 | 2 | 1 | 0 | ||||
paddr63..4+PTS | 2PTS | 2 | V | |||||
60−PTS | 1+PTS | 2 | 1 |
Field | Width | Bits | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V | 1 | 0 |
Valid 0 ⇒ Page Fault, bits 127..1 for software use 1 ⇒ Valid, bits 127..1 interpreted as follows |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WR | 2 | 2:1 |
0 Reserved 1 ⇒ Direct-mapped (see above) 2 ⇒ Paged, pointer to root page table for the segment (this case) 3 Reserved |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 |
Root page table size encoding Table size of root page table is 21+PTS entries (24+PTS bytes):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
paddr63..4+PTS | 60−PTS | 63:4+PTS | Physical address of direct-mapping or root page table |
Note: If it reduces implementation cost, it seems reasonable to change PTS ≥ 32 to be reserved.
63 | 56 | 55 | 48 | 47 | 46 | 45 | 44 | 43 | 32 | 31 | 30 | 27 | 26 | 24 | 23 | 22 | 20 | 19 | 18 | 16 | 15 | 11 | 10 | 8 | 7 | 6 | 5 | 0 | ||||||||||
0 | PMAO | G1 | G0 | Gates | B | 0 | R3 | 0 | R2 | 0 | R1 | 0 | XWR | FC | ssize | |||||||||||||||||||||||
8 | 8 | 2 | 2 | 12 | 1 | 4 | 3 | 1 | 3 | 1 | 3 | 5 | 3 | 2 | 6 |
Field | Width | Bits | Description | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ssize | 6 | 5:0 |
log2 of segment size in bytes (12..61); 0..11 Reserved 62..63 Reserved |
||||||||||||||||
FC | 2 | 7:6 |
Fill check (must be 0 if ssize ≥48) 0 ⇒ address bits 47..ssize must be 0 (upward growing segment) 1 ⇒ address bits 47..ssize must be 1 (downward growing segment) 2 ⇒ address bits 47..ssize ignored (e.g. may be used for HWASAN) 3 Reserved (e.g. could be HWACHK or sign-extend check) |
||||||||||||||||
XWR | 3 | 10:8 |
Execute Write Read permission
|
||||||||||||||||
R1 | 3 | 18:16 | Ring bracket 1 (as described below) | ||||||||||||||||
R2 | 3 | 22:20 | Ring bracket 2 (as described below) | ||||||||||||||||
R3 | 3 | 26:24 | Ring bracket 3 (as described below) | ||||||||||||||||
B | 1 | 31 |
PTE backward compatibility: 0 ⇒ bits 4..5 interpreted as GC dirty as described below 1 ⇒ bit 4 is a Sv39-compatible U bit, and bit 5 is a Sv39-compatible G bit |
||||||||||||||||
Gates | 12 | 43:32 |
Gate count Gate transition only allowed if target7..0 = 0 & targetssize−1..8 < Gates. |
||||||||||||||||
G0 | 2 | 45:44 | Garbage collection generation | ||||||||||||||||
G1 | 2 | 47:46 | Garbage collection dirty | ||||||||||||||||
PMAO | 8 | 55:48 | PMA override, addition, hints, etc. (e.g. PBMT) |
Each segment has three 3‑bit ring numbers—R1, R2, and
R3—stored in the segment descriptor table and used for
bracketing accesses by ring of execution in addition to the Read,
Write, Execute permissions from the segment descriptor table, and in
addition specifies gate access permission. To reiterate, Ssv64
inverts the ring number to privilege mapping chosen by Multics: ring 6
is the most privileged and ring 0 the least privileged. Typically
R3≤R2≤R1. Writes are permitted when the current ring of
execution is in [R1:6], reads in [R2:6], execution in [R2:R1], and
calls to gates in [R3:R2−1]*. Because all eight
rings are not required, Ssv64 reserves the value 7 for other uses.
* The ring number of the caller and the ring brackets of the target
segment are used to calculate the new ring number of execution, as per
the Multics documentation
modified for the inverted ring order:
The gate test criteria cited above requires that the target address bit 128 B aligned (bits 6..0 are zero) and that bits 47..7 are less than the segment’s gate count field in the segment descriptor entry.
What | R1,R2,R3 | seg RWX |
R b | W b | X b | G b | Ring 0 |
Ring 1 |
Ring 2 |
Ring 3 |
Rings 4 to 6 |
---|---|---|---|---|---|---|---|---|---|---|---|
User code | 2,2,2 | R-X | [2,6] | - | [2,2] | - | ---- | ---- | R-X- | R--- | R--- |
User execute only | 2,2,2 | --X | - | - | [2,2] | - | ---- | ---- | --X- | ---- | ---- |
User stack or heap | 2,2,2 | RW- | [2,6] | [2,6] | - | - | ---- | ---- | RW-- | RW-- | RW-- |
User read-only file | 2,2,2 | R-- | [2,6] | - | - | - | ---- | ---- | R--- | R--- | R--- |
Compiler library | 6,0,0 | R-X | [0,6] | - | [0,6] | - | R-X- | R-X- | R-X- | R-X- | R-X- |
Supervisor driver code | 4,3,3 | R-X | [3,6] | - | [3,4] | - | ---- | ---- | ---- | R-X- | R-X- |
Supervisor driver data | 3,3,3 | RW- | [3,6] | [3,6] | - | - | ---- | ---- | ---- | RW-- | RW-- |
Supervisor code | 4,3,4 | R-X | [3,6] | - | [3,4] | - | ---- | ---- | ---- | R-X- | R-X- |
Supervisor heap or stack | 4,4,4 | RW- | [4,6] | [4,6] | - | - | ---- | ---- | ---- | ---- | RW-- |
Supervisor gates for user | 4,4,2 | R-X | [4,6] | - | [4,4] | [2,3] | ---- | ---- | ---G | ---G | R-X- |
Sandboxed JIT code | 1,0,0 | RWX | [0,6] | [1,6] | [0,1] | - | R-X- | RWX- | RW-- | RW-- | RW-- |
Sandboxed JIT stack or heap | 0,0,0 | RW- | [0,6] | [0,6] | - | - | RW-- | RW-- | RW-- | RW-- | RW-- |
Sandboxed non-JIT code | 1,1,1 | R-X | [1,6] | - | [1,1] | - | ---- | R-X- | R--- | R--- | R--- |
User gates for sandboxes | 2,2,0 | R-X | [2,6] | - | [2,2] | [0,1] | ---G | ---G | R-X- | R--- | R--- |
Page Table Entries (PTEs) are similar to Sv57, Sv48, and Sv39 PTE formats except for the following changes:
63 | 3 | 2 | 1 | 0 | ||||
paddr63..4+PTS | 2PTS | 2 | V | |||||
60−PTS | 1+PTS | 2 | 1 |
Field | Width | Bits | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V | 1 | 0 |
Valid 0 ⇒ Page Fault, bits 63..1 for software use 1 ⇒ Valid, next-level page table specified by paddr63..4+PTS |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 |
NAPOT next level table size encoding Table size of next level is 21+PTS entries (24+PTS bytes):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
paddr63..4+PTS | 60−PTS | 63:4+PTS | Physical address of next level page table |
Note: If it reduces implementation cost, it seems reasonable to change PTS ≥ 32 to be reserved.
63 | 11 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 1 | 0 | ||||||
paddr63..12+S | 2S | RSW | D | A | GC | XWR | V | |||||||||
52−S | 1+S | 3 | 1 | 1 | 2 | 3 | 1 |
Field | Width | Bits | Description |
---|---|---|---|
V | 1 | 0 |
Valid 0 ⇒ Page Fault, bits 63..1 for software use 1 ⇒ Valid, bits 63..1 as described below |
XWR | 3 | 3:1 |
Execute, Write, Read permissions 0, 2, 6 Reserved These further restrict (logical-and with) the permissions in the Segment Descriptor ring brackets. |
GC | 2 | 5:4 |
Garbage Collection dirty A page fault trap results if a pointer to a generation more recent than this value is stored to the page. Setting this field to 3 prevents GC traps. If the backward compatibility bit is set in the Segment Descriptor Entry, then this field reverts to a more Sv48-compatible interpretation: U-mode software may only access the page when bit 4 is 1. ASID matching is ignored when bit 5 is 1 (implementations of Ssv64-only without an explicit G bit in the TLB may implement this by setting ASID to 0). |
A | 1 | 6 | Accessed |
D | 1 | 7 | Dirty |
RSW | 3 | 10:8 | For software use |
2S | 1+S | 11+S:11 | NAPOT page size encoding |
paddr63..12+S | 50 | 63:12+S | Physical Page Number |
This section compares translation caching between various choices. There are two sides of the translation cache (TLB): the match side and the translation side. For comparison purposes, this comparison uses a ASID length of 12 bits on the match side and a physical address size of 47 (Sv39 only) or 56 bits on the translation side. The following table lists what needs be stored in the translation cache. Ssv64 GC and Gates features are assumed not present. Three Ssv64 configurations are: Ssv64min with 4 rings, 4 segment groups, and 512 segments per group; Ssv64max with 7 rings, 8 segment groups, and 8192 segments per group, and finally Ssv64max+ adds in all the possible features Ssv64 might allow including GC and Gates. The Ssv64 configuration chosen in rows below is sized to be as minimal in size as the Sv entries paired with. Svnapot is ignored for this comparison. The PS column represents the number of bits required for page sizes for Sv configurations, and desired for Ssv configurations. Set associative translation caches would of course not stores the virtual address bits used to select the set. This table also does not include the bits used for replacement (e.g. pseudo-LRU bits for set-associative caches). Other bits not included might be associated with hypervisor VMID, M-machine mode unmapped entries (if any), parity or ECC, page size reduction, etc. Note that Ssv64 has the potential (but no definition as of yet) to extend PBMT from 2 bits up to 8.
One other caveat concerns the common micro-architecture choice to split translation caches into separate structures for instruction and and data. Instruction and data translation caches require different subsets of the bits enumerated below. For example the RW, D, and GC fields would be data only and the R3, X, Gates fields would be instruction only.
Mode | Match side | Translation side | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PS | G | ASID | vaddr | total | U | AD | XWR | PBMT | paddr | Rn | other | total | |
Sv39 only | 2 | 1 | 12 | 38:12 | 42 | 1 | 2 | 3 | 2 | 46:12 | 0 | 0 | 43 |
Sv48 only | 2 | 1 | 12 | 47:12 | 51 | 1 | 2 | 3 | 2 | 55:12 | 0 | 0 | 52 |
Sv57 only | 3 | 1 | 12 | 56:12 | 61 | 1 | 2 | 3 | 2 | 55:12 | 0 | 0 | 52 |
Sv64 only | 3 | 1 | 12 | 63:12 | 68 | 1 | 2 | 3 | 2 | 55:12 | 0 | 0 | 52 |
Sv39 + Ssv64min | 3 | 1 | 12 | 63,61,56:48,38:12 | 54 | 1 | 2 | 3 | 2 | 46:12 | 6 | 0 | 49 |
Sv48 + Ssv64min | 3 | 1 | 12 | 63,61,56:12 | 63 | 1 | 2 | 3 | 2 | 55:12 | 6 | 0 | 58 |
Sv57 + Ssv64min | 3 | 1 | 12 | 63,61,56:12 | 63 | 1 | 2 | 3 | 2 | 55:12 | 6 | 0 | 58 |
Sv64 + Ssv64max | 3 | 1 | 12 | 63:12 | 68 | 1 | 2 | 3 | 2 | 55:12 | 9 | 0 | 61 |
Ssv64min only | 2 | 0 | 12 | 63:61,56:12 | 62 | 0 | 2 | 3 | 2 | 46:12 | 6 | 0 | 48 |
Ssv64max only | 2 | 0 | 12 | 63:12 | 66 | 0 | 2 | 3 | 2 | 55:12 | 9 | 0 | 60 |
Ssv64max+ only | 2 | 0 | 12 | 63:12 | 66 | 0 | 2 | 0 | 8 | 55:12 | 9 | 16 | 79 |
The hypervisor should be able to use a compatible PTE mechanism to the first-level, but does not need the segmentation and ring mechanisms. What is proposed here is therefore a simplified version of the first-level translation. The segment group is eliminated in favor of a a CSR that gives the size of the guest physical address space size as the gsize field of the hgas CSR (analagous to ssize in Segment Descriptor Entries) and a single table for that 2gsize guest physical address space. The range of gsize is 64..19. It is unlikely that gsize = 64 needs to be supported, and so this field of hgas should be WARL to allow implementations to choose a smaller size. There is no direct-mapping option in the hgapt CSR, as typically the guest would be given I/O and main memory regions, which requires a page table.
63 | 7 | 6 | 0 | ||
0 | gsize | ||||
57 | 7 |
63 | 12 | 11 | 3 | 2 | 1 | 0 | ||
paddr63..4+PTS | 2PTS | 2 | V | |||||
60−PTS | 1+PTS | 2 | 1 |
The second-level non-leaf PTE format is identical to the first-level non-leaf PTE format. The second-level leaf PTE format is identical to the first-level leaf PTE format except that the GC bits are redefined as PBMT bits (details TBD).
AddressSanitizer is a method for detecting programming errors. It has been useful enough that processor additions to reduce the its performance penalty have been made to make it more useable. One is Hardware-assisted AddressSanitizer (HWASAN), which uses unused bits in pointers to add a tag that is compared to the tag of the memory pointed to. The hardware assist refers to ISA features that allow pointer bits to be ignored (e.g. ARM’s Top Byte Ignore (TBI) option). This raises the question of whether this is possible with Ssv64.
Given that the point of a full Ssv64 implementation is to provide a 64‑bit address space, the only bits that can be ignored are the fill bits between the segment number and the segment size (bits 47..ssize). However, these bits would still be used for matching in translation caches, and can only be cheaply ignored on translation cache misses, and thus having different tags for the same VPN will result in many translation cache entries for the same page. This makes ignoring these bits for HWASAN not useful on Ssv64-only processors unless additional virtual address matching is required in translation caches, potentially introducing new critical paths. Processors that support HWASAN and also implement Sv48 or Sv57 will necessarily have the this extra ability to ignore 9 or 18 bits during the TLB matching process. This creates the opportunity to use bits 47..39 for HWASAN with Ssv64 when ssize ≤ 39.
Ssv64 is still useful when less than a 64‑bit address space is implemented by a processor due to its other features. As an example, a a fairly small Ssv64-only implementation might support only 50 bits as illustrated below: only ssize ≤ 39 with only 4 segment groups (with sgen[2] to sgen[5] hardwired to 7) and only 512 segment numbers per segment group, for a total of 2048 segments. In this case, only bits 63, 61, 56..48, and 38..12 participate in translation cache matching and it possible for HWASAN to use 14 bits (bits 62, 60..57, and 47..39) without addition to the translation cache critical path. This bits are marked as C as potential color bits in the figure. This is more bits than ARM’s TBI feature. However, without introducing critical paths in translation, the number of HWASAN bits decreases as the maximum address space implemented in a processor increases, making HWASAN either non-portable or requiring new critical paths.
63 | 62 | 61 | 60 | 57 | 56 | 48 | 47 | 39 | 38 | 0 | ||||
H | C | G | C | segment | C | offset | ||||||||
1 | 1 | 1 | 4 | 9 | 9 | 39 |
While HWASAN has proved useful, it is worth considering whether it is possible to go further, and provide a feature that avoids the additional checking code associated with HWASAN. Such a HWCHK feature might entail three registers giving the tag comparison table to use for three different segments (e.g. heap, stack, and globals), and implementing a hardware cache for tags by address, with misses in this case being filled from the specified tables. This is fairly expensive, as the tag cache involved would probably need to be at least 4 KiB. This feature would be sufficiently performative that it might be used in production code, rather than only in debugging code.
At times it can be useful to be able to execute untrusted code in an environment where that code has no direct access to the rest of the system, but where it can communicate with the system efficiently. Hierarchical protection domains (aka protection rings) provide an efficient way to provide such an environment. Imagine a web browser that wants to be able to download code from an untrusted source, perhaps use Just-In-Time Compilation to generate native code, and then execute to provide some service as part of displaying the web page. The downloaded code should not be able to access any files or the state of the user browser. For this scenario on Ssv64, where ring 0 is the least privileged and ring 6 the most privileged (the opposite of the usual convention), the web browser might execute in ring 2, generate machine code to a segment that is writeable from ring 2, but only Read and Execute to ring 0, and then transfer to that ring 0 code. All rings share the same address space and TLB entries for a given process, but the ring brackets stored in the TLB change access to data based on the current ring of execution. Ring 0 would have access only to its code, stack, and heap segments, and nothing else. It would not be able to make system calls or access files, except indirectly by making requests to ring 2. The only access ring 0 would have outside of its three segments might be to call a limited set of gates in ring 2, causing a ring transition. Interrupts and such would be delivered to the browser in ring 2, allowing it to regain control in the even that the ring 0 code does not terminate. The browser and the rest of the system is completely protected by the code executing in ring 0. Because ring 0 is a subset of the address space of ring 2, ring 2 has complete access to all the data in ring 0, but ring 0 has access only to the segments granted to it by ring 2. Ring 2 has the option to grow or not grow the code, heap, and stack segments of ring 0 as appropriate.
One advantage of a more structured address space with segment descriptors is the room to support features that can take advantage of bits that won’t fit in Page Table Entries (PTEs). Languages such as Java, Julia, and Lisp rely on garbage collection (GC), which eliminates many programming errors that introduce bugs and vulnerabilities, and is therefore both a programming convenience and security feature. However, GC needs to be realtime and low-overhead, which can be achieved by including features for pointer tracking by generation and barriers to allow concurrent GC to be performed by one processor while another continues to run the application. This section outlines how Ssv64 can add support for efficient, realtime GC. One potentially new micro-architectural structure and a few new instructions are required.
For generational GC, new allocations are done in an area of memory that is analyzed frequently without scanning older allocations. Over time as this data ages, it may be moved to an area for older generation data. To work correctly, the pointers in the older areas that point to recent ones need to be known and used as roots for the areas containing more recent allocation. The processor hardware helps this process by taking an exception when a pointer to a newer generation is stored to an older area; the trap handler can note this pointer and then continue. The translation cache access for the store will provide both the generation dirty level for the target page and the generation number of the target segment. New load and store instructions are added that the compiler generates only for pointers to dynamically allocated memory. For the pointer store instruction the small Segment Attribute Cache provides the generation number of the pointer being stored. If the page does not yet contain more recent data than the pointer being stored, an exception occurs, similar to the exception used to transition the PTE dirty bit from 0→1. Ssv64 has support for 4 generations. Whether all such stores trap or only the first may depend on the GC algorithm; the trap handler can turn off traps to this page after the first trap by lowering the PTE GC field to the generation of the pointer being stored, or it can leave it unchanged to be informed of every such store.
For concurrent GC, the scan and reclamation runs in another thread, potentially on another processor while the application continues its work. This creates the potential for the application to change certain data that the GC is operating on. To prevent this, application loads or stores to that data are noticed and handled differently (the details depend on the GC algorithm). Rather than introduce a new special-purpose check for this as on some other architectures, Ssv64 has the option to use segment rings, which is more general-purpose. Consider an example where the GC algorithm employs 4 generations, with two segments for each generation. When it is time to migrate the live data from from the older segment of a generation to a newer segment, its ring is raised to the ring of the GC thread so that references by the application ring trap. When GC has completed its movement of live data, the ring number is lowered and eventually this segment can be used for new allocation, and eventually the roles of the segments can be reversed. When the amount of live data after GC becomes too large, the either segment size can be increased or data can be migrated to a segment with older generational data, thereby decreasing the frequency of GC for the current generation.
The Segment Attribute Cache suggested above can be rather small. If used for both segment bounds checking and GC, as an example, it might be only 1024 entries, 4‑way set associative, with the tag being the ASID and segment number, and the data being the segment size (6 bits) and the GC generation (2 bits). If segment bounds checking is not required and it is only used for GC, then this cache might be even smaller, perhaps just 64 entries, as not many segments would typically be used for GC-managed allocation. The load and store instructions proposed above might be named LP and SP or LG and SG depending on whether pointer or GC is being emphasized.
<webmaster at securerisc.org> | |||
2023-03-21 |