Submit #3389 ยป 0001-vkernel-Restore-MAP_VPAGETABLE-support-with-COW-VPTE.patch
| doc/vpagetable_analysis.txt | ||
|---|---|---|
|
MAP_VPAGETABLE Re-implementation Analysis
|
||
|
==========================================
|
||
|
Date: December 2024
|
||
|
Context: Analysis of commit 4d4f84f5f26bf5e9fe4d0761b34a5f1a3784a16f which
|
||
|
removed MAP_VPAGETABLE support, breaking vkernel functionality.
|
||
|
TABLE OF CONTENTS
|
||
|
-----------------
|
||
|
1. Background
|
||
|
2. Why MAP_VPAGETABLE Was Removed
|
||
|
3. Current VM Architecture
|
||
|
4. The Reverse-Mapping Problem
|
||
|
5. Cost Analysis of Current Mechanisms
|
||
|
6. Proposed Solutions
|
||
|
7. Recommendation
|
||
|
8. Open Questions
|
||
|
==============================================================================
|
||
|
1. BACKGROUND
|
||
|
==============================================================================
|
||
|
MAP_VPAGETABLE was a DragonFly BSD feature that allowed the vkernel (virtual
|
||
|
kernel) to implement software page tables without requiring hardware
|
||
|
virtualization support (Intel VT-x / AMD-V).
|
||
|
The vkernel runs as a userspace process but provides a full kernel
|
||
|
environment. It needs to manage its own "guest" page tables for processes
|
||
|
running inside the vkernel. MAP_VPAGETABLE allowed this by:
|
||
|
1. Creating an mmap region with MAP_VPAGETABLE flag
|
||
|
2. The vkernel writes software page table entries (VPTEs) into this region
|
||
|
3. On page faults, the host kernel walks these VPTEs to translate
|
||
|
guest virtual addresses to host physical addresses
|
||
|
The key advantage was lightweight virtualization - no hypervisor, no special
|
||
|
CPU features required. The vkernel was just a process with some extra kernel
|
||
|
support for the virtual page tables.
|
||
|
==============================================================================
|
||
|
2. WHY MAP_VPAGETABLE WAS REMOVED
|
||
|
==============================================================================
|
||
|
From the commit message:
|
||
|
"The basic problem is that the VM system is moving to an extent-based
|
||
|
mechanism for tracking VM pages entered into PMAPs and is no longer
|
||
|
indexing individual terminal PTEs with pv_entry's.
|
||
|
This means that the VM system is no longer able to get an exact list of
|
||
|
PTEs in PMAPs that a particular vm_page is using. It just has a flag
|
||
|
'this page is in at least one pmap' or 'this page is not in any pmaps'.
|
||
|
To track down the PTEs, the VM system must run through the extents via
|
||
|
the vm_map_backing structures hanging off the related VM object.
|
||
|
This mechanism does not work with MAP_VPAGETABLE. Short of scanning
|
||
|
the entire real pmap, the kernel has no way to reverse-index a page
|
||
|
that might be indirected through MAP_VPAGETABLE."
|
||
|
The core issue: DragonFly optimized memory by removing per-page tracking
|
||
|
(pv_entry lists) in favor of extent-based tracking (vm_map_backing lists).
|
||
|
This works for normal mappings but breaks VPAGETABLE.
|
||
|
==============================================================================
|
||
|
3. CURRENT VM ARCHITECTURE
|
||
|
==============================================================================
|
||
|
3.1 Key Data Structures
|
||
|
-----------------------
|
||
|
vm_object:
|
||
|
- Contains pages (rb_memq tree)
|
||
|
- Has a backing_list: TAILQ of vm_map_backing entries
|
||
|
- Each vm_map_backing represents an extent that maps part of this object
|
||
|
vm_map_backing:
|
||
|
- Links a vm_map_entry to a vm_object
|
||
|
- Contains: pmap, start, end, offset
|
||
|
- Tracks "pages [offset, offset+size) of this object are mapped at
|
||
|
virtual addresses [start, end) in this pmap"
|
||
|
vm_page:
|
||
|
- PG_MAPPED flag: "this page MIGHT be mapped somewhere"
|
||
|
- PG_WRITEABLE flag: "this page MIGHT have a writable mapping"
|
||
|
- md.interlock_count: race detection between pmap_enter/pmap_remove_all
|
||
|
3.2 Reverse-Mapping Mechanism
|
||
|
-----------------------------
|
||
|
The PMAP_PAGE_BACKING_SCAN macro (sys/platform/pc64/x86_64/pmap.c:176-220)
|
||
|
finds all PTEs mapping a given physical page:
|
||
|
for each vm_map_backing in page->object->backing_list:
|
||
|
if page->pindex is within backing's range:
|
||
|
compute va = backing->start + (pindex - offset) * PAGE_SIZE
|
||
|
look up PTE at va in backing->pmap
|
||
|
if PTE maps our physical page:
|
||
|
found it!
|
||
|
This works because for NORMAL mappings, the relationship between object
|
||
|
pindex and virtual address is fixed and computable.
|
||
|
3.3 Why This Doesn't Work for VPAGETABLE
|
||
|
----------------------------------------
|
||
|
With VPAGETABLE:
|
||
|
- One vm_map_backing covers the entire VPAGETABLE region
|
||
|
- The vkernel's software page tables can map ANY physical page to
|
||
|
ANY virtual address within that region
|
||
|
- The formula "va = start + (pindex - offset) * PAGE_SIZE" is WRONG
|
||
|
- The actual VA depends on what the vkernel wrote into its guest PTEs
|
||
|
Example:
|
||
|
- VPAGETABLE region: VA 0x1000000-0x2000000
|
||
|
- Physical page at object pindex 42
|
||
|
- Expected VA by formula: 0x1000000 + 42*4096 = 0x102a000
|
||
|
- Actual VA per guest PTEs: 0x1500000 (and maybe also 0x1800000!)
|
||
|
- The scan looks at 0x102a000, finds nothing, misses the real mappings
|
||
|
==============================================================================
|
||
|
4. THE REVERSE-MAPPING PROBLEM
|
||
|
==============================================================================
|
||
|
4.1 When Reverse-Mapping is Needed
|
||
|
----------------------------------
|
||
|
The backing_list scan is used by these functions (7 call sites in pmap.c):
|
||
|
pmap_remove_all() - Remove page from ALL pmaps (page reclaim, COW)
|
||
|
pmap_remove_specific() - Remove page from ONE specific pmap
|
||
|
pmap_testbit() - Check if Modified bit is set
|
||
|
pmap_clearbit() - Clear Access/Modified/Write bits
|
||
|
pmap_ts_referenced() - Check/clear reference bits for page aging
|
||
|
4.2 When Reverse-Mapping is NOT Needed
|
||
|
--------------------------------------
|
||
|
Normal page faults do NOT use backing_list scans. They:
|
||
|
1. Look up vm_map_entry by faulting VA
|
||
|
2. Walk the vm_map_backing chain to find/create the page
|
||
|
3. Call pmap_enter() to install PTE
|
||
|
|
||
|
This is O(1) with respect to other mappings - no scanning.
|
||
|
4.3 The vkernel's Existing Cooperative Mechanism
|
||
|
------------------------------------------------
|
||
|
The vkernel already has a way to notify the host of PTE changes:
|
||
|
madvise(addr, len, MADV_INVAL)
|
||
|
This tells the host kernel: "I've modified my guest page tables, please
|
||
|
invalidate your cached PTEs for this range."
|
||
|
The host responds with pmap_remove() on the range (vm_map.c:2361-2374).
|
||
|
This mechanism still exists in the codebase and could be leveraged.
|
||
|
==============================================================================
|
||
|
5. COST ANALYSIS OF CURRENT MECHANISMS
|
||
|
==============================================================================
|
||
|
5.1 The O(N) in backing_list Scan
|
||
|
---------------------------------
|
||
|
N = number of vm_map_backing entries on the object's backing_list
|
||
|
For typical objects:
|
||
|
- Private anonymous memory: N = 1 (only owner maps it)
|
||
|
- Small private files: N = 1-10
|
||
|
- Shared libraries (libc.so): N = hundreds to thousands
|
||
|
The scan itself is cheap (pointer chasing + range check), but for shared
|
||
|
objects with many mappings, N can be significant.
|
||
|
5.2 Early Exit Optimizations
|
||
|
----------------------------
|
||
|
pmap_ts_referenced() stops after finding 4 mappings - doesn't need all.
|
||
|
PG_MAPPED flag check allows skipping pages that are definitely unmapped.
|
||
|
5.3 When Scans Actually Happen
|
||
|
------------------------------
|
||
|
Scans are triggered by:
|
||
|
- Page reclaim (pageout daemon) - relatively rare per-page
|
||
|
- COW fault resolution - once per COW page
|
||
|
- msync/fsync - when writing dirty pages
|
||
|
- Process exit - when cleaning up address space
|
||
|
They do NOT happen on every fault, read, or write. The common paths
|
||
|
(fault-in, access already-mapped page) are O(1).
|
||
|
==============================================================================
|
||
|
6. PROPOSED SOLUTIONS
|
||
|
==============================================================================
|
||
|
6.1 Option A: Cooperative Invalidation Only (Simplest)
|
||
|
------------------------------------------------------
|
||
|
Concept: Don't do reverse-mapping for VPAGETABLE at all. Rely entirely
|
||
|
on the vkernel calling MADV_INVAL when it modifies guest PTEs.
|
||
|
Implementation:
|
||
|
1. Re-add VM_MAPTYPE_VPAGETABLE and vm_fault_vpagetable()
|
||
|
2. Add PG_VPTMAPPED flag to vm_page
|
||
|
3. Set PG_VPTMAPPED when a page is mapped via VPAGETABLE
|
||
|
4. In pmap_remove_all() etc, skip backing_list scan for VPAGETABLE
|
||
|
entries (they won't find anything anyway)
|
||
|
5. When reclaiming a PG_VPTMAPPED page, send a signal/notification
|
||
|
to all vkernel processes, or do a full TLB flush for them
|
||
|
Pros:
|
||
|
- Minimal code changes
|
||
|
- No per-mapping memory overhead
|
||
|
- Fast path stays fast
|
||
|
Cons:
|
||
|
- Relies on vkernel being well-behaved with MADV_INVAL
|
||
|
- May need a "big hammer" (full flush) when reclaiming pages
|
||
|
- Race window between vkernel modifying PTEs and calling MADV_INVAL
|
||
|
Cost: O(1) normal case, O(vkernels) for VPTMAPPED page reclaim
|
||
|
6.2 Option B: Per-Page VPAGETABLE Tracking List
|
||
|
-----------------------------------------------
|
||
|
Concept: Add per-page reverse-map tracking, but ONLY for VPAGETABLE
|
||
|
mappings. Normal mappings continue using backing_list.
|
||
|
Implementation:
|
||
|
1. Extend struct md_page:
|
||
|
|
||
|
struct vpte_rmap {
|
||
|
pmap_t pmap;
|
||
|
vm_offset_t va;
|
||
|
TAILQ_ENTRY(vpte_rmap) link;
|
||
|
};
|
||
|
|
||
|
struct md_page {
|
||
|
long interlock_count;
|
||
|
TAILQ_HEAD(, vpte_rmap) vpte_list;
|
||
|
};
|
||
|
2. In vm_fault_vpagetable(), when establishing a mapping:
|
||
|
- Allocate vpte_rmap entry
|
||
|
- Add to page's vpte_list
|
||
|
3. In pmap_remove() for VPAGETABLE regions:
|
||
|
- Remove corresponding vpte_rmap entries
|
||
|
4. In pmap_remove_all() etc:
|
||
|
- After backing_list scan, also walk page's vpte_list
|
||
|
Pros:
|
||
|
- Precise tracking of all VPAGETABLE mappings
|
||
|
- Works with existing pmap infrastructure
|
||
|
- No reliance on vkernel cooperation
|
||
|
Cons:
|
||
|
- Memory overhead: ~24 bytes per VPAGETABLE mapping
|
||
|
- Requires vpte_rmap allocation/free on every mapping change
|
||
|
- Adds complexity to fault path
|
||
|
Cost: O(k) where k = number of VPAGETABLE mappings for this page
|
||
|
6.3 Option C: Lazy Tracking with Bloom Filter
|
||
|
---------------------------------------------
|
||
|
Concept: Use probabilistic data structure to quickly determine if a page
|
||
|
MIGHT be VPAGETABLE-mapped, avoiding expensive scans for the common case.
|
||
|
Implementation:
|
||
|
1. Each VPAGETABLE pmap has a Bloom filter
|
||
|
2. When mapping a page via VPAGETABLE, add its PA to the filter
|
||
|
3. When checking reverse-maps:
|
||
|
- Test each VPAGETABLE pmap's bloom filter
|
||
|
- If negative: definitely not mapped there (skip)
|
||
|
- If positive: might be mapped, do full scan of that pmap
|
||
|
Pros:
|
||
|
- Very fast negative lookups (~O(1))
|
||
|
- Low memory overhead (fixed-size filter per pmap)
|
||
|
- No per-mapping tracking needed
|
||
|
Cons:
|
||
|
- False positives require fallback to full scan
|
||
|
- Bloom filter cannot handle deletions (need rebuilding or counting)
|
||
|
- Still requires some form of scan on positive match
|
||
|
Cost: O(1) for negative, O(pmap_size) for positive (with ~1% false positive rate)
|
||
|
6.4 Option D: Shadow PTE Table
|
||
|
------------------------------
|
||
|
Concept: Maintain a kernel-side shadow of the vkernel's page tables,
|
||
|
indexed by physical address for reverse lookups.
|
||
|
Implementation:
|
||
|
1. Per-VPAGETABLE pmap, maintain an RB-tree or hash table:
|
||
|
Key: physical page address
|
||
|
Value: list of (guest_va, vpte_pointer) pairs
|
||
|
2. Intercept all writes to VPAGETABLE regions:
|
||
|
- Make VPAGETABLE regions read-only initially
|
||
|
- On write fault, update shadow table and allow write
|
||
|
3. For reverse-mapping:
|
||
|
- Look up physical address in shadow table
|
||
|
- Get all VAs directly
|
||
|
Pros:
|
||
|
- O(log n) or O(1) reverse lookups
|
||
|
- Precise tracking
|
||
|
- No vkernel cooperation required
|
||
|
Cons:
|
||
|
- High overhead for intercepting every PTE write
|
||
|
- Memory overhead for shadow table
|
||
|
- Complexity of keeping shadow in sync
|
||
|
Cost: O(1) lookup, but O(1) overhead on every guest PTE modification
|
||
|
6.5 Option E: Hardware Virtualization (Long-term)
|
||
|
-------------------------------------------------
|
||
|
Concept: Use Intel EPT or AMD NPT for vkernel, as suggested in the
|
||
|
original commit message.
|
||
|
Implementation:
|
||
|
- vkernel runs as a proper VM guest
|
||
|
- Hardware handles guest-to-host address translation
|
||
|
- Host kernel manages EPT/NPT tables
|
||
|
- Normal backing_list mechanism works
|
||
|
Pros:
|
||
|
- Native hardware performance
|
||
|
- Clean architecture
|
||
|
- Industry-standard approach
|
||
|
Cons:
|
||
|
- Requires VT-x/AMD-V CPU support
|
||
|
- vkernel becomes a "real" VM, loses lightweight process nature
|
||
|
- Significant implementation effort
|
||
|
- Different architecture than original vkernel design
|
||
|
Cost: Best possible performance, but changes vkernel's nature
|
||
|
==============================================================================
|
||
|
7. RECOMMENDATION
|
||
|
==============================================================================
|
||
|
For re-enabling VPAGETABLE with minimal disruption, I recommend a
|
||
|
HYBRID APPROACH combining Options A and B:
|
||
|
Phase 1: Cooperative + Flag (Quick Win)
|
||
|
---------------------------------------
|
||
|
1. Re-add VM_MAPTYPE_VPAGETABLE
|
||
|
2. Add PG_VPTMAPPED flag to track "might be VPAGETABLE-mapped"
|
||
|
3. Restore vm_fault_vpagetable() to walk guest page tables
|
||
|
4. In reverse-mapping functions, for PG_VPTMAPPED pages:
|
||
|
- Skip the normal backing_list scan (won't find anything)
|
||
|
- Call MADV_INVAL equivalent on all VPAGETABLE regions
|
||
|
that MIGHT contain this page
|
||
|
5. Require vkernel to be cooperative with MADV_INVAL
|
||
|
This gets vkernel working again with minimal changes.
|
||
|
Phase 2: Optional Per-Page Tracking (If Needed)
|
||
|
-----------------------------------------------
|
||
|
If Phase 1 proves insufficient (too many unnecessary invalidations,
|
||
|
race conditions, etc.), add Option B's per-page vpte_list:
|
||
|
|
||
|
1. Track (pmap, va) pairs for each VPAGETABLE mapping
|
||
|
2. Use for precise invalidation instead of broad MADV_INVAL
|
||
|
3. Memory cost is bounded by actual VPAGETABLE usage
|
||
|
Phase 3: Long-term Hardware Support (Optional)
|
||
|
----------------------------------------------
|
||
|
If demand exists for better vkernel performance:
|
||
|
1. Implement EPT/NPT support as Option E
|
||
|
2. Keep VPAGETABLE as fallback for non-VT-x systems
|
||
|
3. Auto-detect and use best available method
|
||
|
==============================================================================
|
||
|
8. OPEN QUESTIONS
|
||
|
==============================================================================
|
||
|
Q1: How important is precise tracking vs. over-invalidation?
|
||
|
- If we can tolerate occasional unnecessary TLB flushes for vkernel
|
||
|
processes, Option A alone may be sufficient.
|
||
|
- Need to understand vkernel workload characteristics.
|
||
|
Q2: How many active VPAGETABLE regions would typically exist?
|
||
|
- Usually one vkernel with one region
|
||
|
- Or multiple vkernels running simultaneously?
|
||
|
- Affects cost of "scan all VPAGETABLE regions" approach
|
||
|
Q3: Is the vkernel already disciplined about calling MADV_INVAL?
|
||
|
- The mechanism exists and was used before
|
||
|
- Need to verify vkernel code still does this properly
|
||
|
- If so, cooperative invalidation is viable
|
||
|
Q4: What are the performance expectations for vkernel?
|
||
|
- Is it acceptable to be slower than native?
|
||
|
- How much slower is acceptable?
|
||
|
- This affects whether we need precise tracking
|
||
|
Q5: Is hardware virtualization an acceptable long-term direction?
|
||
|
- Would change vkernel's nature from "lightweight process" to "VM"
|
||
|
- May or may not align with project goals
|
||
|
- Affects investment in software VPAGETABLE solutions
|
||
|
==============================================================================
|
||
|
APPENDIX A: Key Source Files
|
||
|
==============================================================================
|
||
|
sys/vm/vm_fault.c - Page fault handling, vm_fault_vpagetable removed
|
||
|
sys/vm/vm_map.c - Address space management, MADV_INVAL handling
|
||
|
sys/vm/vm_map.h - vm_map_entry, vm_map_backing structures
|
||
|
sys/vm/vm_object.h - vm_object with backing_list
|
||
|
sys/vm/vm_page.h - vm_page, md_page structures
|
||
|
sys/vm/vm.h - VM_MAPTYPE_* definitions
|
||
|
sys/platform/pc64/x86_64/pmap.c - Real kernel pmap, PMAP_PAGE_BACKING_SCAN
|
||
|
sys/platform/pc64/include/pmap.h - Real kernel md_page (no pv_list)
|
||
|
sys/platform/vkernel64/platform/pmap.c - vkernel pmap (HAS pv_list!)
|
||
|
sys/platform/vkernel64/include/pmap.h - vkernel md_page with pv_list
|
||
|
sys/sys/vkernel.h - vkernel definitions
|
||
|
sys/sys/mman.h - MAP_VPAGETABLE definition
|
||
|
==============================================================================
|
||
|
APPENDIX B: Relevant Commit
|
||
|
==============================================================================
|
||
|
Commit: 4d4f84f5f26bf5e9fe4d0761b34a5f1a3784a16f
|
||
|
Author: Matthew Dillon <dillon@apollo.backplane.com>
|
||
|
Date: Thu Jan 7 11:54:11 2021 -0800
|
||
|
kernel - Remove MAP_VPAGETABLE
|
||
|
|
||
|
* This will break vkernel support for now, but after a lot of mulling
|
||
|
there's just no other way forward. MAP_VPAGETABLE was basically a
|
||
|
software page-table feature for mmap()s that allowed the vkernel
|
||
|
to implement page tables without needing hardware virtualization support.
|
||
|
|
||
|
* The basic problem is that the VM system is moving to an extent-based
|
||
|
mechanism for tracking VM pages entered into PMAPs and is no longer
|
||
|
indexing individual terminal PTEs with pv_entry's.
|
||
|
[... see full commit message for details ...]
|
||
|
|
||
|
* We will need actual hardware mmu virtualization to get the vkernel
|
||
|
working again.
|
||
|
==============================================================================
|
||
|
APPENDIX C: Implementation Progress (Phase 1)
|
||
|
==============================================================================
|
||
|
Branch: vpagetable-analysis
|
||
|
COMPLETED CHANGES:
|
||
|
-----------------
|
||
|
1. sys/vm/vm.h
|
||
|
- Changed VM_MAPTYPE_UNUSED02 back to VM_MAPTYPE_VPAGETABLE (value 2)
|
||
|
2. sys/sys/mman.h
|
||
|
- Updated comment to indicate MAP_VPAGETABLE is supported
|
||
|
3. sys/vm/vm_page.h
|
||
|
- Added PG_VPTMAPPED flag (0x00000001) using existing PG_UNUSED0001 slot
|
||
|
- Documents that it tracks pages mapped via VPAGETABLE regions
|
||
|
4. sys/vm/vm_fault.c
|
||
|
- Added forward declaration for vm_fault_vpagetable()
|
||
|
- Added struct vm_map_ilock and didilock variables to vm_fault()
|
||
|
- Added full vm_fault_vpagetable() function (~140 lines)
|
||
|
- Added VPAGETABLE check in vm_fault() before vm_fault_object()
|
||
|
- Added VPAGETABLE check in vm_fault_bypass() to return KERN_FAILURE
|
||
|
- Added VPAGETABLE check in vm_fault_page()
|
||
|
- Added VM_MAPTYPE_VPAGETABLE case to vm_fault_wire()
|
||
|
5. sys/vm/vm_map.c (COMPLETE)
|
||
|
- Added VM_MAPTYPE_VPAGETABLE to switch statements:
|
||
|
* vmspace_swap_count()
|
||
|
* vmspace_anonymous_count()
|
||
|
* vm_map_backing_attach()
|
||
|
* vm_map_backing_detach()
|
||
|
* vm_map_entry_dispose()
|
||
|
* vm_map_clean() (first switch)
|
||
|
* vm_map_delete()
|
||
|
* vm_map_backing_replicated()
|
||
|
* vmspace_fork()
|
||
|
- Restored MADV_SETMAP functionality
|
||
|
- vm_map_insert(): Skip prefault for VPAGETABLE
|
||
|
- vm_map_madvise(): Allow VPAGETABLE for MADV_INVAL (critical for cooperative invalidation)
|
||
|
- vm_map_lookup(): Recognize VPAGETABLE as object-based
|
||
|
- vm_map_backing_adjust_start/end(): Include VPAGETABLE for clipping
|
||
|
- vm_map_protect(): Include VPAGETABLE in vnode write timestamp update
|
||
|
- vm_map_user_wiring/vm_map_kernel_wiring(): Include VPAGETABLE for shadow setup
|
||
|
- vm_map_copy_entry(): Accept VPAGETABLE in assert
|
||
|
NOT RESTORING (Strategic decisions):
|
||
|
1. vm_map_entry_shadow/allocate_object large object (0x7FFFFFFF)
|
||
|
OLD CODE: Created absurdly large objects because vkernel could map
|
||
|
any page to any VA.
|
||
|
WHY NOT: With cooperative invalidation (MADV_INVAL), we don't need
|
||
|
this hack. Normal-sized objects work because the vkernel invalidates
|
||
|
mappings when it changes its page tables.
|
||
|
2. vm_map_clean() whole-object flush for VPAGETABLE
|
||
|
OLD CODE: Flushed entire object for VPAGETABLE instead of range.
|
||
|
WHY NOT: With cooperative invalidation, range-based cleaning works.
|
||
|
The vkernel is responsible for calling MADV_INVAL after PTE changes.
|
||
|
3. vmspace_fork_normal_entry() backing chain collapse skip
|
||
|
OLD CODE: Skipped backing chain optimization for VPAGETABLE.
|
||
|
WHY NOT: The optimization should work fine. If issues arise, the
|
||
|
vkernel will call MADV_INVAL.
|
||
|
FUTURE WORK (After vm_map.c):
|
||
|
-----------------------------
|
||
|
1. Update pmap reverse-mapping (sys/platform/pc64/x86_64/pmap.c)
|
||
|
- Handle PG_VPTMAPPED pages in PMAP_PAGE_BACKING_SCAN
|
||
|
- Skip normal scan, use cooperative invalidation instead
|
||
|
2. Track active VPAGETABLE pmaps
|
||
|
- Mechanism to broadcast invalidation to all vkernels
|
||
|
- Needed when reclaiming PG_VPTMAPPED pages
|
||
|
3. Verify vkernel code
|
||
|
- Check sys/platform/vkernel64/ properly calls MADV_INVAL
|
||
|
- Ensure cooperative invalidation contract is maintained
|
||
|
4. Test compilation and runtime
|
||
|
- Build kernel with changes
|
||
|
- Test vkernel functionality
|
||
| share/man/man7/vkernel.7 | ||
|---|---|---|
|
.\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
||
|
.\" SUCH DAMAGE.
|
||
|
.\"
|
||
|
.Dd September 7, 2021
|
||
|
.Dd December 18, 2025
|
||
|
.Dt VKERNEL 7
|
||
|
.Os
|
||
|
.Sh NAME
|
||
| ... | ... | |
|
.Op Fl hstUvz
|
||
|
.Op Fl c Ar file
|
||
|
.Op Fl e Ar name Ns = Ns Li value : Ns Ar name Ns = Ns Li value : Ns ...
|
||
|
.Op Fl i Ar file
|
||
|
.Op Fl I Ar interface Ns Op Ar :address1 Ns Oo Ar :address2 Oc Ns Oo Ar /netmask Oc Ns Oo Ar =mac Oc
|
||
|
.Op Fl l Ar cpulock
|
||
|
.Op Fl m Ar size
|
||
|
.Fl m Ar size
|
||
|
.Op Fl n Ar numcpus Ns Op Ar :lbits Ns Oo Ar :cbits Oc
|
||
|
.Op Fl p Ar pidfile
|
||
|
.Op Fl r Ar file Ns Op Ar :serno
|
||
| ... | ... | |
|
This option can be specified more than once.
|
||
|
.It Fl h
|
||
|
Shows a list of available options, each with a short description.
|
||
|
.It Fl i Ar file
|
||
|
Specify a memory image
|
||
|
.Ar file
|
||
|
to be used by the virtual kernel.
|
||
|
If no
|
||
|
.Fl i
|
||
|
option is given, the kernel will generate a name of the form
|
||
|
.Pa /var/vkernel/memimg.XXXXXX ,
|
||
|
with the trailing
|
||
|
.Ql X Ns s
|
||
|
being replaced by a sequential number, e.g.\&
|
||
|
.Pa memimg.000001 .
|
||
|
.It Fl I Ar interface Ns Op Ar :address1 Ns Oo Ar :address2 Oc Ns Oo Ar /netmask Oc Ns Oo Ar =MAC Oc
|
||
|
Create a virtual network device, with the first
|
||
|
.Fl I
|
||
| ... | ... | |
|
Locking the vkernel to a set of cpus is recommended on multi-socket systems
|
||
|
to improve NUMA locality of reference.
|
||
|
.It Fl m Ar size
|
||
|
Specify the amount of memory to be used by the kernel in bytes,
|
||
|
Specify the amount of memory for the virtual kernel in bytes,
|
||
|
.Cm K
|
||
|
.Pq kilobytes ,
|
||
|
.Cm M
|
||
| ... | ... | |
|
and
|
||
|
.Cm G
|
||
|
are allowed.
|
||
|
This option is mandatory.
|
||
|
.It Fl n Ar numcpus Ns Op Ar :lbits Ns Oo Ar :cbits Oc
|
||
|
.Ar numcpus
|
||
|
specifies the number of CPUs you wish to emulate.
|
||
| ... | ... | |
|
to the virtual kernel's
|
||
|
.Xr init 8
|
||
|
process.
|
||
|
.Sh MEMORY MANAGEMENT
|
||
|
The virtual kernel's memory is backed by a temporary file created in
|
||
|
.Pa /var/vkernel
|
||
|
and immediately unlinked.
|
||
|
The file descriptor is kept open for the lifetime of the virtual kernel process.
|
||
|
Both the
|
||
|
.Dv MAP_VPAGETABLE
|
||
|
mapping, which implements the virtual kernel's page tables, and the
|
||
|
direct memory access
|
||
|
.Pq DMAP
|
||
|
region reference this backing store.
|
||
|
.Pp
|
||
|
When the virtual kernel exits, the file descriptor is closed and the
|
||
|
backing store is automatically reclaimed by the operating system.
|
||
|
This ensures proper cleanup of memory resources even if the virtual
|
||
|
kernel terminates abnormally.
|
||
|
.Pp
|
||
|
The following
|
||
|
.Xr sysctl 8
|
||
|
variables provide statistics and debugging for the
|
||
|
.Dv MAP_VPAGETABLE
|
||
|
mechanism:
|
||
|
.Bl -tag -width "vm.vpagetable_faults" -compact
|
||
|
.It Va vm.vpagetable_mmap
|
||
|
Number of
|
||
|
.Dv MAP_VPAGETABLE
|
||
|
mappings created.
|
||
|
.It Va vm.vpagetable_setmap
|
||
|
Number of
|
||
|
.Dv MADV_SETMAP
|
||
|
operations performed.
|
||
|
.It Va vm.vpagetable_inval
|
||
|
Number of
|
||
|
.Dv MADV_INVAL
|
||
|
operations performed.
|
||
|
.It Va vm.vpagetable_faults
|
||
|
Number of page faults handled through VPTE translation.
|
||
|
.It Va vm.debug_vpagetable
|
||
|
Enable debug output for
|
||
|
.Dv MAP_VPAGETABLE
|
||
|
operations.
|
||
|
.El
|
||
|
.Sh DEBUGGING
|
||
|
It is possible to directly gdb the virtual kernel's process.
|
||
|
It is recommended that you do a
|
||
| sys/kern/kern_timeout.c | ||
|---|---|---|
|
/*
|
||
|
* Double check the validity of the callout, detect
|
||
|
* if the originator's structure has been ripped out.
|
||
|
*
|
||
|
* Skip the address range check for virtual kernels
|
||
|
* since vkernel addresses are in host user space.
|
||
|
*/
|
||
|
#ifndef _KERNEL_VIRTUAL
|
||
|
if ((uintptr_t)c->verifier < VM_MAX_USER_ADDRESS) {
|
||
|
spin_unlock(&wheel->spin);
|
||
|
panic("_callout %p verifier %p failed "
|
||
|
"func %p/%p\n",
|
||
|
c, c->verifier, c->rfunc, c->qfunc);
|
||
|
}
|
||
|
#endif
|
||
|
if (c->verifier->toc != c) {
|
||
|
spin_unlock(&wheel->spin);
|
||
|
panic("_callout %p verifier %p failed "
|
||
|
panic("_callout %p verifier %p toc %p (expected %p) "
|
||
|
"func %p/%p\n",
|
||
|
c, c->verifier, c->rfunc, c->qfunc);
|
||
|
c, c->verifier, c->verifier->toc, c,
|
||
|
c->rfunc, c->qfunc);
|
||
|
}
|
||
|
/*
|
||
| sys/platform/pc64/x86_64/pmap.c | ||
|---|---|---|
|
if ((m->flags & PG_MAPPED) == 0)
|
||
|
return;
|
||
|
/*
|
||
|
* Pages mapped via VPAGETABLE cannot be found via the backing_list
|
||
|
* scan because the VA formula doesn't apply. The vkernel is
|
||
|
* responsible for calling MADV_INVAL to remove real PTEs when it
|
||
|
* modifies its page tables.
|
||
|
*
|
||
|
* For PG_VPTMAPPED pages, conservatively assume the page is
|
||
|
* modified and referenced. The backing_list scan below won't
|
||
|
* find these mappings, but that's OK - the vkernel should have
|
||
|
* already removed them via MADV_INVAL.
|
||
|
*
|
||
|
* Clear PG_VPTMAPPED along with PG_MAPPED at the end.
|
||
|
*/
|
||
|
if (m->flags & PG_VPTMAPPED) {
|
||
|
vm_page_dirty(m);
|
||
|
vm_page_flag_set(m, PG_REFERENCED);
|
||
|
}
|
||
|
retry = ticks + hz * 60;
|
||
|
again:
|
||
|
PMAP_PAGE_BACKING_SCAN(m, NULL, ipmap, iptep, ipte, iva) {
|
||
| ... | ... | |
|
m, m->md.interlock_count);
|
||
|
}
|
||
|
}
|
||
|
vm_page_flag_clear(m, PG_MAPPED | PG_MAPPEDMULTI | PG_WRITEABLE);
|
||
|
vm_page_flag_clear(m, PG_MAPPED | PG_MAPPEDMULTI | PG_WRITEABLE |
|
||
|
PG_VPTMAPPED);
|
||
|
}
|
||
|
/*
|
||
| ... | ... | |
|
if (bit == PG_M_IDX && (m->flags & PG_WRITEABLE) == 0)
|
||
|
return FALSE;
|
||
|
/*
|
||
|
* Pages mapped via VPAGETABLE cannot be found via the backing_list
|
||
|
* scan because the VA formula doesn't apply (vkernel can map any
|
||
|
* physical page to any VA). Return TRUE conservatively - the page
|
||
|
* may have the bit set in a vkernel's mapping. The vkernel is
|
||
|
* responsible for calling MADV_INVAL when it modifies its page
|
||
|
* tables.
|
||
|
*/
|
||
|
if (m->flags & PG_VPTMAPPED)
|
||
|
return TRUE;
|
||
|
/*
|
||
|
* Iterate the mapping
|
||
|
*/
|
||
| ... | ... | |
|
if ((m->flags & (PG_MAPPED | PG_WRITEABLE)) == 0)
|
||
|
return;
|
||
|
/*
|
||
|
* Pages mapped via VPAGETABLE cannot be found via the backing_list
|
||
|
* scan. The vkernel is responsible for calling MADV_INVAL when it
|
||
|
* modifies its page tables.
|
||
|
*
|
||
|
* For the RW bit: conservatively mark page dirty, clear WRITEABLE.
|
||
|
* For other bits: cannot clear, just return (vkernel handles this).
|
||
|
*/
|
||
|
if (m->flags & PG_VPTMAPPED) {
|
||
|
if (bit_index == PG_RW_IDX) {
|
||
|
vm_page_dirty(m);
|
||
|
vm_page_flag_clear(m, PG_WRITEABLE);
|
||
|
}
|
||
|
return;
|
||
|
}
|
||
|
/*
|
||
|
* Being asked to clear other random bits, we don't track them
|
||
|
* so we have to iterate.
|
||
| ... | ... | |
|
if (__predict_false(!pmap_initialized || (m->flags & PG_FICTITIOUS)))
|
||
|
return rval;
|
||
|
/*
|
||
|
* Pages mapped via VPAGETABLE cannot be found via the backing_list
|
||
|
* scan. Return non-zero conservatively to indicate the page may
|
||
|
* be referenced. The vkernel is responsible for calling MADV_INVAL
|
||
|
* when it modifies its page tables.
|
||
|
*/
|
||
|
if (m->flags & PG_VPTMAPPED)
|
||
|
return 1;
|
||
|
PMAP_PAGE_BACKING_SCAN(m, NULL, ipmap, iptep, ipte, iva) {
|
||
|
if (ipte & ipmap->pmap_bits[PG_A_IDX]) {
|
||
|
npte = ipte & ~ipmap->pmap_bits[PG_A_IDX];
|
||
| sys/platform/vkernel64/include/md_var.h | ||
|---|---|---|
|
extern char cpu_vendor[]; /* XXX belongs in pc64 */
|
||
|
extern u_int cpu_vendor_id; /* XXX belongs in pc64 */
|
||
|
extern u_int cpu_id; /* XXX belongs in pc64 */
|
||
|
extern u_int cpu_feature; /* XXX belongs in pc64 */
|
||
|
extern u_int cpu_feature2; /* XXX belongs in pc64 */
|
||
|
extern struct vkdisk_info DiskInfo[VKDISK_MAX];
|
||
|
extern int DiskNum;
|
||
| sys/platform/vkernel64/include/pmap.h | ||
|---|---|---|
|
#define __VM_MEMATTR_T_DEFINED__
|
||
|
typedef char vm_memattr_t;
|
||
|
#endif
|
||
|
#ifndef __VM_PROT_T_DEFINED__
|
||
|
#define __VM_PROT_T_DEFINED__
|
||
|
typedef u_char vm_prot_t;
|
||
|
#endif
|
||
|
void pmap_bootstrap(vm_paddr_t *, int64_t);
|
||
|
void *pmap_mapdev (vm_paddr_t, vm_size_t);
|
||
| sys/platform/vkernel64/platform/copyio.c | ||
|---|---|---|
|
#include <cpu/lwbuf.h>
|
||
|
#include <vm/vm_page.h>
|
||
|
#include <vm/vm_extern.h>
|
||
|
#include <vm/pmap.h>
|
||
|
#include <assert.h>
|
||
|
#include <sys/stat.h>
|
||
| sys/platform/vkernel64/platform/init.c | ||
|---|---|---|
|
void *dmap_min_address;
|
||
|
void *vkernel_stack;
|
||
|
u_int cpu_feature; /* XXX */
|
||
|
u_int cpu_feature2; /* XXX */
|
||
|
int tsc_present;
|
||
|
int tsc_invariant;
|
||
|
int tsc_mpsync;
|
||
| ... | ... | |
|
int eflag;
|
||
|
int real_vkernel_enable;
|
||
|
int supports_sse;
|
||
|
uint32_t mxcsr_mask;
|
||
|
size_t vsize;
|
||
|
size_t msize;
|
||
|
size_t kenv_size;
|
||
| ... | ... | |
|
tsc_oneus_approx = ((tsc_frequency|1) + 999999) / 1000000;
|
||
|
/*
|
||
|
* Check SSE
|
||
|
* Check SSE and get the host's MXCSR mask. The mask must be set
|
||
|
* before init_fpu() because npxprobemask() may not work correctly
|
||
|
* in userspace context.
|
||
|
*/
|
||
|
vsize = sizeof(supports_sse);
|
||
|
supports_sse = 0;
|
||
|
sysctlbyname("hw.instruction_sse", &supports_sse, &vsize, NULL, 0);
|
||
|
sysctlbyname("hw.mxcsr_mask", &mxcsr_mask, &msize, NULL, 0);
|
||
|
msize = sizeof(npx_mxcsr_mask);
|
||
|
sysctlbyname("hw.mxcsr_mask", &npx_mxcsr_mask, &msize, NULL, 0);
|
||
|
init_fpu(supports_sse);
|
||
|
if (supports_sse)
|
||
|
cpu_feature |= CPUID_SSE | CPUID_FXSR;
|
||
| ... | ... | |
|
/*
|
||
|
* Initialize system memory. This is the virtual kernel's 'RAM'.
|
||
|
*
|
||
|
* We always use an anonymous memory file (created in /tmp or /var/vkernel
|
||
|
* and immediately unlinked). This ensures proper cleanup of PG_VPTMAPPED
|
||
|
* pages when the vkernel exits - the backing object is destroyed and all
|
||
|
* pages are freed.
|
||
|
*
|
||
|
* The -i option is deprecated but still accepted for compatibility.
|
||
|
*/
|
||
|
static
|
||
|
void
|
||
|
init_sys_memory(char *imageFile)
|
||
|
{
|
||
|
struct stat st;
|
||
|
int i;
|
||
|
int fd;
|
||
|
char *tmpfile;
|
||
|
/*
|
||
|
* Warn if -i was specified (deprecated)
|
||
|
*/
|
||
|
if (imageFile != NULL) {
|
||
|
fprintf(stderr,
|
||
|
"WARNING: -i option is deprecated and ignored.\n"
|
||
|
" Memory is now always anonymous (unlinked file).\n");
|
||
|
}
|
||
|
/*
|
||
|
* Figure out the system memory image size. If an image file was
|
||
|
* specified and -m was not specified, use the image file's size.
|
||
|
* Require -m to be specified
|
||
|
*/
|
||
|
if (imageFile && stat(imageFile, &st) == 0 && Maxmem_bytes == 0)
|
||
|
Maxmem_bytes = (vm_paddr_t)st.st_size;
|
||
|
if ((imageFile == NULL || stat(imageFile, &st) < 0) &&
|
||
|
Maxmem_bytes == 0) {
|
||
|
errx(1, "Cannot create new memory file %s unless "
|
||
|
"system memory size is specified with -m",
|
||
|
imageFile);
|
||
|
if (Maxmem_bytes == 0) {
|
||
|
errx(1, "System memory size must be specified with -m");
|
||
|
/* NOT REACHED */
|
||
|
}
|
||
| ... | ... | |
|
}
|
||
|
/*
|
||
|
* Generate an image file name if necessary, then open/create the
|
||
|
* file exclusively locked. Do not allow multiple virtual kernels
|
||
|
* to use the same image file.
|
||
|
*
|
||
|
* Don't iterate through a million files if we do not have write
|
||
|
* access to the directory, stop if our open() failed on a
|
||
|
* non-existant file. Otherwise opens can fail for any number
|
||
|
* Create an anonymous memory backing file. We create a temp file
|
||
|
* and immediately unlink it. The file descriptor keeps the file
|
||
|
* alive until the vkernel exits, at which point all pages are
|
||
|
* properly freed (including clearing PG_VPTMAPPED).
|
||
|
*/
|
||
|
if (imageFile == NULL) {
|
||
|
for (i = 0; i < 1000000; ++i) {
|
||
|
asprintf(&imageFile, "/var/vkernel/memimg.%06d", i);
|
||
|
fd = open(imageFile,
|
||
|
O_RDWR|O_CREAT|O_EXLOCK|O_NONBLOCK, 0644);
|
||
|
if (fd < 0 && stat(imageFile, &st) == 0) {
|
||
|
free(imageFile);
|
||
|
continue;
|
||
|
}
|
||
|
break;
|
||
|
}
|
||
|
} else {
|
||
|
fd = open(imageFile, O_RDWR|O_CREAT|O_EXLOCK|O_NONBLOCK, 0644);
|
||
|
}
|
||
|
fprintf(stderr, "Using memory file: %s\n", imageFile);
|
||
|
if (fd < 0 || fstat(fd, &st) < 0) {
|
||
|
err(1, "Unable to open/create %s", imageFile);
|
||
|
/* NOT REACHED */
|
||
|
}
|
||
|
asprintf(&tmpfile, "/var/vkernel/.memimg.%d", (int)getpid());
|
||
|
fd = open(tmpfile, O_RDWR|O_CREAT|O_EXCL, 0600);
|
||
|
if (fd < 0)
|
||
|
err(1, "Unable to create %s", tmpfile);
|
||
|
unlink(tmpfile);
|
||
|
free(tmpfile);
|
||
|
fprintf(stderr, "Using anonymous memory (%llu MB)\n",
|
||
|
(unsigned long long)Maxmem_bytes / (1024 * 1024));
|
||
|
/*
|
||
|
* Truncate or extend the file as necessary. Clean out the contents
|
||
|
* of the file, we want it to be full of holes so we don't waste
|
||
|
* time reading in data from an old file that we no longer care
|
||
|
* about.
|
||
|
* Size the file. It will be sparse (no actual disk space used
|
||
|
* until pages are faulted in).
|
||
|
*/
|
||
|
ftruncate(fd, 0);
|
||
|
ftruncate(fd, Maxmem_bytes);
|
||
|
if (ftruncate(fd, Maxmem_bytes) < 0) {
|
||
|
err(1, "Unable to size memory backing file");
|
||
|
/* NOT REACHED */
|
||
|
}
|
||
|
MemImageFd = fd;
|
||
|
Maxmem = Maxmem_bytes >> PAGE_SHIFT;
|
||
| ... | ... | |
|
"\t-c\tSpecify a readonly CD-ROM image file to be used by the kernel.\n"
|
||
|
"\t-e\tSpecify an environment to be used by the kernel.\n"
|
||
|
"\t-h\tThis list of options.\n"
|
||
|
"\t-i\tSpecify a memory image file to be used by the virtual kernel.\n"
|
||
|
"\t-i\t(DEPRECATED) Memory is now always anonymous.\n"
|
||
|
"\t-I\tCreate a virtual network device.\n"
|
||
|
"\t-l\tSpecify which, if any, real CPUs to lock virtual CPUs to.\n"
|
||
|
"\t-m\tSpecify the amount of memory to be used by the kernel in bytes.\n"
|
||
|
"\t-m\tSpecify the amount of memory to be used by the kernel in bytes (required).\n"
|
||
|
"\t-n\tSpecify the number of CPUs and the topology you wish to emulate:\n"
|
||
|
"\t\t\tnumcpus - number of cpus\n"
|
||
|
"\t\t\tlbits - specify the number of bits within APICID(=CPUID)\n"
|
||
| sys/platform/vkernel64/platform/pmap.c | ||
|---|---|---|
|
psize = x86_64_btop(size);
|
||
|
if ((object->type != OBJT_VNODE) ||
|
||
|
((limit & MAP_PREFAULT_PARTIAL) && (psize > MAX_INIT_PT) &&
|
||
|
((limit & COWF_PREFAULT_PARTIAL) && (psize > MAX_INIT_PT) &&
|
||
|
(object->resident_page_count > MAX_INIT_PT))) {
|
||
|
return;
|
||
|
}
|
||
| ... | ... | |
|
* don't allow an madvise to blow away our really
|
||
|
* free pages allocating pv entries.
|
||
|
*/
|
||
|
if ((info->limit & MAP_PREFAULT_MADVISE) &&
|
||
|
if ((info->limit & COWF_PREFAULT_MADVISE) &&
|
||
|
vmstats.v_free_count < vmstats.v_free_reserved) {
|
||
|
return(-1);
|
||
|
}
|
||
| sys/platform/vkernel64/x86_64/cpu_regs.c | ||
|---|---|---|
|
char *sp;
|
||
|
regs = lp->lwp_md.md_regs;
|
||
|
oonstack = (lp->lwp_sigstk.ss_flags & SS_ONSTACK) ? 1 : 0;
|
||
|
/* Save user context */
|
||
| ... | ... | |
|
do_cpuid(1, regs);
|
||
|
cpu_feature = regs[3];
|
||
|
cpu_feature2 = regs[2];
|
||
|
/*
|
||
|
* The vkernel uses fxsave64/fxrstor64 for FPU state management,
|
||
|
* not xsave/xrstor. Mask out AVX/XSAVE features that we don't
|
||
|
* support, otherwise userland (libc/libm) may try to use AVX
|
||
|
* instructions and the FPU state won't be properly saved/restored,
|
||
|
* leading to FPE or corrupted state.
|
||
|
*/
|
||
|
cpu_feature2 &= ~(CPUID2_XSAVE | CPUID2_OSXSAVE | CPUID2_AVX |
|
||
|
CPUID2_FMA | CPUID2_F16C);
|
||
|
}
|
||
| sys/platform/vkernel64/x86_64/exception.c | ||
|---|---|---|
|
int save;
|
||
|
save = errno;
|
||
|
#if 0
|
||
|
kprintf("CAUGHT SIG %d RIP %08lx ERR %08lx TRAPNO %ld "
|
||
|
"err %ld addr %08lx\n",
|
||
|
signo,
|
||
|
ctx->uc_mcontext.mc_rip,
|
||
|
ctx->uc_mcontext.mc_err,
|
||
|
ctx->uc_mcontext.mc_trapno & 0xFFFF,
|
||
|
ctx->uc_mcontext.mc_trapno >> 16,
|
||
|
ctx->uc_mcontext.mc_addr);
|
||
|
#endif
|
||
|
kern_trap((struct trapframe *)&ctx->uc_mcontext.mc_rdi);
|
||
|
splz();
|
||
|
errno = save;
|
||
| sys/platform/vkernel64/x86_64/npx.c | ||
|---|---|---|
|
#define fnstcw(addr) __asm __volatile("fnstcw %0" : "=m" (*(addr)))
|
||
|
#define fnstsw(addr) __asm __volatile("fnstsw %0" : "=m" (*(addr)))
|
||
|
#define frstor(addr) __asm("frstor %0" : : "m" (*(addr)))
|
||
|
#define fxrstor(addr) __asm("fxrstor %0" : : "m" (*(addr)))
|
||
|
#define fxsave(addr) __asm __volatile("fxsave %0" : "=m" (*(addr)))
|
||
|
#define fxrstor(addr) __asm("fxrstor64 %0" : : "m" (*(addr)))
|
||
|
#define fxsave(addr) __asm __volatile("fxsave64 %0" : "=m" (*(addr)))
|
||
|
#define ldmxcsr(csr) __asm __volatile("ldmxcsr %0" : : "m" (csr))
|
||
|
static void fpu_clean_state(void);
|
||
| ... | ... | |
|
* fnsave are broken, so our treatment breaks fnclex if it is the
|
||
|
* first FPU instruction after a context switch.
|
||
|
*/
|
||
|
if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) && cpu_fxsr) {
|
||
|
krateprintf(&badfprate,
|
||
|
"FXRSTOR: illegal FP MXCSR %08x didinit = %d\n",
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr, didinit);
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= 0xFFBF;
|
||
|
if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~npx_mxcsr_mask) && cpu_fxsr) {
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= npx_mxcsr_mask;
|
||
|
lwpsignal(curproc, curthread->td_lwp, SIGFPE);
|
||
|
}
|
||
|
fpurstor(curthread->td_savefpu, 0);
|
||
| ... | ... | |
|
if (td == mdcpu->gd_npxthread)
|
||
|
npxsave(td->td_savefpu);
|
||
|
bcopy(mctx->mc_fpregs, td->td_savefpu, sizeof(*td->td_savefpu));
|
||
|
if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) &&
|
||
|
if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~npx_mxcsr_mask) &&
|
||
|
cpu_fxsr) {
|
||
|
krateprintf(&badfprate,
|
||
|
"pid %d (%s) signal return from user: "
|
||
| ... | ... | |
|
td->td_proc->p_pid,
|
||
|
td->td_proc->p_comm,
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr);
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= 0xFFBF;
|
||
|
td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= npx_mxcsr_mask;
|
||
|
}
|
||
|
td->td_flags |= TDF_USINGFP;
|
||
|
break;
|
||
| sys/platform/vkernel64/x86_64/trap.c | ||
|---|---|---|
|
eva = frame->tf_addr;
|
||
|
else
|
||
|
eva = 0;
|
||
|
#if 0
|
||
|
kprintf("USER_TRAP AT %08lx xflags %ld trapno %ld eva %08lx\n",
|
||
|
frame->tf_rip, frame->tf_xflags, frame->tf_trapno, eva);
|
||
|
#endif
|
||
|
/*
|
||
|
* Everything coming from user mode runs through user_trap,
|
||
|
* including system calls.
|
||
| ... | ... | |
|
*/
|
||
|
gd = mycpu;
|
||
|
gd->gd_flags |= GDF_VIRTUSER;
|
||
|
r = vmspace_ctl(id, VMSPACE_CTL_RUN, tf,
|
||
|
&curthread->td_savevext);
|
||
| ... | ... | |
|
}
|
||
|
crit_exit();
|
||
|
gd->gd_flags &= ~GDF_VIRTUSER;
|
||
|
#if 0
|
||
|
kprintf("GO USER %d trap %ld EVA %08lx RIP %08lx RSP %08lx XFLAGS %02lx/%02lx\n",
|
||
|
r, tf->tf_trapno, tf->tf_addr, tf->tf_rip, tf->tf_rsp,
|
||
|
tf->tf_xflags, frame->if_xflags);
|
||
|
#endif
|
||
|
/* DEBUG: Only log errors and FPU-related traps */
|
||
|
if (r < 0) {
|
||
|
if (errno == EFAULT) {
|
||
|
panic("vmspace_ctl failed with EFAULT");
|
||
|
}
|
||
|
if (errno != EINTR)
|
||
|
panic("vmspace_ctl failed error %d", errno);
|
||
|
} else {
|
||
| sys/sys/mman.h | ||
|---|---|---|
|
/*
|
||
|
* Mapping type
|
||
|
*
|
||
|
* NOTE! MAP_VPAGETABLE is no longer supported and will generate a mmap()
|
||
|
* error.
|
||
|
* NOTE! MAP_VPAGETABLE is used by vkernels for software page tables.
|
||
|
*
|
||
|
*/
|
||
|
#define MAP_FILE 0x0000 /* map from file (default) */
|
||
|
#define MAP_ANON 0x1000 /* allocated from memory, swap space */
|
||
| sys/vm/vm.h | ||
|---|---|---|
|
*/
|
||
|
#define VM_MAPTYPE_UNSPECIFIED 0
|
||
|
#define VM_MAPTYPE_NORMAL 1
|
||
|
#define VM_MAPTYPE_UNUSED02 2 /* was VPAGETABLE */
|
||
|
#define VM_MAPTYPE_VPAGETABLE 2 /* vkernel software page table */
|
||
|
#define VM_MAPTYPE_SUBMAP 3
|
||
|
#define VM_MAPTYPE_UKSMAP 4 /* user-kernel shared memory */
|
||
| sys/vm/vm_fault.c | ||
|---|---|---|
|
SYSCTL_INT(_vm, OID_AUTO, debug_fault, CTLFLAG_RW, &debug_fault, 0, "");
|
||
|
__read_mostly static int debug_cluster = 0;
|
||
|
SYSCTL_INT(_vm, OID_AUTO, debug_cluster, CTLFLAG_RW, &debug_cluster, 0, "");
|
||
|
/* VPAGETABLE debugging - counts and optional verbose output */
|
||
|
static long vpagetable_fault_count = 0;
|
||
|
SYSCTL_LONG(_vm, OID_AUTO, vpagetable_faults, CTLFLAG_RW,
|
||
|
&vpagetable_fault_count, 0, "Number of VPAGETABLE faults");
|
||
|
__read_mostly int debug_vpagetable = 0;
|
||
|
SYSCTL_INT(_vm, OID_AUTO, debug_vpagetable, CTLFLAG_RW,
|
||
|
&debug_vpagetable, 0, "Debug VPAGETABLE operations");
|
||
|
#if 0
|
||
|
static int virtual_copy_enable = 1;
|
||
|
SYSCTL_INT(_vm, OID_AUTO, virtual_copy_enable, CTLFLAG_RW,
|
||
| ... | ... | |
|
vm_pindex_t first_count, int *mextcountp,
|
||
|
vm_prot_t fault_type);
|
||
|
static int vm_fault_object(struct faultstate *, vm_pindex_t, vm_prot_t, int);
|
||
|
static int vm_fault_vpagetable(struct faultstate *, vm_pindex_t *,
|
||
|
vpte_t, int, int);
|
||
|
static void vm_set_nosync(vm_page_t m, vm_map_entry_t entry);
|
||
|
static void vm_prefault(pmap_t pmap, vm_offset_t addra,
|
||
|
vm_map_entry_t entry, int prot, int fault_flags);
|
||
| ... | ... | |
|
struct proc *p;
|
||
|
#endif
|
||
|
thread_t td;
|
||
|
struct vm_map_ilock ilock;
|
||
|
int mextcount;
|
||
|
int didilock;
|
||
|
int growstack;
|
||
|
int retry = 0;
|
||
|
int inherit_prot;
|
||
| ... | ... | |
|
if (vm_fault_bypass_count &&
|
||
|
vm_fault_bypass(&fs, first_pindex, first_count,
|
||
|
&mextcount, fault_type) == KERN_SUCCESS) {
|
||
|
didilock = 0;
|
||
|
fault_flags &= ~VM_FAULT_BURST;
|
||
|
goto success;
|
||
|
}
|
||
| ... | ... | |
|
fs.first_ba_held = 1;
|
||
|
/*
|
||
|
* The page we want is at (first_object, first_pindex).
|
||
|
* The page we want is at (first_object, first_pindex), but if the
|
||
|
* vm_map_entry is VM_MAPTYPE_VPAGETABLE we have to traverse the
|
||
|
* page table to figure out the actual pindex.
|
||
|
*
|
||
|
* NOTE! DEVELOPMENT IN PROGRESS, THIS IS AN INITIAL IMPLEMENTATION
|
||
|
* ONLY
|
||
|
*/
|
||
|
didilock = 0;
|
||
|
if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
|
||
|
++vpagetable_fault_count;
|
||
|
if (debug_vpagetable) {
|
||
|
kprintf("VPAGETABLE fault: vaddr=%lx pde=%lx type=%02x pid=%d\n",
|
||
|
vaddr, fs.entry->aux.master_pde, fault_type,
|
||
|
(curproc ? curproc->p_pid : -1));
|
||
|
}
|
||
|
vm_map_interlock(fs.map, &ilock, vaddr, vaddr + PAGE_SIZE);
|
||
|
didilock = 1;
|
||
|
result = vm_fault_vpagetable(&fs, &first_pindex,
|
||
|
fs.entry->aux.master_pde,
|
||
|
fault_type, 1);
|
||
|
if (result == KERN_TRY_AGAIN) {
|
||
|
vm_map_deinterlock(fs.map, &ilock);
|
||
|
++retry;
|
||
|
goto RetryFault;
|
||
|
}
|
||
|
if (result != KERN_SUCCESS) {
|
||
|
vm_map_deinterlock(fs.map, &ilock);
|
||
|
goto done;
|
||
|
}
|
||
|
}
|
||
|
/*
|
||
|
* Now we have the actual (object, pindex), fault in the page. If
|
||
|
* vm_fault_object() fails it will unlock and deallocate the FS
|
||
|
* data. If it succeeds everything remains locked and fs->ba->object
|
||
| ... | ... | |
|
}
|
||
|
if (result == KERN_TRY_AGAIN) {
|
||
|
if (didilock)
|
||
|
vm_map_deinterlock(fs.map, &ilock);
|
||
|
++retry;
|
||
|
goto RetryFault;
|
||
|
}
|
||
|
if (result != KERN_SUCCESS) {
|
||
|
if (didilock)
|
||
|
vm_map_deinterlock(fs.map, &ilock);
|
||
|
goto done;
|
||
|
}
|
||
| ... | ... | |
|
KKASSERT(fs.lookup_still_valid != 0);
|
||
|
vm_page_flag_set(fs.mary[0], PG_REFERENCED);
|
||
|
/*
|
||
|
* Mark pages mapped via VPAGETABLE so the pmap layer knows
|
||
|
* that the backing_list scan won't find these mappings.
|
||
|
* The vkernel is responsible for calling MADV_INVAL when
|
||
|
* it modifies its page tables.
|
||
|
*/
|
||
|
if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
|
||
|
for (n = 0; n < mextcount; ++n)
|
||
|
vm_page_flag_set(fs.mary[n], PG_VPTMAPPED);
|
||
|
}
|
||
|
for (n = 0; n < mextcount; ++n) {
|
||
|
pmap_enter(fs.map->pmap, vaddr + (n << PAGE_SHIFT),
|
||
|
fs.mary[n], fs.prot | inherit_prot,
|
||
|
fs.wflags & FW_WIRED, fs.entry);
|
||
|
}
|
||
|
if (didilock)
|
||
|
vm_map_deinterlock(fs.map, &ilock);
|
||
|
/*
|
||
|
* If the page is not wired down, then put it where the pageout daemon
|
||
|
* can find it.
|
||
| ... | ... | |
|
if (fs->fault_flags & VM_FAULT_WIRE_MASK)
|
||
|
return KERN_FAILURE;
|
||
|
/*
|
||
|
* Can't handle VPAGETABLE - requires vm_fault_vpagetable() to
|
||
|
* translate the pindex.
|
||
|
*/
|
||
|
if (fs->entry->maptype == VM_MAPTYPE_VPAGETABLE) {
|
||
|
#ifdef VM_FAULT_QUICK_DEBUG
|
||
|
++vm_fault_bypass_failure_count1;
|
||
|
#endif
|
||
|
return KERN_FAILURE;
|
||
|
}
|
||
|
/*
|
||
|
* Ok, try to get the vm_page quickly via the hash table. The
|
||
|
* page will be soft-busied on success (NOT hard-busied).
|
||
| ... | ... | |
|
fs.vp = vnode_pager_lock(fs.first_ba); /* shared */
|
||
|
/*
|
||
|
* The page we want is at (first_object, first_pindex).
|
||
|
* The page we want is at (first_object, first_pindex), but if the
|
||
|
* vm_map_entry is VM_MAPTYPE_VPAGETABLE we have to traverse the
|
||
|
* page table to figure out the actual pindex.
|
||
|
*
|
||
|
* NOTE! DEVELOPMENT IN PROGRESS, THIS IS AN INITIAL IMPLEMENTATION
|
||
|
* ONLY
|
||
|
*/
|
||
|
if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
|
||
|
result = vm_fault_vpagetable(&fs, &first_pindex,
|
||
|
fs.entry->aux.master_pde,
|
||
|
fault_type, 1);
|
||
|
first_count = 1;
|
||
|
if (result == KERN_TRY_AGAIN) {
|
||
|
++retry;
|
||
|
goto RetryFault;
|
||
|
}
|
||
|
if (result != KERN_SUCCESS) {
|
||
|
*errorp = result;
|
||
|
fs.mary[0] = NULL;
|
||
|
goto done;
|
||
|
}
|
||
|
}
|
||
|
/*
|
||
|
* Now we have the actual (object, pindex), fault in the page. If
|
||
|
* vm_fault_object() fails it will unlock and deallocate the FS
|
||
|
* data. If it succeeds everything remains locked and fs->ba->object
|
||
| ... | ... | |
|
* modifications made by ptrace().
|
||
|
*/
|
||
|
vm_page_flag_set(fs.mary[0], PG_REFERENCED);
|
||
|
/*
|
||
|
* Mark pages mapped via VPAGETABLE so the pmap layer knows
|
||
|
* that the backing_list scan won't find these mappings.
|
||
|
*/
|
||
|
if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE)
|
||
|
vm_page_flag_set(fs.mary[0], PG_VPTMAPPED);
|
||
|
#if 0
|
||
|
pmap_enter(fs.map->pmap, vaddr, fs.mary[0], fs.prot,
|
||
|
fs.wflags & FW_WIRED, NULL);
|
||
| ... | ... | |
|
pmap_remove(fs.map->pmap,
|
||
|
vaddr & ~PAGE_MASK,
|
||
|
(vaddr & ~PAGE_MASK) + PAGE_SIZE);
|
||
|
#ifdef _KERNEL_VIRTUAL
|
||
|
/*
|
||
|
* For the vkernel, we must also call pmap_enter() to install
|
||
|
* the new page in the software page table (VPTE) after COW.
|
||
|
* The native kernel doesn't need this because the hardware
|
||
|
* MMU will fault again, but the vkernel writes via DMAP and
|
||
|
* the guest reads via the VPTE, so the VPTE must be updated
|
||
|
* immediately.
|
||
|
*/
|
||
|
pmap_enter(fs.map->pmap, vaddr, fs.mary[0],
|
||
|
fs.prot, fs.wflags & FW_WIRED, NULL);
|
||
|
#endif
|
||
|
}
|
||
|
/*
|
||
| ... | ... | |
|
return(fs.mary[0]);
|
||
|
}
|
||
|
/*
|
||
|
* Translate the virtual page number (first_pindex) that is relative
|
||
|
* to the address space into a logical page number that is relative to the
|
||
|
* backing object. Use the virtual page table pointed to by (vpte).
|
||
|
*
|
||
|
* Possibly downgrade the protection based on the vpte bits.
|
||
|
*
|
||
|
* This implements an N-level page table. Any level can terminate the
|
||
|
* scan by setting VPTE_PS. A linear mapping is accomplished by setting
|
||
|
* VPTE_PS in the master page directory entry set via mcontrol(MADV_SETMAP).
|
||
|
*/
|
||
|
static
|
||
|
int
|
||
|
vm_fault_vpagetable(struct faultstate *fs, vm_pindex_t *pindex,
|
||
|
vpte_t vpte, int fault_type, int allow_nofault)
|
||
|
{
|
||
|
struct lwbuf *lwb;
|
||
|
struct lwbuf lwb_cache;
|
||
|
int vshift = VPTE_FRAME_END - PAGE_SHIFT; /* index bits remaining */
|
||
|
int result;
|
||
|
vpte_t *ptep;
|
||
|
ASSERT_LWKT_TOKEN_HELD(vm_object_token(fs->first_ba->object));
|
||
|
for (;;) {
|
||
|
/*
|
||
|
* We cannot proceed if the vpte is not valid, not readable
|
||
|
* for a read fault, not writable for a write fault, or
|
||
|
* not executable for an instruction execution fault.
|
||
|
*/
|
||
|
if ((vpte & VPTE_V) == 0) {
|
||
|
unlock_things(fs);
|
||
|
return (KERN_FAILURE);
|
||
|
}
|
||
|
if ((fault_type & VM_PROT_WRITE) && (vpte & VPTE_RW) == 0) {
|
||
|
unlock_things(fs);
|
||
|
return (KERN_FAILURE);
|
||
|
}
|
||
|
if ((fault_type & VM_PROT_EXECUTE) && (vpte & VPTE_NX)) {
|
||
|
unlock_things(fs);
|
||
|
return (KERN_FAILURE);
|
||
|
}
|
||
|
if ((vpte & VPTE_PS) || vshift == 0)
|
||
|
break;
|
||
|
/*
|
||
|
* Get the page table page. Nominally we only read the page
|
||
|
* table, but since we are actively setting VPTE_M and VPTE_A,
|
||
|
* tell vm_fault_object() that we are writing it.
|
||
|
*
|
||
|
* There is currently no real need to optimize this.
|
||
|
*/
|
||