0001-vkernel-Restore-MAP_VPAGETABLE-support-with-COW-VPTE.patch - DragonFlyBSD - DragonFlyBSD bugtracker

     MAP_VPAGETABLE Re-implementation Analysis
     ==========================================
     Date: December 2024
     Context: Analysis of commit 4d4f84f5f26bf5e9fe4d0761b34a5f1a3784a16f which
     removed MAP_VPAGETABLE support, breaking vkernel functionality.
     TABLE OF CONTENTS
     -----------------
 . Background
 . Why MAP_VPAGETABLE Was Removed
 . Current VM Architecture
 . The Reverse-Mapping Problem
 . Cost Analysis of Current Mechanisms
 . Proposed Solutions
 . Recommendation
 . Open Questions
     ==============================================================================
 . BACKGROUND
     ==============================================================================
     MAP_VPAGETABLE was a DragonFly BSD feature that allowed the vkernel (virtual
     kernel) to implement software page tables without requiring hardware
     virtualization support (Intel VT-x / AMD-V).
     The vkernel runs as a userspace process but provides a full kernel
     environment. It needs to manage its own "guest" page tables for processes
     running inside the vkernel. MAP_VPAGETABLE allowed this by:
 . Creating an mmap region with MAP_VPAGETABLE flag
 . The vkernel writes software page table entries (VPTEs) into this region
 . On page faults, the host kernel walks these VPTEs to translate
          guest virtual addresses to host physical addresses
     The key advantage was lightweight virtualization - no hypervisor, no special
     CPU features required. The vkernel was just a process with some extra kernel
     support for the virtual page tables.
     ==============================================================================
 . WHY MAP_VPAGETABLE WAS REMOVED
     ==============================================================================
     From the commit message:
       "The basic problem is that the VM system is moving to an extent-based
        mechanism for tracking VM pages entered into PMAPs and is no longer
        indexing individual terminal PTEs with pv_entry's.
        This means that the VM system is no longer able to get an exact list of
        PTEs in PMAPs that a particular vm_page is using. It just has a flag
        'this page is in at least one pmap' or 'this page is not in any pmaps'.
        To track down the PTEs, the VM system must run through the extents via
        the vm_map_backing structures hanging off the related VM object.
        This mechanism does not work with MAP_VPAGETABLE. Short of scanning
        the entire real pmap, the kernel has no way to reverse-index a page
        that might be indirected through MAP_VPAGETABLE."
     The core issue: DragonFly optimized memory by removing per-page tracking
     (pv_entry lists) in favor of extent-based tracking (vm_map_backing lists).
     This works for normal mappings but breaks VPAGETABLE.
     ==============================================================================
 . CURRENT VM ARCHITECTURE
     ==============================================================================
 .1 Key Data Structures
     -----------------------
     vm_object:
       - Contains pages (rb_memq tree)
       - Has a backing_list: TAILQ of vm_map_backing entries
       - Each vm_map_backing represents an extent that maps part of this object
     vm_map_backing:
       - Links a vm_map_entry to a vm_object
       - Contains: pmap, start, end, offset
       - Tracks "pages [offset, offset+size) of this object are mapped at
         virtual addresses [start, end) in this pmap"
     vm_page:
       - PG_MAPPED flag: "this page MIGHT be mapped somewhere"
       - PG_WRITEABLE flag: "this page MIGHT have a writable mapping"
       - md.interlock_count: race detection between pmap_enter/pmap_remove_all
 .2 Reverse-Mapping Mechanism
     -----------------------------
     The PMAP_PAGE_BACKING_SCAN macro (sys/platform/pc64/x86_64/pmap.c:176-220)
     finds all PTEs mapping a given physical page:
       for each vm_map_backing in page->object->backing_list:
           if page->pindex is within backing's range:
               compute va = backing->start + (pindex - offset) * PAGE_SIZE
               look up PTE at va in backing->pmap
               if PTE maps our physical page:
                   found it!
     This works because for NORMAL mappings, the relationship between object
     pindex and virtual address is fixed and computable.
 .3 Why This Doesn't Work for VPAGETABLE
     ----------------------------------------
     With VPAGETABLE:
       - One vm_map_backing covers the entire VPAGETABLE region
       - The vkernel's software page tables can map ANY physical page to
         ANY virtual address within that region
       - The formula "va = start + (pindex - offset) * PAGE_SIZE" is WRONG
       - The actual VA depends on what the vkernel wrote into its guest PTEs
     Example:
       - VPAGETABLE region: VA 0x1000000-0x2000000
       - Physical page at object pindex 42
       - Expected VA by formula: 0x1000000 + 42*4096 = 0x102a000
       - Actual VA per guest PTEs: 0x1500000 (and maybe also 0x1800000!)
       - The scan looks at 0x102a000, finds nothing, misses the real mappings
     ==============================================================================
 . THE REVERSE-MAPPING PROBLEM
     ==============================================================================
 .1 When Reverse-Mapping is Needed
     ----------------------------------
     The backing_list scan is used by these functions (7 call sites in pmap.c):
       pmap_remove_all()      - Remove page from ALL pmaps (page reclaim, COW)
       pmap_remove_specific() - Remove page from ONE specific pmap
       pmap_testbit()         - Check if Modified bit is set
       pmap_clearbit()        - Clear Access/Modified/Write bits
       pmap_ts_referenced()   - Check/clear reference bits for page aging
 .2 When Reverse-Mapping is NOT Needed
     --------------------------------------
     Normal page faults do NOT use backing_list scans. They:
 . Look up vm_map_entry by faulting VA
 . Walk the vm_map_backing chain to find/create the page
 . Call pmap_enter() to install PTE
     This is O(1) with respect to other mappings - no scanning.
 .3 The vkernel's Existing Cooperative Mechanism
     ------------------------------------------------
     The vkernel already has a way to notify the host of PTE changes:
       madvise(addr, len, MADV_INVAL)
     This tells the host kernel: "I've modified my guest page tables, please
     invalidate your cached PTEs for this range."
     The host responds with pmap_remove() on the range (vm_map.c:2361-2374).
     This mechanism still exists in the codebase and could be leveraged.
     ==============================================================================
 . COST ANALYSIS OF CURRENT MECHANISMS
     ==============================================================================
 .1 The O(N) in backing_list Scan
     ---------------------------------
     N = number of vm_map_backing entries on the object's backing_list
     For typical objects:
       - Private anonymous memory: N = 1 (only owner maps it)
       - Small private files: N = 1-10
       - Shared libraries (libc.so): N = hundreds to thousands
     The scan itself is cheap (pointer chasing + range check), but for shared
     objects with many mappings, N can be significant.
 .2 Early Exit Optimizations
     ----------------------------
     pmap_ts_referenced() stops after finding 4 mappings - doesn't need all.
     PG_MAPPED flag check allows skipping pages that are definitely unmapped.
 .3 When Scans Actually Happen
     ------------------------------
     Scans are triggered by:
       - Page reclaim (pageout daemon) - relatively rare per-page
       - COW fault resolution - once per COW page
       - msync/fsync - when writing dirty pages
       - Process exit - when cleaning up address space
     They do NOT happen on every fault, read, or write. The common paths
     (fault-in, access already-mapped page) are O(1).
     ==============================================================================
 . PROPOSED SOLUTIONS
     ==============================================================================
 .1 Option A: Cooperative Invalidation Only (Simplest)
     ------------------------------------------------------
     Concept: Don't do reverse-mapping for VPAGETABLE at all. Rely entirely
     on the vkernel calling MADV_INVAL when it modifies guest PTEs.
     Implementation:
 . Re-add VM_MAPTYPE_VPAGETABLE and vm_fault_vpagetable()
 . Add PG_VPTMAPPED flag to vm_page
 . Set PG_VPTMAPPED when a page is mapped via VPAGETABLE
 . In pmap_remove_all() etc, skip backing_list scan for VPAGETABLE
          entries (they won't find anything anyway)
 . When reclaiming a PG_VPTMAPPED page, send a signal/notification
          to all vkernel processes, or do a full TLB flush for them
     Pros:
       - Minimal code changes
       - No per-mapping memory overhead
       - Fast path stays fast
     Cons:
       - Relies on vkernel being well-behaved with MADV_INVAL
       - May need a "big hammer" (full flush) when reclaiming pages
       - Race window between vkernel modifying PTEs and calling MADV_INVAL
     Cost: O(1) normal case, O(vkernels) for VPTMAPPED page reclaim
 .2 Option B: Per-Page VPAGETABLE Tracking List
     -----------------------------------------------
     Concept: Add per-page reverse-map tracking, but ONLY for VPAGETABLE
     mappings. Normal mappings continue using backing_list.
     Implementation:
 . Extend struct md_page:
          struct vpte_rmap {
              pmap_t pmap;
              vm_offset_t va;
              TAILQ_ENTRY(vpte_rmap) link;
          };
          struct md_page {
              long interlock_count;
              TAILQ_HEAD(, vpte_rmap) vpte_list;
          };
 . In vm_fault_vpagetable(), when establishing a mapping:
          - Allocate vpte_rmap entry
          - Add to page's vpte_list
 . In pmap_remove() for VPAGETABLE regions:
          - Remove corresponding vpte_rmap entries
 . In pmap_remove_all() etc:
          - After backing_list scan, also walk page's vpte_list
     Pros:
       - Precise tracking of all VPAGETABLE mappings
       - Works with existing pmap infrastructure
       - No reliance on vkernel cooperation
     Cons:
       - Memory overhead: ~24 bytes per VPAGETABLE mapping
       - Requires vpte_rmap allocation/free on every mapping change
       - Adds complexity to fault path
     Cost: O(k) where k = number of VPAGETABLE mappings for this page
 .3 Option C: Lazy Tracking with Bloom Filter
     ---------------------------------------------
     Concept: Use probabilistic data structure to quickly determine if a page
     MIGHT be VPAGETABLE-mapped, avoiding expensive scans for the common case.
     Implementation:
 . Each VPAGETABLE pmap has a Bloom filter
 . When mapping a page via VPAGETABLE, add its PA to the filter
 . When checking reverse-maps:
          - Test each VPAGETABLE pmap's bloom filter
          - If negative: definitely not mapped there (skip)
          - If positive: might be mapped, do full scan of that pmap
     Pros:
       - Very fast negative lookups (~O(1))
       - Low memory overhead (fixed-size filter per pmap)
       - No per-mapping tracking needed
     Cons:
       - False positives require fallback to full scan
       - Bloom filter cannot handle deletions (need rebuilding or counting)
       - Still requires some form of scan on positive match
     Cost: O(1) for negative, O(pmap_size) for positive (with ~1% false positive rate)
 .4 Option D: Shadow PTE Table
     ------------------------------
     Concept: Maintain a kernel-side shadow of the vkernel's page tables,
     indexed by physical address for reverse lookups.
     Implementation:
 . Per-VPAGETABLE pmap, maintain an RB-tree or hash table:
          Key: physical page address
          Value: list of (guest_va, vpte_pointer) pairs
 . Intercept all writes to VPAGETABLE regions:
          - Make VPAGETABLE regions read-only initially
          - On write fault, update shadow table and allow write
 . For reverse-mapping:
          - Look up physical address in shadow table
          - Get all VAs directly
     Pros:
       - O(log n) or O(1) reverse lookups
       - Precise tracking
       - No vkernel cooperation required
     Cons:
       - High overhead for intercepting every PTE write
       - Memory overhead for shadow table
       - Complexity of keeping shadow in sync
     Cost: O(1) lookup, but O(1) overhead on every guest PTE modification
 .5 Option E: Hardware Virtualization (Long-term)
     -------------------------------------------------
     Concept: Use Intel EPT or AMD NPT for vkernel, as suggested in the
     original commit message.
     Implementation:
       - vkernel runs as a proper VM guest
       - Hardware handles guest-to-host address translation
       - Host kernel manages EPT/NPT tables
       - Normal backing_list mechanism works
     Pros:
       - Native hardware performance
       - Clean architecture
       - Industry-standard approach
     Cons:
       - Requires VT-x/AMD-V CPU support
       - vkernel becomes a "real" VM, loses lightweight process nature
       - Significant implementation effort
       - Different architecture than original vkernel design
     Cost: Best possible performance, but changes vkernel's nature
     ==============================================================================
 . RECOMMENDATION
     ==============================================================================
     For re-enabling VPAGETABLE with minimal disruption, I recommend a
     HYBRID APPROACH combining Options A and B:
     Phase 1: Cooperative + Flag (Quick Win)
     ---------------------------------------
 . Re-add VM_MAPTYPE_VPAGETABLE
 . Add PG_VPTMAPPED flag to track "might be VPAGETABLE-mapped"
 . Restore vm_fault_vpagetable() to walk guest page tables
 . In reverse-mapping functions, for PG_VPTMAPPED pages:
          - Skip the normal backing_list scan (won't find anything)
          - Call MADV_INVAL equivalent on all VPAGETABLE regions
            that MIGHT contain this page
 . Require vkernel to be cooperative with MADV_INVAL
     This gets vkernel working again with minimal changes.
     Phase 2: Optional Per-Page Tracking (If Needed)
     -----------------------------------------------
     If Phase 1 proves insufficient (too many unnecessary invalidations,
     race conditions, etc.), add Option B's per-page vpte_list:
 . Track (pmap, va) pairs for each VPAGETABLE mapping
 . Use for precise invalidation instead of broad MADV_INVAL
 . Memory cost is bounded by actual VPAGETABLE usage
     Phase 3: Long-term Hardware Support (Optional)
     ----------------------------------------------
     If demand exists for better vkernel performance:
 . Implement EPT/NPT support as Option E
 . Keep VPAGETABLE as fallback for non-VT-x systems
 . Auto-detect and use best available method
     ==============================================================================
 . OPEN QUESTIONS
     ==============================================================================
     Q1: How important is precise tracking vs. over-invalidation?
         - If we can tolerate occasional unnecessary TLB flushes for vkernel
           processes, Option A alone may be sufficient.
         - Need to understand vkernel workload characteristics.
     Q2: How many active VPAGETABLE regions would typically exist?
         - Usually one vkernel with one region
         - Or multiple vkernels running simultaneously?
         - Affects cost of "scan all VPAGETABLE regions" approach
     Q3: Is the vkernel already disciplined about calling MADV_INVAL?
         - The mechanism exists and was used before
         - Need to verify vkernel code still does this properly
         - If so, cooperative invalidation is viable
     Q4: What are the performance expectations for vkernel?
         - Is it acceptable to be slower than native?
         - How much slower is acceptable?
         - This affects whether we need precise tracking
     Q5: Is hardware virtualization an acceptable long-term direction?
         - Would change vkernel's nature from "lightweight process" to "VM"
         - May or may not align with project goals
         - Affects investment in software VPAGETABLE solutions
     ==============================================================================
     APPENDIX A: Key Source Files
     ==============================================================================
     sys/vm/vm_fault.c          - Page fault handling, vm_fault_vpagetable removed
     sys/vm/vm_map.c            - Address space management, MADV_INVAL handling
     sys/vm/vm_map.h            - vm_map_entry, vm_map_backing structures
     sys/vm/vm_object.h         - vm_object with backing_list
     sys/vm/vm_page.h           - vm_page, md_page structures
     sys/vm/vm.h                - VM_MAPTYPE_* definitions
     sys/platform/pc64/x86_64/pmap.c    - Real kernel pmap, PMAP_PAGE_BACKING_SCAN
     sys/platform/pc64/include/pmap.h   - Real kernel md_page (no pv_list)
     sys/platform/vkernel64/platform/pmap.c   - vkernel pmap (HAS pv_list!)
     sys/platform/vkernel64/include/pmap.h    - vkernel md_page with pv_list
     sys/sys/vkernel.h          - vkernel definitions
     sys/sys/mman.h             - MAP_VPAGETABLE definition
     ==============================================================================
     APPENDIX B: Relevant Commit
     ==============================================================================
     Commit: 4d4f84f5f26bf5e9fe4d0761b34a5f1a3784a16f
     Author: Matthew Dillon <dillon@apollo.backplane.com>
     Date:   Thu Jan 7 11:54:11 2021 -0800
         kernel - Remove MAP_VPAGETABLE
         * This will break vkernel support for now, but after a lot of mulling
           there's just no other way forward. MAP_VPAGETABLE was basically a
           software page-table feature for mmap()s that allowed the vkernel
           to implement page tables without needing hardware virtualization support.
         * The basic problem is that the VM system is moving to an extent-based
           mechanism for tracking VM pages entered into PMAPs and is no longer
           indexing individual terminal PTEs with pv_entry's.
           [... see full commit message for details ...]
         * We will need actual hardware mmu virtualization to get the vkernel
           working again.
     ==============================================================================
     APPENDIX C: Implementation Progress (Phase 1)
     ==============================================================================
     Branch: vpagetable-analysis
     COMPLETED CHANGES:
     -----------------
 . sys/vm/vm.h
        - Changed VM_MAPTYPE_UNUSED02 back to VM_MAPTYPE_VPAGETABLE (value 2)
 . sys/sys/mman.h
        - Updated comment to indicate MAP_VPAGETABLE is supported
 . sys/vm/vm_page.h
        - Added PG_VPTMAPPED flag (0x00000001) using existing PG_UNUSED0001 slot
        - Documents that it tracks pages mapped via VPAGETABLE regions
 . sys/vm/vm_fault.c
        - Added forward declaration for vm_fault_vpagetable()
        - Added struct vm_map_ilock and didilock variables to vm_fault()
        - Added full vm_fault_vpagetable() function (~140 lines)
        - Added VPAGETABLE check in vm_fault() before vm_fault_object()
        - Added VPAGETABLE check in vm_fault_bypass() to return KERN_FAILURE
        - Added VPAGETABLE check in vm_fault_page()
        - Added VM_MAPTYPE_VPAGETABLE case to vm_fault_wire()
 . sys/vm/vm_map.c (COMPLETE)
        - Added VM_MAPTYPE_VPAGETABLE to switch statements:
          * vmspace_swap_count()
          * vmspace_anonymous_count()
          * vm_map_backing_attach()
          * vm_map_backing_detach()
          * vm_map_entry_dispose()
          * vm_map_clean() (first switch)
          * vm_map_delete()
          * vm_map_backing_replicated()
          * vmspace_fork()
        - Restored MADV_SETMAP functionality
        - vm_map_insert(): Skip prefault for VPAGETABLE
        - vm_map_madvise(): Allow VPAGETABLE for MADV_INVAL (critical for cooperative invalidation)
        - vm_map_lookup(): Recognize VPAGETABLE as object-based
        - vm_map_backing_adjust_start/end(): Include VPAGETABLE for clipping
        - vm_map_protect(): Include VPAGETABLE in vnode write timestamp update
        - vm_map_user_wiring/vm_map_kernel_wiring(): Include VPAGETABLE for shadow setup
        - vm_map_copy_entry(): Accept VPAGETABLE in assert
     NOT RESTORING (Strategic decisions):
 . vm_map_entry_shadow/allocate_object large object (0x7FFFFFFF)
        OLD CODE: Created absurdly large objects because vkernel could map
        any page to any VA.
        WHY NOT: With cooperative invalidation (MADV_INVAL), we don't need
        this hack. Normal-sized objects work because the vkernel invalidates
        mappings when it changes its page tables.
 . vm_map_clean() whole-object flush for VPAGETABLE
        OLD CODE: Flushed entire object for VPAGETABLE instead of range.
        WHY NOT: With cooperative invalidation, range-based cleaning works.
        The vkernel is responsible for calling MADV_INVAL after PTE changes.
 . vmspace_fork_normal_entry() backing chain collapse skip
        OLD CODE: Skipped backing chain optimization for VPAGETABLE.
        WHY NOT: The optimization should work fine. If issues arise, the
        vkernel will call MADV_INVAL.
     FUTURE WORK (After vm_map.c):
     -----------------------------
 . Update pmap reverse-mapping (sys/platform/pc64/x86_64/pmap.c)
        - Handle PG_VPTMAPPED pages in PMAP_PAGE_BACKING_SCAN
        - Skip normal scan, use cooperative invalidation instead
 . Track active VPAGETABLE pmaps
        - Mechanism to broadcast invalidation to all vkernels
        - Needed when reclaiming PG_VPTMAPPED pages
 . Verify vkernel code
        - Check sys/platform/vkernel64/ properly calls MADV_INVAL
        - Ensure cooperative invalidation contract is maintained
 . Test compilation and runtime
        - Build kernel with changes
        - Test vkernel functionality

     .\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
     .\" SUCH DAMAGE.
     .\"
     .Dd September 7, 2021
     .Dd December 18, 2025
     .Dt VKERNEL 7
     .Os
     .Sh NAME
-...
     .Op Fl hstUvz
     .Op Fl c Ar file
     .Op Fl e Ar name Ns = Ns Li value : Ns Ar name Ns = Ns Li value : Ns ...
     .Op Fl i Ar file
     .Op Fl I Ar interface Ns Op Ar :address1 Ns Oo Ar :address2 Oc Ns Oo Ar /netmask Oc Ns Oo Ar =mac Oc
     .Op Fl l Ar cpulock
     .Op Fl m Ar size
     .Fl m Ar size
     .Op Fl n Ar numcpus Ns Op Ar :lbits Ns Oo Ar :cbits Oc
     .Op Fl p Ar pidfile
     .Op Fl r Ar file Ns Op Ar :serno
-...
     This option can be specified more than once.
     .It Fl h
     Shows a list of available options, each with a short description.
     .It Fl i Ar file
     Specify a memory image
     .Ar file
     to be used by the virtual kernel.
     If no
     .Fl i
     option is given, the kernel will generate a name of the form
     .Pa /var/vkernel/memimg.XXXXXX ,
     with the trailing
     .Ql X Ns s
     being replaced by a sequential number, e.g.\&
     .Pa memimg.000001 .
     .It Fl I Ar interface Ns Op Ar :address1 Ns Oo Ar :address2 Oc Ns Oo Ar /netmask Oc Ns Oo Ar =MAC Oc
     Create a virtual network device, with the first
     .Fl I
-...
     Locking the vkernel to a set of cpus is recommended on multi-socket systems
     to improve NUMA locality of reference.
     .It Fl m Ar size
     Specify the amount of memory to be used by the kernel in bytes,
     Specify the amount of memory for the virtual kernel in bytes,
     .Cm K
     .Pq kilobytes ,
     .Cm M
-...
     and
     .Cm G
     are allowed.
     This option is mandatory.
     .It Fl n Ar numcpus Ns Op Ar :lbits Ns Oo Ar :cbits Oc
     .Ar numcpus
     specifies the number of CPUs you wish to emulate.
-...
     to the virtual kernel's
     .Xr init 8
     process.
     .Sh MEMORY MANAGEMENT
     The virtual kernel's memory is backed by a temporary file created in
     .Pa /var/vkernel
     and immediately unlinked.
     The file descriptor is kept open for the lifetime of the virtual kernel process.
     Both the
     .Dv MAP_VPAGETABLE
     mapping, which implements the virtual kernel's page tables, and the
     direct memory access
     .Pq DMAP
     region reference this backing store.
     .Pp
     When the virtual kernel exits, the file descriptor is closed and the
     backing store is automatically reclaimed by the operating system.
     This ensures proper cleanup of memory resources even if the virtual
     kernel terminates abnormally.
     .Pp
     The following
     .Xr sysctl 8
     variables provide statistics and debugging for the
     .Dv MAP_VPAGETABLE
     mechanism:
     .Bl -tag -width "vm.vpagetable_faults" -compact
     .It Va vm.vpagetable_mmap
     Number of
     .Dv MAP_VPAGETABLE
     mappings created.
     .It Va vm.vpagetable_setmap
     Number of
     .Dv MADV_SETMAP
     operations performed.
     .It Va vm.vpagetable_inval
     Number of
     .Dv MADV_INVAL
     operations performed.
     .It Va vm.vpagetable_faults
     Number of page faults handled through VPTE translation.
     .It Va vm.debug_vpagetable
     Enable debug output for
     .Dv MAP_VPAGETABLE
     operations.
     .El
     .Sh DEBUGGING
     It is possible to directly gdb the virtual kernel's process.
     It is recommended that you do a

     			/*
     			 * Double check the validity of the callout, detect
     			 * if the originator's structure has been ripped out.
+    			 *
     			 * Skip the address range check for virtual kernels
     			 * since vkernel addresses are in host user space.
     			 */
     #ifndef _KERNEL_VIRTUAL
     			if ((uintptr_t)c->verifier < VM_MAX_USER_ADDRESS) {
     				spin_unlock(&wheel->spin);
     				panic("_callout %p verifier %p failed "
     				      "func %p/%p\n",
     				      c, c->verifier, c->rfunc, c->qfunc);
+    			}
     #endif
     			if (c->verifier->toc != c) {
     				spin_unlock(&wheel->spin);
     				panic("_callout %p verifier %p failed "
     				panic("_callout %p verifier %p toc %p (expected %p) "
     				      "func %p/%p\n",
     				      c, c->verifier, c->rfunc, c->qfunc);
     				      c, c->verifier, c->verifier->toc, c,
     				      c->rfunc, c->qfunc);
+    			}
     			/*

     	if ((m->flags & PG_MAPPED) == 0)
     		return;
     	/*
     	 * Pages mapped via VPAGETABLE cannot be found via the backing_list
     	 * scan because the VA formula doesn't apply.  The vkernel is
     	 * responsible for calling MADV_INVAL to remove real PTEs when it
     	 * modifies its page tables.
+    	 *
     	 * For PG_VPTMAPPED pages, conservatively assume the page is
     	 * modified and referenced.  The backing_list scan below won't
     	 * find these mappings, but that's OK - the vkernel should have
     	 * already removed them via MADV_INVAL.
+    	 *
     	 * Clear PG_VPTMAPPED along with PG_MAPPED at the end.
     	 */
     	if (m->flags & PG_VPTMAPPED) {
     		vm_page_dirty(m);
     		vm_page_flag_set(m, PG_REFERENCED);
+    	}
     	retry = ticks + hz * 60;
     again:
     	PMAP_PAGE_BACKING_SCAN(m, NULL, ipmap, iptep, ipte, iva) {
-...
     			      m, m->md.interlock_count);
+    		}
+    	}
     	vm_page_flag_clear(m, PG_MAPPED | PG_MAPPEDMULTI | PG_WRITEABLE);
     	vm_page_flag_clear(m, PG_MAPPED | PG_MAPPEDMULTI | PG_WRITEABLE |
     			      PG_VPTMAPPED);
+    }
     /*
-...
     	if (bit == PG_M_IDX && (m->flags & PG_WRITEABLE) == 0)
     		return FALSE;
     	/*
     	 * Pages mapped via VPAGETABLE cannot be found via the backing_list
     	 * scan because the VA formula doesn't apply (vkernel can map any
     	 * physical page to any VA).  Return TRUE conservatively - the page
     	 * may have the bit set in a vkernel's mapping.  The vkernel is
     	 * responsible for calling MADV_INVAL when it modifies its page
     	 * tables.
     	 */
     	if (m->flags & PG_VPTMAPPED)
     		return TRUE;
     	/*
     	 * Iterate the mapping
     	 */
-...
     	if ((m->flags & (PG_MAPPED | PG_WRITEABLE)) == 0)
     		return;
     	/*
     	 * Pages mapped via VPAGETABLE cannot be found via the backing_list
     	 * scan.  The vkernel is responsible for calling MADV_INVAL when it
     	 * modifies its page tables.
+    	 *
     	 * For the RW bit: conservatively mark page dirty, clear WRITEABLE.
     	 * For other bits: cannot clear, just return (vkernel handles this).
     	 */
     	if (m->flags & PG_VPTMAPPED) {
     		if (bit_index == PG_RW_IDX) {
     			vm_page_dirty(m);
     			vm_page_flag_clear(m, PG_WRITEABLE);
+    		}
     		return;
+    	}
     	/*
     	 * Being asked to clear other random bits, we don't track them
     	 * so we have to iterate.
-...
     	if (__predict_false(!pmap_initialized || (m->flags & PG_FICTITIOUS)))
     		return rval;
     	/*
     	 * Pages mapped via VPAGETABLE cannot be found via the backing_list
     	 * scan.  Return non-zero conservatively to indicate the page may
     	 * be referenced.  The vkernel is responsible for calling MADV_INVAL
     	 * when it modifies its page tables.
     	 */
     	if (m->flags & PG_VPTMAPPED)
     		return 1;
     	PMAP_PAGE_BACKING_SCAN(m, NULL, ipmap, iptep, ipte, iva) {
     		if (ipte & ipmap->pmap_bits[PG_A_IDX]) {
     			npte = ipte & ~ipmap->pmap_bits[PG_A_IDX];

     extern  char    cpu_vendor[];	/* XXX belongs in pc64 */
     extern  u_int   cpu_vendor_id;	/* XXX belongs in pc64 */
     extern  u_int   cpu_id;		/* XXX belongs in pc64 */
     extern  u_int   cpu_feature;	/* XXX belongs in pc64 */
     extern  u_int   cpu_feature2;	/* XXX belongs in pc64 */
     extern struct vkdisk_info DiskInfo[VKDISK_MAX];
     extern int	DiskNum;

     #define __VM_MEMATTR_T_DEFINED__
     typedef char vm_memattr_t;
     #endif
     #ifndef __VM_PROT_T_DEFINED__
     #define __VM_PROT_T_DEFINED__
     typedef u_char vm_prot_t;
     #endif
     void	pmap_bootstrap(vm_paddr_t *, int64_t);
     void	*pmap_mapdev (vm_paddr_t, vm_size_t);

sys/platform/vkernel64/platform/copyio.c
		#include <cpu/lwbuf.h>
		#include <vm/vm_page.h>
		#include <vm/vm_extern.h>
		#include <vm/pmap.h>
		#include <assert.h>

		#include <sys/stat.h>

     void *dmap_min_address;
     void *vkernel_stack;
     u_int cpu_feature;	/* XXX */
     u_int cpu_feature2;	/* XXX */
     int tsc_present;
     int tsc_invariant;
     int tsc_mpsync;
-...
     	int eflag;
     	int real_vkernel_enable;
     	int supports_sse;
     	uint32_t mxcsr_mask;
     	size_t vsize;
     	size_t msize;
     	size_t kenv_size;
-...
     	tsc_oneus_approx = ((tsc_frequency|1) + 999999) / 1000000;
     	/*
     	 * Check SSE
     	 * Check SSE and get the host's MXCSR mask.  The mask must be set
     	 * before init_fpu() because npxprobemask() may not work correctly
     	 * in userspace context.
     	 */
     	vsize = sizeof(supports_sse);
     	supports_sse = 0;
     	sysctlbyname("hw.instruction_sse", &supports_sse, &vsize, NULL, 0);
     	sysctlbyname("hw.mxcsr_mask", &mxcsr_mask, &msize, NULL, 0);
     	msize = sizeof(npx_mxcsr_mask);
     	sysctlbyname("hw.mxcsr_mask", &npx_mxcsr_mask, &msize, NULL, 0);
     	init_fpu(supports_sse);
     	if (supports_sse)
     		cpu_feature |= CPUID_SSE | CPUID_FXSR;
-...
     /*
      * Initialize system memory.  This is the virtual kernel's 'RAM'.
+     *
      * We always use an anonymous memory file (created in /tmp or /var/vkernel
      * and immediately unlinked).  This ensures proper cleanup of PG_VPTMAPPED
      * pages when the vkernel exits - the backing object is destroyed and all
      * pages are freed.
+     *
      * The -i option is deprecated but still accepted for compatibility.
      */
     static
     void
     init_sys_memory(char *imageFile)
+    {
     	struct stat st;
     	int i;
     	int fd;
     	char *tmpfile;
     	/*
     	 * Warn if -i was specified (deprecated)
     	 */
     	if (imageFile != NULL) {
     		fprintf(stderr,
     		    "WARNING: -i option is deprecated and ignored.\n"
     		    "         Memory is now always anonymous (unlinked file).\n");
+    	}
     	/*
     	 * Figure out the system memory image size.  If an image file was
     	 * specified and -m was not specified, use the image file's size.
     	 * Require -m to be specified
     	 */
     	if (imageFile && stat(imageFile, &st) == 0 && Maxmem_bytes == 0)
     		Maxmem_bytes = (vm_paddr_t)st.st_size;
     	if ((imageFile == NULL || stat(imageFile, &st) < 0) &&
     	    Maxmem_bytes == 0) {
     		errx(1, "Cannot create new memory file %s unless "
     		       "system memory size is specified with -m",
     		       imageFile);
     	if (Maxmem_bytes == 0) {
     		errx(1, "System memory size must be specified with -m");
     		/* NOT REACHED */
+    	}
-...
+    	}
     	/*
     	 * Generate an image file name if necessary, then open/create the
     	 * file exclusively locked.  Do not allow multiple virtual kernels
     	 * to use the same image file.
+    	 *
     	 * Don't iterate through a million files if we do not have write
     	 * access to the directory, stop if our open() failed on a
     	 * non-existant file.  Otherwise opens can fail for any number
     	 * Create an anonymous memory backing file.  We create a temp file
     	 * and immediately unlink it.  The file descriptor keeps the file
     	 * alive until the vkernel exits, at which point all pages are
     	 * properly freed (including clearing PG_VPTMAPPED).
     	 */
     	if (imageFile == NULL) {
     		for (i = 0; i < 1000000; ++i) {
     			asprintf(&imageFile, "/var/vkernel/memimg.%06d", i);
     			fd = open(imageFile,
     				  O_RDWR|O_CREAT|O_EXLOCK|O_NONBLOCK, 0644);
     			if (fd < 0 && stat(imageFile, &st) == 0) {
     				free(imageFile);
     				continue;
+    			}
     			break;
+    		}
     	} else {
     		fd = open(imageFile, O_RDWR|O_CREAT|O_EXLOCK|O_NONBLOCK, 0644);
+    	}
     	fprintf(stderr, "Using memory file: %s\n", imageFile);
     	if (fd < 0 || fstat(fd, &st) < 0) {
     		err(1, "Unable to open/create %s", imageFile);
     		/* NOT REACHED */
+    	}
     	asprintf(&tmpfile, "/var/vkernel/.memimg.%d", (int)getpid());
     	fd = open(tmpfile, O_RDWR|O_CREAT|O_EXCL, 0600);
     	if (fd < 0)
     		err(1, "Unable to create %s", tmpfile);
     	unlink(tmpfile);
     	free(tmpfile);
     	fprintf(stderr, "Using anonymous memory (%llu MB)\n",
     		(unsigned long long)Maxmem_bytes / (1024 * 1024));
     	/*
     	 * Truncate or extend the file as necessary.  Clean out the contents
     	 * of the file, we want it to be full of holes so we don't waste
     	 * time reading in data from an old file that we no longer care
     	 * about.
     	 * Size the file.  It will be sparse (no actual disk space used
     	 * until pages are faulted in).
     	 */
     	ftruncate(fd, 0);
     	ftruncate(fd, Maxmem_bytes);
     	if (ftruncate(fd, Maxmem_bytes) < 0) {
     		err(1, "Unable to size memory backing file");
     		/* NOT REACHED */
+    	}
     	MemImageFd = fd;
     	Maxmem = Maxmem_bytes >> PAGE_SHIFT;
-...
     		    "\t-c\tSpecify a readonly CD-ROM image file to be used by the kernel.\n"
     		    "\t-e\tSpecify an environment to be used by the kernel.\n"
     		    "\t-h\tThis list of options.\n"
     		    "\t-i\tSpecify a memory image file to be used by the virtual kernel.\n"
     		    "\t-i\t(DEPRECATED) Memory is now always anonymous.\n"
     		    "\t-I\tCreate a virtual network device.\n"
     		    "\t-l\tSpecify which, if any, real CPUs to lock virtual CPUs to.\n"
     		    "\t-m\tSpecify the amount of memory to be used by the kernel in bytes.\n"
     		    "\t-m\tSpecify the amount of memory to be used by the kernel in bytes (required).\n"
     		    "\t-n\tSpecify the number of CPUs and the topology you wish to emulate:\n"
     		    "\t\t\tnumcpus - number of cpus\n"
     		    "\t\t\tlbits - specify the number of bits within APICID(=CPUID)\n"

     	psize = x86_64_btop(size);
     	if ((object->type != OBJT_VNODE) ||
     		((limit & MAP_PREFAULT_PARTIAL) && (psize > MAX_INIT_PT) &&
     		((limit & COWF_PREFAULT_PARTIAL) && (psize > MAX_INIT_PT) &&
     			(object->resident_page_count > MAX_INIT_PT))) {
     		return;
+    	}
-...
     	 * don't allow an madvise to blow away our really
     	 * free pages allocating pv entries.
     	 */
     	if ((info->limit & MAP_PREFAULT_MADVISE) &&
     	if ((info->limit & COWF_PREFAULT_MADVISE) &&
     		vmstats.v_free_count < vmstats.v_free_reserved) {
     		    return(-1);
+    	}

     	char *sp;
     	regs = lp->lwp_md.md_regs;
     	oonstack = (lp->lwp_sigstk.ss_flags & SS_ONSTACK) ? 1 : 0;
     	/* Save user context */
-...
     	do_cpuid(1, regs);
     	cpu_feature = regs[3];
     	cpu_feature2 = regs[2];
     	/*
     	 * The vkernel uses fxsave64/fxrstor64 for FPU state management,
     	 * not xsave/xrstor.  Mask out AVX/XSAVE features that we don't
     	 * support, otherwise userland (libc/libm) may try to use AVX
     	 * instructions and the FPU state won't be properly saved/restored,
     	 * leading to FPE or corrupted state.
     	 */
     	cpu_feature2 &= ~(CPUID2_XSAVE | CPUID2_OSXSAVE | CPUID2_AVX |
     			  CPUID2_FMA | CPUID2_F16C);
+    }

     	int save;
     	save = errno;
     #if 0
     	kprintf("CAUGHT SIG %d RIP %08lx ERR %08lx TRAPNO %ld "
     		"err %ld addr %08lx\n",
     		signo,
     		ctx->uc_mcontext.mc_rip,
     		ctx->uc_mcontext.mc_err,
     		ctx->uc_mcontext.mc_trapno & 0xFFFF,
     		ctx->uc_mcontext.mc_trapno >> 16,
     		ctx->uc_mcontext.mc_addr);
     #endif
     	kern_trap((struct trapframe *)&ctx->uc_mcontext.mc_rdi);
     	splz();
     	errno = save;

     #define	fnstcw(addr)		__asm __volatile("fnstcw %0" : "=m" (*(addr)))
     #define	fnstsw(addr)		__asm __volatile("fnstsw %0" : "=m" (*(addr)))
     #define	frstor(addr)		__asm("frstor %0" : : "m" (*(addr)))
     #define	fxrstor(addr)		__asm("fxrstor %0" : : "m" (*(addr)))
     #define	fxsave(addr)		__asm __volatile("fxsave %0" : "=m" (*(addr)))
     #define	fxrstor(addr)		__asm("fxrstor64 %0" : : "m" (*(addr)))
     #define	fxsave(addr)		__asm __volatile("fxsave64 %0" : "=m" (*(addr)))
     #define	ldmxcsr(csr)		__asm __volatile("ldmxcsr %0" : : "m" (csr))
     static	void	fpu_clean_state(void);
-...
     	 * fnsave are broken, so our treatment breaks fnclex if it is the
     	 * first FPU instruction after a context switch.
     	 */
     	if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) && cpu_fxsr) {
     		krateprintf(&badfprate,
     			    "FXRSTOR: illegal FP MXCSR %08x didinit = %d\n",
     			    td->td_savefpu->sv_xmm.sv_env.en_mxcsr, didinit);
     		td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= 0xFFBF;
     	if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~npx_mxcsr_mask) && cpu_fxsr) {
     		td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= npx_mxcsr_mask;
     		lwpsignal(curproc, curthread->td_lwp, SIGFPE);
+    	}
     	fpurstor(curthread->td_savefpu, 0);
-...
     		if (td == mdcpu->gd_npxthread)
     			npxsave(td->td_savefpu);
     		bcopy(mctx->mc_fpregs, td->td_savefpu, sizeof(*td->td_savefpu));
     		if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) &&
     		if ((td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~npx_mxcsr_mask) &&
     		    cpu_fxsr) {
     			krateprintf(&badfprate,
     				    "pid %d (%s) signal return from user: "
-...
     				    td->td_proc->p_pid,
     				    td->td_proc->p_comm,
     				    td->td_savefpu->sv_xmm.sv_env.en_mxcsr);
     			td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= 0xFFBF;
     			td->td_savefpu->sv_xmm.sv_env.en_mxcsr &= npx_mxcsr_mask;
+    		}
     		td->td_flags |= TDF_USINGFP;
     		break;

     		eva = frame->tf_addr;
     	else
     		eva = 0;
     #if 0
     	kprintf("USER_TRAP AT %08lx xflags %ld trapno %ld eva %08lx\n",
     		frame->tf_rip, frame->tf_xflags, frame->tf_trapno, eva);
     #endif
     	/*
     	 * Everything coming from user mode runs through user_trap,
     	 * including system calls.
-...
     		 */
     		gd = mycpu;
     		gd->gd_flags |= GDF_VIRTUSER;
     		r = vmspace_ctl(id, VMSPACE_CTL_RUN, tf,
     				&curthread->td_savevext);
-...
+    		}
     		crit_exit();
     		gd->gd_flags &= ~GDF_VIRTUSER;
     #if 0
     		kprintf("GO USER %d trap %ld EVA %08lx RIP %08lx RSP %08lx XFLAGS %02lx/%02lx\n",
     			r, tf->tf_trapno, tf->tf_addr, tf->tf_rip, tf->tf_rsp,
     			tf->tf_xflags, frame->if_xflags);
     #endif
     		/* DEBUG: Only log errors and FPU-related traps */
     		if (r < 0) {
     			if (errno == EFAULT) {
     				panic("vmspace_ctl failed with EFAULT");
+    			}
     			if (errno != EINTR)
     				panic("vmspace_ctl failed error %d", errno);
     		} else {

     /*
      * Mapping type
+     *
      * NOTE! MAP_VPAGETABLE is no longer supported and will generate a mmap()
      *	 error.
      * NOTE! MAP_VPAGETABLE is used by vkernels for software page tables.
+     *
      */
     #define	MAP_FILE	0x0000		/* map from file (default) */
     #define	MAP_ANON	0x1000		/* allocated from memory, swap space */

      */
     #define VM_MAPTYPE_UNSPECIFIED	0
     #define VM_MAPTYPE_NORMAL	1
     #define VM_MAPTYPE_UNUSED02	2	/* was VPAGETABLE */
     #define VM_MAPTYPE_VPAGETABLE	2	/* vkernel software page table */
     #define VM_MAPTYPE_SUBMAP	3
     #define VM_MAPTYPE_UKSMAP	4	/* user-kernel shared memory */

     SYSCTL_INT(_vm, OID_AUTO, debug_fault, CTLFLAG_RW, &debug_fault, 0, "");
     __read_mostly static int debug_cluster = 0;
     SYSCTL_INT(_vm, OID_AUTO, debug_cluster, CTLFLAG_RW, &debug_cluster, 0, "");
     /* VPAGETABLE debugging - counts and optional verbose output */
     static long vpagetable_fault_count = 0;
     SYSCTL_LONG(_vm, OID_AUTO, vpagetable_faults, CTLFLAG_RW,
     	    &vpagetable_fault_count, 0, "Number of VPAGETABLE faults");
     __read_mostly int debug_vpagetable = 0;
     SYSCTL_INT(_vm, OID_AUTO, debug_vpagetable, CTLFLAG_RW,
     	   &debug_vpagetable, 0, "Debug VPAGETABLE operations");
     #if 0
     static int virtual_copy_enable = 1;
     SYSCTL_INT(_vm, OID_AUTO, virtual_copy_enable, CTLFLAG_RW,
-...
     			vm_pindex_t first_count, int *mextcountp,
     			vm_prot_t fault_type);
     static int vm_fault_object(struct faultstate *, vm_pindex_t, vm_prot_t, int);
     static int vm_fault_vpagetable(struct faultstate *, vm_pindex_t *,
     			vpte_t, int, int);
     static void vm_set_nosync(vm_page_t m, vm_map_entry_t entry);
     static void vm_prefault(pmap_t pmap, vm_offset_t addra,
     			vm_map_entry_t entry, int prot, int fault_flags);
-...
     	struct proc *p;
     #endif
     	thread_t td;
     	struct vm_map_ilock ilock;
     	int mextcount;
     	int didilock;
     	int growstack;
     	int retry = 0;
     	int inherit_prot;
-...
     	if (vm_fault_bypass_count &&
     	    vm_fault_bypass(&fs, first_pindex, first_count,
     			   &mextcount, fault_type) == KERN_SUCCESS) {
     		didilock = 0;
     		fault_flags &= ~VM_FAULT_BURST;
     		goto success;
+    	}
-...
     	fs.first_ba_held = 1;
     	/*
     	 * The page we want is at (first_object, first_pindex).
     	 * The page we want is at (first_object, first_pindex), but if the
     	 * vm_map_entry is VM_MAPTYPE_VPAGETABLE we have to traverse the
     	 * page table to figure out the actual pindex.
+    	 *
     	 * NOTE!  DEVELOPMENT IN PROGRESS, THIS IS AN INITIAL IMPLEMENTATION
     	 * ONLY
     	 */
     	didilock = 0;
     	if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
     		++vpagetable_fault_count;
     		if (debug_vpagetable) {
     			kprintf("VPAGETABLE fault: vaddr=%lx pde=%lx type=%02x pid=%d\n",
     				vaddr, fs.entry->aux.master_pde, fault_type,
     				(curproc ? curproc->p_pid : -1));
+    		}
     		vm_map_interlock(fs.map, &ilock, vaddr, vaddr + PAGE_SIZE);
     		didilock = 1;
     		result = vm_fault_vpagetable(&fs, &first_pindex,
     					     fs.entry->aux.master_pde,
     					     fault_type, 1);
     		if (result == KERN_TRY_AGAIN) {
     			vm_map_deinterlock(fs.map, &ilock);
     			++retry;
     			goto RetryFault;
+    		}
     		if (result != KERN_SUCCESS) {
     			vm_map_deinterlock(fs.map, &ilock);
     			goto done;
+    		}
+    	}
     	/*
     	 * Now we have the actual (object, pindex), fault in the page.  If
     	 * vm_fault_object() fails it will unlock and deallocate the FS
     	 * data.   If it succeeds everything remains locked and fs->ba->object
-...
+    	}
     	if (result == KERN_TRY_AGAIN) {
     		if (didilock)
     			vm_map_deinterlock(fs.map, &ilock);
     		++retry;
     		goto RetryFault;
+    	}
     	if (result != KERN_SUCCESS) {
     		if (didilock)
     			vm_map_deinterlock(fs.map, &ilock);
     		goto done;
+    	}
-...
     	KKASSERT(fs.lookup_still_valid != 0);
     	vm_page_flag_set(fs.mary[0], PG_REFERENCED);
     	/*
     	 * Mark pages mapped via VPAGETABLE so the pmap layer knows
     	 * that the backing_list scan won't find these mappings.
     	 * The vkernel is responsible for calling MADV_INVAL when
     	 * it modifies its page tables.
     	 */
     	if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
     		for (n = 0; n < mextcount; ++n)
     			vm_page_flag_set(fs.mary[n], PG_VPTMAPPED);
+    	}
     	for (n = 0; n < mextcount; ++n) {
     		pmap_enter(fs.map->pmap, vaddr + (n << PAGE_SHIFT),
     			   fs.mary[n], fs.prot | inherit_prot,
     			   fs.wflags & FW_WIRED, fs.entry);
+    	}
     	if (didilock)
     		vm_map_deinterlock(fs.map, &ilock);
     	/*
     	 * If the page is not wired down, then put it where the pageout daemon
     	 * can find it.
-...
     	if (fs->fault_flags & VM_FAULT_WIRE_MASK)
     		return KERN_FAILURE;
     	/*
     	 * Can't handle VPAGETABLE - requires vm_fault_vpagetable() to
     	 * translate the pindex.
     	 */
     	if (fs->entry->maptype == VM_MAPTYPE_VPAGETABLE) {
     #ifdef VM_FAULT_QUICK_DEBUG
     		++vm_fault_bypass_failure_count1;
     #endif
     		return KERN_FAILURE;
+    	}
     	/*
     	 * Ok, try to get the vm_page quickly via the hash table.  The
     	 * page will be soft-busied on success (NOT hard-busied).
-...
     		fs.vp = vnode_pager_lock(fs.first_ba);	/* shared */
     	/*
     	 * The page we want is at (first_object, first_pindex).
     	 * The page we want is at (first_object, first_pindex), but if the
     	 * vm_map_entry is VM_MAPTYPE_VPAGETABLE we have to traverse the
     	 * page table to figure out the actual pindex.
+    	 *
     	 * NOTE!  DEVELOPMENT IN PROGRESS, THIS IS AN INITIAL IMPLEMENTATION
     	 * ONLY
     	 */
     	if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE) {
     		result = vm_fault_vpagetable(&fs, &first_pindex,
     					     fs.entry->aux.master_pde,
     					     fault_type, 1);
     		first_count = 1;
     		if (result == KERN_TRY_AGAIN) {
     			++retry;
     			goto RetryFault;
+    		}
     		if (result != KERN_SUCCESS) {
     			*errorp = result;
     			fs.mary[0] = NULL;
     			goto done;
+    		}
+    	}
     	/*
     	 * Now we have the actual (object, pindex), fault in the page.  If
     	 * vm_fault_object() fails it will unlock and deallocate the FS
     	 * data.   If it succeeds everything remains locked and fs->ba->object
-...
     	 * modifications made by ptrace().
     	 */
     	vm_page_flag_set(fs.mary[0], PG_REFERENCED);
     	/*
     	 * Mark pages mapped via VPAGETABLE so the pmap layer knows
     	 * that the backing_list scan won't find these mappings.
     	 */
     	if (fs.entry->maptype == VM_MAPTYPE_VPAGETABLE)
     		vm_page_flag_set(fs.mary[0], PG_VPTMAPPED);
     #if 0
     	pmap_enter(fs.map->pmap, vaddr, fs.mary[0], fs.prot,
     		   fs.wflags & FW_WIRED, NULL);
-...
     		pmap_remove(fs.map->pmap,
     			    vaddr & ~PAGE_MASK,
     			    (vaddr & ~PAGE_MASK) + PAGE_SIZE);
     #ifdef _KERNEL_VIRTUAL
     		/*
     		 * For the vkernel, we must also call pmap_enter() to install
     		 * the new page in the software page table (VPTE) after COW.
     		 * The native kernel doesn't need this because the hardware
     		 * MMU will fault again, but the vkernel writes via DMAP and
     		 * the guest reads via the VPTE, so the VPTE must be updated
     		 * immediately.
     		 */
     		pmap_enter(fs.map->pmap, vaddr, fs.mary[0],
     			   fs.prot, fs.wflags & FW_WIRED, NULL);
     #endif
+    	}
     	/*
-...
     	return(fs.mary[0]);
+    }
     /*
      * Translate the virtual page number (first_pindex) that is relative
      * to the address space into a logical page number that is relative to the
      * backing object.  Use the virtual page table pointed to by (vpte).
+     *
      * Possibly downgrade the protection based on the vpte bits.
+     *
      * This implements an N-level page table.  Any level can terminate the
      * scan by setting VPTE_PS.   A linear mapping is accomplished by setting
      * VPTE_PS in the master page directory entry set via mcontrol(MADV_SETMAP).
      */
     static
     int
     vm_fault_vpagetable(struct faultstate *fs, vm_pindex_t *pindex,
     		    vpte_t vpte, int fault_type, int allow_nofault)
+    {
     	struct lwbuf *lwb;
     	struct lwbuf lwb_cache;
     	int vshift = VPTE_FRAME_END - PAGE_SHIFT; /* index bits remaining */
     	int result;
     	vpte_t *ptep;
     	ASSERT_LWKT_TOKEN_HELD(vm_object_token(fs->first_ba->object));
     	for (;;) {
     		/*
     		 * We cannot proceed if the vpte is not valid, not readable
     		 * for a read fault, not writable for a write fault, or
     		 * not executable for an instruction execution fault.
     		 */
     		if ((vpte & VPTE_V) == 0) {
     			unlock_things(fs);
     			return (KERN_FAILURE);
+    		}
     		if ((fault_type & VM_PROT_WRITE) && (vpte & VPTE_RW) == 0) {
     			unlock_things(fs);
     			return (KERN_FAILURE);
+    		}
     		if ((fault_type & VM_PROT_EXECUTE) && (vpte & VPTE_NX)) {
     			unlock_things(fs);
     			return (KERN_FAILURE);
+    		}
     		if ((vpte & VPTE_PS) || vshift == 0)
     			break;
     		/*
     		 * Get the page table page.  Nominally we only read the page
     		 * table, but since we are actively setting VPTE_M and VPTE_A,
     		 * tell vm_fault_object() that we are writing it.
+    		 *
     		 * There is currently no real need to optimize this.
     		 */

Project

General

Profile

DragonFlyBSD

Submit #3389 » 0001-vkernel-Restore-MAP_VPAGETABLE-support-with-COW-VPTE.patch