Bug #2819
openRandom micro system freezes after a week of uptime
Description
On a file server, the system freeze for a few seconds to more than a minute after aproximately a week of uptime.
The longer the machine stays up, the worse the freezes become. The micro-freezes happen more often and become longer.
It is a complete kernel freeze: characters typed on the console stop appearing on the screen when it happens.
Characters don't stop being displayed when the operating is stopped and the kernel debugger is active, which indicates the problem is not of a hardware origin.
Hardware specs:
- Xeon E5-2620, 64 GB RAM
- Areca RAID controller
- 2x 500GB system disks (RAID 1)
- 11x 2 GB data disks (RAID 5)
- 1x 512 GB SSD (JBOD), used entirely for swap
- 10Gb Intel X540 ethernet adapter
Software configuration:
- swapcache enabled, up to 85% of the available swap size
- deduplication enabled on the data volume
The data volume is used for two things:
- protein sequences
- a rsnapshot backup directory for various servers
       Updated by ftigeot over 10 years ago
      Updated by ftigeot over 10 years ago
      
    
    vmstat -m output when the machine suffers from micro-freezes:
Memory statistics by type                          Type  Kern
              Type   InUse  MemUse HighUse       Limit  Requests Limit Limit
      HAMMER-inodes43004623762905K      0K  134203388K  35443885 0     0
      HAMMER-others 233194  40792K      0K    6498816K 525146675 0     0
    tmpfs name zone      0      0K      0K    6498816K      5197 0     0
       tmpfs dirent     56      5K      0K    6498816K        56 0     0
         tmpfs node     64     18K      0K    6498816K        64 0     0
      HAMMER-inodes   1014    888K      0K  134203388K    176625 0     0
      HAMMER-others   1799    357K      0K    6498816K   2181892 0     0
           pci_link     16      2K      0K    6498816K        16 0     0
           acpitask      0      0K      0K    6498816K         6 0     0
             acpica  42238   1467K      0K    6498816K    245264 0     0
            acpidev    245     10K      0K    6498816K       245 0     0
            acpisem     59      3K      0K    6498816K        59 0     0
       eventhandler     35      2K      0K    6498816K        35 0     0
               disk      6      1K      0K    6498816K         6 0     0
           atkbddev      2      1K      0K    6498816K         2 0     0
                bus   1512    191K      0K    6498816K     31013 0     0
            callout     12  49152K      0K    6498816K        12 0     0
           nexusdev      7      1K      0K    6498816K         7 0     0
             sysctl      0      0K      0K    6498816K   1043131 0     0
          sysctloid   7003    229K      0K    6498816K      7155 0     0
            tslpque     11    704K      0K    6498816K        11 0     0
            syscons     41    167K      0K    6498816K        41 0     0
         aesni_data      1      1K      0K    6498816K         1 0     0
             dsched   8366    792K      0K    6498816K      8366 0     0
       lwkt message     22     21K      0K    6498816K      5574 0     0
             thread    233    321K      0K    6498816K       235 0     0
            scsi_da      0      0K      0K    6498816K         9 0     0
            memdesc      1      4K      0K    6498816K         1 0     0
        MPipe Array      2      3K      0K    6498816K         2 0     0
              cache 133175  13201K      0K    6498816K    133267 0     0
             devbuf   2044   2583K      0K    6498816K      2060 0     0
               temp    501    129K      0K    6498816K 958039631 0     0
             ip6ndp     12      1K      0K    6498816K        15 0     0
          CAM queue     32      8K      0K    6498816K      1549 0     0
              xform      0      0K      0K    6498816K     18666 0     0
             crypto      1      1K      0K    6498816K         1 0     0
           propstng   1383     55K      0K    6498816K      1383 0     0
        prop string   1375     11K      0K    6498816K      1457 0     0
           propnmbr   1636     90K      0K    6498816K      1636 0     0
            pdict16     20      2K      0K    6498816K        20 0     0
           propdict    681    128K      0K    6498816K       681 0     0
    prop dictionary    677    170K      0K    6498816K       715 0     0
             kbdmux      6      8K      0K    6498816K         6 0     0
             isadev     20      2K      0K    6498816K        20 0     0
               ZONE      1      4K      0K    6498816K         1 0     0
            uidinfo      4     65K      0K    6498816K    183908 0     0
               cred     14      2K      0K    6498816K    977404 0     0
               pgrp     26      4K      0K    6498816K    194639 0     0
            session     23      2K      0K    6498816K    194593 0     0
            vmspace    144    126K      0K    6498816K       154 0     0
               proc     41     71K      0K    6498816K   1175580 0     0
                lwp     42     27K      0K    6498816K    981838 0     0
            subproc     84    122K      0K    6498816K    983962 0     0
        tmpfs mount      1      1K      0K    6498816K         1 0     0
       HAMMER-mount      2    136K      0K    6498816K         2 0     0
           objcache     64     43K      0K    6498816K        64 0     0
              devfs   3710    592K      0K    6498816K      4185 0     0
  objcache magazine  11459  11149K      0K    6498816K     11459 0     0
        UFS dirhash     21     16K      0K    6498816K        21 0     0
          UFS mount      3      5K      0K    6498816K         3 0     0
          UFS ihash      1  16384K      0K    6498816K         1 0     0
           FFS node   1184    370K      0K  134203388K      1193 0     0
            pagedep      1   8192K      0K    6498816K         1 0     0
           inodedep      1  65536K      0K    6498816K         1 0     0
             newblk      1      1K      0K    6498816K         1 0     0
           p1003.1b      1      1K      0K    6498816K         1 0     0
              lockf      7      1K      0K    6498816K      3856 0     0
             atexit      2      1K      0K    6498816K         2 0     0
          proc-args     34      2K      0K    6498816K    641794 0     0
          exec-args     20   5200K      0K    6498816K        20 0     0
             kqueue     40      5K      0K    6498816K  39853084 0     0
               kenv     37      6K      0K    6498816K        37 0     0
          file desc     41     62K      0K    6498816K    985898 0     0
               file     99     13K      0K    6498816K  83268868 0     0
              sigio      1      1K      0K    6498816K         1 0     0
         NFS daemon      5     18K      0K    6498816K         5 0     0
      NFSV3 srvdesc      0      0K      0K    6498816K  53078720 0     0
           NFS hash      1  65536K      0K    6498816K         1 0     0
        NFS srvsock      2      2K      0K    6498816K        39 0     0
       ip6_moptions      1      1K      0K    6498816K         1 0     0
           syncache      8     96K      0K    6498816K      2652 0     0
            tcptemp     25      2K      0K    6498816K        25 0     0
               sblk      2      1K      0K    6498816K     34712 0     0
          tseg_qent      0      0K      0K    6498816K       664 0     0
                ipq    250     12K      0K    6498816K       250 0     0
                kld    127      8K      0K    6498816K       134 0     0
           in_multi     25      2K      0K    6498816K        25 0     0
               igmp      1      1K      0K    6498816K         1 0     0
             module    338     32K      0K    6498816K       338 0     0
           routetbl    927    143K      0K    6498816K     10772 0     0
             varsym    258     10K      0K    6498816K       272 0     0
              faith      1      1K      0K    6498816K         1 0     0
            CAM SIM      8      2K      0K    6498816K         8 0     0
         CAM periph     13      2K      0K    6498816K       231 0     0
        ISOFS mount      1  65536K      0K    6498816K         1 0     0
           vn_softc      4     11K      0K    6498816K         4 0     0
              clone      6     24K      0K    6498816K         6 0     0
             ifaddr    109     87K      0K    6498816K       109 0     0
        ether_multi     87      5K      0K    6498816K        87 0     0
              ifnet      1      1K      0K    6498816K         8 0     0
                BPF      8      1K      0K    6498816K         8 0     0
      MSDOSFS mount      1  65536K      0K    6498816K         1 0     0
       NULLFS mount      6      4K      0K    6498816K         6 0     0
             vnodes43027071747975K      0K  134203388K  35629122 0     0
        Export Host      1      1K      0K    6498816K         1 0     0
           vnodeops     22     13K      0K    6498816K        22 0     0
          nameibufs     44     44K      0K    6498816K        44 0     0
              mount     13     13K      0K    6498816K        16 0     0
       cluster_save      0      0K      0K    6498816K    194908 0     0
           vfscache413619643672964K      0K    6498816K1424298426 0     0
         BIO buffer      2      3K      0K    6498816K        23 0     0
              unpcb     14      3K      0K    6498816K     13786 0     0
      CAM dev queue      8      1K      0K    6498816K         8 0     0
             socket     37     26K      0K    6498816K    203194 0     0
             soname      5      1K      0K    6498816K    216667 0     0
                pcb    148    163K      0K    6498816K    272388 0     0
                tag      0      0K      0K    6498816K     67129 0     0
               mbuf 105755  52878K      0K    6498816K 318727634 0     0
             mbufcl  44054 129120K      0K    6498816K     44054 0     0
            mclmeta  44054    689K      0K    6498816K     44054 0     0
               ptys    257    129K      0K    6498816K       259 0     0
               ttys    879    113K      0K    6498816K      3257 0     0
                shm      1     40K      0K    6498816K         1 0     0
            CAM XPT    353    216K      0K    6498816K      1551 0     0
                sem      1    144K      0K    6498816K         1 0     0
                msg      4     27K      0K    6498816K         4 0     0
            MD disk      2      2K      0K    6498816K         2 0     0
             Unitno      1      1K      0K    6498816K         1 0     0
               rman    168     18K      0K    6498816K       534 0     0
               pipe    446    104K      0K    6498816K   1072362 0     0
           ioctlops      0      0K      0K    6498816K    195930 0     0
          taskqueue     28      2K      0K    6498816K        28 0     0
               sbuf      0      0K      0K    6498816K        24 0     0
               SWAP      2 131077K      0K    6498816K         2 0     0
               kobj    234    527K      0K    6498816K       234 0     0
Memory Totals:  In Use    Free    Requests
              9915877K      0K    3486074011
       Updated by ftigeot over 10 years ago
      Updated by ftigeot over 10 years ago
      
    
    The bug reporting tool sadly doesn't preserve whitespace.
The following lines are interesting and way out of line compared to other values:
Type   InUse  MemUse HighUse       Limit  Requests Limit LimitHAMMER-inodes 43004623762905K 0K 134203388K 35443885 0 0
vnodes 43027071747975K 0K 134203388K 35629122 0 0
vfscache 413619643672964K 0K 6498816K 1424298426 0 0
       Updated by ftigeot over 10 years ago
      Updated by ftigeot over 10 years ago
      
    
    The /data filesystem contains about 5 million files.
95% of them are hard links or subdirectories in the rsnapshot directory.
       Updated by ftigeot over 10 years ago
      Updated by ftigeot over 10 years ago
      
    
    - Status changed from New to In Progress
The same system with the kern.maxvnodes sysctl set to 300,000 after a fresh reboot doesn't suffer from the micro-freezes after 15 days of uptime.
       Updated by ftigeot over 10 years ago
      Updated by ftigeot over 10 years ago
      
    
    Another data point: when the problems happened, atime was enabled on /data
The trouble-free >2 weeks uptime is with noatime=on on /data