1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Your cluster patching and maintenance strategies

Discussion in 'Servers and Enterprise Solutions' started by TheOracle, Aug 1, 2018.

  1. TheOracle


    Joined: Sep 30, 2005

    Posts: 9,913

    Hi Everyone,

    Just wondering what everyone is doing regarding patching and general maintenance on clusters.

    AFAIK, the options are

    1. Use cluster aware updating (do you do this manually or on a schedule)
    2. Use SCCM and either reboot manually or use PS scripts on a schedule
    3. Use SCCM with the beta server groups feature (Hows that going?)
    4. Use SCVMM patching
    5. Use WSUS
    6. Do everything manually
    7. Something else

    What are you doing about reporting? We run and present reports every Friday which shows our estate and the current patch levels. These reports are saved and given to an external auditor every year to keep our compliancy.

  2. Conanius


    Joined: Oct 18, 2002

    Posts: 12,201

    We've come a long way with our patching strategy over the last 2 years.

    Edit - I've rewritten this with more detail to try and help a bit more.

    We've gone from pretty much manually cranking everything by the Service Desk (24x7 team) to getting SCCM to do auto patching on Windows boxes and Spacewalk for our CentOS and RHEL boxes.

    From a platform to platform basis, we go T&D/Pre Prod/BCDR/Production with 1 week separation for our platforms and patch every month following vendor patch release. Critical Vulnerabilities are assessed and pushed out out of band if needed.

    Within a specific platform, we set up a 4 hour window, and 3 patch groups. First and Last group have any clustered/shared services like domain controllers/SQL/etc split across them. Where there are non resilient app servers, they are done in the 2nd Patch group. SCCM pushes the patches to the boxes the day of the install, but no install is run. When that patch group 'starts' it has a deployed time set 1 minute in the past so the install starts immediately.

    We set up our monitoring tool to mute alerts for the duration of the window, and then when its all back up at the end of the window, our ITSD checks to see if anything hasn't come back up as expected.

    In the main our success has been superb. We've got some customers who have had difficulties with their apps coping well with reboots (Servers coming up in the right orders) but we've worked with them on start up scripts so it waits for the servers to come online correctly.

    With SQL, we're currently doing a couple of those manually still as everything we've tried on failover on the SQL clusters never seems to work properly... interested to see what others say here on those. Some of them just work perfectly and when the box is rebooted the failover works as expected. We've tried all sorts of comparisons and can't see the difference.
    Last edited: Aug 2, 2018
  3. TheOracle


    Joined: Sep 30, 2005

    Posts: 9,913

    Thanks, that is basically what I've ended up doing. SCCM for the patching with powershell scripts to handle the node failovers.

    Did some testing yesterday which went well.

    Eg: 15 node HV cluster

    SCCM delivers the updates
    on a set day run a batch file from a scheduled task every 2hrs for 23hrs. This checks the time, and runs a ps script against a certain hosts depending on what time it is
    the ps script does all the failover, reboots and migrates resources back depending on what node it is

    The reason for this is that we are currently over committed until we buy some new nodes (hopefully next month)
  4. Conanius


    Joined: Oct 18, 2002

    Posts: 12,201

    I don't suppose you willing to share that Powershell script ?
  5. TheOracle


    Joined: Sep 30, 2005

    Posts: 9,913

    of course. Basically we use SCCM to deliver the patches (so they show up on our weekly reports) and reboot the hosts once a month

    Scheduled task runs a batch file every 1st sunday of the month starting at 7am and repeating each 2hrs for the next 23hrs

    This is the batch file

    rem HyperV Reboot Script which runs from *NAME OF OUR SCVMM SERVER* every first Sunday of the month. It repeats every 2hrs for 23hrs
    @Echo off

    for /f "tokens=2 delims==" %%I in ('wmic os get localdatetime /format:list') do set datetime=%%I
    set datetime=%datetime:~0,8%-%datetime:~8,6%

    echo Scripts running at %datetime% >> c:\scripts\log-%datetime%.txt"

    echo %time% | find /i "07:00:">nul && (
    echo it's 7 o'clock so let's reboot *SERVER1* and *SERVER2* >> c:\scripts\log-%datetime%.txt"
    powershell.exe -executionpolicy bypass -file c:\scripts\server1.ps1 >> c:\scripts\log-%datetime%.txt"
    powershell.exe -executionpolicy bypass -file c:\scripts\server2.ps1 >> c:\scripts\log-%datetime%.txt"

    echo %time% | find /i "09:00:">nul && (
    echo it's 9 o'clock so let's reboot *SERVER3* and *SERVER4* >> c:\scripts\log-%datetime%.txt"
    powershell.exe -executionpolicy bypass -file c:\scripts\server3.ps1 >> c:\scripts\log-%datetime%.txt"
    powershell.exe -executionpolicy bypass -file c:\scripts\server4.ps1 >> c:\scripts\log-%datetime%.txt"

    This is one of the PS scripts

    Invoke-Command -ComputerName server1 {Suspend-ClusterNode -drain}
    Start-Sleep -S 600
    Restart-Computer -Force -Wait -ComputerName server1
    Start-Sleep -S 300
    Invoke-Command -ComputerName server1 {Resume-ClusterNode}

    There is one host which is slightly different as that has to always hold a special VM so the script for that one is:

    Stop-VM -Name SPECIALVM -Force -Confirm False
    Invoke-Command -ComputerName server5 {Suspend-ClusterNode -drain}
    Start-Sleep -S 600
    Restart-Computer -Force -Wait -ComputerName server5
    Start-Sleep -S 300
    Invoke-Command -ComputerName server5 {Resume-ClusterNode -failback Immediate}
    Start-Sleep -S 120
    Start-VM -Name SPECIALVM -Confirm False