Your cluster patching and maintenance strategies

glenimp617 · 1 Aug 2018 at 09:57

Hi Everyone,

Just wondering what everyone is doing regarding patching and general maintenance on clusters.

AFAIK, the options are

1. Use cluster aware updating (do you do this manually or on a schedule)
2. Use SCCM and either reboot manually or use PS scripts on a schedule
3. Use SCCM with the beta server groups feature (Hows that going?)
4. Use SCVMM patching
5. Use WSUS
6. Do everything manually
7. Something else

What are you doing about reporting? We run and present reports every Friday which shows our estate and the current patch levels. These reports are saved and given to an external auditor every year to keep our compliancy.

Thanks

Conanius · 2 Aug 2018 at 19:18

We've come a long way with our patching strategy over the last 2 years.

Edit - I've rewritten this with more detail to try and help a bit more.

We've gone from pretty much manually cranking everything by the Service Desk (24x7 team) to getting SCCM to do auto patching on Windows boxes and Spacewalk for our CentOS and RHEL boxes.

From a platform to platform basis, we go T&D/Pre Prod/BCDR/Production with 1 week separation for our platforms and patch every month following vendor patch release. Critical Vulnerabilities are assessed and pushed out out of band if needed.

Within a specific platform, we set up a 4 hour window, and 3 patch groups. First and Last group have any clustered/shared services like domain controllers/SQL/etc split across them. Where there are non resilient app servers, they are done in the 2nd Patch group. SCCM pushes the patches to the boxes the day of the install, but no install is run. When that patch group 'starts' it has a deployed time set 1 minute in the past so the install starts immediately.

We set up our monitoring tool to mute alerts for the duration of the window, and then when its all back up at the end of the window, our ITSD checks to see if anything hasn't come back up as expected.

In the main our success has been superb. We've got some customers who have had difficulties with their apps coping well with reboots (Servers coming up in the right orders) but we've worked with them on start up scripts so it waits for the servers to come online correctly.

With SQL, we're currently doing a couple of those manually still as everything we've tried on failover on the SQL clusters never seems to work properly... interested to see what others say here on those. Some of them just work perfectly and when the box is rebooted the failover works as expected. We've tried all sorts of comparisons and can't see the difference.

glenimp617 · 3 Aug 2018 at 08:34

Thanks, that is basically what I've ended up doing. SCCM for the patching with powershell scripts to handle the node failovers.

Did some testing yesterday which went well.

Eg: 15 node HV cluster

SCCM delivers the updates
on a set day run a batch file from a scheduled task every 2hrs for 23hrs. This checks the time, and runs a ps script against a certain hosts depending on what time it is
the ps script does all the failover, reboots and migrates resources back depending on what node it is

The reason for this is that we are currently over committed until we buy some new nodes (hopefully next month)

Conanius · 7 Aug 2018 at 17:49

I don't suppose you willing to share that Powershell script ?

glenimp617 · 9 Aug 2018 at 15:06

Conanius said:
I don't suppose you willing to share that Powershell script ?

of course. Basically we use SCCM to deliver the patches (so they show up on our weekly reports) and reboot the hosts once a month

Scheduled task runs a batch file every 1st sunday of the month starting at 7am and repeating each 2hrs for the next 23hrs

This is the batch file
=============

rem HyperV Reboot Script which runs from *NAME OF OUR SCVMM SERVER* every first Sunday of the month. It repeats every 2hrs for 23hrs
@Echo off

for /f "tokens=2 delims==" %%I in ('wmic os get localdatetime /format:list') do set datetime=%%I
set datetime=%datetime:~0,8%-%datetime:~8,6%

echo Scripts running at %datetime% >> c:\scripts\log-%datetime%.txt"

echo %time% | find /i "07:00:">nul && (
echo it's 7 o'clock so let's reboot *SERVER1* and *SERVER2* >> c:\scripts\log-%datetime%.txt"
powershell.exe -executionpolicy bypass -file c:\scripts\server1.ps1 >> c:\scripts\log-%datetime%.txt"
powershell.exe -executionpolicy bypass -file c:\scripts\server2.ps1 >> c:\scripts\log-%datetime%.txt"
)

echo %time% | find /i "09:00:">nul && (
echo it's 9 o'clock so let's reboot *SERVER3* and *SERVER4* >> c:\scripts\log-%datetime%.txt"
powershell.exe -executionpolicy bypass -file c:\scripts\server3.ps1 >> c:\scripts\log-%datetime%.txt"
powershell.exe -executionpolicy bypass -file c:\scripts\server4.ps1 >> c:\scripts\log-%datetime%.txt"
)

This is one of the PS scripts
==================

Invoke-Command -ComputerName server1 {Suspend-ClusterNode -drain}
Start-Sleep -S 600
Restart-Computer -Force -Wait -ComputerName server1
Start-Sleep -S 300
Invoke-Command -ComputerName server1 {Resume-ClusterNode}

There is one host which is slightly different as that has to always hold a special VM so the script for that one is:

Stop-VM -Name SPECIALVM -Force -Confirm False
Invoke-Command -ComputerName server5 {Suspend-ClusterNode -drain}
Start-Sleep -S 600
Restart-Computer -Force -Wait -ComputerName server5
Start-Sleep -S 300
Invoke-Command -ComputerName server5 {Resume-ClusterNode -failback Immediate}
Start-Sleep -S 120
Start-VM -Name SPECIALVM -Confirm False