I see there's a new CBT bug with esxi 6 surfaced again.
THE WORD FROM GOSTEV
So vSphere 6 keeps enhancing its position of being the worst VMware release ever, remaining unusable in production for over 8 months after its original release. Just when we thought it's over with vSphere 6 Update 1a – last week, yet another critical issue has been discovered and documented by VMware as KB2136854. To put it short, Changed Block Tracking (CBT) data on ESXi 6 cannot be trusted, and there are no real workarounds except downgrading to vSphere 5.5. Yet again, those of you who decided to hold off jumping the new vSphere release until it stabilizes will get some pats on their shoulders from colleagues, while the rest will have a few tough weeks ahead...
Unfortunately, there is very little technical information available on this issue from VMware, so at this point it's hard to estimate the impact. And since the article was only published late last week, we did not have a chance to do enough testing to make some solid conclusions. However, what we do know by now is that the issue does exist. Some of our QC folks were working this weekend, and they managed to reproduce the issue by taking incremental backups of a VM with continuous heavy write I/O generated inside of the guest. They then performed the exact same test in vSphere 5.5 lab, and there were no issues – CBT information about changed blocks was correct.
As you can see from the KB article, workarounds proposed by VMware are not really feasible. Luckily, Veeam users have one more option, and that is to disable the use of CBT data in the advanced job settings. Your backups and replicas will remain incremental, but they will take longer, because the job will need to read the entire source disk to determine the changes. Disabling CBT is essential - otherwise, even Active Full backup may contain corruption, because CBT data is used there too to determine and skip zeroed regions of virtual disks. On the other hand, disabling CBT is sufficient to both prevent and remediate the issue, because this will make jobs physically compare latest state of disk in backup or replica with its actual state, and transfer any non-matching blocks over as a part of incremental backup (along with actually changed blocks), thus fixing any corruption that may already be in place.
THE WORD FROM GOSTEV
So vSphere 6 keeps enhancing its position of being the worst VMware release ever, remaining unusable in production for over 8 months after its original release. Just when we thought it's over with vSphere 6 Update 1a – last week, yet another critical issue has been discovered and documented by VMware as KB2136854. To put it short, Changed Block Tracking (CBT) data on ESXi 6 cannot be trusted, and there are no real workarounds except downgrading to vSphere 5.5. Yet again, those of you who decided to hold off jumping the new vSphere release until it stabilizes will get some pats on their shoulders from colleagues, while the rest will have a few tough weeks ahead...
Unfortunately, there is very little technical information available on this issue from VMware, so at this point it's hard to estimate the impact. And since the article was only published late last week, we did not have a chance to do enough testing to make some solid conclusions. However, what we do know by now is that the issue does exist. Some of our QC folks were working this weekend, and they managed to reproduce the issue by taking incremental backups of a VM with continuous heavy write I/O generated inside of the guest. They then performed the exact same test in vSphere 5.5 lab, and there were no issues – CBT information about changed blocks was correct.
As you can see from the KB article, workarounds proposed by VMware are not really feasible. Luckily, Veeam users have one more option, and that is to disable the use of CBT data in the advanced job settings. Your backups and replicas will remain incremental, but they will take longer, because the job will need to read the entire source disk to determine the changes. Disabling CBT is essential - otherwise, even Active Full backup may contain corruption, because CBT data is used there too to determine and skip zeroed regions of virtual disks. On the other hand, disabling CBT is sufficient to both prevent and remediate the issue, because this will make jobs physically compare latest state of disk in backup or replica with its actual state, and transfer any non-matching blocks over as a part of incremental backup (along with actually changed blocks), thus fixing any corruption that may already be in place.