SRM: Lessons learned in the First 90-days

Lee Dilworth and Dave Burgess are two SRM guys from the UK. Right now it sounds like EMEA is stronger on SRM than their counterparts in the US! This was my best session of VMworld mainly because it relates to topic which is very dear to my heart, and was technical presentation which really needed the audience to know the VMware SRM pretty well. Most of the SRM/BC/DR sesssion I went to were quite “general” and conceptual and didn’t deal with the nitty gritty day-to-day issues which is what I have to deal with.

So I got plenty of juicy gotchas, tips and FAQs some which I had seen but many which I hadn’t. I decided to mention the ones I hadn’t seen or understood before…

Firstly, something that I got this week was the realities of replication. In fairness I got this from Adam Carter (of Lefthand Networks) presentation on the realities of replication types and their limitations. Although synchronous replication offers the highest quality of intergrity in terms of data, the reality is that distance limitations currently on this technology is for many people not far enough away to class the second location as recovery. In terms of maximum distance with fibre is some 450km. This assumes big time a very good link with latancies well within the 5ms range. The reality is that most peoples pipes don’t offer this superlow latency – in practice synchronious replication is limited to little 50-60Km away from the primary site. In other words not far enough to protect you from a true diaster.

Lee & Dave then went to outline thier top gotchas they have seen in the first 90-days since the VMware SRM GA. I will list the one’s I didn’t know – because they are NOT in my book (yet!). So for the very privilege 50 people who managed to grab a copy of my book on SRM. You will be fully up-to-date. Hopefully, you attended this session just like I did.

Gotcha 1: Expired Eval Licenses
Suprisingly, many people get caught out by licensing the SRM system with a .lic file which then expires. Although their exists an alarm with SRM to warn you about expiring/expired licenses – if you don’t configure it – you don’t the get the standard “insufficent licenses to…” message you would normally get from VirtualCenter

Gotcha 2: Rename Sites
When you install SRM you set a “site name” to identify the site. Unfortunately, currently you cannot rename a site once you have set it (even though it is exposed in some of the .XML files. You have no alternative but to re-install SRM. So moral of the story until this fixed: pick a name you never want to change!!!

Gotcha 3: Site Recovery Adapters are not the same
Most of the SRA’s I’ve used so far just use IP and properity TCP ports to speak to the Storage Array. I was supsome to hear that some SRM’s can/need be configured with an RDM to speak to a “management” LUN on the array, apparently EMC Symetric’s is a case in point. Believe it or not I’ve yet to read the long PDF’s for each an every vendor and array type. It’s something I intend to do this week. Some guys at EMC and NetAPP have very kindly offered to provide me with storage hardware to do further development on, and reading those docs will certainly be pre-requiste before that happens. Chatting to the SRM guys this week, it became these PDF guides from the vendors are very much pitched towards full-time storage admins, rather than part-time storage guys like you and me. The assume a level knowledge that sometimes leaves VMware PoC guys like me scratching their head. That’s not such a bad thing for me, it means there’s scope for RTFM “Gettings Started with…” guides for all the storage vendors – not just Lefthand Networks which is what I currently have in the book

FAQ: Low Resources Message with SRM
I’ve seen this lots in my lab environment. It’s caused by the settings in the vmware-dr.xml file. I think I might have mentioned this in the book. When I get back to the UK I will have to double-check my files to see. The vmware-dr.xml has some tolerences for CPU and Memory if these are breached the “alarm” is triggered. The alarm is beign it won’t stop a virtual machine from being recovered.

FAQ. Why do you do need two VCs
There’s no workaround to this requirement in SRM. Two sites, require two VirtualCenter. Contray to popular belief this design decision has nothing to with VMware trying to make more money by selling more VirtualCenter licenses. By having separate VirtualCenter databases rather than a clustered VirtualCenter you are less vunerable to database errors such as aribitory permissions changes or service packs. The split model guarantees that there are no interdepedency issues between the Protected and Recovery Site. It means your Recovery Plans can reside VirtualCenter that has survived the diaster. Additionally, the two VirtualCenter module allows for subtle differences between the configuration of the VirtualCenter at the Recovery Site which allows for flexibility in your folder and resource pool structures

FAQ: Why does it in the install & pairing process say that Port 80 will be used to communicate to VirtualCenter?
Even though SRM uses SSl when it communicatioes to VC, it does not used 443. SRM estabilies a TCP to port 80 then use HTTP connect to establish a tunnel to the VC server, then does a SSL handshake with VC over that tunnelled connection.

FAQ: Why do I see “Recompute Datastore” freqently in the Taskbar?
I’ve seen this a lot – and more or less ignored it because it seemed to me very much a beign message. So it was nice to know why it happens. Put very simply practically any change (right-click Edit Settings) to a protected virtual machine causes this. SRM must check the virtual machine each time to ensure you have fundementally change the configuration of that virtual machine which could affect how SRM will protect it. So it must check for things like have you added/remove a new virtual disk or patched it to a different portgroup.

TIP: Log files are good!
You wouldn’t expect an SE not to recommend using log files to pull more information out the system. Lee & Dave gave a good example of virtual machine not being protected properly due to an invald/incomplete “Inventory Mapping”. This resulted in an “unset” entry in the SRM log file. I only briefly name-check the log files in my book. But I’m thinking off coming back to it, and creating a load of very common errors – and then seeing if I can “profile” these from the log file. Kinda like if you search for “X” in the log file it means you have “Y” problem. Trouble with log files after all is often interpreting their often “cryptic” meaning.

TIP: The LUN is King
The LUN is critical aspect of SRM. Bad LUN structures – like a one-big LUN that contains everything really limits the flexibility you have in being able recovery teams of virtual machines that make up a particular application. For me this a bit of a goldilock issue. One big LUN with everything it is too much of a blunt stick and wouldn’t be optimized for performance. It would really limit you to replicating virtual machines unneccessarily. However, the opposite – one virtual machine per LUN. Gives you ultimate flexibilty but a lot of replication work to do – for each and every virtual machine that needs protection. Although you get a lot of flexibility – you creating a lot of work in SRM, such as having to create protection group for every individual VM you have. Personally, I’m fan of groups of LUNs (boot, log, data) being gathered together and being put in the same replication group. Kinda like 3xLUNs for Citrix, 3xLUNs for Domain Controllers, 3xLUNS for SQL. In this respect you using the LUNs to distrubute the I/O whilst at the same time creating groups of application LUNs that can be included in the same cycle of replication so they shame the same integrity and protection groups in SRM.

Recommendation: SRM on a separate windows instances
Apparently, when you execute recovery plan SRM can be CPU intensive – hence the SRM guys recommended putting it on a separate box during recovery. Admitedly, a lot depends on how big/bizzy your existing VirtualCenter environment is. My book was written in test lab with a free small VMs doing very little processing – so I didn’t see it. Personally, I’m really pleased my new book assumes SRM is separate virtual machine. I’ve become increasing uneasy with the increasing number of management roles the VirtualCenter box is being asked to handle – just a bit too much “eggs in one basket” for me…

Comments are closed.