Multisite (light) Disaster recovery

From Iwan
Jump to: navigation, search

I have created a 4-part video series that will demonstrate how NSX-T multi-site (Disaster Recovery) works.

Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening.

The full high-level steps

A bit of a spoiler here: The steps to recover from a site failure is a lengthy process with a lot of (manual) steps, checks and specific prerequisites. The whole process took me around 45 minutes! (if I subtract my slowness)

The full high-level steps that should be taken are described below and can be watched in Part 4 of the video series:

  1. Make sure DC1 NSX-T Manager(s) is using FQDN for component registration and backup
    1. This is not the case out of the box
    2. This can only be turned on (and off) with a REST API call
  2. Verify if the backup is done correctly with the FQDN name in the folder name
  3. Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
  4. Deploy (a) new NSX-T Manager(s) on DC2 with a new IP address in another IP range then the DC1 NSX-T Manager was in
  5. SIMULATE A DISASTER IN DC1 + START CONTINUOUS PING FROM WEB01 (172.16.10.11) + START STOPWATCH
    1. Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
  6. Repoint the DNS A record to the (new) NSX-T Manager(s) in DC2
  7. Make sure this new DC2 NSX-T Manager(s) is using FQDN for component registration and backup
    1. This is basically the same as we did in step 1
  8. Restore the backup on the new DC2 NSX-T Manager(s)
    1. This may take around 20 minutes to finish
  9. Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
    1. This is basically the same as we did in step 3
  10. Run SRM Recovery Plan on DC2 SRM Server and recover the Web, App and DB VM’s of DC1
  11. Log in to the (newly restored from backup) NSX-T Manager(s)
  12. Move the T1 Gateway from the DC1-EN-CLUSTER (that is no longer available) to the DC2-EN-CLUSTER
  13. Move the uplink from the DC1-T0 Gateway (that is no longer available) to the DC2-T0 Gateway
  14. Verify if ping starts working again
  15. Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around

Have fun testing this out! I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s

PART 1» Introduction to the Lab POC and Virtual Network environment

PART 2» Ping and Trace-route tests to demonstrate normal operation of the Active and Standby deployment

PART 3»Ping and Trace-route tests to demonstrate normal operation of the Active and Active deployment

PART 4» Simulate failure on DC1 and continue operations from DC2

I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, please contact me and I will fix this.