My bad luck at workForest underscore-zone zone transfers matter
In our environment, we are sharing a single DNS namespace with Windows and *NIX hosts. Because of this, BIND owns foo.com (the same namespace as our forest root).
The forest root was upgraded and then, one by one, each forest DC was wiped and server 2003 was installed. The forest was happy and functioning properly. However, by default, zone transfers are disabled in Windows DNS. This minor oversight on our part leads to domain devastation.
Next in line was the child domain, ad.foo.com. Instead of upgrading we have brand new hardware to replace the three current DC's. Friday afternoon we decided to dcpromo the first 2003 server into ad.foo.com. The dcpromo is going rather slowly, but we expected that. The child domain has many, many more objects to replicate than the forest. Dcpromo finally finished and we noticed the first odd message. The summary said that since no subnet existed for the current server, a site was picked at random. It just happened to pick our office in Florida. Crap! We forgot to add the new subnets into Sites and Services. Oh well, no biggie. We'll fix that.
So we added the 3 new subnets into Sites and Services and waited until the new DC replicated. Tick tock... tick tock. Hmmmm. This isn't good. It's been a half hour and the new DC still doesn't have a SYSVOL share, it's in the Florida site, and it's only registering as a DC on FloridaDC1. No other DC in the directory seems to know our new DC is, in fact, a DC.
Oh boy. So we start running diagnostics to see what the problem is. Uh oh! Why has replication completely stopped? That's right, not a single DC in the entire directory could replicate with anything. We broke the domain. This is not good.
It turns out that, by coincidence, we dcpromo'd the new server about 24 hours after finishing the forest; the forest in which we DID NOT enable zone transfers. BIND has a certain time limit (which happens to be around 24 hours) where if it cannot contact a master to get it's slave copy it says, "Ok, fine! I give up. You don't exist anymore." That was pretty much it. Dns1 couldn't grab it's slave copies of the forest underscore zones to it deemed them non-existent.
But why is that a problem? Don't the underscore zones only affect the Windows side of the network? BIND only has a slave copy, not the master. What gives?
Ok, give up? Here's why it matters. In order for a DC to be fully functional, it needs to be capable of replicating with any DC in the forest. That means that it has to be able to replicate to AdDC1 as well as FooDC1. To be able to replicate, it must have registered its SRV record in the underscore zones in both the child domain and the forest. That means that it must have a way of finding _msdcs.ad.foo.com as well as _msdcs.foo.com. So it starts the process... It finds _msdcs.ad.foo.com because it houses a copy and is in a site (Florida) where it can also contact another DC (FloridaDC1), but since replication isn't working that change doesn't get any further than those two servers. Next it looks for _msdcs.foo.com to add it's record there. It queries DNS on FloridaDC1 but FloridaDC1 is configured to send foo.com requests to dns1. Dns1 checks the foo.com zone and sees that _msdcs.foo.com is delegated to the forest DC's, which it has deemed non-existent. So the process ends there for NewDC1.
But why did the entire domain stop working? Well, each DC has a GUID. That GUID is part of their SRV record in the underscore zones. In order to replicate, each DC will look up the GUID of the server to which it is to replicate by asking the DNS server for the SRV record. But just like before, dns1 is in the loop, and since the forest DC's aren't talking to dns1 anymore, dns1 says, "I don't know what you're talking about." And just like that every DC is stranded and can not talk to anything.
After numerous tests and some comments about massive amounts of zone transfer errors on dns1, it was understood that a problem existed with the underscore directories. A quick look showed us that the forest DC's were not allowing zone transfers and that was quickly remedied. After some waiting, everything began talking again, NewDC1 completed it's journey to DC Valhalla, and it was moved out of the Florida site and into the Pennsylvania site.
Lesson learned: DNS is always the problem.