Friday, February 1, 2013

Odd ADSI OpenQuery Performance Issue When Domain Controller was Retired

I recently ran into a Sharepoint/SSRS performance issue with slow report load times. That is, the report itself was slow to load before picking parameters and actually executing it. Once it did load, though, the report ran just fine. Some tracing determined that the hangup was on the population of the parameter lists for the report. The lists are dynamic, displaying specific values based on the executing user's Active Directory group memberships. The user's group membership are determined through a stored procedure that uses an ADSI OpenQuery to  interrogate AD.

Because AD groups can be nested or bundled with a role, the procedure processes groups/subgroups recursively. Our environment is pretty complex, so each time this procedure runs, it is often doing about 100 or so ADSI queries. Normally, this still completes in a fraction of a second and report load times are acceptable. Until about a week ago when suddenly report load times began taking about 2 minutes. Having traced it to this procedure, a statement-level Profiler trace showed that each ADSI query was taking about 1015ms, where in all of our test and development environments, they took about 15ms. Each ADSI query was taking almost exactly 1 second longer in our production environment. What the heck?

Grasping at straws at this point, I remembered from our change control meetings that a old Domain Controller had been retired about the same time the problem began. I theorized that SQL Server had cached the domain controller it used for these queries, even though the query was written to interrogate the domain, not a particular controller. Perhaps when this query was first run, it hit the now-retired controller, and now it was attempting to contact it first each and every time. Perhaps it was then timing out after 1s, and then retrying on another DC, which would succeed.

Since this was affecting our reporting system and not anything transactional, an after hours restart of the SQL service was simple to schedule, so we decided to try it. When in doubt, reboot, right? We hoped that when the service account logged in, it would pick up a new domain controller and the problem would go away. Sure enough, after the service restart, query times returned to the 15ms range and report load times returned to normal.

So, the question is, was our theory correct? Because we decided to test the theory by restarting the service instead of attempting to trace AD query traffic, I guess we'll never know, and that is bothering me now. I wish, in hindsight, that we had worked with our infrastructure team to trace that traffic and know for sure.

Our ADSI linked server is using the default timeout values (0), I've done some Googling to see if I could find a reference to ADSI timeouts being 1s by default, but found that it seems to be 30s by default. So, I'm at a loss to confirm that our theory was correct. Any additional thoughts on the matter are much appreciated!

No comments:

Post a Comment