The problem
This afternoon, out of the blue, the development folks called over wanting to know why the DB server was not responding, sure enough the databases were not accessible from application and from MMS. I knew there weren’t any maintenance happening and so I logged in to the server remotely and found that the sql services were still running as usual and the services had not restarted. To my surprise, in 10-15 mins everyone was able connect to the server again. My first thoughts were, it would have been an issue with the network and due to the glitch the servers weren’t accessible during the time period.
Environment details :
- The sql server were on a hyper v with a single CPU and 1024 memory
- There was 80 + transaction replications setup and further 20-30 sql server non replication jobs
What was checked
Out of curiosity I rechecked the sql server error log ( using sp_readerrorlog – “Reading the SQL Server log files using TSQL“) and couldn’t find anything unusual, but did notice quite a few messages in the agent log. Most of the messages were related to the replication jobs pointing to “maximum number of worker threads” which didn’t ring a bell from the first look, but after reading the message several times I knew it didn’t look right.
What was in the error log [398] The job (Server\I01-) has been queued because the maximum number of working threads (100) are already running. This job will be executed as soon as one of the working thread finishes execution.
·
- As it was obvious that it was something to do with maximum number of worker threads, I checked if we had changed the default setting for “Max worker threads” of the server and found that it was having the default setting
sp_configure 'show advanced option' , '1'
reconfigure with override
go
sp_configure 'max worker threads'
- · Subsequently I wanted to see how much “Max worker threads” sql server had set for the each subsystem
use msdb
Select * from syssubsystems
Most subsystems had been allocated with 100 “Max worker threads” with the exception to Logreader agent, which was assigned with 25. After searching on the web for a while, I was convinced the problem was to do with the simultaneous jobs been executed and that there weren’t sufficient threads to support connection requested by the replication jobs.
When the total number of threads supporting the subsystem reached the max_worker_thread value, the new request gets queued till a consumed connection is freed back to the pool. This will have a direct impact on the logreader subsystem as it continuously runs and if there are more than 25 log reader job request the queued request may never get executed.
This same problem exists for the transact-sql subsystem which I believe why external applications were not able to make a connections to sql server.
“Because the Transact-SQL subsystem default is 20, the maximum number of concurrent Transact-SQL jobs is reached quickly if several Transact-SQL jobs are set to run. Add the Transact-SQL subsystem value as a string with a value that is higher than 20 if you notice that the Transact-SQL jobs are being queued for this reason.
A design limitation imposes a one second delay between jobs. This limitation was set so that the job scheduler does not monopolize the resources on the hosting server. Because of this limitation, up to 60 jobs can be started in the same one-minute interval. If jobs run and finish in less than one minute and are started at the same time based on a one-minute schedule, some jobs may never run. The jobs may never run because jobs that were already executed are retriggered every minute. Other jobs are left in a starvation situation.” - http://support.microsoft.com/kb/306457
A design limitation imposes a one second delay between jobs. This limitation was set so that the job scheduler does not monopolize the resources on the hosting server. Because of this limitation, up to 60 jobs can be started in the same one-minute interval. If jobs run and finish in less than one minute and are started at the same time based on a one-minute schedule, some jobs may never run. The jobs may never run because jobs that were already executed are retriggered every minute. Other jobs are left in a starvation situation.” - http://support.microsoft.com/kb/306457
What was done
After further research, I found that how important the “Max worker threads” setting was and how sql server calculated it (http://msdn.microsoft.com/en-us/library/ms187024.aspx)
After allocating one more CPU and a further 1024 MB to the hyperv the problem disappeared
Conclusion
As DBAs we need know the load that a server will be put through so that setting such as “max_worker_thread” at subsystems level so they don’t turnaround and byte us back. We were lucky this was a hyper-v , were CPU and memory can be much easily be allocated but it wouldn’t have been the same if this was a standalone server.
Referenced
Reading the SQL Server log files using TSQL - http://www.mssqltips.com/sqlservertip/1476/reading-the-sql-server-log-files-using-tsql/
Comments
Post a Comment