Skip to main content

The maximum number of working threads (100) are already running

The problem
                This afternoon, out of the blue, the development folks called over wanting to know why the DB server was not responding, sure enough the databases were not accessible from application and from MMS. I knew there weren’t any maintenance happening and so I logged in to the server remotely and found that the sql services were still running as usual and the services had not restarted. To my surprise, in 10-15 mins everyone was able connect to the server again.  My first thoughts were,  it would have been an issue with the network and due to the glitch the servers weren’t accessible during the  time period.
Environment details :
-          The sql server were on a hyper v with a single CPU and 1024 memory
-          There was 80 + transaction replications setup and further 20-30 sql server non replication jobs
What was checked
                Out of curiosity I rechecked the sql server error log ( using sp_readerrorlog – “Reading the SQL Server log files using TSQL“)  and couldn’t find anything unusual, but did notice quite a few messages in the agent log. Most of the messages were related to the replication jobs pointing to “maximum number of worker threads” which didn’t ring a bell from the first look, but after reading the message several times I knew it didn’t look right.
           What was in the error log  [398] The job (Server\I01-) has been queued because the maximum number of working threads (100) are already running.  This job will be executed as soon as one of the working thread finishes execution.
·       
  •       As it was obvious that it was something to do with maximum number of worker threads, I checked if we had changed the default setting for “Max worker threads” of the server and found that it was having the default setting
sp_configure 'show advanced option' , '1'
reconfigure with override
go
sp_configure 'max worker threads'
  • ·           Subsequently I wanted to see how much “Max worker threads” sql server had set for the each subsystem
use msdb
Select * from syssubsystems

Most subsystems had been allocated with 100 “Max worker threads” with the exception to Logreader agent, which was assigned with 25. After searching on the web for a while,  I  was convinced the problem was to do with the simultaneous jobs  been executed and that there weren’t sufficient threads to support connection requested by the replication jobs.
When the total number of threads supporting the subsystem reached the max_worker_thread value, the new request gets queued till a consumed connection is freed back to the pool. This will have a direct impact on the logreader subsystem as it continuously runs and if there are more than 25 log reader job request the queued request may never get executed.
                This same problem exists for the transact-sql subsystem which I believe why external applications were not able to make a connections to sql server.
“Because the Transact-SQL subsystem default is 20, the maximum number of concurrent Transact-SQL jobs is reached quickly if several Transact-SQL jobs are set to run. Add the Transact-SQL subsystem value as a string with a value that is higher than 20 if you notice that the Transact-SQL jobs are being queued for this reason.

A design limitation imposes a one second delay between jobs. This limitation was set so that the job scheduler does not monopolize the resources on the hosting server. Because of this limitation, up to 60 jobs can be started in the same one-minute interval. If jobs run and finish in less than one minute and are started at the same time based on a one-minute schedule, some jobs may never run. The jobs may never run because jobs that were already executed are retriggered every minute. Other jobs are left in a starvation situation.”     - http://support.microsoft.com/kb/306457

What was done
After further research, I found that how important the “Max worker threads” setting was and how sql server calculated it (http://msdn.microsoft.com/en-us/library/ms187024.aspx)
After allocating one more CPU and a further 1024 MB to the hyperv the problem disappeared

Conclusion
                As  DBAs we need know the load that a server will be put through so that setting such as “max_worker_thread” at subsystems level so they don’t turnaround and byte us back. We were lucky this was a hyper-v , were CPU and memory can be much easily be allocated but it wouldn’t have been the same if this was a standalone server.

Referenced

Comments

Popular posts from this blog

Create a dacpac To Compare Schema Differences

It's been some time since i added anything to the blog and a lot has happened in the last few months. I have run into many number of challenging stuff at Xero and spread my self to learn new things. As a start i want to share a situation where I used a dacpac to compare the differences of a database schema's. - This involves of creating the two dacpacs for the different databases - Comparing the two dacpacs and generating a report to know the exact differences - Generate a script that would have all the changes How to generate a dacbpac The easiest way to create a dacpac for a database is through management studio ( right click on the databae --> task --> Extract data-tier-application). This will work under most cases but will error out when the database has difffrent settings. ie. if CDC is enabled To work around this blocker, you need to use command line to send the extra parameters. Bellow is the command used to generate the dacpac. "%ProgramFiles...

High Watermarks For Incremental Models in dbt

The last few months it’s all been dbt. Dbt is a transform and load tool which is provided by fishtown analytics. For those that have created incremental models in dbt would have found the simplicity and easiness of how it drives the workload. Depending on the target datastore, the incremental model workload implementation changes. But all that said, the question is, should the incremental model use high-watermark as part of the implementation. How incremental models work behind the scenes is the best place to start this investigation. And when it’s not obvious, the next best place is to investigate the log after an test incremental model execution and find the implementation. Following are the internal steps followed for a datastore that does not support the merge statements. This was observed in the dbt log. - As the first step, It will copy all the data to a temp table generated from the incremental execution. - It will then delete all the data from the base table th...