If you’ve stumbled on this blog, then the likelihood is that SQL Server performance is pretty critical to your business. Whether you’re chiefly responsible for your organisation’s IT strategy, or a newly-appointed DBA needing to make an impact, the onus is on you to maximise the not-inconsiderable investment your organisation has made.
Over the past six months, we shared hints and tips on our LinkedIn page to help you get the most out of your SQL Server database environments. In this blog, we’ve brought together and expanded on that information.
Databases and Tables
Hands up if you have a database called ‘Test’! Chances are, you’re still performing maintenance tasks on it without even realising.
Identifying and dropping the databases and tables that aren’t being used will speed up those maintenance tasks, such as backups, reducing their overall impact on server performance. If you’re running a Platform as a Service (PaaS) engine, such as Azure SQL Database or Amazon RDS, then those unused databases and tables could be costing you money, too!
Of course, we know what you’re thinking:
"How do I tell if a database is being used?"
Drop unused Indexes
Creating the right indexes is critical for the performance of SELECT queries, but when table data is modified, SQL Server has to modify the indexes too. So, the more indexes a database has, the slower the Inserts, Updates and Deletes become.
Dropping unused indexes is a relatively easy way to improve the performance of queries (and maintenance tasks!). This script from Pinal Dave will identify good candidates for dropping. If you see high updates and low seeks or scans, then the index is probably doing more harm than good (high scan numbers are generally bad too, but that’s a subject for a different blog).
If you’re nervous about dropping these indexes, consider disabling them to start with, or at least script the index before dropping it. Either way, the index will need to be rebuilt if you decide you do need it after all.
We also recommend removing/combining similar indexes. Do this with caution however, accounting for differences in index type, column order, and included columns.
Throughput is the measure of how much work a database server is performing. An increase in throughput (i.e. demand) can lead to a decrease in overall performance.
When measuring throughput, it’s important to ask yourself these questions:
"What is the 'normal' throughput for my SQL Server?"
"If challenged, could I prove if my SQL Server is simply busier than it used to be?"
In a SQL Server environment, you can begin to answer these questions by monitoring the number of requests that occur each second. You can capture and store this metric using a utility such as Perfmon (Batch Requests/sec counter) or plain-old T-SQL! Over time, you’ll develop a useful baseline for comparison when things start to slow down!
Once you have a baseline, think about the following:
Is the number of Batch Requests/sec higher at certain times and does that correlate with when users report slow performance?
When compared to your baseline, is the server now dealing with significantly more requests than it was previously?
If you answer “yes” to either of these questions, then you may be considering upgrading your server hardware or moving up that PaaS pricing tier.
Well…before you do, let’s first establish how hard your server is working, and see if you can reduce it’s work-rate by tuning your workload.
We know throughput is the measure of how much work SQL Server is performing. Work-rate is the measure of how hard your server is working.
If users are complaining about a slow application, and you’ve recently seen an increase in throughput, then this may mean your SQL Server is overloaded, but not necessarily. If a bottleneck elsewhere is the real cause of performance issues, then that extra hardware or cloud spend you’re planning won’t help. It’s the job of the DBA to prove it one way or another.
Measuring CPU is one way of establishing work-rate and can be easily captured using a built-in Windows monitoring tool such as Perfmon. PaaS offerings such as Azure SQL Database, and AWS RDS for SQL Server also expose this data via their built-in dashboards.
The % Processor Time counter indicates how busy the CPU is and Processor Queue Length logs how many threads are waiting. Both are good indicators of work-rate. Baselining these values during periods of ‘good’ performance and correlating with throughput during slow performance will help identify if CPU pressure is (part of) your problem.
Memory usage is another critical metric to monitor. This can be done using Perfmon, and specifically, the Page Life Expectancy (PLE) counter. PLE is a measure of how long data pages stay in SQL Server’s memory. If a data page isn’t in memory, then SQL Server has to first fetch it from disk, which is significantly slower. A sustained drop in PLE indicates memory pressure (sudden drops during maintenance windows are normal). Again, baselining during periods of ‘good’ performance is critical.
Tuning your Workload
Who doesn’t want faster queries, right?
We all do, and sure, if you know you’ve got a hardware bottleneck then improving that hardware will generally result in faster queries.
But wouldn’t it be great if they could go faster without investing in expensive hardware?
Start by identifying your most ‘expensive’ queries. Expensive queries are often considered the queries that run the longest or most frequently, and consume the largest amounts of CPU time. However, when performing a SQL Server Health Check for our customers, we typically start by examining the queries that perform the largest number of page reads.
Why? It’s because the more page reads your queries are doing, the more CPU and memory they’re demanding! More memory requirements can cause increased disk I/O (input and output operations). The more pages your queries require access to, the more locking and blocking that occurs across your server! If you can reduce the number of reads occurring, then the benefits will be realised throughout.
SQL Server Query Store is a great tool for identifying these expensive queries. For older versions of SQL Server, where Query Store isn’t available, you’ll need to query the SQL Server Plan Cache instead, perhaps using a free tool such as sp_BlitzCache. These tools make it easy to identify those queries with the highest number of reads, so you know where to focus your tuning efforts.
Implementing missing indexes can often provide a significant performance boost to your most expensive queries, without modifying the SQL code itself. Costly index SCANS can often be replaced by a more efficient SEEK. Don’t be tempted to just create any missing indexes that SQL Server suggests. It’s pretty good with its suggestions, but it’s far from perfect, and no match for an experienced DBA.
Next, look for queries with large memory grants (into the GBs), regular re-compiles and sort warnings. All these are signs of serious resource-consumers and offer the best chance for significant gains.
Database queries compete with other queries for access to the resources they need (e.g. CPU, disk or the data itself). If a resource isn’t available, SQL Server must wait until it is, and this ‘wait’ time is internally recorded for us to interrogate. Understanding and interpreting SQL Server Wait Statistics is one of the most important but difficult aspects of database performance tuning.
A high number of CPU-related waits could naturally indicate CPU pressure. High disk-related waits could indicate excessive disk operations caused by a long-running query. Amongst the most common waits are those that relate to locking and blocking (i.e. multiple queries that are trying to read and/or modify the same data pages). SQL Server’s default locking strategy will prevent a query from reading data that is being modified by a different process – generally, that’s a good thing! However, that query is therefore going to have to wait (and will generate a wait stat!).
Throughout this blog, we’ve championed the approach of tuning your workload before buying hardware (or increasing your cloud spend) – this is no different. For example, just because you have high physical I/O waits, doesn’t mean you should automatically splash out on faster disks. Ask yourself, why is this query requiring physical I/O (disk) instead of logical I/O (memory)? Does my server have memory pressure and can I improve that by tuning my workload?
For a more in depth look into wait statistics, see this blog post.
High Availability (HA) / Disaster Recovery (DR)
Before we wrap up this blog post, let’s talk about what’s perhaps the most critical aspect of our SQL Server Health Check – availability!
When talking to customers about HA and DR, we start with two key questions:
1. What is your Recovery Time Objective (RTO)?
This is the maximum amount of time that the business can tolerate your data being unavailable.
2. What is your Recovery Point Objective (RPO)?
This is the amount of data loss that the business can tolerate in the event of a DR scenario.
It will come as no surprise to hear that the lower levels of tolerance are likely to result in higher licensing and running costs. There’s no such thing as a free lunch unfortunately!
SQL Server has several different HA/DR options, including:
- Always-on Availability Groups (AG)
Each of the above options has well-documented pros and cons. For example, AG and Replication can offer relatively low RPO/RTO, plus the ability to offload your reporting workloads. However, maintenance can be complex.
Log-shipping is simple to implement but has a higher RPO. Log-shipping can handle read-only workloads providing you don’t need the data to be 100% ‘live’.
When it comes to cloud platforms, such as Azure SQL Database or Amazon RDS, uptime is “guaranteed” 99.99% of the time. However, these platforms each offer different architectural models (with cost implications) and your choice will impact how well the environment responds to a DR incident.