Resource governance in Azure SQL Database

This blog post continues the Azure SQL Database architecture series where we share background on how we run the service, as described by the architects who originally created the service. The first two posts covered data integrity in Azure SQL Database and how cloud speed helps SQL Server database administrators. In this blog post, we will talk about how we use governance to help achieve a balanced system.

Allocated and governed resources

When you choose a specific Azure SQL Database service tier, you are selecting a pre-defined set of allocated resources across several dimensions such as CPU, storage type, storage limit, memory, and more. Ideally you will select a service tier that meets the workload demands of your application, however if you over or under-size your selection you can easily scale up or down accordingly.
With each service tier selection, you are also inherently selecting a set of resource usage boundaries or limits. For example, a business critical, Gen 4 database with eight cores has the following resource allocations and associated limits:
Compute sizeBC_Gen4_8
Memory (GB)56
In-memory OLTP storage (GB)8
Storage typeLocal SSD
Max data size (GB)650
Max log size (GB)195
TempDB size (GB)256
IO latency (approximate)

Target IOPS (64KB)
1-2 millisecond (write)
1-2 millisecond (read)
40000
Log rate limits (MBps)48
Max concurrent workers (requests)1600
Max concurrent logins (requests)1600
Max allowed sessions30000
Number of replicas4
As you increase the resources in your tier, you may also see changes in limits up to a certain threshold. Furthermore, these limits can be automatically relaxed over time, but never further restricted without penalty to the customer.
We document resource allocation by service tier and also the associated resource governance limits in the following resources:
While the resource allocation by service tier is intuitive to customers because the more you pay, the more resources you get, resource governance and boundaries has historically been less clear of a subject with customers. While we are increasing transparency around these governing mechanisms, it is important to understand the broader purposes behind resource governance in a database as a service (DBaaS). For this, we’ll talk next about what it takes to achieve a balanced system.

Providing a balanced database as a service (DBaaS)

For the context of this blog post, we define a system as balanced if all resources are sufficiently maximized without encountering bottlenecks. This balance includes an interplay of resources such as CPU, IO, memory, network paired with an application’s workload characteristics, maximum tolerated latency, and desired throughput.
With Azure SQL Database, our view of a balanced system must also take a broad and comprehensive perspective in order to meet articulated DBaaS requirements and customer expectations.
Azure SQL Database surfaces a familiar and popular database ecosystem with the intent of giving customers the following additional benefits:
  • Elasticity of scale – Customers can provision a database based on the throughput requirements of their application. As throughput requirements change, the customer can easily scale up or down.
  • Automated backups with self-service restore to any point in time – Database backups are automatically handled by the service, with log backups generally occurring every five to ten minutes.
  • High availability – Azure SQL Database supports a differentiated availability SLA with a maximum of 99.995 percent, backed by availability zone resilience to infrastructure failures.
  • Predictable performance – Customers on the same provisioned resource level always get the same performance with the same workload.
  • Predictable scalability – Customers using the hyperscale service tier can rely on predictable latency of the online scaling operations backed by a verifiable scaling SLA. This gives the customer a reliable tool to react to, changing compute capacity demands in a timely manner.
  • Automatic upgrades – Azure SQL Database is designed to facilitate transparent hardware, software upgrades, and periodic, lightweight software updates.
  • Global scale – Customers can deploy databases around the world and easily provision geographically distributed database replicas enabling regional data access and disaster recovery solutions. These solutions are backed by strong geo-replication and failover SLAs.
For the Azure SQL Database engineering team, providing a balanced DBaaS system for customers goes well beyond simply providing the purchased CPU, IO, memory, and storage. We must also honor all aforementioned factors and aim to balance these key DBaaS factors along with overall performance requirements.
The following figure shows some of the key resources that are governed within the service.
Figure 1: Governed resources in Azure SQL Database
We need to provide this balanced system in such a way that allows us to continually improve the service over time. This requirement for continual improvement implies a necessary level of component abstraction and over-arching governance. Governance in Azure SQL Database ensures that we properly balance requirements around scale, high availability, recoverability, disaster recovery, and predictable performance.
To illustrate, let’s use transaction log rate governance as an example of why we actively manage in order to provide a balanced DBaaS. Transaction log governance is a process in Azure SQL Database used to limit high ingestion rates for workloads such as bulk insert, select into, and index builds.
Why govern this type of activity? Consider the following dimensions and the impact of transaction log generation rate.
Dimension
Log generation rate impact
Database recoverability
We make guarantees around the maximum window of possible data loss based on transaction log backup frequency.
High availability
Local replicas must remain within a recoverability and availability (up-time) range that aligns with our SLAs.
Disaster recovery
Globally distributed replicas must remain within a recoverability range that minimizes data loss.
Predictable performance
Log generation rates must not over-saturate the system or create unpredictable performance.
Log rates are set such that they can be achieved and sustained in a variety of scenarios, while the overall system can maintain its functionality with minimized impact to the user load. Log rate governance ensures that transaction log backups stay within published recoverability SLAs and prevents an excessive backlog on secondary replicas. We have similar impact and interdependencies across other governed areas including CPU, memory, and data IOPs.

How we govern resources in Azure SQL Database

While we use a multi-faceted approach to governance, today we do rely primarily on three main technologies, Job Objects, File Server Resource Manager (FSRM), and SQL Server Resource Governor.

Job Objects

Azure SQL Database leverages multiple mechanisms for governing overall performance for a database. One of the features we leverage is Windows Job Objects, which allows a group of processes to be managed and governed as a unit.   We use this functionality to govern file virtual memory commit, working set caps, CPU affinity, and rate caps. We onboard new governance capabilities as the Windows team releases them.

File Source Resource Manager (FSRM)

Available in Windows Server, we use FSRM to govern file directory quotas.

SQL Server Resource Governor

A SQL Server instance has multiple consumers of resources, including user requests and system tasks. SQL Server Resource Governor was introduced to ensure fair sharing of resources and prevent out-of-control requests from starving other requests. This feature was introduced in SQL Server years ago and over time was extended to help govern several resources including CPU, physical IO, memory, and more for a SQL Server instance. We use this functionality in Azure SQL Database as well to help govern IOPs both local and remote, CPU caps, memory, worker counts, session counts, memory grant limits, and the maximum number of concurrent requests.
Beyond the three main technologies, we also created additional mechanisms for governing transaction log rate.

Configurations for safe and predictable operations

Consider all the settings one must configure for a well-tuned on-premises SQL Server instance, including database file settings, max memory, max degree of parallelism, and more. In Azure SQL Database we pre-configure several settings based on similar best practices. And as mentioned earlier, we pre-configure SQL Server Resource Governor, FSRM, and Job Objects to deliver fairness and prevent starvation. The reasoning behind this is to aim for safe and predictable operation. We can also provide varying settings for customers based on their workload and specific needs, assuming it conforms to safety limits defined for the service.

Improvements over time

Sometimes we deploy software changes that improve the performance and scalability of specific operations. Customers benefit automatically and we might exceed the defined limits and/or increase them for all customers in the future. Furthermore, as we enhance the hardware of machines, storage, and network, these benefits may also be transparently available to an application. This is because we have defined this DBaaS abstraction layer instead of just providing a specific physical machine.

Evolving governance

The Azure SQL Database engineering team regularly enhances governance capabilities used in the service. We continually review our models based on feedback and production telemetry and we modify our limits to maximize available resources, increase safety, and reduce the impact of system tasks.

Comments

Popular posts from this blog

Microsoft Whiteboard: now more colorful and customizable than ever

SharePoint page enhancements

New device modeling experience in Azure IoT Central