Failover Time-Prediction

In the previous post, we learned the basic definition of SQL’s new feature AlwaysOn and the various events that occur when a failover happens.

Today, we will learn how SQL Database engine predicts the time involved during the failover which can be very helpful for deciding the Service Level Agreement(SLA) of the Recovery Time Objective(RTO) for high availability solutions.

Depending when the SQL Server Instance last performed the checkpoint operation, there can be substantial amount of dirty pages in the buffer pool. The failover lasts as long as it takes time to write the dirty pages from buffer cache to disk which can lead to long and unpredictable time.

In SQL 2102, the Failover Cluster Instance (FCI) can use the Indirect Checkpoints to control the amount of dirty pages kept in the buffer cache. This makes the time more predictable and configurable.

To understand this more, lets see Checkpoints and its type.

What’s a CheckPoint?

Whenever a modification is made to a database the changes are not directly applied to physical files (MDF & LDF). Instead for performance reasons the Database Engine performs these modifications in memory known as buffer cache. A checkpoint is issued periodically on each database which then further flushes the changes to disk. A checkpoint also writes the information about the transaction log from memory to disk.
What are the different types of CheckPoint?
There are 4 types of Checkpoints-

Automatic Checkpoint
Indirect Checkpoint
Manual Checkpoint
Internal Checkpoint

Automatic:- It is issued automatically in background. This is a server level configuration decided for all the databases present in the instance. The recovery interval option decides the maximum time it should take to recover the databases in the event of system crash,clean shutdown or a failover. The automatic checkpoint is issued considering this recovery interval option. The default recovery interval is 0 which results in automatic checkpoints every minute and it recovers in less than a minute.

To change the recovery interval to 2 minutes we can use below example.

USE AdventureWorks2012 ;

GO

EXEC sp_configure ‘show advanced options’, 1;

GO

RECONFIGURE ;

GO

EXEC sp_configure ‘recovery interval’, 2 ;

GO

RECONFIGURE;

GO

Indirect- is again issued in background but this time it is issued to meet the user defined target recovery time for a give database. This is a new feature introduced in SQL Server 2012. It is much faster than automatic checkpoint but at the cost of high I/O spikes.

The Automatic checkpoint determines the number of transactions to determine the recovery time whereas Indirect uses the number of dirty pages. Indirect checkpoints ensures that the dirty pages are below certain threshold so that the recovery time is within the target time.

When indirect checkpoint is enabled on a database and a database which receives large DML operations, the background writer aggressively start flushing the dirty pages to disk causing additional I/O.

The below example changes the target recovery time to 60 seconds.

ALTER DATABASE AdventureWorks2012 SET TARGET_RECOVERY_TIME = 60 SECONDS;

Manual- Issue when a manual TSQL script is executed with the syntax of CHECKPOINT.

IF issued without any parameter then it will take the default time to complete the checkpoint and if issued with the parameter value say 5 then it will attempts to complete the operation in 5 seconds.

The less time specified will result to more resource usage. Therefore one should be careful when using the parameter value.

Internal- This is issued by various server operations like, backups, a file have been added or removed by using the Alter Database, or a database snapshot is created, database shutdown is performed.

Now that we know what is the use of checkpoint and the various checkpoints let’s understand how the indirect checkpoint helps in recovering the database in short period of time only.

Prior to SQL 2012, if recovery time was required to be changed then update was required to be made at server level which resulted to set across all the databases and since the default value was 1 all we could do to update the value was to set to a longer time.

With SQL Server 2012, the target recovery time of a user database can now be changed. Meaning that, if there are more critical databases then we can change the Target recovery time of a database lets say to 30 seconds then if there is a sudden shutdown or a failover happens then this database can recover in 30 seconds and rest other databases will recover in their default 1 minute time.

Thus the discussion ends here summarizing that the FCI can use this indirect checkpoints for the databases who’s up-time is essential. To study more check the links below.

SQL Server 2012 Indirect Checkpoints

Checkpoints

In the upcoming topics we will learn more about the AlwaysOn Availability Groups. Stay tuned for more updates.