Microsoft SQL Server Always On Availability Groups: Implementation, Monitoring, Troubleshooting, and Restoring

Fabian John-Baptiste
Jan 24, 2025
9 min read

Microsoft SQL Server Always On Availability Groups: Implementation, Monitoring, Troubleshooting, and Restoring

Always On Availability Groups (AOAGs) are a critical feature in Microsoft SQL Server that offer high availability and disaster recovery for your databases 1. By allowing multiple copies of a database to be synchronized across different servers, AOAGs ensure that your data remains accessible even if one server fails. This comprehensive guide will walk you through the process of implementing, monitoring, troubleshooting, and restoring failed availability groups, as well as reviewing SQL Server error logs for related issues.

Implementing Always On Availability Groups

Before implementing AOAGs, ensure you meet the following prerequisites 1:

SQL Server Editions: You need SQL Server 2019 Enterprise or Standard edition. AOAGs are not supported in Express or Web editions.
Windows Server: Servers must run a supported Windows Server version (e.g., Windows Server 2016 or later).
Domain Configuration: All servers should be part of the same Active Directory domain.
Cluster Service: Install and configure the Failover Cluster feature on all participating nodes.

Step 1: Configure Windows Server Failover Clustering (WSFC)

Install Failover Clustering Feature: In Server Manager, navigate to Manage -> Add Roles and Features and follow the wizard to install the Failover Clustering feature on all servers.
Create a Failover Cluster: Open the Failover Cluster Manager and click Create Cluster. Follow the wizard, specifying the cluster name and IP address. Validate the configuration and complete the cluster creation.
Configure a File Share Witness: To ensure quorum in the WSFC cluster, configure a file share witness on a separate server. This witness acts as a tie-breaker in case of a split-brain scenario where communication between cluster nodes is lost 2.

Step 2: Configure SQL Server for Always On

Enable Always On Availability Groups: Open SQL Server Configuration Manager, go to SQL Server Services, right-click the SQL Server instance, and select Properties. Navigate to the Always On High Availability tab, check Enable Always On Availability Groups, specify the WSFC instance name, and click OK. Restart the SQL Server service.
Configure Service Account: Create a domain user account for the SQL Server service and add it to the Administrators group. This service account needs appropriate permissions to access cluster resources and manage the availability group 3.

Step 3: Create the Availability Group

Create Database Backups: Before adding a database to an availability group, back up the databases you want to include. Perform a full backup and a transaction log backup of each database.
Create an Availability Group: In SQL Server Management Studio (SSMS), connect to the primary SQL Server instance. Right-click the Always On High Availability node and select New Availability Group Wizard. Follow the wizard, specifying the availability group name, selecting the databases to include, adding replicas (secondary servers), configuring failover settings, and setting backup preferences. Configure the availability group listener, which provides a virtual network name for applications to connect to.
Specify Replicas: When adding replicas, you can choose between synchronous commit and asynchronous commit modes. Synchronous commit ensures that transactions are committed on both the primary and secondary replicas before being acknowledged, guaranteeing no data loss. Asynchronous commit allows transactions to be committed on the primary replica first and then replicated to the secondary replicas, providing better performance but with the potential for data loss 3. SQL Server 2012 supports up to 4 secondary replicas, while SQL Server 2014 and later versions support up to 8 replicas 3.
Backup Preferences: Configure backup preferences to specify where backups should be taken and how replicas are prioritized for backups. The available options are:
Prefer Secondary: Backups are preferentially taken on secondary replicas if available.
Only Secondary: Backups are only taken on secondary replicas.
Secondary Only: Similar to "Only Secondary," but with different behavior in some scenarios.
Primary: Backups are taken on the primary replica 3.

Availability Mode	Failover Mode	Data Loss Potential	Synchronization
Synchronous Commit	Automatic	No Data Loss	Primary and secondary replicas are synchronized before the transaction is committed.
Asynchronous Commit	Automatic	Potential Data Loss	Transactions are committed on the primary replica first and then replicated to the secondary replicas.
Synchronous Commit	Manual	No Data Loss	Same as above, but failover requires manual intervention.
Asynchronous Commit	Manual	Potential Data Loss	Same as above, but failover requires manual intervention.

Monitoring Availability Group Health

Monitoring the health of your AOAGs is crucial to ensure high availability and data integrity. Here are some key aspects to monitor 4:

1. Database Health Checks

Synchronization State: Verify that secondary replicas are synchronized with the primary replica.
Data Loss Potential: Evaluate the potential for data loss during a failover.
Database Status: Ensure databases are online and accessible.

2. Performance Metrics

Log Send Queue: Monitor the size of the log send queue to identify potential bottlenecks in log transmission.
Redo Queue: Track the redo queue size to assess the performance of log replay on secondary replicas.
Synchronization Health: Use this metric to get an overall view of the health of data replication.

3. Cluster Health Checks

Node Status: Check if all cluster nodes are up and running.
Quorum Configuration: Ensure the cluster has a quorum, allowing it to make decisions even if some nodes fail.
Resource Status: Monitor the availability of cluster resources.

Database Level Health Detection

Starting in SQL Server 2016, you can enable database level health detection (DB_FAILOVER) for an availability group 5. This feature allows the availability group to automatically fail over if a database is no longer online due to issues such as transaction log write failures. This option is enabled for the entire availability group and cannot be selectively applied to specific databases.

To enable database level health detection, you can use the New Availability Group Wizard in SSMS, modify the availability group properties, or use the ALTER AVAILABILITY GROUP T-SQL statement with the DB_FAILOVER = ON option.

The DB_FAILOVER option works in conjunction with the FAILURE_CONDITION_LEVEL option, which configures the thresholds for SQL Server process health detection. These two options are independent and can be configured separately.

Tools for Monitoring

SQL Server Management Studio (SSMS): The Always On Dashboard in SSMS provides real-time information about replicas and databases.
Performance Monitor (PerfMon): Monitor Always On counters under SQLServer:Replica and SQLServer:Database Replica.
SQL Server Agent Alerts: Configure alerts for critical events like lease timeouts.
Custom Monitoring Scripts: Develop scripts to check replica health and monitor database states.

Best Practices

Regular Health Checks: Implement automated and frequent health checks to detect issues early on.
Alerts Configuration: Set up immediate notifications for critical events like lease timeouts or database failures.
Documentation: Maintain comprehensive documentation of your monitoring configurations and any incidents.
Testing: Regularly test failover processes to validate your infrastructure's resilience and ensure that failovers occur as expected.
Updates: Keep SQL Server and Windows Server updated with the latest patches and security updates to benefit from bug fixes and performance improvements 4.

Troubleshooting Common Issues

Troubleshooting AOAG issues requires a systematic approach. It's crucial to investigate and address the root cause of failovers rather than simply restarting services or manually failing over. Correlating recent changes in your environment with health problems can help you pinpoint the source of the issue 6. Here are some common problems and troubleshooting steps 6:

1. Failover Issues

Cluster Health Events: Investigate cluster log errors for issues like node removal or communication failures.
SQL Server Service Down: Ensure the SQL Server service is running on all replicas.
Lease Timeouts: Check for network latency, resource contention, or disk I/O bottlenecks that might cause lease timeouts 7.
Health Check Timeouts: Investigate non-yielding scheduler events or system performance issues. You can use Extended Events to capture information about non-yielding scheduler events, such as scheduler_monitor_non_yielding_ring_buffer_recorded and scheduler_monitor_stalled_dispatcher_ring_buffer_recorded 6.
SQL Server Health Issues: Analyze SQL Server error logs and extended events for issues like excessive memory dumps or deadlocks. You can use Extended Events to capture diagnostic information related to failover events, such as the availability_replica_database_fault_reporting event 6.

2. Synchronization Issues

Network Connectivity: Verify network connectivity between replicas.
Disk Space: Ensure sufficient disk space on secondary replicas.
Endpoint Configuration: Check endpoint configuration, including port numbers and permissions.
Database Status: Troubleshoot databases in Restoring, Recovering, Recovery Pending, or Suspect states.

3. Performance Issues

Network Latency: High network latency can impact synchronization and failover performance.
Resource Contention: Identify and address resource bottlenecks on replicas.
Disk I/O: Optimize disk I/O performance to improve log transmission and replay.

Restoring Failed Availability Groups

When an availability group fails, you need to restore it to ensure high availability. Here are some common scenarios and restoration methods 8:

1. Restoring a Single Database

Remove the database from the availability group using SSMS or the ALTER AVAILABILITY GROUP statement with the REMOVE DATABASE clause.
Drop the database on secondary replicas.
Restore the database on the primary replica from a backup.
Back up the database and transaction log on the primary replica.
Restore the database and transaction log on secondary replicas with the NORECOVERY option.
Add the database back to the availability group using SSMS or the ALTER AVAILABILITY GROUP statement with the ADD DATABASE clause.

2. Recovering from a Failed Replica

Remove the failed replica from the availability group.
Resolve the underlying issue causing the failure.
Re-join the replica to the availability group.

3. Dealing with a Damaged Database

Remove the replica hosting the damaged database.
Resolve any system issues contributing to the database failure.
Restore the replica to the availability group.

Reviewing SQL Server Error Logs

The SQL Server error log is a valuable resource for troubleshooting AOAG issues. It reports events related to 10:

WSFC cluster communication
Availability replica state transitions
Availability database state transitions
Connectivity between replicas
Endpoint and listener statuses
Lease status
Error events

Review the error log for symptoms like 10:

Inability to access availability databases
Unexpected failovers
Availability group in Resolving state
Availability group in an indeterminate state

Use tools like SSMS or the sp_readerrorlog stored procedure to review the error log and filter for specific keywords like RESOLVING to identify potential issues 11.

Keyword	Description	Possible Causes
RESOLVING	Indicates that the availability group is in the process of resolving a synchronization or failover issue.	Network communication problems, endpoint permission issues, corrupted storage on the secondary replica, offline WSFC node, or excessive VLFs on the database.
NOT SYNCHRONIZING	Indicates that one or more replicas are not synchronized with the primary replica.	Network issues, disk space problems, or high transaction volume.

You can also configure SQL Server for additional logging into the error logs. For example, you can enable a trace flag to capture deadlocks information 12. To manage error logs effectively, you can use the sp_cycle_errorlog stored procedure to manually recycle them 12.

Regularly reviewing error logs is crucial for identifying potential issues before they escalate and cause data loss or downtime 12.

Reviewing SQL Server Agent Logs

In addition to the SQL Server error log, the SQL Server Agent also maintains its own error log. This log contains information about SQL Server Agent jobs, alerts, and other activities.

The SQL Server Agent error log is a text file with the extension *.OUT. By default, it is located in the \MSSQL\Log directory of your SQL Server instance. You can view the log file location and configure the logging level (warnings, errors, or information messages) by right-clicking on the Error Logs folder under SQL Server Agent in SSMS and selecting Configure.

To archive the current SQL Server Agent error log, you can use the sp_cycle_errorlog stored procedure. This procedure creates a new log file and renames the existing log files, similar to the process for SQL Server error logs 12.

Conclusion

Always On Availability Groups are a powerful solution for achieving high availability and disaster recovery in SQL Server. By following the steps outlined in this guide, you can implement, monitor, troubleshoot, and restore AOAGs effectively. Key takeaways include:

Prerequisites: Ensure you meet the requirements for SQL Server editions, Windows Server versions, domain configuration, and cluster service installation.
Implementation: Configure WSFC, enable Always On Availability Groups, create a service account, and configure the availability group with appropriate settings for replicas and backups.
Monitoring: Monitor database health, performance metrics, and cluster health. Utilize database level health detection and various tools like SSMS, PerfMon, and SQL Server Agent alerts.
Troubleshooting: Systematically investigate failover, synchronization, and performance issues. Analyze error logs and extended events to identify root causes.
Restoration: Employ appropriate methods to restore failed availability groups, including restoring individual databases, recovering failed replicas, and dealing with damaged databases.
Log Review: Proactively review SQL Server error logs and SQL Server Agent logs to identify potential issues and ensure the health and availability of your databases.

By implementing AOAGs, proactively monitoring their health, and effectively troubleshooting any issues, you can significantly minimize the risk of data loss and downtime, ensuring that your critical SQL Server databases remain highly available and resilient.

Works cited

1. How to implement Always on Availability Groups in SQL Server 2019 on Windows?, accessed January 24, 2025, https://rafaelrampineli.medium.com/how-to-implement-always-on-availability-groups-in-sql-server-2019-on-windows-11f6fb8aad5f

2. Configure SQL Server Always On availability groups with synchronous commit using an internal load balancer | Compute Engine Documentation | Google Cloud, accessed January 24, 2025, https://cloud.google.com/compute/docs/instances/sql-server/configure-availability

3. Configuring a SQL Server AlwaysOn High Availability Group - SQLShack, accessed January 24, 2025, https://www.sqlshack.com/configuring-a-sql-server-alwayson-high-availability-group/

4. SQL Server Always On Health Check and Lease Timeout Monitoring, accessed January 24, 2025, https://www.sqltabletalk.com/?p=855

5. Database-level health detection - SQL Server Always On | Microsoft Learn, accessed January 24, 2025, https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-always-on-database-health-detection-failover-option?view=sql-server-ver16

6. Troubleshoot failover of Always On Availability Groups - SQL Server - Microsoft Learn, accessed January 24, 2025, https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/availability-groups/troubleshooting-availability-group-failover

7. Troubleshoot Always On Availability Groups Configuration (SQL Server) - GitHub, accessed January 24, 2025, https://github.com/MicrosoftDocs/sql-docs/blob/live/docs/database-engine/availability-groups/windows/troubleshoot-always-on-availability-groups-configuration-sql-server.md

8. Restoring backup to AlwaysOn Availability Group primary server. - SQLServerCentral, accessed January 24, 2025, https://www.sqlservercentral.com/forums/topic/restoring-backup-to-alwayson-availability-group-primary-server

9. Restore an existing availability group database participating in SQL Server Always On Availability Groups - SQLShack, accessed January 24, 2025, https://www.sqlshack.com/restore-an-existing-availability-group-database-participating-in-sql-server-always-on-availability-groups/

10. SQL Server error log (Always On Availability Groups) - Microsoft Learn, accessed January 24, 2025, https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-error-log-always-on-availability-groups?view=sql-server-ver16

11. Reviewing the SQL Server Error Logs to Find the Two Most Common Keywords That Can Cause Availability Issues, accessed January 24, 2025, https://www.edwinmsarmiento.com/sqlserver-error-log-troubleshooting-ha/

12. How to manage SQL Server logs effectively - SQLShack, accessed January 24, 2025, https://www.sqlshack.com/how-to-manage-sql-server-logs-effectively/