Have you noticed the SQL Server Agent Job RETRY is set to 2147? Well it is not, the dialog window just cuts off the rest of the digits.
Actually the Replication Agent are set by default to retry 2147483647 times, once every minute. That is over 4000 years!
2147483647 = 35791394 Hours = 1491308 Days = 4085 Years.
How does this impact Replication?
Should a Replication Agent fail and the agent be configured for “continuous”, then agent will not report back to SQL Agent Service, but will wait 1 minute, then try again for 4000 years. Any ALTERTS placed with SQL Agent for “job failure” will not get called.
For example, if the Distribution Agent, again running continuous, is unable to delivery data because of 20598 or a 2627 error, the Agent will go into an infinite retry logic.
The row was not found at the Subscriber when applying the replicated command.
Violation of PRIMARY KEY constraint ‘PK__A’. Cannot insert duplicate key in object ‘dbo.A’.
If undetected, the Agent will stay in retry mode for 36 hours until the Distribution Cleanup job removes the “cached” transactions from the Distribution Database. At this point the Distribution Agent will no longer have the data needed to “catch up” the subscriber. As result, the Subscriber will be marked “inactive” and post the following error in the repl_errors table in the Distribution database.
|18854||One or more subscriptions have been marked inactive. Drop and re-create all subscriptions for this node that are failing with this error.|
Which Agent are marked to Retry?
You can look at the individual Agent job properties in SQL Server Management Studio or run the following query in the MSDB database.
Select * from msdb.dbo.sysjobsteps where retry_attempts =2147483647
What should I do instead?
I would recommend setting all of your Replication Agent retry_attempts at smaller interval, say 60 (fail after 1 hour). The actual value you select is dependent on your business needs. For extra measure, you can also add notification ALERT that the JOB has failed.
This should give you enough time to detect the problem, resolve the issue, and get the Replication Agent up and running before the subscriber is marked as “expired”.