Primary Key design considerations for Merge Replication
by Holger Linke, Microsoft SQL Escalation Engineer
GUIDs and Clustered Indexes
Uniqueness is a key factor when synchronizing data between SQL Server/Azure and multiple endpoints like Slates and Smartphones. With data simultaneously created and updated on servers and clients, ensuring rows are unique to avoid key collisions is critical. As you know, each row is uniquely identified by its Primary Key.
When creating Primary Keys, it’s common to use a compound key based on things like account numbers, insert time and other appropriate business items. It’s even more popular to create Identity Columns for the Primary Key based on an Int or BigInt data type based on what I see from my customers. When you designate a column(s) to be a Primary Key, SQL Server automatically makes it a Clustered Index. Clustered indexes are faster than normal indexes for sequential values because the B-Tree leaf nodes are the actual data pages on disk, rather than just pointers to data pages.
While Identity Columns work well in most database situations, they often break down in a data synchronization scenario since multiple clients could find themselves creating new rows using the same key value. When these clients sync their data with SQL Server, key collisions would occur. Merge Replication includes a feature that hands out blocks of Identity Ranges to each client to prevent this.
When using other Microsoft sync technologies like the Sync Framework or RDA, no such Identity Range mechanism exists and therefore I often see GUIDs utilized as Primary Keys to ensure uniqueness across all endpoints. In fact, I see this more and more with Merge Replication too since SQL Server adds a GUID column to the end of each row for tracking purposes anyway. Two birds get killed with one Uniqueidentifier stone.
Using the Uniqueidentifier data type is not necessarily a bad idea. Despite the tradeoff of reduced join performance vs. integers, the solved uniqueness problem allows sync pros to sleep better at night. The primary drawback with using GUIDs as Primary Keys goes back to the fact that SQL Server automatically gives those columns a Clustered Index.
I thought Clustered Indexes were a good thing?
They are a good thing when the values found in the indexed column are sequential. Unfortunately, GUIDs generated with the default NewId() function are completely random and therefore create a serious performance problem. All those mobile devices uploading captured data means lots of Inserts for SQL Server. Inserting random key values like GUIDs can cause fragmentation in excess of 90% because new pages have to be allocated with rows pushed to the new page in order to insert the record on the existing page. This performance-killing, space-wasting page splitting wouldn’t happen with sequential Integers or Datetime values since they actually help fill the existing page.
What about NEWSEQUENTIALID()?
Generating your GUIDs on SQL Server with this function will dramatically reduce fragmentation and wasted space since it guarantees that each GUID will be sequential. Unfortunately, this isn’t bulletproof. If your Windows Server is restarted for any reason, your GUIDs may start from a lower range. They’ll still be globally unique, but your fragmentation will increase and performance will decrease. Also keep in mind that all the devices synchronizing with SQL Server will be creating their own GUIDs which blows the whole NEWSEQUENTIALID() strategy out of the water.
Takeaway
If you’re going to use the Uniqueidentifier data type for your Primary Keys and you plan to sync your data with RDA, the Sync Framework or Merge Replication, ensure that Create as Clustered == No for better performance. You’ll still get fragmentation, but it will be closer to the ~30% range instead almost 100%.