How does "Normal Incremental Sync" work under the hood

The tech manual says :-

"Normal Incremental Synchronization
In this mode, each file is analyzed at the source and target. This
requires the file to be fully scanned at each end. A list of cyclical
redundancy checks (CRCs) are exchanged and the source determines
how to patch the target to make it identical to the source. The source
sends the appropriate patches to the target, which updates the file.
All aspects of the file will be updated, including alternate streams,
attributes, and permissions.
If the source determined that the file is sufficiently different from the
file on the target, it may not patch it but send the entire file, similar to
a Full Synchronization."

My reading of this is that the Replistors do a block ro similar low level scan of the entire file, generating CRCs, and then patch the target with differences to bring it in sync.

This implies that the Sync does NOT necessarily copy the whole file if the source has not changed significantly enough to satisfy the second part of the statement.

What Constitutes "Sufficiently Different" 10% changed, 20%, 60%, it would be nice to know?

Secondly the wording kind of implies that the WHOLE scan operation completes at both ends, building CRCS for EVERY file, and THEN these lists are exchanged and checked for changes. IS THIS THE CASE.

If so then I would expect a long period of little data transfer while the scans are in progress, and then high data transfer as the lists are exchanged, and the "Patches" start shipping. ( My observations of a test Sync seem to point in this direction )

Am I on the right Track here? can anyone confirm the detailed workings of Sync.
It is of critical importance to me as we have very slow links here, and we need to calculate whether we can get the anticipated changes per day transferred quickly enough.

Thanks
JohnG

Responses(3)

dramjass

151 Posts

0

February 3rd, 2009 06:00

Here is a high level view of an Inc Sync that comes fromt he Developer. The excperpt you have below was for 6.1 and below when CRC's were used. MD5, a more reliable method, is now used...

"An incremental sync splits the file into "portions" (about 256k, I think). For each portion, it splits that further into segments of 2k. For each segment, it takes an MD5 hash. For each portion, it sends a list of these hashes to the target where it creates a hash list in the same way. It compares the hashes. If any differ, it sends that information back to the source and the source will send over each 2k segment that was different and patches the target file."

Therefore, there is meta-data sent between the Source and Taret during the Inc Sync to compare the files. Only changes in the file are sent. As for the "Sufficiently Different" part, the service decides this and I do not think it can be quantified as a percentage. I do not believe this piece has ever been documented as it relates to the code side of RepliStor. Besides, if a file is that much different on the Source vs. Target and most of it is being sent any way, why not just do a Full Sync of the file. In my expereience, I have never really seen this be a factor, even with SQL DB's.

Scanning of every file and of the entire file does occur on both the Source and Target. This is the case.

If there is an Inc Sync occurring, new I/O occurring in the environment does not get sent until the Sync is complete. Therefore,t he data is sent in a FIFO fashion. This is required for data consistency. You will see the Kernel Cache and potentially the Kernel Logs grow during a Sync. This is by design.

Let me know if you have any questions beyond this.

Duncan

tribicic

157 Posts

1

February 3rd, 2009 06:00

You don't have to worry about this - what you do care about is 'mirroring' operation, not the 'sync'. Sync is ideally done only once, and after it completes the mirroring kicks in.
Mirroring continuously captures granular changes on the filesystem and sends them over, it it will not send anything else except for the actual data changed.
You can monitor the amount of data sent over using the performance monitor while testing this between two local server. You don't even have to care if the data can be transferred in real time - for example it can accumulate over the office hours, and then it can catch up during the night and manage to send all the data over.
Personally, I would create a sync in between local machines, sync them, and then break the network link. After 24 hours you can check the size of the accumulated cache files in the data directory, that way you can see the amount of changes you can expect per day, and easily calculate whether it is theoretically possible to push that much data over to the other side.

Regarding the questions about the sync, I don't have any firm info about this either.

J

jtgowing

28 Posts

0

February 3rd, 2009 08:00

Thanks Again Guys.

Your detailed explanation of Sync confirms my expectation that it is, indeed a granular update.

The reason I am concerned is that with a low speed link, (and typical link speeds here are only a few hundrted Kbps) the steady daily stream of Mirror updates is not the problem, it is the massive transfer which can occur during a Sync, and which can take several days which is the problem.
So I am keen to understand the process in detail, so I can best optimise my approach, and hence the other thread regarding Pre-Sync Manual Copies etc.

You have confirmed most of what I considered in terms of doing local syncs, and then moving the data drive, Pre-copying to portable drives etc etc.

I will now do my measurements and calculations and figure out the best may to do this.

Thanks again and good luck

JohnG

View All

No Events found!

Replication Manager

How does "Normal Incremental Sync" work under the hood

Was this post helpful?