next up previous contents
Next: System Model Up: PDA Synchronization Previous: CPISync   Contents


Deterministic CPISync

Synchronization is done on two sets of similar data sets held on two different devices. By similar we mean that most of the records in the data sets are identical. Three types of updates may occur in databases - an addition of a record, a deletion of a record or a modification of a record, each introducing a difference. Consider a set $ S_{PDA}$, a data set on the PDA and a similar set on the PC, namely $ S_{PC}$. $ S_{PDA}$ and $ S_{PC}$ were probably derived from the same identical set (a previous synchronization and subsequent independent updates on the hosts for example). For simplicity we consider sets of integers. For example $ S_{PDA} =
\{1,2,4,5\}$ and $ S_{PC} = \{5,1,6\}$. The objective of an efficient synchronization routine is to determine the set $ S =
S_{PDA} \bigcup S_{PC} = \{1,2,4,5,6\}$ and install this set $ S$ in the PDA and the PC in the least possible time, using the least amount of communication and computation. CPISync first constructs a characteristic polynomial for each of the data sets it is reconciling which is simply a univariate polynomial whose factors are the elements in of the data set. These characteristic polynomials is sampled at $ \overline{m}$ sample points. The PDA sends these evaluations to the PC which calculates rational function evaluations by taking the ratios of the evaluations of the characteristic polynomials at the same sample points and interpolating a rational function $ f(z)$ fitting the points. The factors of the numerator and denominator of $ f(z)$ give the different (missing) set data. We now explain the CPISync Algorithm in detail by means of the example given above. Figure 3.1 outlines the steps of the algorithm which are itemized below.
Figure 3.1: CPISync Algorithm - a simple example showing interpolation based synchronization
\begin{figure}\epsfysize =9cm
\centerline{{\epsffile{example3.eps}}}
\end{figure}
  1. The PDA calculates the characteristic polynomial function from its data set $ S_{PDA}$. A characteristic polynomial function has all the data set elements as its factors, as is shown in figure 3.1. For example, for $ S_{PDA} = S_{A} =
\{1,2,4,5\}$, the characteristic polynomial function $ \chi_{PDA}(X) = (X-1)\cdot(X-2)\cdot(X-4)\cdot(X-5)$. This is the `signature' of the information contained in the data set. The PC also calculates the characteristic polynomial function for its own data set $ S_{PC}$ = $ S_B = \{5,1,6\}$; $ \chi_{PC}(X) =
(X-5)\cdot(X-1)\cdot(X-6)$. Note that the characteristic polynomials are always monic, i.e., the highest degree term of the polynomial has coefficient $ 1$.
  2. The characteristic function on the PDA is then `sampled' at some pre-selected random sample points. In our example, these are $ \{7,8,9\}$. The number of sample points has to be at least equal to the number of differences in the system which explains the need to know an upper bound $ \overline{m}$ on the number of differences in the system.
  3. The sampled characteristic polynomial function values of the PDA are transferred to the PC from the PDA. This almost completes the PDA's job, it is now up to the computationally stronger PC to do the remaining steps of the reconciliation synchronization. In Figure 3.1 the shaded text indicates the computation performed on the PC: this is the asymmetric capability of the CPISync algorithm.
  4. The PC calculates rational function instantiations by dividing each sample point evaluation of the PDA's characteristic polynomial function with its own characteristic polynomial function's evaluation at that same sample point, ending up with 2-tuples of (sample point, rational function evaluation).
  5. These 2-tuples are then interpolated to obtain the rational function $ F(X)$. This rational function is of the type $ \frac{Pu(X)}{Qu(X)}$ where $ Pu(X)$ and $ Qu(X)$ are (unreduced) polynomials with coefficients in the finite field $ F_{11}$ (in our example). $ Pu(X)$ and $ Qu(X)$ are divided with their GCD to obtain an irreducible polynomial of the form $ \frac{P(X)}{Q(X)}$. The sum of the degrees of $ P(X)$ and $ Q(X)$ equal the total number of differences between the data sets $ S_{PDA}$ and $ S_{PC}$.
  6. $ P(X)$ and $ Q(X)$ are separately factored yielding the differences the PDA and the PC are missing respectively. The factors of $ P(X)$ are sent to the PDA to update its data set, while the factors of $ Q(X)$ are used by the PC to update its own data set and incorporate the new data from the PDA that it had been missing. Note that the last step is the most simplistic scenario. There might be instances of conflicts (see section 2.1) which would need to be dealt with separately.
There are some subtleties that need to be explained at this point. The first is the use of finite field arithmetic to limit the size of the characteristic polynomial evaluations because their calculation involves evaluating a product of factors which grows very rapidly. We have used $ F_{32749}$ for the PC-PDA synchronization, which gives us a 15-bit space for the mathematical operations. One bit out of these is used for the sample points, which are discussed next.
Figure 3.2: The finite field space is divided into two regions, one for sample points and the other for data points
\begin{figure}\epsfysize =7cm
\centerline{{\epsffile{fig2of3.eps}}}
\end{figure}
There are two things that need to be explained about the choice of the sample points. The sample points need to be separate from the data points in the set. This is because otherwise, the characteristic polynomial of the corresponding data set will evaluate to 0 in step $ 2$. In our example (Figure 3.1), if we selected $ 2$ as a sample point, the characteristic polynomial evaluation $ \chi_{PDA}(2) =
(2-1)\cdot(2-2)\cdot(2-4)\cdot(2-5) = 0$. A bad choice may result in a division-by-zero error in step $ 4$ of the algorithm. Since we do not know a priori the number of differences in the data to be reconciled, we increase the size of the finite field by approximately $ 1$ bit. One bit provides adequate space to accommodate the sample points because for $ b$ bit data, there can be a maximum of $ 2^b$ differences between any two data sets and CPISync requires at least as many sample points as there are differences. This concept is shown in figure 3.2. The idea here is to pick the sample points from a separate `region' of the number line than the data points, eliminating the possibility of a `collision' between the data and sample points. The specific choice of the region of sample points in Figure 3.2 is because choosing sample points symmetrically over the 0 point speeds up calculations in the Gaussian elimination involved in interpolation in step $ 2$ of the algorithm described above. Symmetric sample points (such as $ 0,1,-1,2,-2...$) help in making the Gaussian elimination faster by making it simple to get many zeros in the matrix while solving for the coefficients of the rational function, thus speeding up Gaussian elimination. This is particularly important because as we will see later, most of the latency arises from this Gaussian elimination. Note that exact symmetry about 0 is possible only for an odd number of sample points. A second point of interest to mention here is the choice of the finite field. We have used $ F_p$ where $ p$ is a prime number. For the Palm PDA, the word size is 16 bits. We carefully chose $ p = 32749 (= 2^{15} - 19)$. This is the largest prime number less than $ 2^{15} (= 32768)$. In case the records are variable length strings, they may be hashed into fixed small length hashes to be used as the data set elements that are reconciled using CPISync. The difference information in terms of hashes reported by CPISync may then be transformed into information about the corresponding differing strings using a reverse table lookup. We have avoided the issues of hashing by restricting our record database entries to 15-bit integers. We note that, in practice, the hashing operation needs to be performed only once per entry, at the time that the entry is added to the data set, thus the complexity of the hashing is not a bottleneck for synchronization. By restricting entries to 15 bits, we have also avoided the issues of multiple-precision arithmetic on the PDA.
Figure 3.3: The overall scheme of the PC-PDA synchronization system
\begin{figure}\epsfysize =8cm
\centerline{{\epsffile{fig3of3.eps}}}
\end{figure}
Figure 3.3 shows the high level scheme of the PDA-PC synchronization. The thing to note here is that the transfer of data (indicated by the arrows) is typically through a slow link. In the case of a Palm PDA, the transfer is through a serial cable or IrDA (Async Serial-IR 9600-115.2kb/s) [28]. Newer PDAs have USB [18] and Ethernet connectivity. These technologies are substantially faster than serial links, though newer PDAs often run more data intensive applications, leaving the problem of slow data transfer between the PDA and the PC network intact, as was shown in Figure 2.5. Figure 3.3 indicates that the total communication between the PDA and the PC is simply the transmission of the characteristic polynomial function evaluations from the PDA to the PC (there are $ \overline{m}$ such evaluations). After the reconciliation is complete, the number of data set elements the PDA is missing ( $ \overline{m}$ at worst) are transferred back to the PDA. Thus taken totally, the number of data elements transferred is O( $ \overline{m}$), irrespective of the sizes of the data sets on the PDA and the PC.

Subsections
next up previous contents
Next: System Model Up: PDA Synchronization Previous: CPISync   Contents
Sachin Kumar Agarwal 2002-07-12