CATC Indexing
CATC Indexing
Overview
In a CATC indexing scheme:
- Tuple signature formed by concatenating attribute codewords.
- Signature is
mbits long, with ~=m / 2bits set to 1. - Codeword for
A_iisu_ibits long and has ~=u_i / 2bits set to 1. - Each codeword could be different length, but always
sum(u_i for i in range(n)) == m.
A tuple descriptor desc(t) is:
desc(t) = cw(A_n) + cw(A_n-1) + .. + cw(A_1).- Where
+denotes bit-string concatenation.
Queries
To answer query q in CATC:
- Generate
desc(q)by combining codewords for all attributes.- For known
A_iusecw(A_i); for unknownA_i, usecw(A_i).
- For known
- Attempt to match
desc(q)against all signatures in signature file.- Match implementation is exactly the same as in SIMC.
CATC Parameters
Let p_F denote the likelihood of a false match.
Ways to reduce false matches:
- Use different hash function for each attribute (
h_iforA_i). - Increase descriptor size (
m).
Choosing Optimal m and u
- Start by choosing acceptable
p_F. - Choose
maccording to same formula in SIMC. - Choose
u_i:- Each
A_ihas sameu_i, or - Allocate
u_ibased on size of attribute domains.
- Each
Query Cost for CATC
Exactly the same as in SIMC.
Variation on CATC
Page-level descriptors:
- Same as in SIMC.
Bit slices:
- Same as in SIMC.
Comparison with SIMC
Assume same m, p_F, n for each method:
- CATC has
u_i-bit codewords, each has~= u_i / 2bits set to 1. - SIMC has
m-bit codewords, each haskbits set to 1. - Signatures for both have
mbits, with~= m / 2bits set to 1. - CATC has flexibility in
u_ibut small(er) codewords so more hash collisions. - SIMC has less hash collisions, but has errors from unfortunate overlays.
