Created by Robert Berry / @no0p_

Record linking solves one specific data quality problem.

Two or more records,

referencing the same entity,

without a key.

“Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources.

Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier”

But how can we discern which are matches without a key?

Label the records in table1 as set **A**, the records in table2 as set **B**.

Define a set of possible matches as **A** X **B**, such that each pair (a∈A,b∈B) is a possible match or non-match.

The set of true pairs is labeled set **T** where (a=b), and the set of false pairs is labeled set **F** where (a≠b).

Solve the problem by distinguishing whether a given pair belongs to set **T** or set **F**.

**T** and **F** are a partition of **A** X **B**.

Which is resolved by a function returning the probability that any old tuple from **A** and any old tuple from **B** are a true pair.

$P(truepair) = f(a, b)$

$P(truepair) = f(a, b) = f(x_1, x_2, x_3, \ldots x_n)$

Where $x_1, x_2, \ldots x_n$ are numeric metrics of the pairing

$d_j = \frac{1}{3}(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m})$

Where:

$m$ is the number of matching characters.

$t$ is half the number of transpositions.

Levenshtein Distance of Name | $\mathbb{Z}$ |

Trigram Distance of Name | $[0,1]$ |

Geo Spatial Distance of Address | $\mathbb{R}$ |

Name Shares Middle Initial Character | $0/1$ |

What is the optimal threshold?

$P(true pair) = g(t) = \frac{1}{1 + e^{-t}}$

$t = \beta_0 + \beta_1x_1 + \ldots \beta_nx_n$

Barack Obama vs. Barry Obama

$x_1$ = Jaro Winkler Distance = 0.89

$x_2$ = Trigram Distance = 0.2

$P(true pair) = f(x_1, x_2) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2)}}$

Logistic regression model solves for $\beta$ coefficients to complete equation.

Can derive features from joined tables

Same Company/Organization/Relation | $0/1$ |

Date/Time of Overlapping Activity | $\mathbb{R}$ |

Geo Spatial Distance of Record or Relation | $\mathbb{R}$ |

Same Phone Number | $0/1$ |

- Regression Models
- Tree Models
- Conditional Random Field
- ARIMA
- Latent Dirichlet Allocation
- K-means clustering
- Descriptive Statistics
- Inference
- SVM
- Cardinality Estimators
- Utility Functions and Types

```
pgxnclient install madlib
```

or from source on github:
```
Configure; make
madpack -p postgres -c user@0.0.0.0/db install
```

Create training tables, apply SQL function, use model tables

Manually insert records into table.

Or leverage easily linkable records, e.g. 20% of company_a and company_b records have an IRS TID you can link on, but still have representative typo and alternative spelling examples.

Provides coefficients for $t = \beta_0 + \beta_1x_1 + \ldots \beta_nx_n$

Returns a probability tuple pair is a true pair.

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = 0.995$

$Precision = \frac{TP}{TP + FP} = 0.985$

$Recall = \frac{TP}{TP + FN} = 0.992$

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = 0.99$

$Precision = \frac{TP}{TP + FP} = 0.914$

$Recall = \frac{TP}{TP + FN} = 0.662$

Model only had name string as a parameter. Poor recall results.

- Does not require application changes

- Preserves Original Information

- 'Probabilistic Foreign Key' Sounds fun

- Reasonably Painless Access Pattern

- Duplication and Linking Case

- Manual Review

- M*N ~ (N^2) pair comparisons to find best match

- Compare only against best candidates

- Indexable value that can exclude candidates

- trigram distance

- Geographic restriction

- Kmeans cluster id

- LSH bucket

- Probably OK to have fairly loose comparison set

Identify Problem as multiple records for same entity without a key

This can be formulated as a binary classification of paired records as members of a set of true pairs or false pairs in A X B

Solve binary classification in Postgresql with Madlib logistic regression module

String metric features & features from relations

Create probabilistic foreign keys

Manually review close calls

- We should strive for correct data with referential integrity

- Be comfortable in a world of imperfect databases