src.pes_match package¶

Subpackages¶

Submodules¶

src.pes_match.cleaning module¶

src.pes_match.cleaning.alpha_name(df, input_col, output_col)¶

Orders string columns alphabetically, after removing whitespace/special characters and setting strings to upper case.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
input_col (str) – Name of column to be sorted alphabetically
output_col (str) – Name of column to be output

Returns

Pandas dataframe with output_col appended

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import re
>>> df = pd.DataFrame({'forename': ['Charlie']})
>>> df['forename'].head(n=1)
0    Charlie
Name: forename, dtype: object
>>> df = alpha_name(df, input_col='forename', output_col='alphaname')
>>> df['alphaname'].head(n=1)
0    ACEHILR
Name: alphaname, dtype: object

src.pes_match.cleaning.change_types(df, input_cols, types)¶

Casts specific dataframe columns to a specified type. The function can either take a single column or a list of columns.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
input_cols (str or list of str) – The subset of columns that are having their datatypes converted.
types – The datatype that the column values will be converted into.

Returns

Returns the complete dataframe with changes to the datatypes on specified columns.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'number': [1]})
>>> df.dtypes[0]
dtype('int64')
>>> df = change_types(df, input_cols='number', types='str')
>>> df.dtypes[0]
dtype('O')

src.pes_match.cleaning.clean_name(df, name_column, suffix='')¶

Derives a cleaned version of a column contained in a pandas dataframe.

Parameters

df (pandas.DataFrame) – Input dataframe with name_column present
name_column (str) – Name of column containing name as string type
suffix (str, default = "") – Optional suffix to append to name component column names

Returns

clean_name returns the dataframe with a cleaned version of name_column.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> import re
>>> df = pd.DataFrame({'Name': ['Charlie!']})
>>> df.head(n=1)
       Name
0  Charlie!
>>> df = clean_name(df, name_column='Name', suffix='_cen')
>>> df.head(n=1)
       Name Name_clean_cen
0  Charlie!        CHARLIE

src.pes_match.cleaning.concat(df, columns, output_col, sep=' ')¶

Concatenates strings from specified columns into a single string and stores the new string value in a new column.

Parameters

df (pandas.DataFrame) – Dataframe to which the function is applied.
columns (list of strings, default = []) – The list of columns being concatenated into one string
output_col (str) – The name, in string format, of the output column for the new concatenated strings to be stored in.
sep (str, default = ' ') – This is the value used to separate the strings in the different columns when combining them into a single string.

Returns

Returns dataframe with ‘output_col’ column containing the concatenated string.

Return type

pandas.DataFrame

See also

replace_vals: Uses regular expressions to replace values within dataframe columns.

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'Forename': ['John'],
...                    'Surname': ['Smith']})
>>> df.head(n=1)
  Forename Surname
0     John   Smith
>>> df = concat(df, columns=['Forename', 'Surname'], output_col='Fullname', sep=' ')
>>> df.head(n=1)
  Forename Surname    Fullname
0     John   Smith  John Smith

src.pes_match.cleaning.derive_list(df, partition_var, list_var, output_col)¶

Aggregate function: Collects list of values from one column after partitioning by another column. Results stored in a new column

Parameters

df (pandas.DataFrame) – Input dataframe with partition_var and list_var present
partition_var (str) – Name of column to partition on e.g. household ID
list_var (str) – Variable to collect list of values over chosen partition e.g. names
output_col (str) – Name of list column to be output

Returns

derive_list returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Forename': ['John', 'Steve', 'Charlie', 'James'],
...                    'Household': [1, 1, 2, 2]})
>>> df.head(n=4)
  Forename  Household
0     John          1
1    Steve          1
2  Charlie          2
3    James          2
>>> df = derive_list(df, partition_var='Household', list_var='Forename',
...                  output_col='Forename_List')
>>> df.head(n=4)
  Forename  Household     Forename_List
0     John          1     [John, Steve]
1    Steve          1     [John, Steve]
2  Charlie          2  [Charlie, James]
3    James          2  [Charlie, James]

src.pes_match.cleaning.derive_names(df, clean_fullname_column, suffix='')¶

Derives first name, middle name(s) and last name from a pandas dataframe column containing a cleaned fullname column.

Parameters

df (pandas.DataFrame) – Input dataframe with clean_fullname_column present
clean_fullname_column (str) – Name of column containing fullname as string type
suffix (str, default = "") – Optional suffix to append to name component column names

Returns

derive_names returns the dataframe with additional columns for first name, middle name(s) and last name.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Clean_Name': ['John Paul William Smith']})
>>> df.head(1)
                Clean_Name
0  John Paul William Smith
>>> df = derive_names(df, clean_fullname_column='Clean_Name', suffix="")
>>> df.head(n=1)
                Clean_Name forename   middle_name last_name
0  John Paul William Smith     John  Paul William     Smith

src.pes_match.cleaning.n_gram(df, input_col, output_col, missing_value, n)¶

Generates the upper case n-gram sequence for all strings in a column.

Parameters

df (pandas.DataFrame) – Input dataframe with input_col present
input_col (str) – name of column to apply n_gram to
output_col (str) – name of column to be output
missing_value – value that is used for missingness in input_col will also be used for missingness in output_col
n (int) – Chosen n-gram

Returns

n_gram returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Forename': ['Jonathon', np.nan]})
>>> df.head(n=2)
   Forename
0  Jonathon
1       NaN
>>> df = n_gram(df, input_col='Forename', output_col='First_Two', missing_value=np.nan, n=2)
>>> df = n_gram(df, input_col='Forename', output_col='Last_Two', missing_value=np.nan, n=-2)
>>> df.head(n=2)
   Forename First_Two Last_Two
0  Jonathon        JO       ON
1       NaN       NaN      NaN

src.pes_match.cleaning.pad_column(df, input_col, output_col, length)¶

Pads a column (int or string type) with leading zeros. Values in input_col that are longer than the chosen pad length will not be padded and will remain unchanged.

Parameters

df (pandas.DataFrame) – Input dataframe with input_col present
input_col (str or int) – name of column to apply pad_column to
output_col (str) – name of column to be output
length (int) – Chosen length of strings in column AFTER padding with zeros

Returns

pad_column returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Age': [2, 5, 100]})
>>> df.head(n=3)
   Age
0    2
1    5
2  100
>>> df = pad_column(df, 'Age', 'Age_Padded', 3)
>>> df.head(n=3)
   Age Age_Padded
0    2        002
1    5        005
2  100        100

src.pes_match.cleaning.replace_vals(df, subset, dic)¶

Uses regular expressions to replace values within dataframe columns.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
dic (dict) – The values of the dictionary are the substrings that are being replaced within the subset of columns. These must either be regex statements in the form of a string, or numpy nan values. The key is the replacement. The value is the regex to be replaced.
subset (str or list of str) – The subset is the list of columns in the dataframe on which replace_vals is performing its actions.

Returns

replace_vals returns the dataframe with the column values changed appropriately.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Sex': ['M', 'F']})
>>> df.head(n=2)
  Sex
0   M
1   F
>>> df = replace_vals(df, dic={'MALE':'M', 'FEMALE':'F'}, subset='Sex')

>>> df.head(n=3)
      Sex
0    MALE
1  FEMALE

src.pes_match.cleaning.select(df, columns)¶

Retains only specified list of columns.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
columns (str or list of str, default = None) – This argument can be entered as a list of column headers that are the columns to be selected. If a single string that is a name of a column is entered, it will select only that column.

Returns

Dataframe with only selected columns included

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Sex': ['M', 'F'],
...                    'Age': [10, 29]})
>>> df.head(n=2)
  Sex  Age
0   M   10
1   F   29
>>> df = select(df, columns = 'Sex')
>>> df.head(n=2)
  Sex
0   M
1   F

src.pes_match.cleaning.soundex(df, input_col, output_col, missing_value)¶

Generates the soundex phonetic encoding for all strings in a column.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
input_col (str) – name of column to apply soundex to
output_col (str) – name of column to be output
missing_value – value that is used for missing value in input_col will also be used for missing value in output_col

Returns

soundex returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import jellyfish
>>> df = pd.DataFrame({'Forename': ['Charlie', 'Rachel', '-9']})
>>> df.head(n=3)
  Forename
0  Charlie
1   Rachel
2       -9
>>> df = soundex(df, input_col='Forename', output_col='sdx_Forename', missing_value='-9')
>>> df.head(n=3)
  Forename sdx_Forename
0  Charlie         C640
1   Rachel         R240
2       -9           -9

src.pes_match.cluster module¶

src.pes_match.cluster.cluster_number(df, id_column, suffix_1, suffix_2)¶

Takes dataframe of matches with two id columns and assigns a cluster number to the dataframe based on the unique id pairings.

Parameters

df (pandas.DataFrame) – DataFrame to add new column ‘Cluster_Number’ to.
id_column (str) – ID column that should be common to both DataFrames (excluding suffixes).
suffix_1 (str) – Suffix used for id_column in first DataFrame.
suffix_2 (str) – Suffix used for id_column in second DataFrame.

Raises

TypeError – if variables id_column, suffix_1 or suffix_2 are not strings.

Returns

df – dataframe with Cluster_Number added

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> import networkx as nx
>>> df = pd.DataFrame({"id_1":["C1","C2","C3","C4","C5","C6"],
...                    "id_2":["P1","P2","P2","P3","P1","P6"]})
>>> df.head(n=6)
  id_1 id_2
0   C1   P1
1   C2   P2
2   C3   P2
3   C4   P3
4   C5   P1
5   C6   P6
>>> df = cluster_number(df = df, id_column='id', suffix_1="_1", suffix_2="_2")
>>> df.head(n=6)
  id_1 id_2  Cluster_Number
0   C1   P1               1
1   C2   P2               2
2   C3   P2               2
3   C4   P3               3
4   C5   P1               1
5   C6   P6               4

src.pes_match.crow module¶

src.pes_match.crow.collect_conflicts(df, id_1, id_2)¶

Collects non-unique matches from a set of matches, removing all unique cases.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
id_1 (str) – ID column in first DataFrame (including suffix).
id_2 (str) – ID column in second DataFrame (including suffix).

Returns

Pandas dataframe with unique matches removed

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"],
...                    "id_2":["B1","B2","B3","B3","B4","B5"]})
>>> df.head(n=6)
  id_1 id_2
0   A1   B1
1   A2   B2
2   A3   B3
3   A4   B3
4   A5   B4
5   A6   B5
>>> collect_conflicts(df, id_1='id_1', id_2='id_2')
  id_1 id_2  CLERICAL
0   A3   B3         1
1   A4   B3         1

src.pes_match.crow.collect_uniques(df, id_1, id_2, match_type)¶

Collects unique matches from a set of matches, removing all non-unique cases.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
id_1 (str) – ID column in first DataFrame (including suffix).
id_2 (str) – ID column in second DataFrame (including suffix).
match_type (str) – Indicator that is added to specify which stage the matches were made on.

Returns

Pandas dataframe with non-unique matches removed and indicator column appended

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"],
...                    "id_2":["B1","B2","B3","B3","B4","B5"]})
>>> df.head(n=6)
  id_1 id_2
0   A1   B1
1   A2   B2
2   A3   B3
3   A4   B3
4   A5   B4
5   A6   B5
>>> collect_uniques(df, id_1='id_1', id_2='id_2', match_type='Stage_X_Matchkeys')
  id_1 id_2  CLERICAL         Match_Type
0   A1   B1         0  Stage_X_Matchkeys
1   A2   B2         0  Stage_X_Matchkeys
2   A5   B4         0  Stage_X_Matchkeys
3   A6   B5         0  Stage_X_Matchkeys

src.pes_match.crow.combine_crow_results(stage)¶

Takes all matches made in CROW from a chosen stage and combines them into a single pandas DataFrame. All matching in CROW for the chosen stage must be completed before running this function.

Parameters: stage (str) – Chosen stage of matching e.g., ‘Stage_1’. The function will look inside CLERICAL_PATH and combine all clerically matched CSV files that contain this string. File names for completed matches must also end in ‘_DONE.csv’, otherwise they will not be included in the final set of combined clerical matches.
Returns: Pandas dataframe with all clerically matched records from a selected stage combined.
Return type: pandas.DataFrame

src.pes_match.crow.crow_output_updater(output_df, id_column, source_column, suffix_1, suffix_2, match_type)¶

Returns the outputs of CROW in a pairwise linked format. Only matched pairs are retained.

Parameters

output_df (pandas.DataFrame) – The dataframe containing CROW matched output
id_column (str) – Name of record_id column in CROW matched output
source_column (str) – Name of column in CROW matched output identifying which data source the record is from
suffix_1 (str) – Suffix used for the first data source.
suffix_2 (str) – Suffix used for the second data source.
match_type (str) – indicator to say which stage matches were made at

Returns

df – CROW matches in pairwise format

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"],
...                  "Cluster_Number": [1, 1, 2, 2, 2, 3, 3],
...                  "Match": ["A1,B1", "A1,B1", "A2,B2,B3", "A2,B2,B3",
...                  "A2,B2,B3", "No match in cluster", "No match in cluster"],
...                  "Source_Dataset": ["_cen", "_pes", "_cen", "_pes",
...                  "_pes", "_cen", "_pes"]})
>>> df
  puid  Cluster_Number                Match Source_Dataset
0   A1               1                A1,B1           _cen
1   B1               1                A1,B1           _pes
2   A2               2             A2,B2,B3           _cen
3   B2               2             A2,B2,B3           _pes
4   B3               2             A2,B2,B3           _pes
5   A4               3  No match in cluster           _cen
6   B4               3  No match in cluster           _pes
>>> df_updated = crow_output_updater(output_df = df, id_column = 'puid',
...                                  source_column = 'Source_Dataset',
...                                  suffix_1 = '_cen', suffix_2 = '_pes',
...                                  match_type = 'Stage_1_Conflicts')
>>> df_updated
  puid_cen puid_pes         Match_Type  CLERICAL  MK
0       A1       B1  Stage_1_Conflicts         1   0
1       A2       B2  Stage_1_Conflicts         1   0
2       A2       B3  Stage_1_Conflicts         1   0

src.pes_match.crow.remove_large_clusters(df, n)¶

Filters out clusters containing n or more unique records. This can be required if clusters are too large for the CROW system.

Parameters

df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.
n (int) – Minimum size of clusters that will be removed.

See also

save_for_crow

Returns: Pandas dataframe with clusters >= size n removed.
Return type: pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"],
...                  "Cluster_Number": [1, 1, 2, 2, 2, 3, 3],
...                  "Source_Dataset": ["_cen", "_pes", "_cen", "_pes",
...                  "_pes", "_cen", "_pes"]})
>>> df
  puid  Cluster_Number Source_Dataset
0   A1               1           _cen
1   B1               1           _pes
2   A2               2           _cen
3   B2               2           _pes
4   B3               2           _pes
5   A4               3           _cen
6   B4               3           _pes
>>> df = remove_large_clusters(df, n=3)
>>> df
  puid  Cluster_Number Source_Dataset
0   A1               1           _cen
1   B1               1           _pes
2   A4               3           _cen
3   B4               3           _pes

src.pes_match.crow.save_for_crow(df, id_column, suffix_1, suffix_2, file_name, no_of_files=1)¶

Takes candidate matches, updates their format ready for CROW and then saves them. Split matches into multiple files if desired. Large clusters (size 12+) that are too big for CROW are removed.

Parameters

df (pandas.DataFrame) – DataFrame containing all candidate pairs ready for CROW.
id_column (str) – Name of record_id column in CROW candidates.
suffix_1 (str) – Suffix used for the first data source.
suffix_2 (str) – Suffix used for the second data source.
file_name (str) – Name of file that will be saved. If multiple files are saved, each file will have a different suffix e.g. “_1”, “_2”, etc.
no_of_files (int, default = 1) – Number of csv files that the output will be split into.

See also

cluster_number, remove_large_clusters, split_save

src.pes_match.crow.split_save(df, file_name, no_of_files)¶

Splits clusters (that are already in a format ready for CROW) into multiple smaller files.

Parameters

df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.
file_name (str) – Name of files that will be saved. Each file will have a different suffix e.g. “_1”, “_2”, etc.
no_of_files (int) – Number of csv files that the output will be split into.

See also

save_for_crow

src.pes_match.matching module¶

src.pes_match.matching.age_diff_filter(df, age_1, age_2)¶

Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
age_1 (str) – Name of age column (integer type) from first dataset
age_2 (str) – Name of age column (integer type) from second dataset

Returns

Filtered pandas dataframe which only includes records that meet the age tolerance criteria.

Return type

pandas.DataFrame

See also

age_tolerance: Function that returns True or False depending on whether two integer ages are within certain tolerances.

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'age_1': [5, 15, 25, 50, 99],
...                    'age_2': [5, 20, 22, 52, 90]})
>>> df.head(n=6)
   age_1  age_2
0      5      5
1     15     20
2     25     22
3     50     52
4     99     90
>>> df = age_diff_filter(df, 'age_1', 'age_2')
>>> df.head(n=5)
   age_1  age_2
0      5      5
1     25     22
2     50     52

src.pes_match.matching.age_tolerance(val1, val2)¶

Function that returns True or False depending on whether two integer ages are within certain tolerances. Used in the age_diff_filter filtering function.

Parameters

val1 (int) – First age value
val2 (int) – Second age value

Returns

The return value is True for cases that meet the age tolerance rules, False otherwise.

Return type

bool

See also

age_diff_filter: Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Example

>>> age_tolerance(5,5)
True
>>> age_tolerance(5,8)
False
>>> age_tolerance(45,49)
True
>>> age_tolerance(45,50)
False

src.pes_match.matching.combine(matchkeys, person_id, suffix_1, suffix_2, keep)¶

Takes results from a set of matchkeys and combines into a single deduplicated dataframe. If duplicate matches are made across matchkeys, the version with the lowest matchkey number is retained.

Parameters

matchkeys (list of pandas.DataFrame) – List of dataframes containing matches made from each matchkey
person_id (str) – Name of person id used in both datasets (without suffix)
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
keep (list of str) – List of variables to retain. Suffixes not required. New matchkey column “MK” will also be retained

See also

run_single_matchkey: Function to collect matches from a chosen matchkey

Returns: df – Combined dataset containing all matches made across matchkeys.
Return type: pandas.DataFrame

Example

>>> import pandas as pd
>>> mk1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5],
...                     'puid_2': [21, 22, 23, 24, 25],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'SAM', 'PAUL'],
...                     'name_2': ['CHARLES', 'JON', 'STEPHEN', 'SAMM', 'PAUL']})
>>> mk1.head(n=5)
   puid_1  puid_2   name_1   name_2
0       1      21  CHARLIE  CHARLES
1       2      22     JOHN      JON
2       3      23    STEVE  STEPHEN
3       4      24      SAM     SAMM
4       5      25     PAUL     PAUL
>>> mk2 = pd.DataFrame({'puid_1': [1, 2, 3, 6, 7],
...                     'puid_2': [21, 22, 30, 31, 32],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'MARK', 'DAVE'],
...                     'name_2': ['CHARLES', 'JON', 'STEUE', 'MARL', 'DAVE']})
>>> mk2.head(n=5)
   puid_1  puid_2   name_1   name_2
0       1      21  CHARLIE  CHARLES
1       2      22     JOHN      JON
2       3      30    STEVE    STEUE
3       6      31     MARK     MARL
4       7      32     DAVE     DAVE
>>> matches = combine(matchkeys=[mk1, mk2], suffix_1="_1", suffix_2="_2",
...                   person_id="puid", keep=['puid', 'name'])
>>> matches.head(n=8)
   puid_1   name_1  puid_2   name_2  MK
0       1  CHARLIE      21  CHARLES   1
1       2     JOHN      22      JON   1
2       3    STEVE      23  STEPHEN   1
3       4      SAM      24     SAMM   1
4       5     PAUL      25     PAUL   1
5       3    STEVE      30    STEUE   2
6       6     MARK      31     MARL   2
7       7     DAVE      32     DAVE   2

src.pes_match.matching.generate_matchkey(suffix_1, suffix_2, hh_id, level, variables, swap_variables=None)¶

Function to generate a single matchkey for matching two dataframes together. ‘swap_variables’ enables different variables to be used across dataframes e.g. require agreement between forename (on dataframe 1) and surname (on dataframe 2).

Parameters

suffix_1 (str) – Suffix used for columns in the first dataframe to match
suffix_2 (str) – Suffix used for columns in the second dataframe to match
hh_id (str) – Name of household ID column in dataframes to match (without suffixes) Required when level=’associative’.
level (str) – Level of geography to include in the matchkey e.g. household, enumeration area etc. If level = ‘associative’ then an associative matchkey is applied instead.
variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)
swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataframe to a different variable on the other dataframe. For example, to match forename on dataframe 1 to surname on dataframe 2, swap_variables = [(‘forename_1’, ‘surname_2’)]

Returns

df1_link_vars (list) – Variables to match on, suffixed with suffix_1
df2_link_vars (list) – Variables to match on, suffixed with suffix_2

See also

run_single_matchkey

Example

>>> mk = generate_matchkey(
...     suffix_1="_cen",
...     suffix_2="_pes",
...     hh_id="hid",
...     level="Eaid",
...     variables=["forename", "dob", "sex"],
...     swap_variables=[("middlename_cen", "surname_pes")])
>>> mk[0]
['forename_cen', 'dob_cen', 'sex_cen', 'Eaid_cen', 'middlename_cen']
>>> mk[1]
['forename_pes', 'dob_pes', 'sex_pes', 'Eaid_pes', 'surname_pes']

src.pes_match.matching.get_assoc_candidates(df1, df2, suffix_1, suffix_2, matches, person_id, hh_id)¶

Associative Matching Function. Takes all person matches made between two datasets and collects their unique household pairs. Unmatched person records from these household pairs are then grouped together (associatively). This is done by merging on the household ID pair to each unmatched record.

Parameters

df1 (pandas.DataFrame) – The first dataframe being matched - must contain a person_id and hh_id
df2 (pandas.DataFrame) – The second dataframe being matched - must contain a person_id and hh_id
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
matches (pandas.DataFrame) – All unique person matches that will be used to make additional associative matches. This DataFrame should contain two person ID columns only
person_id (str) – Name of person ID column (without suffixes)
hh_id (str) – Name of household ID column (without suffixes)

Returns

df1 (pandas.DataFrame) – Unmatched person records from df1 with additional household ID column from df2
df2 (pandas.DataFrame) – Unmatched person records from df2 with additional household ID column from df1

See also

get_residuals: Function for collecting person records not yet matched

Example

>>> import pandas as pd
>>> df1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5],
...                     'hhid_1': [1, 1, 1, 1, 1],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE',
...                                'SAM', 'PAUL']})
>>> df2 = pd.DataFrame({'puid_2': [21, 22, 23, 24, 25],
...                     'hhid_2': [2, 2, 2, 2, 2],
...                     'name_2': ['CHARLES', 'JON',
...                                'STEPHEN', 'SAMANTHA', 'PAUL']})
>>> matches = pd.DataFrame({'puid_1': [1, 5],
...                         'puid_2': [21, 25]})
>>> df1, df2 = get_assoc_candidates(df1, df2, suffix_1='_1', suffix_2='_2',
...                                 matches=matches,person_id='puid',
...                                 hh_id='hhid')
>>> df1.head(n=5)
   puid_1  hhid_1 name_1  hhid_2
0       2       1   JOHN       2
1       3       1  STEVE       2
2       4       1    SAM       2
>>> df2.head(n=5)
   puid_2  hhid_2    name_2  hhid_1
0      22       2       JON       1
1      23       2   STEPHEN       1
2      24       2  SAMANTHA       1

Additional column added to each DataFrame of residuals, which has associated households 1 and 2 together, using the existing matches between persons 1 and 21 and persons 5 and 25.

src.pes_match.matching.get_residuals(all_records, matched_records, id_column)¶

Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Parameters

all_records (pandas.DataFrame) – The dataframe containing all person records.
matched_records (pandas.DataFrame) – The dataframe containing all matched person records
id_column (str) – Name of person ID column (including suffixes)

Returns

Matched records removed, leaving only the residuals.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> all_records = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5]})
>>> all_records.head(n=5)
   puid_1
0       1
1       2
2       3
3       4
4       5
>>> matched_records = pd.DataFrame({'puid_1': [1, 2, 3],
...                                 'puid_2': [21, 22, 23]})
>>> matched_records.head(n=5)
   puid_1  puid_2
0       1      21
1       2      22
2       3      23
>>> residuals = get_residuals(all_records=all_records,
...                           matched_records=matched_records,
...                           id_column='puid_1')
>>> residuals.head(n=5)
   puid_1
0       4
1       5

src.pes_match.matching.mult_match(df, hh_id_1, hh_id_2)¶

Filters a set of matched records by retaining only those where 2 or more matches have been made across a pair of households.

Parameters

df (pandas.DataFrame) – dataframe to filter containing all person matches / candidates.
hh_id_1 (str) – Name of household ID column in first dataset (including suffix)
hh_id_2 (str) – Name of household ID column in second dataset (including suffix)

Returns

Retains cases where mutliple matches have been made between two households. Other cases are discarded.

Return type

pandas.DataFrame

src.pes_match.matching.run_single_matchkey(df1, df2, suffix_1, suffix_2, hh_id, level, variables, swap_variables=None, lev_variables=None, age_threshold=None)¶

Function to collect matches from a chosen matchkey. Partial agreement can be included using std_lev_filter, and age filters can be applied using age_threshold. Use swap_variables to match across different variables e.g. forename = surname.

Parameters

df1 (pandas.DataFrame) – The first dataframe being matched
df2 (pandas.DataFrame) – The second dataframe being matched
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
hh_id (str) – Name of household ID column in df1 and df2 (without suffixes) Required when level=’associative’.
level (str) – Level of geography to include in the matchkey e.g. household, EA etc. If level = ‘associative’ then an associative matchkey is applied instead.
variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)
swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataset to a different variable on the other dataset. For example, to match forename on df1 to surname on df2, swap_variables = [(‘forename_1’, ‘surname_2’)]
lev_variables (list of tuple, optional) – Use if you want to apply the std_lev_filter function within the matchkey. For example, to apply to forenames (threshold = 0.80): lev_variables = [(‘forename_1’, ‘forename_2’, 0.80)]
age_threshold (bool, optional) – Use if you want to apply the age_diff_filter function within the matchkey. To apply, simply set age_threshold = True

Returns

matches – All matches made from chosen matchkey (non-unique matches included)

Return type

pandas.DataFrame

src.pes_match.matching.std_lev(string1, string2)¶

Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1. Used in the std_lev_filter filtering function.

Parameters

string1 (str or None) – First string for comparison
string2 (str or None) – Second string for comparison

Returns

Score between 0 and 1. The closer to 1, the stronger the similarity between the two strings (1 = full agreeement / exact matches).

Return type

float

See also

std_lev_filter: Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.

Example

>>> std_lev('CHARLIE','CHARLIE')
1.0
>>> std_lev('CHARLIE','CHARLES')
0.7142857142857143
>>> std_lev('CHARLIE',None)

src.pes_match.matching.std_lev_filter(df, column1, column2, threshold)¶

Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.

Parameters

df (pandas.DataFrame) – The dataframe to which the function is applied.
column1 (str) – Name column (string type) from first dataset
column2 (str) – Name column (string type) from second dataset
threshold (float) – Record pairs with a std levenstein edit distance Below this threshold will be discarded

Returns

Filtered pandas dataframe which only includes records that meet the edit distance filter criteria.

Return type

pandas.DataFrame

See also

std_lev: Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1.

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name_1': ['CHARLES', None, 'C', 'CHRLEI', 'CH4RL1E'],
...                    'name_2': ['CHARLIE', 'CHARLIE', 'CHARLIE', 'CHARLIE',
...                               'CHARLIE']})
>>> df.head(n=5)
    name_1   name_2
0  CHARLES  CHARLIE
1     None  CHARLIE
2        C  CHARLIE
3   CHRLEI  CHARLIE
4  CH4RL1E  CHARLIE
>>> df = std_lev_filter(df, column1='name_1', column2='name_2', threshold=0.60)
>>> df.head(n=5)
    name_1   name_2
0  CHARLES  CHARLIE
1  CH4RL1E  CHARLIE

src.pes_match package¶

Subpackages¶

Submodules¶

src.pes_match.cleaning module¶

src.pes_match.cluster module¶

src.pes_match.crow module¶

src.pes_match.matching module¶

src.pes_match.parameters module¶

Module contents¶

PES MATCH

Navigation

Related Topics