src.pes_match package¶
Subpackages¶
Submodules¶
src.pes_match.cleaning module¶
- src.pes_match.cleaning.alpha_name(df, input_col, output_col)¶
Orders string columns alphabetically, after removing whitespace/special characters and setting strings to upper case.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
input_col (str) – Name of column to be sorted alphabetically
output_col (str) – Name of column to be output
- Returns
Pandas dataframe with output_col appended
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import re >>> df = pd.DataFrame({'forename': ['Charlie']}) >>> df['forename'].head(n=1) 0 Charlie Name: forename, dtype: object >>> df = alpha_name(df, input_col='forename', output_col='alphaname') >>> df['alphaname'].head(n=1) 0 ACEHILR Name: alphaname, dtype: object
- src.pes_match.cleaning.change_types(df, input_cols, types)¶
Casts specific dataframe columns to a specified type. The function can either take a single column or a list of columns.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
input_cols (str or list of str) – The subset of columns that are having their datatypes converted.
types – The datatype that the column values will be converted into.
- Returns
Returns the complete dataframe with changes to the datatypes on specified columns.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'number': [1]}) >>> df.dtypes[0] dtype('int64') >>> df = change_types(df, input_cols='number', types='str') >>> df.dtypes[0] dtype('O')
- src.pes_match.cleaning.clean_name(df, name_column, suffix='')¶
Derives a cleaned version of a column contained in a pandas dataframe.
- Parameters
df (pandas.DataFrame) – Input dataframe with name_column present
name_column (str) – Name of column containing name as string type
suffix (str, default = "") – Optional suffix to append to name component column names
- Returns
clean_name returns the dataframe with a cleaned version of name_column.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> import re >>> df = pd.DataFrame({'Name': ['Charlie!']}) >>> df.head(n=1) Name 0 Charlie! >>> df = clean_name(df, name_column='Name', suffix='_cen') >>> df.head(n=1) Name Name_clean_cen 0 Charlie! CHARLIE
- src.pes_match.cleaning.concat(df, columns, output_col, sep=' ')¶
Concatenates strings from specified columns into a single string and stores the new string value in a new column.
- Parameters
df (pandas.DataFrame) – Dataframe to which the function is applied.
columns (list of strings, default = []) – The list of columns being concatenated into one string
output_col (str) – The name, in string format, of the output column for the new concatenated strings to be stored in.
sep (str, default = ' ') – This is the value used to separate the strings in the different columns when combining them into a single string.
- Returns
Returns dataframe with ‘output_col’ column containing the concatenated string.
- Return type
pandas.DataFrame
See also
replace_vals
Uses regular expressions to replace values within dataframe columns.
Example
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({'Forename': ['John'], ... 'Surname': ['Smith']}) >>> df.head(n=1) Forename Surname 0 John Smith >>> df = concat(df, columns=['Forename', 'Surname'], output_col='Fullname', sep=' ') >>> df.head(n=1) Forename Surname Fullname 0 John Smith John Smith
- src.pes_match.cleaning.derive_list(df, partition_var, list_var, output_col)¶
Aggregate function: Collects list of values from one column after partitioning by another column. Results stored in a new column
- Parameters
df (pandas.DataFrame) – Input dataframe with partition_var and list_var present
partition_var (str) – Name of column to partition on e.g. household ID
list_var (str) – Variable to collect list of values over chosen partition e.g. names
output_col (str) – Name of list column to be output
- Returns
derive_list returns the dataframe with additional column output_col
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Forename': ['John', 'Steve', 'Charlie', 'James'], ... 'Household': [1, 1, 2, 2]}) >>> df.head(n=4) Forename Household 0 John 1 1 Steve 1 2 Charlie 2 3 James 2 >>> df = derive_list(df, partition_var='Household', list_var='Forename', ... output_col='Forename_List') >>> df.head(n=4) Forename Household Forename_List 0 John 1 [John, Steve] 1 Steve 1 [John, Steve] 2 Charlie 2 [Charlie, James] 3 James 2 [Charlie, James]
- src.pes_match.cleaning.derive_names(df, clean_fullname_column, suffix='')¶
Derives first name, middle name(s) and last name from a pandas dataframe column containing a cleaned fullname column.
- Parameters
df (pandas.DataFrame) – Input dataframe with clean_fullname_column present
clean_fullname_column (str) – Name of column containing fullname as string type
suffix (str, default = "") – Optional suffix to append to name component column names
- Returns
derive_names returns the dataframe with additional columns for first name, middle name(s) and last name.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Clean_Name': ['John Paul William Smith']}) >>> df.head(1) Clean_Name 0 John Paul William Smith >>> df = derive_names(df, clean_fullname_column='Clean_Name', suffix="") >>> df.head(n=1) Clean_Name forename middle_name last_name 0 John Paul William Smith John Paul William Smith
- src.pes_match.cleaning.n_gram(df, input_col, output_col, missing_value, n)¶
Generates the upper case n-gram sequence for all strings in a column.
- Parameters
df (pandas.DataFrame) – Input dataframe with input_col present
input_col (str) – name of column to apply n_gram to
output_col (str) – name of column to be output
missing_value – value that is used for missingness in input_col will also be used for missingness in output_col
n (int) – Chosen n-gram
- Returns
n_gram returns the dataframe with additional column output_col
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Forename': ['Jonathon', np.nan]}) >>> df.head(n=2) Forename 0 Jonathon 1 NaN >>> df = n_gram(df, input_col='Forename', output_col='First_Two', missing_value=np.nan, n=2) >>> df = n_gram(df, input_col='Forename', output_col='Last_Two', missing_value=np.nan, n=-2) >>> df.head(n=2) Forename First_Two Last_Two 0 Jonathon JO ON 1 NaN NaN NaN
- src.pes_match.cleaning.pad_column(df, input_col, output_col, length)¶
Pads a column (int or string type) with leading zeros. Values in input_col that are longer than the chosen pad length will not be padded and will remain unchanged.
- Parameters
df (pandas.DataFrame) – Input dataframe with input_col present
input_col (str or int) – name of column to apply pad_column to
output_col (str) – name of column to be output
length (int) – Chosen length of strings in column AFTER padding with zeros
- Returns
pad_column returns the dataframe with additional column output_col
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Age': [2, 5, 100]}) >>> df.head(n=3) Age 0 2 1 5 2 100 >>> df = pad_column(df, 'Age', 'Age_Padded', 3) >>> df.head(n=3) Age Age_Padded 0 2 002 1 5 005 2 100 100
- src.pes_match.cleaning.replace_vals(df, subset, dic)¶
Uses regular expressions to replace values within dataframe columns.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
dic (dict) – The values of the dictionary are the substrings that are being replaced within the subset of columns. These must either be regex statements in the form of a string, or numpy nan values. The key is the replacement. The value is the regex to be replaced.
subset (str or list of str) – The subset is the list of columns in the dataframe on which replace_vals is performing its actions.
- Returns
replace_vals returns the dataframe with the column values changed appropriately.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Sex': ['M', 'F']}) >>> df.head(n=2) Sex 0 M 1 F >>> df = replace_vals(df, dic={'MALE':'M', 'FEMALE':'F'}, subset='Sex')
>>> df.head(n=3) Sex 0 MALE 1 FEMALE
- src.pes_match.cleaning.select(df, columns)¶
Retains only specified list of columns.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
columns (str or list of str, default = None) – This argument can be entered as a list of column headers that are the columns to be selected. If a single string that is a name of a column is entered, it will select only that column.
- Returns
Dataframe with only selected columns included
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({'Sex': ['M', 'F'], ... 'Age': [10, 29]}) >>> df.head(n=2) Sex Age 0 M 10 1 F 29 >>> df = select(df, columns = 'Sex') >>> df.head(n=2) Sex 0 M 1 F
- src.pes_match.cleaning.soundex(df, input_col, output_col, missing_value)¶
Generates the soundex phonetic encoding for all strings in a column.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
input_col (str) – name of column to apply soundex to
output_col (str) – name of column to be output
missing_value – value that is used for missing value in input_col will also be used for missing value in output_col
- Returns
soundex returns the dataframe with additional column output_col
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import jellyfish >>> df = pd.DataFrame({'Forename': ['Charlie', 'Rachel', '-9']}) >>> df.head(n=3) Forename 0 Charlie 1 Rachel 2 -9 >>> df = soundex(df, input_col='Forename', output_col='sdx_Forename', missing_value='-9') >>> df.head(n=3) Forename sdx_Forename 0 Charlie C640 1 Rachel R240 2 -9 -9
src.pes_match.cluster module¶
- src.pes_match.cluster.cluster_number(df, id_column, suffix_1, suffix_2)¶
Takes dataframe of matches with two id columns and assigns a cluster number to the dataframe based on the unique id pairings.
- Parameters
df (pandas.DataFrame) – DataFrame to add new column ‘Cluster_Number’ to.
id_column (str) – ID column that should be common to both DataFrames (excluding suffixes).
suffix_1 (str) – Suffix used for id_column in first DataFrame.
suffix_2 (str) – Suffix used for id_column in second DataFrame.
- Raises
TypeError – if variables id_column, suffix_1 or suffix_2 are not strings.
- Returns
df – dataframe with Cluster_Number added
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> import networkx as nx >>> df = pd.DataFrame({"id_1":["C1","C2","C3","C4","C5","C6"], ... "id_2":["P1","P2","P2","P3","P1","P6"]}) >>> df.head(n=6) id_1 id_2 0 C1 P1 1 C2 P2 2 C3 P2 3 C4 P3 4 C5 P1 5 C6 P6 >>> df = cluster_number(df = df, id_column='id', suffix_1="_1", suffix_2="_2") >>> df.head(n=6) id_1 id_2 Cluster_Number 0 C1 P1 1 1 C2 P2 2 2 C3 P2 2 3 C4 P3 3 4 C5 P1 1 5 C6 P6 4
src.pes_match.crow module¶
- src.pes_match.crow.collect_conflicts(df, id_1, id_2)¶
Collects non-unique matches from a set of matches, removing all unique cases.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
id_1 (str) – ID column in first DataFrame (including suffix).
id_2 (str) – ID column in second DataFrame (including suffix).
- Returns
Pandas dataframe with unique matches removed
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"], ... "id_2":["B1","B2","B3","B3","B4","B5"]}) >>> df.head(n=6) id_1 id_2 0 A1 B1 1 A2 B2 2 A3 B3 3 A4 B3 4 A5 B4 5 A6 B5 >>> collect_conflicts(df, id_1='id_1', id_2='id_2') id_1 id_2 CLERICAL 0 A3 B3 1 1 A4 B3 1
- src.pes_match.crow.collect_uniques(df, id_1, id_2, match_type)¶
Collects unique matches from a set of matches, removing all non-unique cases.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
id_1 (str) – ID column in first DataFrame (including suffix).
id_2 (str) – ID column in second DataFrame (including suffix).
match_type (str) – Indicator that is added to specify which stage the matches were made on.
- Returns
Pandas dataframe with non-unique matches removed and indicator column appended
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"], ... "id_2":["B1","B2","B3","B3","B4","B5"]}) >>> df.head(n=6) id_1 id_2 0 A1 B1 1 A2 B2 2 A3 B3 3 A4 B3 4 A5 B4 5 A6 B5 >>> collect_uniques(df, id_1='id_1', id_2='id_2', match_type='Stage_X_Matchkeys') id_1 id_2 CLERICAL Match_Type 0 A1 B1 0 Stage_X_Matchkeys 1 A2 B2 0 Stage_X_Matchkeys 2 A5 B4 0 Stage_X_Matchkeys 3 A6 B5 0 Stage_X_Matchkeys
- src.pes_match.crow.combine_crow_results(stage)¶
Takes all matches made in CROW from a chosen stage and combines them into a single pandas DataFrame. All matching in CROW for the chosen stage must be completed before running this function.
- Parameters
stage (str) – Chosen stage of matching e.g., ‘Stage_1’. The function will look inside CLERICAL_PATH and combine all clerically matched CSV files that contain this string. File names for completed matches must also end in ‘_DONE.csv’, otherwise they will not be included in the final set of combined clerical matches.
- Returns
Pandas dataframe with all clerically matched records from a selected stage combined.
- Return type
pandas.DataFrame
- src.pes_match.crow.crow_output_updater(output_df, id_column, source_column, suffix_1, suffix_2, match_type)¶
Returns the outputs of CROW in a pairwise linked format. Only matched pairs are retained.
- Parameters
output_df (pandas.DataFrame) – The dataframe containing CROW matched output
id_column (str) – Name of record_id column in CROW matched output
source_column (str) – Name of column in CROW matched output identifying which data source the record is from
suffix_1 (str) – Suffix used for the first data source.
suffix_2 (str) – Suffix used for the second data source.
match_type (str) – indicator to say which stage matches were made at
- Returns
df – CROW matches in pairwise format
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np
>>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"], ... "Cluster_Number": [1, 1, 2, 2, 2, 3, 3], ... "Match": ["A1,B1", "A1,B1", "A2,B2,B3", "A2,B2,B3", ... "A2,B2,B3", "No match in cluster", "No match in cluster"], ... "Source_Dataset": ["_cen", "_pes", "_cen", "_pes", ... "_pes", "_cen", "_pes"]}) >>> df puid Cluster_Number Match Source_Dataset 0 A1 1 A1,B1 _cen 1 B1 1 A1,B1 _pes 2 A2 2 A2,B2,B3 _cen 3 B2 2 A2,B2,B3 _pes 4 B3 2 A2,B2,B3 _pes 5 A4 3 No match in cluster _cen 6 B4 3 No match in cluster _pes >>> df_updated = crow_output_updater(output_df = df, id_column = 'puid', ... source_column = 'Source_Dataset', ... suffix_1 = '_cen', suffix_2 = '_pes', ... match_type = 'Stage_1_Conflicts') >>> df_updated puid_cen puid_pes Match_Type CLERICAL MK 0 A1 B1 Stage_1_Conflicts 1 0 1 A2 B2 Stage_1_Conflicts 1 0 2 A2 B3 Stage_1_Conflicts 1 0
- src.pes_match.crow.remove_large_clusters(df, n)¶
Filters out clusters containing n or more unique records. This can be required if clusters are too large for the CROW system.
- Parameters
df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.
n (int) – Minimum size of clusters that will be removed.
See also
- Returns
Pandas dataframe with clusters >= size n removed.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"], ... "Cluster_Number": [1, 1, 2, 2, 2, 3, 3], ... "Source_Dataset": ["_cen", "_pes", "_cen", "_pes", ... "_pes", "_cen", "_pes"]}) >>> df puid Cluster_Number Source_Dataset 0 A1 1 _cen 1 B1 1 _pes 2 A2 2 _cen 3 B2 2 _pes 4 B3 2 _pes 5 A4 3 _cen 6 B4 3 _pes >>> df = remove_large_clusters(df, n=3) >>> df puid Cluster_Number Source_Dataset 0 A1 1 _cen 1 B1 1 _pes 2 A4 3 _cen 3 B4 3 _pes
- src.pes_match.crow.save_for_crow(df, id_column, suffix_1, suffix_2, file_name, no_of_files=1)¶
Takes candidate matches, updates their format ready for CROW and then saves them. Split matches into multiple files if desired. Large clusters (size 12+) that are too big for CROW are removed.
- Parameters
df (pandas.DataFrame) – DataFrame containing all candidate pairs ready for CROW.
id_column (str) – Name of record_id column in CROW candidates.
suffix_1 (str) – Suffix used for the first data source.
suffix_2 (str) – Suffix used for the second data source.
file_name (str) – Name of file that will be saved. If multiple files are saved, each file will have a different suffix e.g. “_1”, “_2”, etc.
no_of_files (int, default = 1) – Number of csv files that the output will be split into.
See also
cluster_number
,remove_large_clusters
,split_save
- src.pes_match.crow.split_save(df, file_name, no_of_files)¶
Splits clusters (that are already in a format ready for CROW) into multiple smaller files.
- Parameters
df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.
file_name (str) – Name of files that will be saved. Each file will have a different suffix e.g. “_1”, “_2”, etc.
no_of_files (int) – Number of csv files that the output will be split into.
See also
src.pes_match.matching module¶
- src.pes_match.matching.age_diff_filter(df, age_1, age_2)¶
Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
age_1 (str) – Name of age column (integer type) from first dataset
age_2 (str) – Name of age column (integer type) from second dataset
- Returns
Filtered pandas dataframe which only includes records that meet the age tolerance criteria.
- Return type
pandas.DataFrame
See also
age_tolerance
Function that returns True or False depending on whether two integer ages are within certain tolerances.
Example
>>> import pandas as pd >>> df = pd.DataFrame({'age_1': [5, 15, 25, 50, 99], ... 'age_2': [5, 20, 22, 52, 90]}) >>> df.head(n=6) age_1 age_2 0 5 5 1 15 20 2 25 22 3 50 52 4 99 90 >>> df = age_diff_filter(df, 'age_1', 'age_2') >>> df.head(n=5) age_1 age_2 0 5 5 1 25 22 2 50 52
- src.pes_match.matching.age_tolerance(val1, val2)¶
Function that returns True or False depending on whether two integer ages are within certain tolerances. Used in the age_diff_filter filtering function.
- Parameters
val1 (int) – First age value
val2 (int) – Second age value
- Returns
The return value is True for cases that meet the age tolerance rules, False otherwise.
- Return type
bool
See also
age_diff_filter
Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.
Example
>>> age_tolerance(5,5) True >>> age_tolerance(5,8) False >>> age_tolerance(45,49) True >>> age_tolerance(45,50) False
- src.pes_match.matching.combine(matchkeys, person_id, suffix_1, suffix_2, keep)¶
Takes results from a set of matchkeys and combines into a single deduplicated dataframe. If duplicate matches are made across matchkeys, the version with the lowest matchkey number is retained.
- Parameters
matchkeys (list of pandas.DataFrame) – List of dataframes containing matches made from each matchkey
person_id (str) – Name of person id used in both datasets (without suffix)
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
keep (list of str) – List of variables to retain. Suffixes not required. New matchkey column “MK” will also be retained
See also
run_single_matchkey
Function to collect matches from a chosen matchkey
- Returns
df – Combined dataset containing all matches made across matchkeys.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> mk1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5], ... 'puid_2': [21, 22, 23, 24, 25], ... 'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'SAM', 'PAUL'], ... 'name_2': ['CHARLES', 'JON', 'STEPHEN', 'SAMM', 'PAUL']}) >>> mk1.head(n=5) puid_1 puid_2 name_1 name_2 0 1 21 CHARLIE CHARLES 1 2 22 JOHN JON 2 3 23 STEVE STEPHEN 3 4 24 SAM SAMM 4 5 25 PAUL PAUL >>> mk2 = pd.DataFrame({'puid_1': [1, 2, 3, 6, 7], ... 'puid_2': [21, 22, 30, 31, 32], ... 'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'MARK', 'DAVE'], ... 'name_2': ['CHARLES', 'JON', 'STEUE', 'MARL', 'DAVE']}) >>> mk2.head(n=5) puid_1 puid_2 name_1 name_2 0 1 21 CHARLIE CHARLES 1 2 22 JOHN JON 2 3 30 STEVE STEUE 3 6 31 MARK MARL 4 7 32 DAVE DAVE >>> matches = combine(matchkeys=[mk1, mk2], suffix_1="_1", suffix_2="_2", ... person_id="puid", keep=['puid', 'name']) >>> matches.head(n=8) puid_1 name_1 puid_2 name_2 MK 0 1 CHARLIE 21 CHARLES 1 1 2 JOHN 22 JON 1 2 3 STEVE 23 STEPHEN 1 3 4 SAM 24 SAMM 1 4 5 PAUL 25 PAUL 1 5 3 STEVE 30 STEUE 2 6 6 MARK 31 MARL 2 7 7 DAVE 32 DAVE 2
- src.pes_match.matching.generate_matchkey(suffix_1, suffix_2, hh_id, level, variables, swap_variables=None)¶
Function to generate a single matchkey for matching two dataframes together. ‘swap_variables’ enables different variables to be used across dataframes e.g. require agreement between forename (on dataframe 1) and surname (on dataframe 2).
- Parameters
suffix_1 (str) – Suffix used for columns in the first dataframe to match
suffix_2 (str) – Suffix used for columns in the second dataframe to match
hh_id (str) – Name of household ID column in dataframes to match (without suffixes) Required when level=’associative’.
level (str) – Level of geography to include in the matchkey e.g. household, enumeration area etc. If level = ‘associative’ then an associative matchkey is applied instead.
variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)
swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataframe to a different variable on the other dataframe. For example, to match forename on dataframe 1 to surname on dataframe 2, swap_variables = [(‘forename_1’, ‘surname_2’)]
- Returns
df1_link_vars (list) – Variables to match on, suffixed with suffix_1
df2_link_vars (list) – Variables to match on, suffixed with suffix_2
See also
Example
>>> mk = generate_matchkey( ... suffix_1="_cen", ... suffix_2="_pes", ... hh_id="hid", ... level="Eaid", ... variables=["forename", "dob", "sex"], ... swap_variables=[("middlename_cen", "surname_pes")]) >>> mk[0] ['forename_cen', 'dob_cen', 'sex_cen', 'Eaid_cen', 'middlename_cen'] >>> mk[1] ['forename_pes', 'dob_pes', 'sex_pes', 'Eaid_pes', 'surname_pes']
- src.pes_match.matching.get_assoc_candidates(df1, df2, suffix_1, suffix_2, matches, person_id, hh_id)¶
Associative Matching Function. Takes all person matches made between two datasets and collects their unique household pairs. Unmatched person records from these household pairs are then grouped together (associatively). This is done by merging on the household ID pair to each unmatched record.
- Parameters
df1 (pandas.DataFrame) – The first dataframe being matched - must contain a person_id and hh_id
df2 (pandas.DataFrame) – The second dataframe being matched - must contain a person_id and hh_id
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
matches (pandas.DataFrame) – All unique person matches that will be used to make additional associative matches. This DataFrame should contain two person ID columns only
person_id (str) – Name of person ID column (without suffixes)
hh_id (str) – Name of household ID column (without suffixes)
- Returns
df1 (pandas.DataFrame) – Unmatched person records from df1 with additional household ID column from df2
df2 (pandas.DataFrame) – Unmatched person records from df2 with additional household ID column from df1
See also
get_residuals
Function for collecting person records not yet matched
Example
>>> import pandas as pd >>> df1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5], ... 'hhid_1': [1, 1, 1, 1, 1], ... 'name_1': ['CHARLIE', 'JOHN', 'STEVE', ... 'SAM', 'PAUL']}) >>> df2 = pd.DataFrame({'puid_2': [21, 22, 23, 24, 25], ... 'hhid_2': [2, 2, 2, 2, 2], ... 'name_2': ['CHARLES', 'JON', ... 'STEPHEN', 'SAMANTHA', 'PAUL']}) >>> matches = pd.DataFrame({'puid_1': [1, 5], ... 'puid_2': [21, 25]}) >>> df1, df2 = get_assoc_candidates(df1, df2, suffix_1='_1', suffix_2='_2', ... matches=matches,person_id='puid', ... hh_id='hhid') >>> df1.head(n=5) puid_1 hhid_1 name_1 hhid_2 0 2 1 JOHN 2 1 3 1 STEVE 2 2 4 1 SAM 2 >>> df2.head(n=5) puid_2 hhid_2 name_2 hhid_1 0 22 2 JON 1 1 23 2 STEPHEN 1 2 24 2 SAMANTHA 1
Additional column added to each DataFrame of residuals, which has associated households 1 and 2 together, using the existing matches between persons 1 and 21 and persons 5 and 25.
- src.pes_match.matching.get_residuals(all_records, matched_records, id_column)¶
Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.
- Parameters
all_records (pandas.DataFrame) – The dataframe containing all person records.
matched_records (pandas.DataFrame) – The dataframe containing all matched person records
id_column (str) – Name of person ID column (including suffixes)
- Returns
Matched records removed, leaving only the residuals.
- Return type
pandas.DataFrame
Example
>>> import pandas as pd >>> all_records = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5]}) >>> all_records.head(n=5) puid_1 0 1 1 2 2 3 3 4 4 5 >>> matched_records = pd.DataFrame({'puid_1': [1, 2, 3], ... 'puid_2': [21, 22, 23]}) >>> matched_records.head(n=5) puid_1 puid_2 0 1 21 1 2 22 2 3 23 >>> residuals = get_residuals(all_records=all_records, ... matched_records=matched_records, ... id_column='puid_1') >>> residuals.head(n=5) puid_1 0 4 1 5
- src.pes_match.matching.mult_match(df, hh_id_1, hh_id_2)¶
Filters a set of matched records by retaining only those where 2 or more matches have been made across a pair of households.
- Parameters
df (pandas.DataFrame) – dataframe to filter containing all person matches / candidates.
hh_id_1 (str) – Name of household ID column in first dataset (including suffix)
hh_id_2 (str) – Name of household ID column in second dataset (including suffix)
- Returns
Retains cases where mutliple matches have been made between two households. Other cases are discarded.
- Return type
pandas.DataFrame
- src.pes_match.matching.run_single_matchkey(df1, df2, suffix_1, suffix_2, hh_id, level, variables, swap_variables=None, lev_variables=None, age_threshold=None)¶
Function to collect matches from a chosen matchkey. Partial agreement can be included using std_lev_filter, and age filters can be applied using age_threshold. Use swap_variables to match across different variables e.g. forename = surname.
- Parameters
df1 (pandas.DataFrame) – The first dataframe being matched
df2 (pandas.DataFrame) – The second dataframe being matched
suffix_1 (str) – Suffix used for columns in the first dataframe
suffix_2 (str) – Suffix used for columns in the second dataframe
hh_id (str) – Name of household ID column in df1 and df2 (without suffixes) Required when level=’associative’.
level (str) – Level of geography to include in the matchkey e.g. household, EA etc. If level = ‘associative’ then an associative matchkey is applied instead.
variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)
swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataset to a different variable on the other dataset. For example, to match forename on df1 to surname on df2, swap_variables = [(‘forename_1’, ‘surname_2’)]
lev_variables (list of tuple, optional) – Use if you want to apply the std_lev_filter function within the matchkey. For example, to apply to forenames (threshold = 0.80): lev_variables = [(‘forename_1’, ‘forename_2’, 0.80)]
age_threshold (bool, optional) – Use if you want to apply the age_diff_filter function within the matchkey. To apply, simply set age_threshold = True
- Returns
matches – All matches made from chosen matchkey (non-unique matches included)
- Return type
pandas.DataFrame
See also
- src.pes_match.matching.std_lev(string1, string2)¶
Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1. Used in the std_lev_filter filtering function.
- Parameters
string1 (str or None) – First string for comparison
string2 (str or None) – Second string for comparison
- Returns
Score between 0 and 1. The closer to 1, the stronger the similarity between the two strings (1 = full agreeement / exact matches).
- Return type
float
See also
std_lev_filter
Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.
Example
>>> std_lev('CHARLIE','CHARLIE') 1.0 >>> std_lev('CHARLIE','CHARLES') 0.7142857142857143 >>> std_lev('CHARLIE',None)
- src.pes_match.matching.std_lev_filter(df, column1, column2, threshold)¶
Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.
- Parameters
df (pandas.DataFrame) – The dataframe to which the function is applied.
column1 (str) – Name column (string type) from first dataset
column2 (str) – Name column (string type) from second dataset
threshold (float) – Record pairs with a std levenstein edit distance Below this threshold will be discarded
- Returns
Filtered pandas dataframe which only includes records that meet the edit distance filter criteria.
- Return type
pandas.DataFrame
See also
std_lev
Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1.
Example
>>> import pandas as pd >>> df = pd.DataFrame({'name_1': ['CHARLES', None, 'C', 'CHRLEI', 'CH4RL1E'], ... 'name_2': ['CHARLIE', 'CHARLIE', 'CHARLIE', 'CHARLIE', ... 'CHARLIE']}) >>> df.head(n=5) name_1 name_2 0 CHARLES CHARLIE 1 None CHARLIE 2 C CHARLIE 3 CHRLEI CHARLIE 4 CH4RL1E CHARLIE >>> df = std_lev_filter(df, column1='name_1', column2='name_2', threshold=0.60) >>> df.head(n=5) name_1 name_2 0 CHARLES CHARLIE 1 CH4RL1E CHARLIE