src.pes_match package

Subpackages

Submodules

src.pes_match.cleaning module

src.pes_match.cleaning.alpha_name(df, input_col, output_col)

Orders string columns alphabetically, after removing whitespace/special characters and setting strings to upper case.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • input_col (str) – Name of column to be sorted alphabetically

  • output_col (str) – Name of column to be output

Returns

Pandas dataframe with output_col appended

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import re
>>> df = pd.DataFrame({'forename': ['Charlie']})
>>> df['forename'].head(n=1)
0    Charlie
Name: forename, dtype: object
>>> df = alpha_name(df, input_col='forename', output_col='alphaname')
>>> df['alphaname'].head(n=1)
0    ACEHILR
Name: alphaname, dtype: object
src.pes_match.cleaning.change_types(df, input_cols, types)

Casts specific dataframe columns to a specified type. The function can either take a single column or a list of columns.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • input_cols (str or list of str) – The subset of columns that are having their datatypes converted.

  • types – The datatype that the column values will be converted into.

Returns

Returns the complete dataframe with changes to the datatypes on specified columns.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'number': [1]})
>>> df.dtypes[0]
dtype('int64')
>>> df = change_types(df, input_cols='number', types='str')
>>> df.dtypes[0]
dtype('O')
src.pes_match.cleaning.clean_name(df, name_column, suffix='')

Derives a cleaned version of a column contained in a pandas dataframe.

Parameters
  • df (pandas.DataFrame) – Input dataframe with name_column present

  • name_column (str) – Name of column containing name as string type

  • suffix (str, default = "") – Optional suffix to append to name component column names

Returns

clean_name returns the dataframe with a cleaned version of name_column.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> import re
>>> df = pd.DataFrame({'Name': ['Charlie!']})
>>> df.head(n=1)
       Name
0  Charlie!
>>> df = clean_name(df, name_column='Name', suffix='_cen')
>>> df.head(n=1)
       Name Name_clean_cen
0  Charlie!        CHARLIE
src.pes_match.cleaning.concat(df, columns, output_col, sep=' ')

Concatenates strings from specified columns into a single string and stores the new string value in a new column.

Parameters
  • df (pandas.DataFrame) – Dataframe to which the function is applied.

  • columns (list of strings, default = []) – The list of columns being concatenated into one string

  • output_col (str) – The name, in string format, of the output column for the new concatenated strings to be stored in.

  • sep (str, default = ' ') – This is the value used to separate the strings in the different columns when combining them into a single string.

Returns

Returns dataframe with ‘output_col’ column containing the concatenated string.

Return type

pandas.DataFrame

See also

replace_vals

Uses regular expressions to replace values within dataframe columns.

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'Forename': ['John'],
...                    'Surname': ['Smith']})
>>> df.head(n=1)
  Forename Surname
0     John   Smith
>>> df = concat(df, columns=['Forename', 'Surname'], output_col='Fullname', sep=' ')
>>> df.head(n=1)
  Forename Surname    Fullname
0     John   Smith  John Smith
src.pes_match.cleaning.derive_list(df, partition_var, list_var, output_col)

Aggregate function: Collects list of values from one column after partitioning by another column. Results stored in a new column

Parameters
  • df (pandas.DataFrame) – Input dataframe with partition_var and list_var present

  • partition_var (str) – Name of column to partition on e.g. household ID

  • list_var (str) – Variable to collect list of values over chosen partition e.g. names

  • output_col (str) – Name of list column to be output

Returns

derive_list returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Forename': ['John', 'Steve', 'Charlie', 'James'],
...                    'Household': [1, 1, 2, 2]})
>>> df.head(n=4)
  Forename  Household
0     John          1
1    Steve          1
2  Charlie          2
3    James          2
>>> df = derive_list(df, partition_var='Household', list_var='Forename',
...                  output_col='Forename_List')
>>> df.head(n=4)
  Forename  Household     Forename_List
0     John          1     [John, Steve]
1    Steve          1     [John, Steve]
2  Charlie          2  [Charlie, James]
3    James          2  [Charlie, James]
src.pes_match.cleaning.derive_names(df, clean_fullname_column, suffix='')

Derives first name, middle name(s) and last name from a pandas dataframe column containing a cleaned fullname column.

Parameters
  • df (pandas.DataFrame) – Input dataframe with clean_fullname_column present

  • clean_fullname_column (str) – Name of column containing fullname as string type

  • suffix (str, default = "") – Optional suffix to append to name component column names

Returns

derive_names returns the dataframe with additional columns for first name, middle name(s) and last name.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Clean_Name': ['John Paul William Smith']})
>>> df.head(1)
                Clean_Name
0  John Paul William Smith
>>> df = derive_names(df, clean_fullname_column='Clean_Name', suffix="")
>>> df.head(n=1)
                Clean_Name forename   middle_name last_name
0  John Paul William Smith     John  Paul William     Smith
src.pes_match.cleaning.n_gram(df, input_col, output_col, missing_value, n)

Generates the upper case n-gram sequence for all strings in a column.

Parameters
  • df (pandas.DataFrame) – Input dataframe with input_col present

  • input_col (str) – name of column to apply n_gram to

  • output_col (str) – name of column to be output

  • missing_value – value that is used for missingness in input_col will also be used for missingness in output_col

  • n (int) – Chosen n-gram

Returns

n_gram returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Forename': ['Jonathon', np.nan]})
>>> df.head(n=2)
   Forename
0  Jonathon
1       NaN
>>> df = n_gram(df, input_col='Forename', output_col='First_Two', missing_value=np.nan, n=2)
>>> df = n_gram(df, input_col='Forename', output_col='Last_Two', missing_value=np.nan, n=-2)
>>> df.head(n=2)
   Forename First_Two Last_Two
0  Jonathon        JO       ON
1       NaN       NaN      NaN
src.pes_match.cleaning.pad_column(df, input_col, output_col, length)

Pads a column (int or string type) with leading zeros. Values in input_col that are longer than the chosen pad length will not be padded and will remain unchanged.

Parameters
  • df (pandas.DataFrame) – Input dataframe with input_col present

  • input_col (str or int) – name of column to apply pad_column to

  • output_col (str) – name of column to be output

  • length (int) – Chosen length of strings in column AFTER padding with zeros

Returns

pad_column returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Age': [2, 5, 100]})
>>> df.head(n=3)
   Age
0    2
1    5
2  100
>>> df = pad_column(df, 'Age', 'Age_Padded', 3)
>>> df.head(n=3)
   Age Age_Padded
0    2        002
1    5        005
2  100        100
src.pes_match.cleaning.replace_vals(df, subset, dic)

Uses regular expressions to replace values within dataframe columns.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • dic (dict) – The values of the dictionary are the substrings that are being replaced within the subset of columns. These must either be regex statements in the form of a string, or numpy nan values. The key is the replacement. The value is the regex to be replaced.

  • subset (str or list of str) – The subset is the list of columns in the dataframe on which replace_vals is performing its actions.

Returns

replace_vals returns the dataframe with the column values changed appropriately.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Sex': ['M', 'F']})
>>> df.head(n=2)
  Sex
0   M
1   F
>>> df = replace_vals(df, dic={'MALE':'M', 'FEMALE':'F'}, subset='Sex')
>>> df.head(n=3)
      Sex
0    MALE
1  FEMALE
src.pes_match.cleaning.select(df, columns)

Retains only specified list of columns.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • columns (str or list of str, default = None) – This argument can be entered as a list of column headers that are the columns to be selected. If a single string that is a name of a column is entered, it will select only that column.

Returns

Dataframe with only selected columns included

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'Sex': ['M', 'F'],
...                    'Age': [10, 29]})
>>> df.head(n=2)
  Sex  Age
0   M   10
1   F   29
>>> df = select(df, columns = 'Sex')
>>> df.head(n=2)
  Sex
0   M
1   F
src.pes_match.cleaning.soundex(df, input_col, output_col, missing_value)

Generates the soundex phonetic encoding for all strings in a column.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • input_col (str) – name of column to apply soundex to

  • output_col (str) – name of column to be output

  • missing_value – value that is used for missing value in input_col will also be used for missing value in output_col

Returns

soundex returns the dataframe with additional column output_col

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import jellyfish
>>> df = pd.DataFrame({'Forename': ['Charlie', 'Rachel', '-9']})
>>> df.head(n=3)
  Forename
0  Charlie
1   Rachel
2       -9
>>> df = soundex(df, input_col='Forename', output_col='sdx_Forename', missing_value='-9')
>>> df.head(n=3)
  Forename sdx_Forename
0  Charlie         C640
1   Rachel         R240
2       -9           -9

src.pes_match.cluster module

src.pes_match.cluster.cluster_number(df, id_column, suffix_1, suffix_2)

Takes dataframe of matches with two id columns and assigns a cluster number to the dataframe based on the unique id pairings.

Parameters
  • df (pandas.DataFrame) – DataFrame to add new column ‘Cluster_Number’ to.

  • id_column (str) – ID column that should be common to both DataFrames (excluding suffixes).

  • suffix_1 (str) – Suffix used for id_column in first DataFrame.

  • suffix_2 (str) – Suffix used for id_column in second DataFrame.

Raises

TypeError – if variables id_column, suffix_1 or suffix_2 are not strings.

Returns

df – dataframe with Cluster_Number added

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> import networkx as nx
>>> df = pd.DataFrame({"id_1":["C1","C2","C3","C4","C5","C6"],
...                    "id_2":["P1","P2","P2","P3","P1","P6"]})
>>> df.head(n=6)
  id_1 id_2
0   C1   P1
1   C2   P2
2   C3   P2
3   C4   P3
4   C5   P1
5   C6   P6
>>> df = cluster_number(df = df, id_column='id', suffix_1="_1", suffix_2="_2")
>>> df.head(n=6)
  id_1 id_2  Cluster_Number
0   C1   P1               1
1   C2   P2               2
2   C3   P2               2
3   C4   P3               3
4   C5   P1               1
5   C6   P6               4

src.pes_match.crow module

src.pes_match.crow.collect_conflicts(df, id_1, id_2)

Collects non-unique matches from a set of matches, removing all unique cases.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • id_1 (str) – ID column in first DataFrame (including suffix).

  • id_2 (str) – ID column in second DataFrame (including suffix).

Returns

Pandas dataframe with unique matches removed

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"],
...                    "id_2":["B1","B2","B3","B3","B4","B5"]})
>>> df.head(n=6)
  id_1 id_2
0   A1   B1
1   A2   B2
2   A3   B3
3   A4   B3
4   A5   B4
5   A6   B5
>>> collect_conflicts(df, id_1='id_1', id_2='id_2')
  id_1 id_2  CLERICAL
0   A3   B3         1
1   A4   B3         1
src.pes_match.crow.collect_uniques(df, id_1, id_2, match_type)

Collects unique matches from a set of matches, removing all non-unique cases.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • id_1 (str) – ID column in first DataFrame (including suffix).

  • id_2 (str) – ID column in second DataFrame (including suffix).

  • match_type (str) – Indicator that is added to specify which stage the matches were made on.

Returns

Pandas dataframe with non-unique matches removed and indicator column appended

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id_1":["A1","A2","A3","A4","A5","A6"],
...                    "id_2":["B1","B2","B3","B3","B4","B5"]})
>>> df.head(n=6)
  id_1 id_2
0   A1   B1
1   A2   B2
2   A3   B3
3   A4   B3
4   A5   B4
5   A6   B5
>>> collect_uniques(df, id_1='id_1', id_2='id_2', match_type='Stage_X_Matchkeys')
  id_1 id_2  CLERICAL         Match_Type
0   A1   B1         0  Stage_X_Matchkeys
1   A2   B2         0  Stage_X_Matchkeys
2   A5   B4         0  Stage_X_Matchkeys
3   A6   B5         0  Stage_X_Matchkeys
src.pes_match.crow.combine_crow_results(stage)

Takes all matches made in CROW from a chosen stage and combines them into a single pandas DataFrame. All matching in CROW for the chosen stage must be completed before running this function.

Parameters

stage (str) – Chosen stage of matching e.g., ‘Stage_1’. The function will look inside CLERICAL_PATH and combine all clerically matched CSV files that contain this string. File names for completed matches must also end in ‘_DONE.csv’, otherwise they will not be included in the final set of combined clerical matches.

Returns

Pandas dataframe with all clerically matched records from a selected stage combined.

Return type

pandas.DataFrame

src.pes_match.crow.crow_output_updater(output_df, id_column, source_column, suffix_1, suffix_2, match_type)

Returns the outputs of CROW in a pairwise linked format. Only matched pairs are retained.

Parameters
  • output_df (pandas.DataFrame) – The dataframe containing CROW matched output

  • id_column (str) – Name of record_id column in CROW matched output

  • source_column (str) – Name of column in CROW matched output identifying which data source the record is from

  • suffix_1 (str) – Suffix used for the first data source.

  • suffix_2 (str) – Suffix used for the second data source.

  • match_type (str) – indicator to say which stage matches were made at

Returns

df – CROW matches in pairwise format

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"],
...                  "Cluster_Number": [1, 1, 2, 2, 2, 3, 3],
...                  "Match": ["A1,B1", "A1,B1", "A2,B2,B3", "A2,B2,B3",
...                  "A2,B2,B3", "No match in cluster", "No match in cluster"],
...                  "Source_Dataset": ["_cen", "_pes", "_cen", "_pes",
...                  "_pes", "_cen", "_pes"]})
>>> df
  puid  Cluster_Number                Match Source_Dataset
0   A1               1                A1,B1           _cen
1   B1               1                A1,B1           _pes
2   A2               2             A2,B2,B3           _cen
3   B2               2             A2,B2,B3           _pes
4   B3               2             A2,B2,B3           _pes
5   A4               3  No match in cluster           _cen
6   B4               3  No match in cluster           _pes
>>> df_updated = crow_output_updater(output_df = df, id_column = 'puid',
...                                  source_column = 'Source_Dataset',
...                                  suffix_1 = '_cen', suffix_2 = '_pes',
...                                  match_type = 'Stage_1_Conflicts')
>>> df_updated
  puid_cen puid_pes         Match_Type  CLERICAL  MK
0       A1       B1  Stage_1_Conflicts         1   0
1       A2       B2  Stage_1_Conflicts         1   0
2       A2       B3  Stage_1_Conflicts         1   0
src.pes_match.crow.remove_large_clusters(df, n)

Filters out clusters containing n or more unique records. This can be required if clusters are too large for the CROW system.

Parameters
  • df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.

  • n (int) – Minimum size of clusters that will be removed.

See also

save_for_crow

Returns

Pandas dataframe with clusters >= size n removed.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({"puid": ["A1", "B1", "A2", "B2", "B3", "A4", "B4"],
...                  "Cluster_Number": [1, 1, 2, 2, 2, 3, 3],
...                  "Source_Dataset": ["_cen", "_pes", "_cen", "_pes",
...                  "_pes", "_cen", "_pes"]})
>>> df
  puid  Cluster_Number Source_Dataset
0   A1               1           _cen
1   B1               1           _pes
2   A2               2           _cen
3   B2               2           _pes
4   B3               2           _pes
5   A4               3           _cen
6   B4               3           _pes
>>> df = remove_large_clusters(df, n=3)
>>> df
  puid  Cluster_Number Source_Dataset
0   A1               1           _cen
1   B1               1           _pes
2   A4               3           _cen
3   B4               3           _pes
src.pes_match.crow.save_for_crow(df, id_column, suffix_1, suffix_2, file_name, no_of_files=1)

Takes candidate matches, updates their format ready for CROW and then saves them. Split matches into multiple files if desired. Large clusters (size 12+) that are too big for CROW are removed.

Parameters
  • df (pandas.DataFrame) – DataFrame containing all candidate pairs ready for CROW.

  • id_column (str) – Name of record_id column in CROW candidates.

  • suffix_1 (str) – Suffix used for the first data source.

  • suffix_2 (str) – Suffix used for the second data source.

  • file_name (str) – Name of file that will be saved. If multiple files are saved, each file will have a different suffix e.g. “_1”, “_2”, etc.

  • no_of_files (int, default = 1) – Number of csv files that the output will be split into.

See also

cluster_number, remove_large_clusters, split_save

src.pes_match.crow.split_save(df, file_name, no_of_files)

Splits clusters (that are already in a format ready for CROW) into multiple smaller files.

Parameters
  • df (pandas.DataFrame) – DataFrame containing all clusters ready for CROW.

  • file_name (str) – Name of files that will be saved. Each file will have a different suffix e.g. “_1”, “_2”, etc.

  • no_of_files (int) – Number of csv files that the output will be split into.

See also

save_for_crow

src.pes_match.matching module

src.pes_match.matching.age_diff_filter(df, age_1, age_2)

Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • age_1 (str) – Name of age column (integer type) from first dataset

  • age_2 (str) – Name of age column (integer type) from second dataset

Returns

Filtered pandas dataframe which only includes records that meet the age tolerance criteria.

Return type

pandas.DataFrame

See also

age_tolerance

Function that returns True or False depending on whether two integer ages are within certain tolerances.

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'age_1': [5, 15, 25, 50, 99],
...                    'age_2': [5, 20, 22, 52, 90]})
>>> df.head(n=6)
   age_1  age_2
0      5      5
1     15     20
2     25     22
3     50     52
4     99     90
>>> df = age_diff_filter(df, 'age_1', 'age_2')
>>> df.head(n=5)
   age_1  age_2
0      5      5
1     25     22
2     50     52
src.pes_match.matching.age_tolerance(val1, val2)

Function that returns True or False depending on whether two integer ages are within certain tolerances. Used in the age_diff_filter filtering function.

Parameters
  • val1 (int) – First age value

  • val2 (int) – Second age value

Returns

The return value is True for cases that meet the age tolerance rules, False otherwise.

Return type

bool

See also

age_diff_filter

Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Example

>>> age_tolerance(5,5)
True
>>> age_tolerance(5,8)
False
>>> age_tolerance(45,49)
True
>>> age_tolerance(45,50)
False
src.pes_match.matching.combine(matchkeys, person_id, suffix_1, suffix_2, keep)

Takes results from a set of matchkeys and combines into a single deduplicated dataframe. If duplicate matches are made across matchkeys, the version with the lowest matchkey number is retained.

Parameters
  • matchkeys (list of pandas.DataFrame) – List of dataframes containing matches made from each matchkey

  • person_id (str) – Name of person id used in both datasets (without suffix)

  • suffix_1 (str) – Suffix used for columns in the first dataframe

  • suffix_2 (str) – Suffix used for columns in the second dataframe

  • keep (list of str) – List of variables to retain. Suffixes not required. New matchkey column “MK” will also be retained

See also

run_single_matchkey

Function to collect matches from a chosen matchkey

Returns

df – Combined dataset containing all matches made across matchkeys.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> mk1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5],
...                     'puid_2': [21, 22, 23, 24, 25],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'SAM', 'PAUL'],
...                     'name_2': ['CHARLES', 'JON', 'STEPHEN', 'SAMM', 'PAUL']})
>>> mk1.head(n=5)
   puid_1  puid_2   name_1   name_2
0       1      21  CHARLIE  CHARLES
1       2      22     JOHN      JON
2       3      23    STEVE  STEPHEN
3       4      24      SAM     SAMM
4       5      25     PAUL     PAUL
>>> mk2 = pd.DataFrame({'puid_1': [1, 2, 3, 6, 7],
...                     'puid_2': [21, 22, 30, 31, 32],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE', 'MARK', 'DAVE'],
...                     'name_2': ['CHARLES', 'JON', 'STEUE', 'MARL', 'DAVE']})
>>> mk2.head(n=5)
   puid_1  puid_2   name_1   name_2
0       1      21  CHARLIE  CHARLES
1       2      22     JOHN      JON
2       3      30    STEVE    STEUE
3       6      31     MARK     MARL
4       7      32     DAVE     DAVE
>>> matches = combine(matchkeys=[mk1, mk2], suffix_1="_1", suffix_2="_2",
...                   person_id="puid", keep=['puid', 'name'])
>>> matches.head(n=8)
   puid_1   name_1  puid_2   name_2  MK
0       1  CHARLIE      21  CHARLES   1
1       2     JOHN      22      JON   1
2       3    STEVE      23  STEPHEN   1
3       4      SAM      24     SAMM   1
4       5     PAUL      25     PAUL   1
5       3    STEVE      30    STEUE   2
6       6     MARK      31     MARL   2
7       7     DAVE      32     DAVE   2
src.pes_match.matching.generate_matchkey(suffix_1, suffix_2, hh_id, level, variables, swap_variables=None)

Function to generate a single matchkey for matching two dataframes together. ‘swap_variables’ enables different variables to be used across dataframes e.g. require agreement between forename (on dataframe 1) and surname (on dataframe 2).

Parameters
  • suffix_1 (str) – Suffix used for columns in the first dataframe to match

  • suffix_2 (str) – Suffix used for columns in the second dataframe to match

  • hh_id (str) – Name of household ID column in dataframes to match (without suffixes) Required when level=’associative’.

  • level (str) – Level of geography to include in the matchkey e.g. household, enumeration area etc. If level = ‘associative’ then an associative matchkey is applied instead.

  • variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)

  • swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataframe to a different variable on the other dataframe. For example, to match forename on dataframe 1 to surname on dataframe 2, swap_variables = [(‘forename_1’, ‘surname_2’)]

Returns

  • df1_link_vars (list) – Variables to match on, suffixed with suffix_1

  • df2_link_vars (list) – Variables to match on, suffixed with suffix_2

Example

>>> mk = generate_matchkey(
...     suffix_1="_cen",
...     suffix_2="_pes",
...     hh_id="hid",
...     level="Eaid",
...     variables=["forename", "dob", "sex"],
...     swap_variables=[("middlename_cen", "surname_pes")])
>>> mk[0]
['forename_cen', 'dob_cen', 'sex_cen', 'Eaid_cen', 'middlename_cen']
>>> mk[1]
['forename_pes', 'dob_pes', 'sex_pes', 'Eaid_pes', 'surname_pes']
src.pes_match.matching.get_assoc_candidates(df1, df2, suffix_1, suffix_2, matches, person_id, hh_id)

Associative Matching Function. Takes all person matches made between two datasets and collects their unique household pairs. Unmatched person records from these household pairs are then grouped together (associatively). This is done by merging on the household ID pair to each unmatched record.

Parameters
  • df1 (pandas.DataFrame) – The first dataframe being matched - must contain a person_id and hh_id

  • df2 (pandas.DataFrame) – The second dataframe being matched - must contain a person_id and hh_id

  • suffix_1 (str) – Suffix used for columns in the first dataframe

  • suffix_2 (str) – Suffix used for columns in the second dataframe

  • matches (pandas.DataFrame) – All unique person matches that will be used to make additional associative matches. This DataFrame should contain two person ID columns only

  • person_id (str) – Name of person ID column (without suffixes)

  • hh_id (str) – Name of household ID column (without suffixes)

Returns

  • df1 (pandas.DataFrame) – Unmatched person records from df1 with additional household ID column from df2

  • df2 (pandas.DataFrame) – Unmatched person records from df2 with additional household ID column from df1

See also

get_residuals

Function for collecting person records not yet matched

Example

>>> import pandas as pd
>>> df1 = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5],
...                     'hhid_1': [1, 1, 1, 1, 1],
...                     'name_1': ['CHARLIE', 'JOHN', 'STEVE',
...                                'SAM', 'PAUL']})
>>> df2 = pd.DataFrame({'puid_2': [21, 22, 23, 24, 25],
...                     'hhid_2': [2, 2, 2, 2, 2],
...                     'name_2': ['CHARLES', 'JON',
...                                'STEPHEN', 'SAMANTHA', 'PAUL']})
>>> matches = pd.DataFrame({'puid_1': [1, 5],
...                         'puid_2': [21, 25]})
>>> df1, df2 = get_assoc_candidates(df1, df2, suffix_1='_1', suffix_2='_2',
...                                 matches=matches,person_id='puid',
...                                 hh_id='hhid')
>>> df1.head(n=5)
   puid_1  hhid_1 name_1  hhid_2
0       2       1   JOHN       2
1       3       1  STEVE       2
2       4       1    SAM       2
>>> df2.head(n=5)
   puid_2  hhid_2    name_2  hhid_1
0      22       2       JON       1
1      23       2   STEPHEN       1
2      24       2  SAMANTHA       1

Additional column added to each DataFrame of residuals, which has associated households 1 and 2 together, using the existing matches between persons 1 and 21 and persons 5 and 25.

src.pes_match.matching.get_residuals(all_records, matched_records, id_column)

Filters a set of matched records to keep only records within certain age tolerances. Age tolerances increase slightly as age increases.

Parameters
  • all_records (pandas.DataFrame) – The dataframe containing all person records.

  • matched_records (pandas.DataFrame) – The dataframe containing all matched person records

  • id_column (str) – Name of person ID column (including suffixes)

Returns

Matched records removed, leaving only the residuals.

Return type

pandas.DataFrame

Example

>>> import pandas as pd
>>> all_records = pd.DataFrame({'puid_1': [1, 2, 3, 4, 5]})
>>> all_records.head(n=5)
   puid_1
0       1
1       2
2       3
3       4
4       5
>>> matched_records = pd.DataFrame({'puid_1': [1, 2, 3],
...                                 'puid_2': [21, 22, 23]})
>>> matched_records.head(n=5)
   puid_1  puid_2
0       1      21
1       2      22
2       3      23
>>> residuals = get_residuals(all_records=all_records,
...                           matched_records=matched_records,
...                           id_column='puid_1')
>>> residuals.head(n=5)
   puid_1
0       4
1       5
src.pes_match.matching.mult_match(df, hh_id_1, hh_id_2)

Filters a set of matched records by retaining only those where 2 or more matches have been made across a pair of households.

Parameters
  • df (pandas.DataFrame) – dataframe to filter containing all person matches / candidates.

  • hh_id_1 (str) – Name of household ID column in first dataset (including suffix)

  • hh_id_2 (str) – Name of household ID column in second dataset (including suffix)

Returns

Retains cases where mutliple matches have been made between two households. Other cases are discarded.

Return type

pandas.DataFrame

src.pes_match.matching.run_single_matchkey(df1, df2, suffix_1, suffix_2, hh_id, level, variables, swap_variables=None, lev_variables=None, age_threshold=None)

Function to collect matches from a chosen matchkey. Partial agreement can be included using std_lev_filter, and age filters can be applied using age_threshold. Use swap_variables to match across different variables e.g. forename = surname.

Parameters
  • df1 (pandas.DataFrame) – The first dataframe being matched

  • df2 (pandas.DataFrame) – The second dataframe being matched

  • suffix_1 (str) – Suffix used for columns in the first dataframe

  • suffix_2 (str) – Suffix used for columns in the second dataframe

  • hh_id (str) – Name of household ID column in df1 and df2 (without suffixes) Required when level=’associative’.

  • level (str) – Level of geography to include in the matchkey e.g. household, EA etc. If level = ‘associative’ then an associative matchkey is applied instead.

  • variables (list of str) – List of variables to use in matchkey rule (exluding level of geography)

  • swap_variables (list of tuple, optional) – Use if you want to match a variable from one dataset to a different variable on the other dataset. For example, to match forename on df1 to surname on df2, swap_variables = [(‘forename_1’, ‘surname_2’)]

  • lev_variables (list of tuple, optional) – Use if you want to apply the std_lev_filter function within the matchkey. For example, to apply to forenames (threshold = 0.80): lev_variables = [(‘forename_1’, ‘forename_2’, 0.80)]

  • age_threshold (bool, optional) – Use if you want to apply the age_diff_filter function within the matchkey. To apply, simply set age_threshold = True

Returns

matches – All matches made from chosen matchkey (non-unique matches included)

Return type

pandas.DataFrame

src.pes_match.matching.std_lev(string1, string2)

Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1. Used in the std_lev_filter filtering function.

Parameters
  • string1 (str or None) – First string for comparison

  • string2 (str or None) – Second string for comparison

Returns

Score between 0 and 1. The closer to 1, the stronger the similarity between the two strings (1 = full agreeement / exact matches).

Return type

float

See also

std_lev_filter

Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.

Example

>>> std_lev('CHARLIE','CHARLIE')
1.0
>>> std_lev('CHARLIE','CHARLES')
0.7142857142857143
>>> std_lev('CHARLIE',None)
src.pes_match.matching.std_lev_filter(df, column1, column2, threshold)

Filters a set of matched records to keep only records where names have a similarity greater than a chosen threshold.

Parameters
  • df (pandas.DataFrame) – The dataframe to which the function is applied.

  • column1 (str) – Name column (string type) from first dataset

  • column2 (str) – Name column (string type) from second dataset

  • threshold (float) – Record pairs with a std levenstein edit distance Below this threshold will be discarded

Returns

Filtered pandas dataframe which only includes records that meet the edit distance filter criteria.

Return type

pandas.DataFrame

See also

std_lev

Function that compares two strings (usually names) and returns the standardised levenstein edit distance score, between 0 and 1.

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name_1': ['CHARLES', None, 'C', 'CHRLEI', 'CH4RL1E'],
...                    'name_2': ['CHARLIE', 'CHARLIE', 'CHARLIE', 'CHARLIE',
...                               'CHARLIE']})
>>> df.head(n=5)
    name_1   name_2
0  CHARLES  CHARLIE
1     None  CHARLIE
2        C  CHARLIE
3   CHRLEI  CHARLIE
4  CH4RL1E  CHARLIE
>>> df = std_lev_filter(df, column1='name_1', column2='name_2', threshold=0.60)
>>> df.head(n=5)
    name_1   name_2
0  CHARLES  CHARLIE
1  CH4RL1E  CHARLIE

src.pes_match.parameters module

Module contents