Look-alike models for Social Good – Servian

Background

Community Hubs is a not-for-profit organisation with the aim to help migrant and refugee families, especially mothers with young children to connect, learn and receive health, education and settlement support. The primary unit of impact are hubs — schools where the CHA sessions are run.

The goal of my analysis is to ease the selection of schools to establish new hubs by identifying Australian schools similar to existing CHA hubs. I will employ some basic look-alike modelling for this purpose. There are two ways of solving this problem — unsupervised and supervised. However, there are no negative instances (all schools where hubs don’t exist currently are not negative cases). Hence, supervised method is not very suitable. To keep things simple, we will measure the “similarity” of every school with the current hubs, sum it and rank the schools in decreasing amount of similarity.

Diving in

Like any other data science problems, selecting features to measure the similarity is based on the business problem. I started looking for features which could identify schools in LGA (Local Government Areas) where there are more migrant/refugee families (generally non-English speaking) esp. with young kids (census data, early childhood development).

I narrowed my search to these data sources:

  • ACARA school profiles with information such as enrolments (girls, boys, total), staff (full-time, teaching, non-teaching), SEA quarter of the students (Socio-Economic Advantage), language background other than English, type and location of schools (primary, secondary, combined)
  • Australian PHIDU Social Atlas for Local Government Areas (LGA) with indicators about Families, Migrant Statistics (Skilled, Humanitarian, Family), Birthplace (Non-English Speaking Countries), Early Childhood Development, Child care, Income Support

Preprocessing

The hubs data has school name, latitude, longitude so the first step is to merge it with other ACARA schools to get other information — enrolment, staff. I used the Google Maps Reverse Geocoding API to identify the address (postcode, state, suburb etc.) from the latitude, longitude. This list was joined with the schools through postcode and school name. The process identified most of the schools associated with the hubs. Some hubs were still not identified due to mismatch in school names (e.g. words like primary or suburb name might be present in school names).

ACARA school profiles track year on year changes, hence the dataset is filtered to keep the latest profile for every school. Some additional features, girls % enrolment, boys % enrolment, year start (e.g. Prep, K, U)/end, teaching/non-teaching staff by enrolment were created since the hubs are operated with the help of volunteer and school staff. School names were also cleaned (removed suburbs, primary/east/west etc.) since it might be easier for CHA to approach other branches of existing hubs.

Since PHIDU datasets are on a LGA (Local Government Area) level, to merge them with schools/hubs information geographic boundaries were fetched from data.gov.au. The library geopandas was used for all geospatial analysis. The latitude/longitude information for all ACARA schools was fetched using Google Maps Geocoding API (note: we have the latitude/longitude information for hubs but not all schools). Finally, the datasets were merged using the spatial join (sjoin) functionality of geopandas.

Modelling

To find the similarity, we need a distance metric — Gower is chosen for this analysis to account for numerical (e.g. enrolment, PHIDU, staff) and categorical features (e.g. suburb, postcode, school name).

Gower dissimilarity is calculated as an average of the dissimilarity of all the features. If the feature (f) is numerical, the ratio of the absolute difference of the values and the range of the feature is used. For categorical values, the feature is similar if both the hub and the school have the same value.

Such similarity is calculated for every hub-school pair and summed for every school to get a ranked list of schools.

Since Gower similarity is a sum-based metric, any feature will have an equal impact. But some features might not be relevant and can sometimes deteriorate the ranking. Hence, feature selection is important. For this analysis, we use information value and weight of evidence to determine which features are most important in defining a “hub-worthy” school. The top 20 features:

Our goal is to find a ranking which identifies the existing hubs out of all schools pretty quickly and early. In other words, we should aim to maximise the recall/sensitivity (The number of hubs identified in Top K schools). I selected K as # of hubs * 1.5 ~ 105 to give ample opportunity for the model to identify all the hubs in the top 105 schools.

Finally, the model iterates through the list of features ranked by their information value and select the ones which increase the recall. Two models were created — one with location (Suburb/LGA) and school name information and one without it. If the business wants to venture into new suburbs without any existing hubs they can use the model without the location features. The algorithm selected the following features:

Model with location/school name features (65% recall):

Suburb , school_name_processed , % Children developmentally vulnerable in communication domain

Model without location/school name features (30% recall):

% children in jobless families, Language Background Other Than English (%), % Permanent migrants under the Humanitarian Program (2000 to 2006), Pensioner Concession Card holders, Health Care Card holders, Jobless families with children under 15 years, Children developmentally vulnerable in language and cognitive domain, People receiving an unemployment benefit, People receiving an unemployment benefit for less than 6 months, School_Year_Start

We can ascertain if the ranking does actually work by analysing if the Top 105 ranked schools exhibit similar behaviour to hubs for the selected features. We will analyse the model without location/school name information.

Feature summary for all schools:

Feature summary for the Top 105 ranked schools:

Feature summary for existing hubs:

As a final spot check, I looked at the existing hubs and their ranks in Model 1 vs Model 2 and verified that they are ranking higher.