Text can be beautiful

Identifying company commitments with SpaCy

In order to understand what companies are actively doing and committing to doing, we will need to create an intelligent way of identifying such commitments in each Modern Slavery return.

A typical return will include a lot of non-relevant information such as background on the Company and the Modern Slavery Act. Thankfully using the SpaCy NLP library, we can filter these out using its powerful matching features.

The problem with text matching is that it can quickly become burdensome, even when using techniques like regular expression. The issue is all of the different combinations of phrases you need to consider when looking to find even a simple pattern. For example, we are interested in identifying phrases which contain statements like:

"We are committed to..."

However the below phrases would be of interest as well, how can we include these in our analysis without having to write code for every example?

"We promise to"
"We have committed to"
"We will continue to"
"[COMPANY NAME] has committed to"
"[COMPANY NAME] has implemented"

Part of Speech matching

The matching engine in SpaCy allows you to use Part of Speech (POS) tags to match phrases to a specific pattern, for example, rather than searching for specific words, we could filter for a sequence of POS tags:

PRON, VERB, VERB

This matching identifies the following phrase from a snippet from a Modern Slavery return:

Even using a very simple POS filter we can identify phrases which denote commitments made from businesses in their Modern Slavery returns. The match here is highlighted in yellow.

SpaCy even provides an online tool for helping to build and review the results of different rules:

SpaCy’s rule based matcher

It does not take long at all to create a set of rules which produce good results. The below code implements these rules and returns the whole sentence where a result has been identified:

def collect_sents(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start:end] # Matched span
sent = span.sent # Sentence containing matched span
# Append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{
"start": span.start_char - sent.start_char,
"end": span.end_char - sent.start_char,
"label": "MATCH",
}]
matched_sents.append({"text": sent.text, "ents": match_ents})

matcher = Matcher(nlp.vocab)
#this type of pattern matching requires SpaCy >2.1:

pattern = [{'POS': {'IN': ['PROPN', 'PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
{'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']} },
{'POS': {'IN': ['VERB', 'DET']}, 'LOWER': {'NOT_IN': ['a']}}]
matcher.add("commit", collect_sents, pattern)
pattern = [{'POS': {'IN': ['PROPN','PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
{'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']}},
{'POS': 'ADJ'},
{'POS': 'ADP'}]
matcher.add("commit", collect_sents, pattern)

How are different industries tackling Modern Slavery?

Now that we have a set statements filtered for commitments and actions from company submissions what can this tell us about how different industries are responding to Modern Slavery?

For this analysis we will use the fantastic ScatterText library developed by Jason Kessler.

This uses a simple, yet powerful approach to find key words and phrases which separate two categories of text. The results can then be output easily into an interactive visualisation.

The below code filters our modern slavery returns to two high-risk industries: Construction and Retail. It then creates a corpus of text for use in the scatter text visualisation:

#select industries to compare:
ind1 = 'Specialty Retail'
ind2 = 'Construction & Engineering'
#Filter into a new df with 3 columns one for industry, one for company and the third containing the text
ftr = (df['Industry'] == ind1) | (df['Industry'] == ind2)
df_corp = df.loc[ftr]
df_corp = df_corp[['Industry','Company','clean text']]
#Create a scattertext corpus from the df:
corpus = st.CorpusFromPandas( df_corp,
category_col='Industry',
text_col='clean text',
nlp=nlp).build()

Once this is complete, we can run the below to create an interactive scattertext output:

html = st.produce_scattertext_explorer(corpus,
category='Construction & Engineering',
category_name='Construction & Engineering',
not_category_name=ind1,
width_in_pixels=1600)
open("MS-Visualization.html", 'wb').write(html.encode('utf-8'))
HTML(html)

This produces the following output:

This plots the distribution of words by the two categories (in this case the Retail and Construction industries). Words at the top right are common across both categories, words at the bottom left are uncommon across both categories.

It is the words at the top left and bottom right which show the key differences between the two industries in their approach to combating Modern Slavery. Clicking on a word reveals where it has been used within the corpus. This is useful to find the context and the reasons why certain words and phrases occur within one industry and not another. The full output is available to download at the end of this article.

After just a few minutes of analysis, it is easy to find significant differences in the way the two industries are approaching the issue of Modern Slavery (items in bold represent the words on the chart which have been analysed):

Construction

  • The construction industry already has regulation in place regarding quality management (ISO 9001) and environmental management systems (ISO 14001). Companies are leveraging processes put in place by these standards to help combat modern slavery risks.
  • The industry is aware that subcontractors pose a risk, little is currently being done with regards to implementing checks or controls on subcontractors.
  • It places greater emphasis on its internal workforce. Responsibility is placed with the HR department and line managers to put processes in place to reduce risk.

Retail

  • The retail industry is more externally facing in its approach; placing importance on audits performed with suppliers at high risk locations (with India, China and Turkey often been categorised as high risk countries).
  • In retail, more focus is placed on the supply chain and mapping beyond direct suppliers to understand what lies below the first tier of the supply network. It is clear that some companies have made more progress in this area than others.

Closing thoughts:

The value of being able to look across thousands of documents and instantly understand trends across industries is huge. It can be used to:

  • highlight best practice;
  • help to bring innovation from one industry to others, and;
  • identify where enough is not being done to prevent Modern Slavery.

Hopefully this article has been helpful in showing that some simple, but powerful NLP and visualisation techniques can unlock insights that are otherwise locked within unstructured data.