Scraping the Top 5 Tech Company Job Boards

How to scrape Facebook’s job board, along with Apple, Amazon, Google and Netflix.

Gustave Caillebotte [Public domain]

In this project, I wanted to scrape the job search results from Apple, Amazon, Facebook, Google, and Netflix to help expedite my job search. It is a tedious thing to go to each site to get all the jobs results for different cities, so I figured I would automate it. Here is the thing, I have never scraped data off a site before but really wanted to try using the tools available to do it. Now I do realize that these sites probably don’t want you scraping their data, so do this at your own risk. This is the long but fun journey to finally figure out how to do this.

Also, to give credit to where credit is due, I learned a lot of this from the data science program over at dataquest.io. Check them out because it’s a really great program.

This project was supposed to be an easy exercise in web scraping using Beautiful Soup and dumping the data into a simple Excel file. This turned out not being that kind of project.

Since there are five sites used, I will only go over Facebook since it was a little more difficult at the end than some of the others and to avoid turning this post into something as long as a CVS coupon printout and boring you. I am working on the post for Amazon’s code since that couldn’t be scraped using Beautiful Soup and had a many blocking issues but I should have that up soon.

The Python code for Amazon, Apple, Google, Netflix and Facebook can be found on my GitHub page here.

To start, each one of these sites had their own issues which prevented me from creating one big script to run on all five sites. So I created one for each but can run them all at once. For Facebook, I could run Beautiful Soup and it ran well, but as you will see, it grouped multiple cities in one cell for the same job. I wanted each job to be unique with its own city. So that took a while to figure out. For Amazon, there are just so many jobs to scrape and I kept getting blocked after a while, but I figured that out too. Of course, I am just beginning, so there must be better ways to do this and I am all for feedback to make this better.

Okay, that is enough talk. Let’s look at some code.

Scraping Facebook’s Job Search Page

To begin, here’s the full code I used for Facebook:

Modules to Import

These are the modules I had to import to make it work:

from time import time
from datetime import datetime
from time import sleep
from random import randint
from requests import get
from IPython.core.display import clear_output
from warnings import warn
import lxml
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np
import tempfile

The Job Search URL

When searching for jobs on Facebook, you can search by city. Here is the URL used for the cities I needed. There must be a way to make this look better using parameters or something for each city instead of hard coding it this way but I just needed it to see how many pages to scrape. Plus, I found it easier to iterate through one URL instead of creating parameters for each city separately through another for loop.

https://www.facebook.com/careers/jobs?page=1&results_per_page=100&locations[0]=Fremont%2C%20CA&locations[1]=Northridge%2C%20CA&locations[2]=San%20Francisco%2C%20CA&locations[3]=Santa%20Clara%2C%20CA&locations[4]=Mountain%20View%2C%20CA&locations[5]=Los%20Angeles%2C%20CA&locations[6]=Seattle%2C%20WA&locations[7]=Woodland%20Hills%2C%20CA&locations[8]=Sunnyvale%2C%20CA&locations[9]=Menlo%20Park%2C%20CA&locations[10]=Redmond%2C%20WA#search_result

The main Function

I created a main function that creates a certain number of pages(responses) within a certain known page range. You have to look at the jobs results from the URL above to determine how many pages you will need to scrape. At the time of this writing, they have 19 job search page results for 100 results per page. So that’s almost 1,900 jobs search results.

def main(): 
# Set the number of pages to scrape
pages = [str(i) for i in range(1, 19)]

Monitoring the loop while it’s working

From this main function, a pages variable is created and used in the def get_all_jobs(pages) function at the beginning of the code. This function will monitor the frequency of each request and show a time stamp of how fast each request is sent to the site.

The requests variable will count the number of requests through each iteration of pages.

The start_time variable will set the start time using the time function from the time module that was imported.

The datetime.now() module will be added to the total_runtime variable and used to show how long it took to run the entire script.

For each page requested, the URL of the results page is called and assigned to the response variable.

def get_job_infos(response): 
page_soup = BeautifulSoup(response.text, “lxml”)
job_containers = page_soup.find_all(“a”, “_69jm”)

for container in job_containers:
site = page_soup.find(“title”).text
title = container.find(“div”, “_69jo”).text
location = container.find(“div”, “_1n-z _6hy- _21-h”).text
job_link = “https://www.facebook.com”+container.get("href")

yield site, title, location, job_link

Here, through each successful iteration, the count is increased and added to the requests variable.

# Monitor the frequency of requests
requests += 1

This next section will pause the loop between 8–15 seconds and display the number of requests, how fast each request is sent, and the total time it took to run the entire script.

  • The sleep function will pause randomly between 8 – 15 seconds
  • The time function is added to the current_time variable
  • The current time is subtracted from the starting time of the loop and added to the elapsed_time variable
  • The variables are printed out to display the results
  • The clear_output function will clear the output after each iteration and replace it with information about the most recent request. This will show each result one at a time instead of a long list of results.
# Pauses the loop between 8–15 seconds and marks the elapsed time sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time — start_time
print(“Facebook Request:{}; Frequency: {} request/s; Total Run Time:    
{}”.format(requests, requests / elapsed_time, datetime.now() —
total_runtime))
clear_output(wait=True)

The request frequency displayed when the script is executed.

Here, we check if the request’s response was successful. A 200 code is what we want to but if a non-200 code is found, a warning will be thrown.

# Throw a warning for non-200 status codes 
if response.status_code != 200:
warn(“Request: {}; Status code: {}”
.format(requests,response.status_code))

For the last ‘if’ statement below, if the requests are more than requested, the loop will break. If we stay within the number of requested pages, we should be fine. In this case, there are 19 requests we are making because there are 19 pages.

The yield from get_job_infos(response) at the bottom allows that value to be made available to the def get_all_jobs(pages) function.

# Set page requests. Break the loop if number of requests is greater # than expected 
if requests > 19:
warn(“Number of requests was greater than expected.”)
break
yield from get_job_infos(response)

Scraping with Beautiful Soup

In the get_job_infos(response) function, Beautiful Soup is used to parse html tags from the response variable.

  • The Beautiful Soup function takes in the response variable as text and uses lxml to parse the html.
  • It then finds all a tags that also contain the _69jo class and puts that value into the job_containers variable.
  • A for loop iterates through job_containers and finds more values from other tags and classes then puts those values into the variables for site, title, location, and job link.
def get_job_infos(response):
page_soup = BeautifulSoup(response.text, “lxml”)
job_containers = page_soup.find_all(“a”, “_69jm”)

for container in job_containers:
site = page_soup.find(“title”).text
title = container.find(“div”, “_69jo”).text
location = container.find(“div”, “_1n-z _6hy- _21-h”).text
job_link = “https://www.facebook.com”+container.get("href")

        yield site, title, location, job_link

Writing to a temp file

These last couple sections were what really stumped me but after an awful amount of hours searching Stack Overflow and Google, I finally figured it out. Well, I figured out one way of doing it.

This creates a temporary CSV file with a header row and writes the output from the get_all_jobs(pages) generator function to this temporary file. We need a temporary file to avoid creating two CSV files in the end. This temp file will be read into a pandas data frame since, as far as I know, you cannot create a data frame from a generator. Please let me know if I am far off on this. As you’ll see in the following steps, the temp file disappears and your left with the pandas data frame to work with.

# Writes to a temp file
with tempfile.NamedTemporaryFile(mode=’w+’, delete=False, newline=’’, encoding=”utf-8") as temp_csv:
writer = csv.writer(temp_csv)
writer.writerow([“Website”, “Title”, “Location”, “Job URL”])
writer.writerows(get_all_jobs(pages))

Separating Cities Into Their Own Cell

The biggest problem I ran across was trying to figure out how to parse multiple cities listed on each row for the same job. I needed each row to be for a single job for each city. This meant I needed duplicate rows for the amount of cities listed in the location column and show one city for that row.

Since I had a little experience with Pandas library, I found it easier to put the data into a data frame and parse out the cities by doing the following:

  • Read the temp CSV file into the fb_df variable.
  • Take that variable and isolate the location column. Then use regular expressions to split the cities at each spot where it finds a comma before a space and capital letter next to a small letter, like so: New York, NY, Seattle, WA, Menlo Park, CA. It will split on the commas right after ‘NY’ and ‘WA’.
  • The expand parameter is set to True in order to put each city in its own cell
  • The pandas add_prefix function labels the columns with ‘city_’ for columns that have cities and appends the column number that city is in.
  • The fillna function fills any empty cell it pulled from the Location column after the split with a NaN value, in case there are any.
# Reads the temp file into a data frame for output to csv file 
fb_df = pd.read_csv(temp_csv.name)
fb_df = fb_df.join(fb_df[‘Location’]
.str.split(‘W+(?=s[A-Z][a-z])’, expand=True)
.add_prefix(‘city_’).fillna(np.nan))
  • Set the data frame index using the first four columns. These four columns are Website, Title, Location, and Job URL.
  • Then use the stack function to duplicate the rows based on the number of cities split.
  • Reset the index
  • Copy the split cities, paste into the location column, and strip of any white space.
fb_df = fb_df.set_index(list(fb_df.columns.values[0:4])).stack()
fb_df = fb_df.reset_index()
fb_df[‘Location’] = fb_df.iloc[:, -1].str.strip()

Finally, drop the last two columns and write to a new CSV file without an index.

fb_df.drop(fb_df.columns[[-1, -2]], axis=1, inplace=True) fb_df.to_csv(‘facebook_jobs.csv’, index=False)

This file can be executed on its own since it calls the main function but I’ll be importing this into another file and running it from there along with the other scraping files to make one big CSV file with search results from each site.

I’m always open to feedback, so let me know how to make this better. Also, I’d like to find a way to analyze this data somehow. I could probably map all the jobs geographically but there must be other things to analyze as well.

Heck, maybe there’s an API to get this info to begin with but it was a great challenge anyway.