If you like to travel, let Python help you scrape the best fares!

Well, every Selenium project starts with a webdriver. I’m using Chromedriver, but there are other alternatives. PhantomJS or Firefox are also popular. After downloading it, place it in a folder and that’s it. These first lines will open a blank Chrome tab.

Please bear in mind I’m not breaking new ground here. There are way more advanced ways of finding cheap deals, but I want my posts to share something simple yet practical!

These are the packages I will use for the whole project. I’ll use randint to make the bot sleep a random number of seconds between each search. That is usually a must have feature for any bot. If you ran the previous code, you should have a Chrome window open, which is where the bot will navigate.

So let’s make a quick test and go to kayak.com on a different window. Select the cities you want to fly from and to, and the dates. When selecting the dates, make sure you select “+-3 days”. I have written the code with that results page in mind, so there is a high chance you need to make a few adjustments if you want to search specific dates only. I’ll try to point the changes throughout the text, but if you get stuck shoot me a message in the comments.

Hit the search button and get the link in the address bar. It should look something like the link I use below, where I define the variable kayak as the url and execute the get method from the webdriver. Your search results should appear.

Whenever I used the get command more than two or three times in a few minutes, I would be presented with a reCaptcha check. You can actually solve the reCaptcha yourself, and keep doing the tests you want before the next one comes up. From my testing, it seems to be fine for the first search all the times, so it’s really a matter of solving the puzzle yourself if you want to play with the code, and leave the code running by itself with long intervals between them. You really don’t need 10 minute updates on those prices, do you?!

Every XPath has its puddle

So far we’ve opened a window and got a website. In order to start getting prices and other information, we have to use XPath or CSS selectors. I’ve chosen XPath and didn’t feel the need to mix it up with CSS, but it is perfectly possible to do so. Navigating the webpages with XPath can be confusing, and even if you use the methods I described in the Instagram article, where I use the “copy XPath” trick directly from the inspector view, I realized it’s really not the optimal way to get to the elements you want. Sometimes that link is so specific, that it quickly turns obsolete. The book Web Scraping with Python does a phenomenal job explaining the basics of navigating with XPath and CSS selectors.

Moving on, let’s use Python to select the cheapest results. The red text in the code above is the XPath selector, and it can be seen if you right click the webpage anywhere and select “inspect”. Click again with the right click where you want to see the code, and inspect again.

To illustrate my previous observation on the shortcomings of copying the path from the inspector, consider these differences:

1 # This is what the copy method would return. Right click highlighted rows on the right side and select "copy > Copy XPath"
2 # This is what I used to define the "Cheapest" button
cheap_results = ‘//a[@data-code = “price”]’

It’s clearly visible the simplicity of the second option. It searches for an element a which has an attribute data-code equal to price. The first alternative looks for an element with an id equal to wtKI-price_aTab and follows the first div element, four more divs, and two spans. It will work… this time. I can tell you right now that the id element will change next time you load the page. The letters wtKI change dynamically every time a page loads, so your code would be useless as soon as the page reloads. Invest a little time reading about XPath and I promise it will pay off.

Nevertheless, using the copy method will work on less “sophisticated” websites, and that’s fine too!

Building on what I displayed above, what if we wanted to get all the search results in several strings, inside a list? Simple. Each result is inside an object with the class “resultWrapper”. Fetching all the results can be achieved with a for loop like the next. If you understand this part, you should be able to understand most of the code that will follow. It’s basically pointing to what you want (the results wrapper), using some kind of way (XPath) to get the text there, and placing it in a readable object (first with the flight_containers and then with the flights_list).

The first 3 rows are displayed and we can clearly see everything we need, but we have better alternatives to get the information. We need to scrape each element individually.

Clear for take-off!

The easiest function to code is to load more results, so let’s start with that. I want to maximize the amount of flights I get, without triggering the security check, so I will click once in the “load more results” button every time a page is displayed. The only thing new is the try statement, which I added because sometimes the button was not loading properly. If it acts up with you too, simply comment it out in the start_kayak function that I will show ahead.

And now, after a long intro (I can get carried away at times!) we’re ready to define the function that will actually scrape the pages.

I already compiled most of the elements in the next function called page_scrape. Sometimes, the elements returned lists interpolating first and second legs information. I used a simple method to split them, for instance in the first section_a_list and section_b_list variables. The function also returns a dataframe flights_df, so we can separate the results we get in the different sorts and merge them later.

I’ve tried to make the names clear to follow. Remember that the variables with a are related with the first leg of the trip, and b with the second. On to the next function.

Wait, there’s more?!

So far we have a function to load more results, and a function to scrape those results. I could end the article here and you would still be able to use these manually and use the scraping function on a page you browsed by yourself, but I did mention something about sending an email to yourself and some other information! That is all inside the next function start_kayak!

It requires you to declare the cities and the dates. From there, it will open the address in the kayak string, which goes directly to the sort by “best” results page. After the first scrape, I took the liberty of getting the top matrix with the prices. It will be used to calculate an average and a minimum, to be sent in the email along with Kayak’s prediction (in the page it should be on the top left). This is one of the things that could cause an error on a single date search, since there is no matrix element there.

I tested this using an Outlook account (hotmail.com). Although I did not test it using a Gmail account to send the email, there are lots of alternatives you can search, and the book I mentioned earlier has other ways to do it too. If you already have a Hotmail account, it should work if you replace your details.

If you want to explore what some parts of the script are doing, please copy it and use it outside the functions. That is the only way you will fully understand it

Using everything we just created

After all this, we might as well come up with a simple loop to start using the functions we just created and keep them busy. Completed with four “fancy” prompts for you to actually write the cities and dates (the inputs). Since when we’re testing, we don’t want to type these variables every time, alternate it with the explicit way below them when needed.

If you made it this far, congratulations! There are plenty of improvements I can think of, like integrating with Twilio to send you a text message instead of an email. You can also use VPN’s or more obscure ways to grind the search results from several servers at the same time. There’s the captcha issue, that may pop up from time to time, but there are workarounds for these sort of things. I think you have some pretty solid basis here, and I encourage you to try and add some extra features. Maybe you want the excel file sent as an attachment. I always welcome constructive feedback, so feel free to leave a comment below. I try to respond to every one!

example of a test run with the script

If you want to learn more about Web Scraping, I strongly recommend you the book Web Scraping with Python. I really liked the examples and the clear explanation of how the code is working. And if you prefer social media scraping, there’s also an excellent book exclusively on the subject. I’m using the latter for my next article using the Twitter API, but there is stuff there to scrape even LinkedIn! (If you decide to purchase and use my links, I receive a small fee at no extra cost to you. I do need a lot of coffee to write these articles! Thanks in advance!)