Learning to scrape the web using Python can be quite challenging. When I first got started, it took many hours. I tried libraries, consulted Reddit, browsed Stack Overflow, and googled my heart out until I got the code to finally work.
Since then, I really haven’t had the need to learn anything else. I just reused the same code over and over again, applying it to different websites in a variety of projects.
This tutorial will teach you the basics of web-scraping in Python and will also explain some pitfalls to watch out for. After completing this guide, you will be ready to work on your own web-scraping projects. Happy coding!
These are the tools you will use. I have included some explanation of each tool’s function and what you’ll need to do in order to get them set up correctly.
Google Chrome: To get the web-scraper to work you need either Google Chrome or Firefox. We will use Google Chrome. If you don’t have it already downloaded, click here. Once you have it downloaded, click on the stacked triple circle icon in the upper right. Then click “Help” and then click “About Chrome”. Note the version number. This will be important for the next tool.
Chrome Driver: Our next tool is called Chrome Driver. Chrome Driver will do the work of our application and execute our python code. Click this link to download and make sure you match up the Chrome Driver version number with the Google Chrome version number you recorded earlier. Periodically, you may come to find that your code has randomly stopped working. In my experience, this is usually caused by Google Chrome updating to a new version that leaves the Chrome Driver outdated. If this ever happens to you, simply download the newer Chrome Driver version, delete the old one, and place the new one where the old one used to be in your files.
Anaconda: The next step is to get Anaconda downloaded which you can find here. Anaconda contains a bundle of resources, the most important of which, for our purposes, is Jupyter Notebook. Click through the downloading process without much care.
Jupyter Notebook: Next we have Jupyter Notebook. It is a relatively simple code editor. If you already have Anaconda downloaded, you can open Jupyter Notebook and the notebook should open. Navigate to the folder where you want the python code to be located and then press “new” and then click “Python 3” to create your web-scraping file.
Selenium: The last tool you will use is the Selenium package for python. This package contains the names of the functions you will use to write your web-scraper. If you don’t already have it downloaded, open up Anaconda Prompt. Then type “pip install selenium” and wait for selenium to be downloaded. Setup over! Bring on the copy and pasting, am I right?
You may copy and paste the following base code into your Jupyter Notebook file:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
The above code will import the selenium library and will give a simpler name to one of the Selenium functions.
Next, you can link the python code to the Chrome Driver. Use the following code with the executable path set to your machine’s Chrome Driver location. Mine looks like this:
driver = webdriver.Chrome(executable_path = 'C:/Users/Ethan Schreur/Documents/chromedriver.exe')
Base code over! Now things will get interesting because you are ready to actually code the scraper and interact with your desired website.
This section will teach you the basic commands you can give your program to do the scraping.
Opening the Website
After the line where you tell your code the Chrome Driver’s location, you can write code that opens your chosen website. Type the following:
Easy, right? Now how will you interact with the website’s elements? Here is where XPath comes in.
XPath is an incredibly easy way to help Chrome Driver find elements on a website. To get the XPath of an element, right-click over that element and press “inspect”. This will open up Chrome’s Dev Tools. You can look in the HTML code and hover your cursor over different lines which will highlight elements on the displayed website. You can do all of these things (look at the code, right-click/inspect, or look at the highlights) to find the right code for the element you wish to scrape or interact with. Then, right-click on the element’s code, press “Copy”, and press one of two options: “Copy XPath” or “Copy full XPath”.
Full XPath is longer than regular XPath and for the most part, the regular XPath works fine. But it's good to be aware of the longer path in case it ever becomes useful. Knowing how to find the XPath of an element is in my opinion quite an important skill for the amateur scraper. It’s also quite fun!
One problem you may come across on your web-scraping journey is this: You’ve found the correct XPath. Your code is correct. Yet the web-scraper still doesn’t work. The reason may be that the page hasn’t fully loaded when your program is trying to scrape the page. The solution is to make your web-driver wait until the element is clickable with this code:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'Element's XPath')))
This code waits up to 50 seconds until the element has loaded and is now clickable. It’s probably excessive. But just to be safe, I use this code anytime my program selects an element, regardless of whether the element is clickable or not.
You’ve navigated to the website and you’ve waited until your target element loads. If the target element contains text, this code will scrape that text:
your_element = driver.find_element_by_xpath('Element's XPath')
your_element_text = your_element.text
If you want to click an element, this code will do just that:
Filling Out Forms (Logging In)
Finally, to fill out forms in order to, for example, login or sign up, your code will need to send some text to the element that accepts text. You do this by sending keys to the various text receiving elements until the form is filled out:
driver.find_element_by_xpath('Element's XPath').send_keys('Text you wish to send')
Then, if there is a submit button you wish to click, follow the code in the “Clicking Elements” section to submit the form.
There you go! You’ve learned the basics of web-scraping in Python and are now equipped to work on your own web-scraping projects.
Further reading and what I'm up to
Here is a link to the original article published on Medium:
I am working on building a full online Bootcamp on Medium, if you want to check out my progress: check it out at this link.
Thanks for reading and I hope you found it useful!