Master Web Scraping with Python: Step-by-Step Tutorial for Beginners

Joe Raedle/GettyImages
facebooktwitterreddit

If you are looking to get into data analysis, cyber security, or any other field that requires you to go through a lot of data, you are going to need to know how to build and use a web scraper. Fortunately, it isn’t hard, even for beginners, and with only a few lines of Python code, you will have a powerful and versatile tool.

What is a Web Scraping?

Web scraping is the automatic extraction of data from websites using software, which in this case is a Python script. It can help you gather information from social media, news outlets, and many other sources much more efficiently and allows you to gather large amounts of information quickly.

Is Web Scraping Wrong?

No. As long you are not violating any rules mentioned in the site’s terms and conditions or trying to get around paywalls, and the site hasn’t asked users not to use robots, there is no harm in automatically copying and pasting information you can copy and past yourself, and there are no laws against it. However, if you are unsure, choose another site. In the example below, we scrape a site that specifically mentions it’s ok, so you can use that to get practice if it makes you uneasy.

Why is Learning to Web Scrape a Good Idea?

Learning how to web scrape is a great way to find a new job. It can help you dig through the many job sites easily to find jobs right for you. The ability to scrape websites effectively is also a skill that many companies will pay for, as is being skilled at programming in Python.

Before You Get Started

Before you get started on this project, you will need to have Python installed, and we recommend having at least a beginner’s understanding of how it works, trying out the Hello World script, and knowing how to run the code that you write.

If you are writin a lot of P ython code it can also be hlpful to choose an IDE you are comfortable with to help make the work a little easier.

You will also need to download and install two Python libraries. One is titled “requests,” and the other is “BeautifulSoup.” You install them at the command line with the following code before you begin:

pip install requests beautifulsoup4

  • I’ll put all of the code at the end so you can copy and paste, but you will still need to do this part separately.

Writing the Code

Import the Necessary Libraries

The first thing you will need to do in our web scraping program is to import the libraries we just downloaded so Python can make use of them. The requests library allows Python to use the internet and fetch the web pages we need to scrape, while the BeautifulSoup library will help us break down the data we receive.

import requests

from bs4 import BeautifulSoup

Sending an HTTP Request

Once we import the libraries, the first thing we want our script to do is search for the website that has the information we want and let us know if it can retrieve it. If successful, the program will continue. So, we need to enter a URL.

url = 'http://books.toscrape.com/'

response = requests.get(url)

if response.status_code == 200:

print("Successfully fetched the web page")

else:

print("Failed to fetch the web page")

Parsing the HTML Content

Once Python downloads the website, you will need to parse it with BeautifulSoup so you can retrieve the information you need.

soup = BeautifulSoup(response.content, 'html.parser')

Extracting Data

Now, we are ready to extract the data we need, and we do so by having the web scraper look for specific HTML tags on the page. For instance, let's look at the following code:

books = soup.find_all('article', class_='product_pod')

for book in books:

title = book.h3.a['title']

price = book.find('p', class_='price_color').text

print(f'Title: {title}, Price: {price}')

In the code above, we are telling Python to look for every article tag with the class 'product_pod,' and you can find this tag if you look at the source code of http://books.toscrape.com/ to see what we are looking for. We also want Python to look inside that tag to find other tags to get the title and price of each book.

Saving the Data

import csv

with open('books.csv', 'w', newline='', encoding='utf-8') as file:

writer = csv.writer(file)

writer.writerow(['Title', 'Price'])

for book in books:

title = book.h3.a['title']

price = book.find('p', class_='price_color').text

writer.writerow([title, price])

Entire Code for Web Scraper

import requests

from bs4 import BeautifulSoup

import csv

# URL of the website to scrape

url = 'http://books.toscrape.com/'

# Send an HTTP request to the website

response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

print("Successfully fetched the web page")

else:

print("Failed to fetch the web page")

exit()

# Parse the HTML content using BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Find all the book listings on the page

books = soup.find_all('article', class_='product_pod')

# Open a CSV file to save the extracted data

with open('books.csv', 'w', newline='', encoding='utf-8') as file:

writer = csv.writer(file)

# Write the header row

writer.writerow(['Title', 'Price'])

# Iterate through the books and extract title and price

for book in books:

title = book.h3.a['title']

price = book.find('p', class_='price_color').text

# Write the data to the CSV file

writer.writerow([title, price])

print("Data has been saved to books.csv")

Next Steps

Once you get the code working, it’s important to understand it. Try modifying it to get a list of books and their star rating instead of price.

Once you are successful at getting different types of data from the test website, try getting data from other websites.

feed