How to Use BeautifulSoup and Requests Library to Extract  Data From a Website

Photo by Mick Haupt on Unsplash

How to Use BeautifulSoup and Requests Library to Extract Data From a Website

Scraping the ASPCA Website

Introduction to Web Scraping

Web scraping is the process of automatically extracting data from websites. It helps you gather information without manually inputting data from each page on a website. Web scraping has a wide range of applications including:

  • Automating repetitive tasks such as checking prices or deals on e-commerce sites.

  • Building data sets for machine learning algorithms.

  • Gathering data for research and analysis.

Web Scraping In Python

To perform web scraping in Python, we’ll use two libraries – Requests and BeautifulSoup
The Requests library is a popular Python library for making HTTP requests. It allows you to send HTTP requests and handle responses easily. You'll use Requests to fetch web pages, which you can then parse with BeautifulSoup.

Below is an example of how to use Requests to fetch the content of a web page:

import requests
# Define the URL of the web page you want to scrape
url = 'https://example.com'
# Send an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the page content
else:
    # Handle error

In this code, we are importing the Requests library and using it to send an HTTP GET request to a website’s URL. Then we check if the request is successful by confirming the status code is 200. If successful, we parse the page content, otherwise, we handle the error.

After getting the page content, use BeautifulSoup to parse the HTML. BeautifulSoup provides a convenient way to navigate and search the Document Object Model(DOM). For example, you can search elements by their class names, tag names, and attributes.

Here’s an example of how you can use BeautifulSoup to parse the content fetched using the Requests library:

from bs4 import BeautifulSoup
import requests
# Define the URL of the web page you want to scrape
url = 'https://example.com'
# Send an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    # Now you can work with the 'soup' object to extract data
else:
    print('Failed to retrieve the web page')

Once you get the soup object, you can start extracting data from the web page.

Parsing the HTML Content

The DOM is a tree-like representation of the structure and content of a web page. Before scraping a website, it’s important to understand what this structure looks like. Consider the following simple HTML page.

<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div class="content">
            <p>This is a sample paragraph.</p>
            <a href="https://example.com">Visit Example</a>
         <ul id="fruits">
            <li class="apple">Apple</li>
             <li class="banana">Banana</li>
             <li class="cherry">Cherry</li>
         </ul>
        </div>
    </body>
</html>

On this page, you can see tags like <div> <p> and <a>. There are also attributes within the tags like href and class. The elements nested inside other elements like the p tag nested inside the div element are called children. Elements that share the same parent are called siblings like the <p> and <a> tags.

Searching Items Within the Parsed HTML Elements

In the HTML sample code above, the <ul> tag has an id with the value “fruits”. Using BeautifulSoup, you can find that specific HTML element by ID like this:

fruits = soup.find(id=”fruits”)

To make the HTML easy to read, use the prettify method.

print(fruits.prettify()

Aside from IDs, you can use BeautifulSoup to find elements with a specific class.

For example, access the div with the class content using the following code:

content_div = soup.find('div', class_='content')

The underscore after class is essential as the term class is a reserved keyword in Python.

To access the link value within the div use the following.

content_div.find(‘a’).get("href")

If you want to find all the divs with the content class, call find_all() on the Soup object.

content_div = soup.find_all('div', class_='content')

This will return an iterable object containing all the HTML markup of the elements with the specified class.

To access the title of the HTML page call the title on the Soup object:

title = soup.title

This will return the title element of the page. You can use .text to retrieve the text content of the HTML elements of a Soup object. For instance, call .text on the title object to get the title text.

titleText= title.text

Real-World Example: Parsing the ASPCA Website For Plants Information

Setting Up Your Environment

A virtual environment isolates your project’s dependencies from the system-wide Python installation and prevents dependency conflicts between projects. Use the following commands to create and activate a virtual environment:

Create a virtual environment:

python -m venv myenv

This command creates a virtual environment called myenv.

Activate the virtual environment.:

On macOS and Linux:

source myenv/bin/activate

On Windows:

myenv\Scripts\activate

Step 2: Install the Required Libraries

Use the following command to install BeautifulSoup via pip:

pip install beautifulsoup

You’ll also need to install requests package.

pip install requests

The data we want to scrape is about plants that are poisonous to plants from the ASPCA website. The ASPCA website maintains an extensive list of plants categorized into toxic and non-toxic varieties, which is an invaluable resource for pet owners and animal enthusiasts.

The screenshot below shows how the plants are displayed on the website.

The lASPCA website organizes the plants in a paginated manner, where each page displays a subset of the plant data. Each plant listed on the website serves as a clickable link, leading to a dedicated page containing detailed information about the plant's toxicity. We will be collecting this data.

The initial step of scraping the site is to understand the HTML structure of the website. You can do this by right-clicking on any section of interest and selecting "Inspect" from the context menu of the browser. This action will open the browser's developer tools and highlight the corresponding HTML elements.

When you hover over the HTML, the browser should highlight the corresponding element.

Observe the div with the class named views-row. The div contains the individual plant data. This means that we will have to get the row content and then iterate over the contents of individual rows to get the data.

But wait, the page is paginated. Each page shows a limited number of plants. To ensure we capture all the data, we need to implement the logic for navigating through the pages and scraping the content from each one.

When you scroll to the bottom of the page, you will see the following

Items 1-15 of 1028

This indicates the total number of items available, such as "Items 1-15 of 1028". This informs us that each page displays 15 items. Therefore, by dividing the total count (1028) by the items per page (15), we get the number of pages, which in this case is approximately 68.53, and we can round it up to 69.

When you navigate to the second page, you'll notice that the URL changes to something like https://www.aspca.org/pet-care/animal-poison-control/toxic-and-non-toxic-plants?page=1

The "page" parameter in the URL enables us to construct a loop that iterates through the pages. This loop dynamically appends the appropriate search query to the URL to accommodate the different page numbers.

This means we can create a loop that runs from 0-69 and append the appropriate search query to the URL depending on the page number.

Scraping the First Page

Create a new file and name it scrape.py. This is the file where you’ll write the web scraping code. Begin by importing BeautifulSoup and the requests package at the top of this file.

# Import necessary libraries 
import requests 
from bs4 import BeautifulSoup 
from math import ceil

Defining the Base URL and Pagination Details

Begin by defining the base URL of the website you wish to scrape. In our case, we'll be scraping plant information from the ASPCA website. It's also important to determine the number of items displayed per page and the total number of items you wish to scrape.

base_url = "https://www.aspca.org/pet-care/animal-poison-control/toxic-and-non-toxic-plants"
items_per_page = 15
total_items = 1028
total_pages = ceil(total_items / items_per_page)

Making a Request For the First Page

We'll create a function named scrapePage to fetch the content of a given page. This function sends an HTTP request to the specified URL and checks if the response status code is 200, indicating a successful request.

def scrapePage(url):

    scraped_data = []

    response = requests.get(url)

    if response.status_code == 200:

        soup = BeautifulSoup(response.content, "html.parser")

The next step is to retrieve the links to individual plant pages. We'll locate the div elements with a class named views-row as these contain the href attributes leading to the plant pages.

row_content = soup.find_all("div", class_="views-row")

links = []

for row in row_content:

    item_link = row.find("a").get("href")

    links.append("https://www.aspca.org" + item_link)

Scraping Data from Individual Plant Pages

Now that we have the links to each plant's page, we can proceed to extract the plant names, additional names, scientific names, family names, and toxicity information from each page.

for link in links:

            response = requests.get(link)

            soup = BeautifulSoup(response.content, "html.parser")

            name = soup.find("h1").text

            additional_names = None

            additional_name_pane = soup.find(

                "div", class_="field-name-field-additional-common-names"

            )

            if additional_name_pane:

                additional_names_values = additional_name_pane.find(

                    "span", class_="values"

                )

                if additional_names_values:

                    additional_names = additional_names_values.text

            scientific_names = None

            scientific_name_pane = soup.find(

                "div", class_="field-name-field-scientific-name"

            )

            if scientific_name_pane:

                scientific_name_values = scientific_name_pane.find(

                    "span", class_="values"

                )

                if scientific_name_values:

                    scientific_names = scientific_name_values.text

            family_names = None

            family_name_pane = soup.find("div", class_="field-name-field-family")

            if family_name_pane:

                family_name_values = family_name_pane.find("span", class_="values")

                if family_name_values:

                    family_names = family_name_values.text

            toxic = None

            nonToxic = None

            nontoxicity_status_pane = soup.find(
                "div", class_="field-name-field-non-toxicity"
            )

            if nontoxicity_status_pane:
                non_toxic_status = nontoxicity_status_pane.find("span", class_="values")
                if non_toxic_status:
                    nonToxic = non_toxic_status.text

            toxicity_status_pane = soup.find("div", class_="field-name-field-toxicity")

            if toxicity_status_pane:
                toxic_status = toxicity_status_pane.find("span", class_="values")
                if toxic_status:
                    toxic = toxic_status.text
            data = {
                "name": name,
                "additional_names": additional_names,
                "family_names": family_names,
                "scientific_names": scientific_names,
                "toxic": toxic,
                "non_toxic": nonToxic,
            }
            scraped_data.append(data)
    return scraped_data

Get Data For All the Pages

As mentioned earlier, the data on the ASPCA website is paginated, meaning there are multiple pages to scrape. So far, we've created a function to scrape data from the first page. To obtain data from all the pages, we'll loop through the total number of pages and call the scrapePage function for each page.

scraped_data = []
for page_number in range(total_pages):
    # Construct the URL for the current page
    url = base_url
    if page_number != 0:
        url = base_url + "?page=" + str(page_number)
    # Call the scrapePage function for the current page
    data = scrapePage(url)
    scraped_data.append(data)

In this code, we create an empty list scraped_data to store the data we extract from each page. We then use a for loop to iterate through the range of total_pages. For each page, we construct the URL, taking into account the page number. The scrapePage function is called with the updated URL, and the data extracted is appended to the scraped_data list.

Storing the Scraped Data in a CSV File

Now that we have successfully extracted data from multiple pages, the next step is to store this data for further analysis or reference. One of the most common and versatile ways to store tabular data is by using CSV (Comma-Separated Values) files. In this section, we'll explore how to save your scraped data to a CSV file.

Using the csv Module

Python's built-in csv module provides a convenient way to work with CSV files. To use this module, you'll need to import it at the beginning of your script:

import csv

Writing Data to a CSV File

Let's continue by writing our scraped data to a CSV file. We'll create a function to handle this task:

def saveToCSV(data, filename):
    with open(filename, 'w', newline='') as csvfile:
        fieldnames = data[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in data:
            writer.writerow(row)

Here's what this function does:

  • It opens a CSV file with the specified filename in write mode ('w').

  • It defines the fieldnames as the keys of the first item in your data list, which represent the column names.

  • It creates a CSV.DictWriter object, which allows us to write dictionaries to the CSV file.

  • It writes the header (column names) to the CSV file using writer.writeheader().

  • It iterates through the data list and writes each row (dictionary) to the CSV file.

Saving Your Data

After defining the saveToCSV function, you can use it to save your scraped data to a CSV file.

# Define the filename for your CSV file
csv_filename = "scraped_data.csv"
# Save the scraped data to the CSV file
saveToCSV(scraped_data, csv_filename)
# Notify the user that the data has been saved

print(f"Scraped data has been saved to {csv_filename}")

In this example, we specify the csv_filename as the name of your CSV file, and then call the saveToCSV function with your scraped_data and the csv_filename.

Your scraped data will now be stored in a CSV file, making it easy to analyze, share, or use for any other purposes. CSV files are widely supported and can be imported into various data analysis tools, making them a practical choice for data storage in web scraping projects.

Other Data Storage Options

Depending on your project requirements, there are several data storage formats you can store the scraped data. Below are a few other alternatives:

Save to Excel

Use the pandas library to save your scraped data to an Excel file. Start by importing the pandas and openpyxl libraries via pip:

pip install pandas openpyxl

Pandas is a powerful and popular Python library for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, such as databases and spreadsheets.

To work with Excel files in Python using the pandas library, you need to have openpyxl installed. Openpyxl is a Python library for working with Excel files in Python. It allows you to create, read, and modify Excel files.

After installing these two libraries, use the code below to save your data in a spreadsheet.

import pandas as pd
def saveExcel():
    df = pd.DataFrame(scraped_data[0])
    # Define the filename for your Excel file
    excel_filename = "scraped_data.xlsx"
    # Save the DataFrame to an Excel file
    df.to_excel(excel_filename, index=False)
    print(f"Scraped data has been saved to {excel_filename}")

If successful, you should see the following statement on the console.

Scraped data has been saved to scraped_data.xlsx

Save to JSON

JSON is a flexible data format you can use to represent complex data structures including nested objects and array. Therefore, if the data you are scraping has a nested structure consider saving it in JSON format.using the json library as shown below:

import json
def saveToJSON(data, json_filename):
    with open(json_filename, "w", encoding="utf-8") as json_file:
        json.dump(data, json_file, ensure_ascii=False, indent=4)
saveToJSON(scraped_data, scraped_data.json)

When you call this function a new JSON file containing the scraped data will be created at the root of the folder.

Conclusion

Web scraping is a powerful technique for automating data extraction from websites. It has various real-world applications, from automating repetitive tasks to building datasets for machine learning and conducting research. Python libraries like Requests and BeautifulSoup provide methods for parsing HTML and extracting content from websites.

In this guide, we used the Requests and BeautifulSoup libraries to scrape the ASPCA website for information about plant toxicity to animals. We also explored different data storage options for the scraped data including CSV, Excel, and JSON. You can now create a User Interface to display this data in an aesthetically pleasing manner.