Beautiful Soup Web Scraping

Beautiful Soup Web Scraping examples | Python 3 tutorial

Beautiful Soup is a Python package that allows you to parse HTML and XML files. It builds a parse tree for parsed pages, which can be used to extract data from HTML and is useful for web scraping.

We’ll go over how to do web scraping with Python from the ground up in this tutorial.
Then, using weather data as an example, we’ll work on a real-world web scraping project.

What is Web Scraping?

Web scraping is a term that refers to the process of extracting and processing large quantities of data from the internet using software or algorithm. If you find data on the web that you can’t download directly, web scraping with Python is an ability you can use to convert the data into a usable format that you can import.

How to Setup a virtual environment?

virtualenv is used to manage Python packages for different projects. You can prevent breaking machine tools or other projects by using virtualenv instead of downloading Python packages globally. Pip can be used to install virtualenv.

# beautiful soup python setup example
# install and create a virtual environment
pip install virtualenv

# make a project directory
mkdir soup
cd soup

# create a virtual environment
virtualenv venv

# activate the virtual environment
# macos
source venv/bin/activate

# windows
# to deactivate the virtual environment (if needed)

Once you create the project folder and activate the virtual environment in it then it will look like this.

# virtual env
#in our case the project_folder_name is soup
(venv) [email protected]:~/Desktop/project_folder_name

Now you are in the virtual environment. You can add or remove packages here. They are independent of your global settings and configurations.

Add a python file, for example, we will create a file.

# create file
sudo nano

Within this file, we will import two libraries named Requests and Beautiful Soup. you can install Requests and Beautiful Soup via PIP in the terminal.

# example install modules
#install requests
pip install requests
#install Beautiful Soup
pip install beautifulsoup4
#install html5lib
pip install html5lib

The Requests library makes it easy to use HTTP in Python programming language in a human-readable way, and the Beautiful Soup module helps you scrape the web quickly.

With the import statement, we’ll import both Requests and Beautiful Soup. Beautiful Soup will be imported from bs4, the package that contains Beautiful Soup 4.

import requests
from bs4 import BeautifulSoup
url = ''
# get rquest
r = requests.get(url)
# get html content
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')

Now we have all the HTML content of a web page. We can access the data inside it. for example, we will display phones from the Nokia site.

beautiful soup
# access the headings, which contains phone names
phones = soup.findAll('h3',{'class':'css-17c0ng7-Heading'})
for phone in phones:

The above loop will output some things like this. We are selecting H3 headings with class= 'css-17c0ng7-Heading'. note that we are getting the HTML tags as well.

<!--  Output -->
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia X20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia X10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia G20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia G10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia C20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia C10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 5310</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 8000 4G</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 6300 4G</h3>
<!-- /output -->

We can extract only text with get_text() method.

phones = soup.findAll('h3',{'class':'css-17c0ng7-Heading'})
for phone in phones:
    # print(phone)

The output will be a simple list of phone names.

#List of h3 headings
Nokia X20
Nokia X10
Nokia G20
Nokia G10
Nokia C20
Nokia C10
Nokia 5310
Nokia 8000 4G
Nokia 6300 4G

Beautiful Soup, The list of methods

In this section, we will learn how to access the various sections of the document object model. The BeautifulSoup comes with many methods to access data.

Here is a list of some BeautifulSoup methods.

Beautiful Soup prettify() method

The prettify method is used to beautify the content. It formats the HTML code.

Beautiful Soup: Accessing HTML Tags

We can easily find and access the content of various HTML tags such as head, title, div, p, and h1 using the BeautifulSoup module. Let’s look at a quick example where we’ll print the webpage’s title tag.

## getting 'title' tag
title_tag = soup.title
#Output: <title>The latest Nokia Android smartphones and mobile phones</title>
## getting 'title' tag Text only
title_text = soup.title.text
#Output: The latest Nokia Android smartphones and mobile phones

Similarly, we can access the head, div, p, and headings in the same manner.

# getting 'title' tag text
#Output: The latest Nokia Android smartphones and mobile phones
# getting inline CSS from head tag
# getting first div tag
# getting all div tags
# getting first tag
# getting all p tags

Accessing HTML Tag Attributes

Using the following syntax, we can get the attributes of any HTML tag:
In our HTML code, let’s extract the href="" attribute from the anchor tag.

# get the anchor tag
link = soup.a
#Output: <a href="/phones/en_pk"><svg aria-label="Nokia" class="icon" focusable="false" viewbox="0 0 105 18"><use xlink:href="#nokia"></use></svg><span>Phones</span></a>
# print the 'href' attribute of the anchor tag
#Output: /phones/en_pk

The contents method

The contents method is used to display all of the tags in the parent tag. Using the contents method, we can get a list of all the children’s HTML tags of the body tag.

Also read: List in python with examples.

#Using the contents method
the_head = soup.head
all_tags = the_head.contents
for i in all_tags:
##Output will print all the available tags in <head></head> tag

The children method

The contents method is similar to the children method, except that the contents method returns a list of all the children, while the children method returns an iterator.

#Using the children method
the_head = soup.head
all_children = the_head.children
for i in all_children:
    print(i, '\n')
##Output will print all the available tags in <head></head> tag

The descendants method

The descendants method is useful for retrieving all of a parent tag’s child tags. It looks similar to the children and contents method. It works in a different manner. If we use it to extract the body tag, it will print the first div tag, then the div tag’s child, and then their child until it reaches the end, after which stage it will move on to the next div tag, and so on.

#Using the descendants method
the_head = soup.head
all_descendants = the_head.descendants
for i in all_descendants:
    print(i, '\n')

The parent method

The parent method is used to get the parent tag. By default, it will return all the tags inside the parent tag. for example, we will only print the name of the parent tag.

the_head = soup.head
head_parent = the_head.parent
# use 'name' method to get the name of the tag
#Output: html

The parents method

The parent method is used to get all the parent tags. It gives you a generator as a result. Consider the following example:

the_head = soup.head
head_parents = the_head.parents
# use 'name' method to get the name of the tag
# If the child has more than one parent, all of their names will be printed.
for i in head_parents:
    print(, '\n')
#Output: html
#       [document]

The sibling methods

There are four sibling methods.

  • next_sibling method
  • previous_sibling method
  • next_siblings method
  • previous_siblings method

As their name suggests they work on HTML tag siblings.

The next_sibling method is used to get the next tag from the same parent.

The previous_sibling method is used to get the previous sibling tag. They both return simply an HTML tag.

metaTag = soap.find('meta')
#Output: None
#returns None if the next sibiling is not available
#previous_sibling example
title = soap.title
#Output: <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>

The next_siblings and previous_siblings methods are similar to the above-mentioned methods except they return generator with all available siblings.

You can use a loop to display their content.

metaTag = soap.find('meta')
#Output: None
#previous_sibling example
title = soap.title
#Output: <generator object PageElement.previous_siblings at 0x7f8860e21190>

Web Scraping with Python 3 – Beautiful Soup Crash Course – youtube