Web Scraping with Python in 5 Minutes

Reading Time: 4 minutes

Web scraping is the process of gathering information from the Internet. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

The Python libraries requests and Beautiful Soup are powerful tools for the job. In this tutorial, you’ll learn how to:

  • Use requests and Beautiful Soup for scraping and parsing data from the Web
  • Build a script that fetches prices and product from a popular website (TrovaPrezzi.it) and displays relevant information

Make requests

The first thing we need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests and the GET request is just one.

Webscraping is not allowed in all website and to avoid problem in making requests we need to simulate a web broswer by using a fake user agent.  There is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents. In our example we will use a random user-agent, so that every request comes from a different user agent.

Let’s try downloading a simple sample web page, https://www.trovaprezzi.it/cuffie-microfoni/prezzi-scheda-prodotto/apple_airpods_2nd_generation using the requests.get method.

import requests
from fake_useragent import UserAgent
url = "https://www.trovaprezzi.it/cuffie-microfoni/prezzi-scheda-prodotto/apple_airpods_2nd_generation"
ua = UserAgent()
headers = {"User-Agent": ua.random,
           "Accept-Encoding": "gzip, deflate",
           "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1",
           "Connection": "close", "Upgrade-Insecure-Requests": "1"}
r = requests.get(url, headers=headers)
r.raise_for_status()
content = r.content
    

In content we can find the whole HTML content of the page.

Parsing a page with BeautifulSoup

Let’s jump into the broswer and firstly inspect our page (F12). Then we need to select the element we are interested in the web page. In order to do this, we just need to click the selector (1) and then the element (2).

Once we click, the corrisponding html tag will be highlighted and we just need to copy the class name.

Let’s go back in our IDE and firstly import the library, and create an instance of the BeautifulSoup class to parse our document. If we want to extract a single tag, we can use the find_all method, which will find all the instances of a tag on a page. Note that find_all returns a list, so we’ll have to loop through.

from bs4 import BeautifulSoup
import re
import pandas as pd
soup = BeautifulSoup(content, 'html.parser')
prices = soup.findAll('div', attrs={'class': 'item_total_price'})
names = soup.findAll('a', attrs={'class': 'item_name no_desc'})
links = soup.findAll('a', attrs={'class': "listing_item_button cta_button"})
if prices is not None and names is not None and links is not None:
    df = pd.DataFrame({'name': names, 'price': prices, 'link': links})
    for index, product in df.iterrows():
        product['name'] = product['name'].text.strip()
        product['price'] = re.findall(r'\d+,\d+', product['price'].text.strip())[0]
        product['link'] = ("https://trovaprezzi.it" + product['link'].attrs['href'])

In our example we use a pandas dataframe to collect all the extracted data (the lists have the same length) and then we loop through the dataframe. In the for loop we extract the text, the price and the hyperlink. The following picture shows the data stored in our dataframe.

Now we can change the url string with another url string of TrovaPrezzi.it. Note that the html structure must be same and we need to have the same classes we are looking for. Otherwise the lists will be empty because we are “scraping” something that do not exist in that page.

Leave a Comment

Your email address will not be published. Required fields are marked *