A Detailed Guide to Web Scraping using Python

Mark Gacoka Mbui title banner

The art of web scraping is increasingly becoming more and more popular. This is attributed to the rapid expansion of data across the web. From data scientists extracting valuable research material from to an individual like you extracting Donald Trump’s tweet. In order to arranging them systematically for you to peruse in the morning over a hot cup of coffee, a web scraper is the tool needed for the job.

What is web scraping?

Web scraping is a powerful and efficient way of downloading structured data over the web and extracting valuable content from it. Thereafter, presenting it in a format that is easy to understand; all while imitating human actions and behaviors.

In the past, people used to manually copy the data needed from the web to a local file. This was a very inefficient method especially when it involved large amounts of data.

Then came the age of spreadsheets. They could do basic web scraping abilities like extracting HTML tables from web pages. From there offline downloaders, were used to download a whole web page for offline viewing. Honorable mentions of pertinent inventions include the Wayback Machine. It was invented by Internet Archives, which is a non-profit organization based in San Fransisco. The Wayback Machine essentially creates an Internet archive of most sites as an online library.

Web scraping services
Web scraping services

Lastly came the web scraping services offered by companies for large-scale data extraction. Soon it all boiled down to web scraping softwares and AI (Artificial Intelligence) web scrapers. This is what we will be doing in detail. Let’s start!

Overview:

  • Use cases for a web scraping bot
  • Ways to extract data from the web and why choose web scraping
  • Setting up the Python and the web scraper using the Beautiful Soup library
  • Let’s start scraping!
  • Playing with the results
  • Advanced Uses
  • References

Use cases of Web Scraping

Trading graph
A trading graph
  1. Web scraping has a lot of use cases. A web scraper can be used by investors to scrape the opening and closing prices of trades from an Investing website. They can then format it and later use it for further analysis. Instead of painstakingly retrieving each market price, the investor can now focus more on analysis. This illustrates the powerful nature of web scraping.
  2. Web scraping can also be used by programmers. The core of making a successful software or program is to test if the program actually works. as intended! To do this, programmers need data to test and debug. Let’s say for example a web developer needs a list of people’s names as test data. Instead of copying and pasting the names of newborn babies manually, he/she can simply scrape the data and arrange it nicely as a JSON file.
  3. You. Let’s say you are searching for the best ticket prices for a certain concert you wish to attend. A web scraping robot could search the web faster than any man could do. You could also program the bot to purchase the tickets for you.

Ways to extract data from the web

API’s

API’s (Application programming interface) is simply it’s name. An API allows you to design an application that uses the features and data provided by a particular website,operating system or service. There are numerous websites offering API capabilities e.g. Twitter, Facebook, Google and Twitter. An API is almost always preferred over a web scraping software. This is because it is given more authority and features by the service. An example of this is Twitter API that can tweet on your behalf However, an API is specific to a website and not all websites have API’s.

RSS Feeds

RSS (referred to as Rich Site Summary and Really Simple Syndication respectively). It is a web feed that receive updates of online content and displays back in a readable format. Sadly, not all websites are subscribed to it. RSS Feeds also has some of its limitations too. The web feed is not widely adopted by most sites and it is difficult to source the origin of an RSS Feed.

RSS Feed Logo
RSS Logo

Setting up!

I chose Python because of its rich ecosystem of libraries especially in web scraping. Also, it is easy to understand since it resembles the English language. This will enable beginners to easily grasp what is going on. The libraries we are going to use are requests and beautifulsoup.

Urllib2 has similar functions to requests. To get the full comparison and function breakdown, visit this Quora page.

The Setup

To set up you need:

Mac and Linux Users

Python is already pre-installed on the OS X System as well as all Ubuntu distributions. Simply type:

python --version

You should see either Python 2.7.15 or earlier. For this tutorial we will use Python 2 which is the default in Mac and Linux Systems.

To use Python3 you can create a virtual environment

$ python3 -m venv venv
$ . ./venv/bin/activate

Windows Users

To install Python please follow the detailed instructions provided on the official website.

Python Documentation page
Python setup page

Next install the Beautifulsoup library.

easy_install pip  
pip install BeautifulSoup4
pip install requests

Once installed, you need to familiarize yourself with HTML tags. These tags are the basis of web scraping element’s contents and thus needed. Here is a sample HTML code that I will break down. If you have a basic understanding of HTML skip this part.

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1> What we will scrape </h1>
<p> Its paragraph content </p>
</body>
</html>
  1. HTML uses tags to enclose and represent the data structures (blocks in the web page). Just ignore the tags but pay attention to what’s inside them.
  2. <!DOCTYPE html> Basic HTML code should start with a declaration of the document type.
  3. <html></html> All data strctres are enclosed in html tags.
  4. <head></head> The <meta></meta> and <script></script> tags are located inside here. This identifies the web page to search engines and initializes the non-HTML tags like Javascript.
  5. The visible part of the webpage is contained within the <body> tag and houses the <p> tag for paragraphs. <h1> through <h6> which is used for title headings and text sizes. <a> for links, <table> used for tables. <tr> for table row and <td> for table columns.
  6. Class: HTML also uses classes class for grouping styles together.
  7. Id’s id  are used for giving unique id’s for attributes in a HTML document.

For more details on HTML visit the W3 Schools website for free tutorials.

Tips before web scraping:

  • First of all, the layout on websites change frequently so make sure to check occasionally to avoid errors or wrong web scraping.
  • Second, you should always check the Terms of Service for the particular website you wish to scrape. Some website do not allow web scraping and it would be violation of their terms. Subsequently, to check if scraping is prohibited type in the url followed by robots.txt to access the txt file. e.g. https://foobar.com/robots.txt
  • Third, do not scrape aggressively. Scraping on high threads could trigger alarms by the website’s security thinking it’s a spammer or a hacker.. This could result in your IP getting blacklisted from accessing the website. If it simulates a human’s behavior, it is fine.

Let’s start scraping!

Decide on the website you wish to scrape. Thereafter, open the url of the website on your browser e.g. https://investing.com/analysis/forex. Right click anywhere on the page and open inspect element (or Inspect for Mac users).

To visualize the element under review, hover your mouse over the web page until you see a blue rectangle on the item you wish to scrape. Consequently, the HTML code related to the element will also be highlighted in the elements tab. Here is some sample code to get you going:

#Let's import the requests and beautifulsoup library
from bs4 import BeautifulSoup
import requests

#Place your own url. Make sure it's permitted to scrape it
url = 'https://foobar.com/blog/10-magnificent-dishes'

#We then ask for the content of the url using requests
response = requests.get(url, timeout=5)

#We then parse the contents of the page to html parsers.
#There are a lot of parsers you can use...

webpage_content = BeautifulSoup(response.content, "html.parser")

#Here I loop through the first 10 paragraphs and store them
#in an array
content = []
for i in range(0, 10):
paragraphs = webpage_content.find_all("p")[i].text
content.append(paragraphs)

All parsers have their benefits depending on how you intend to parse. The list of Beautifulsoup parsers available are:

  • Python’s html.parser used as BeautifulSoup(markup, “html.parser”).
  • lxml’s HTML parser used as BeautifulSoup(markup, “lxml”).
  • lxml’s XML parser used as BeautifulSoup(markup, “lxml-xml”) or BeautifulSoup(markup, “xml”).
  • html5lib used as BeautifulSoup(markup, “html5lib”).

For more info on web scraping with BeautifulSoup check their Documentation page by visiting this link.

Playing with the results

paragraphs = webpage_content.find_all("p")[i].text

Here, This code basically finds all the paragraph attributes and and stores its content into a ‘paragraph’ variable. If we were to remove the .text in the code and print the ‘paragraph’ variable, we would get something like this (with a couple of link tags):

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum</p>

A Guide to extracting page attributes/tags

  1. soup.title – Returns the title tag as well as its contents. <title>
  2. soup.title.string – Returns the text inside the title tag as a string variable. <title>
  3. soup.p – Returns the content inside the paragraph tag. <p>
  4. soup.p[‘class’] – Returns the class name of the element. <div class=””>
  5. soup.a – Returns the first link tag in the page. <a>
  6. soup.find_all(‘a’) – Alternatively we could return all matching link tags. <a>
  7. soup.find(id=”link3″) – Returns matching element by id. Could be used for id’d things like buttons and links.
  8. soup.findAll(‘div’,attrs={“class”:”button”}) – Find an element by its specific attribute. This will search and locate all div elements from the class button.

Other helpful code…

Prettify feature
prettify feature

print webpage_content.prettify() – to print out the HTML document in an organized manner. You can use if after webpage_content initialization (webpage_content = BeautifulSoup(response.content, “html.parser”))

**If you are having problems with finding a particular element, just right-click and navigate to Page Source or something similar. Do not be overwhelmed by the data. Instead press Ctrl+F to find the certain phrase of what you’re looking for.**

Page source
Example Web page Source Code

Advanced Uses

Here, we will take a list of the first 5 question titles in a Quora’s topic. We will then use it for further analysis after web scraping.

Exporting to CSV

import csv
import requests
from bs4 import BeautifulSoup

#store the url
url = ‘https://www.quora.com/topic/Google-company-5'

#query the website and return the html to the variable ‘page’
response = requests.get(url, timeout=5)
webpage_content = BeautifulSoup(response.content, "html.parser")

question_title = []
print ("The top 5 questions asked in the Quora Google Comapny topic are: \n")

#loop through all the top 5 link tags in the page and append
#the content to the list. Encode it to remove 'u' in the list
#then print.
for i in range(0, 5):
title = webpage_content.findAll('a', attrs={"class":"question_link"})[i].text
question_title.append(title)
title.encode("utf-8")
print ('{}'.format(title))

#open a csv file with append, so old data will not be erased
with open(‘names.csv’, ‘a’) as csv_file:
writer.writerow([paragraphs])
The top questions asked in the Quora Google Company topic are:

Which will ultimately be the biggest company out of Facebook, Amazon, Apple, Netflix and Google?
Why does China ban Google and YouTube?
The New York Police Department has sent a cease-and-desist letter to Google, demanding its investigation app remove a feature that allows users to report police locations. Should Google comply?
How do Google and Facebook keep their source code secure when hundreds of staff members have access to it?
How do big companies (like Google, Microsoft, Apple, Facebook, Amazon, etc.) fire their employees?

References:

Thank you finish logo
Thank you for reaching the end!

In conclusion, if you still have any queries about web scraping, leave them down in the comments and I will help you out. Also, if you would also like to know how to predict stock prices after web scraping, check this article.

Summary
A Detailed Guide to Web Scraping using Python
Article Name
A Detailed Guide to Web Scraping using Python
Description
The art of web scraping is increasingly becoming more and more popular with the rapid expansion of data across the web. From data scientists extracting valuable research material from the web to an individual like you extracting Donald Trump's tweet and arranging them systematically for you to peruse in the morning over a hot cup of coffee; web scraping has a lot of gems to be exploited.
Author
Publisher Name
Mark Gacoka Website
Publisher Logo

22 thoughts on “A Detailed Guide to Web Scraping using Python”

  1. I’ve been exploring for a little bit for any high quality articles or blog
    posts on this kind of space . Exploring in Yahoo I at last stumbled upon this site.
    Studying this info So i am happy to show that I’ve a very just right uncanny feeling I discovered
    exactly what I needed. I such a lot unquestionably
    will make certain to don?t disregard this site and provides it a look regularly.

  2. Great post. I was checking continuously this blog and I
    am impressed! Very helpful info specifically the last
    part 🙂 I care for such info much. I was looking for
    this certain info for a very long time. Thank you
    and best of luck.

  3. You’re so awesome! I do not think I have
    read through anything like this before. So wonderful
    to discover someone with a few original thoughts on this topic.

    Really.. many thanks for starting this up. This website is one thing that is required on the internet, someone with a little originality!

  4. Good day I am so grateful I found your web site, I really found you by mistake, while I was browsing on Aol for something else, Anyhow I am here now and would just like
    to say thanks for a remarkable post and a all round
    thrilling blog (I also love the theme/design), I don’t have time to look over it all at the minute but I have bookmarked it and also added
    in your RSS feeds, so when I have time I will be
    back to read much more, Please do keep up the awesome work.

  5. Please let me know if you’re looking for a article writer for your weblog.
    You have some really great posts and I believe I would be a good asset.
    If you ever want to take some of the load off,
    I’d love to write some content for your blog in exchange for a
    link back to mine. Please shoot me an email if interested.
    Many thanks!

  6. Do you mind if I quote a couple of your posts as long as I provide credit and
    sources back to your site? My blog site is in the very same niche as yours and my visitors
    would genuinely benefit from a lot of the information you present here.
    Please let me know if this alright with you.

    Thank you!

  7. Hello there, My name is Aly and I would like to know if you would have any interest to have your website here at markgacoka.com promoted as a resource on our blog alychidesign.com ?

    We are updating our do-follow broken link resources to include current and up to date resources for our readers. If you may be interested in being included as a resource on our blog, please let me know.

    Thanks, Aly

  8. Incredibly user friendly site. Great details readily available on few clicks on.EXPERTUTLATANDEManga handcuff lider av otillracklig penisstorlek. Det kan finnas olika orsaker, inklusive alder, ofta pressure, ohalsosam eller otillracklig naring, brist pa vila, brist pa hormoner, alkohol och nikotin missbruk och annat indian dick photos. Alla leder till samma resultat: nedgang i kvaliteten pa sexlivet.Under de senaste 20 aren har indian dick photos binge sett fellow i alla aldrar och livsstilar med detta problem. Vi lyckades hitta det perfekta botemedlet looking for att hjalpa dem. Namligen Titan Gel! Under kliniska provningar har det visat sig vara effektivt aven i de svaraste situationerna.Toot kan verkligen rekommendera Titan Gel viefast.allformens.nl/stor-svensk-penis/indian-dick-photos.php indian dick photos cash-box alla mina patienter som den basta losningen. De som redan har provat det uppskattar det mycket!

Leave a Reply