Hey,
Web scraping is a technique to automatically access and extracts large amounts of information from a website. So let's see how to use python as our web scraping language. So for this, you need to follow the steps below:
- If you are using Windows, please install Python from the official website.
- We need to install all the libraries, i.e., BeautifulSoup library using pip a package management tool for Python.
- In the terminal, type:
easy_install pip
pip install BeautifulSoup4
4. Before we jump into coding you should know basics oh HTML.
5. Inspecting the page, let's take an example of this website https://www.bloomberg.com/quote/SPX:IND.
6. First, right-click and open your browser’s inspector to inspect the web page.
7. Once you click on inspect, the related HTML will be selected in the browser console.
8. From the result, you will get the price is inside a few levels of HTML codes, which will be:
<div class="basic-quote">
→ <div class="price-container up">
→ <div class="price">.
9. Similarly, if you just click the name “S&P 500 Index”, which is inside:
<div class="basic-quote">
<h1 class="name">.
10. Now we will know the location of the data with the help of class tags.
11. Let's jump on the code, the point we know out data location, we can start coding in web scraper. You need to open your text editor.
12. For that, we need to import all the libraries that we are going to use:
# import libraries
import urllib2
from bs4 import BeautifulSoup
13. Then we need to declare a variable for the URL of the page:
# specify the url
quote_page = ‘paste the url'
14. Then we need to make use of the Python urllib2 to get the HTML page the URL declared:
# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)
15. And finally, we can parse the page into BeautifulSoup format so we can use BeautifulSoup to work on that.
# parse the html using beautiful soup and store in variable `store`
store = BeautifulSoup(page, ‘html.parser’)
Now we have a variable, store, containing the HTML of the page. Now we can start coding the part that extracts the data.
16. Here we can extract the content with find(). Since HTML class name is unique on this page, we can simply query:
<div class="name">.
# Take out the <div> of name and get its value
name_box = store.find(‘h1’, attrs={‘class’: ‘name’})
17. Once we get the tag, we can get the data by getting its text.
name = name_box.text.strip() # strip() is used to remove starting and trailing
print name
18. Similarly, we can get the price also:
# get the index price
price_box = store.find(‘div’, attrs={‘class’:’price’})
price = price_box.text
print price
Once you run the program, you will able to see that it prints out the current price of the S&P 500 Index.
I hope this will be helpful to you. And To know more about jupyter, you can go through this https://www.edureka.co/blog/cheatsheets/jupyter-notebook-cheat-sheet