Web Scraping in Python

Web Scraping: The process of extracting and collecting data from a web page is known as web scraping. 

We are going to perform web scraping on a Wikipedia page to have a practical understanding of it.

Data science requires lots of data. 

To learn data science and to become a successful data scientist, one must never feel a shortage of data. There must be an abundance of datasets. 

The most important work of Data scientists is to analyze datasets and draw insights from it based on patterns. But from where does the dataset comes? 

To analyze data, we need data. It is one of the most rigorous tasks to collect data from all sources possible. 

If you have already started working on real-world problems then you might have felt a shortage of data or you have thought to have your own dataset. Or maybe you don't want to rely on anybody's datasets.

Here comes the role of Web scraping.


RELATED POST: 

1. LOAN PREDICTION PROBLEM PROJECT

2. UNDERSTANDING MACHINE LEARNING ALGORITHMS


Now, without explaining alone let's dive deep into coding.

GitHub repository is given at the bottom.


You can see from the image below that I have only used 2 libraries in this project.

If you don't have these libraries already installed, you can use these commands to install them on your machine:

  • pip install urllib/urllib3
  • pip install beautifulsoup/beautifulsoup4




  • urllib - It is a module in Python which is used for URL handling. The most basic and important function of this module is to fetch URLs.
  • BeautifulSoup - It is a Python library that is basically used to parse HTML and XML pages.

We can infer that these two libraries are together used to scrap any page on the web.

Now, after importing the libraries we will begin with our 1st step which is to request the page from URL/web using urllib.request.



The returned page is stored on a variable named page. We can see that the type of page is client.HTTPResponse because it is provided as a response by the urllib module.

The BeautifulSoup library has also been initialized on the variable soup. The returned page is also provided as its parameter which will tell the BeautifulSoup library that it needs to extract HTML and XMLs from that particular page.

Now we have all the data which is present on the Wikipedia page. 





We can see that there is a lot of unwanted data. The data which is needed is present on a table. We need to extract the table from the page. 
So, the next step is to extract tables present on the page.





We have stored all the tables present on the page in a variable named all_tables.
  • find_all - It finds all the strings as provided in the parameter of soup.find_all(). Here, we need tables so we gave the parameter "table".

There might be a chance that more than one table exists on the page. For that purpose, we have to choose the right table according to our requirements.

To select the right table, we have to see the class of the required table.

Here in the above image, we can see that the class named wikitable sortable is the name of the class whose table is needed. So we will use this class to extract the right table which we need.





The required table is stored in the variable named right_table
If you have noticed, we have used soup.find() instead of soup.find_all(). The reason is that now we are looking for a table which belongs to a particular class instead of all the tables present.

After successfully completing the above steps, until now we have right_table which consists of the table which is needed. If you have carefully followed, the major tasks have been done. 

Now we need to extract the values present in the table so that we can further convert them into a data frame.





A, B, C are the empty lists that will hold the column values i.e. Postal Code, Borough, and Neighbourhood
Try to understand the logic which is written above. 
In HTML:
  • tr denotes rows
  • td denotes cells in those particular rows
After understanding and executing the above lines, we have 3 separate lists that hold column values. 

Now the final step is to convert those lists into a data frame.





You can see that the dataset is named as df. It is the same as other datasets which we use generally for data science and analysis purposes.



This is why Web scraping plays an important role if you want to become a successful data scientist. 

We have successfully done Web scraping in Python.

Kindly follow the Telegram channel for future updates.

Post a Comment

0 Comments