Oasist Blog

Deliver posts regarding linguistics, engineering and life at my will.

Web Scraping in Python + Selenium Vol.1 - Simple DOM -

f:id:oasist:20210128231504p:plain
Web Scraping

Contents

1. Introduction

I took an online lecture of Udemy as a trial, in which I chose "Web Scraping for Efficiency in Business" course.
I would like to share what I learned from it for the next 3 articles as follows.

  1. Web Scraping in Python + Selenium Vol.1 - Simple DOM - (this post)
  2. Web Scraping in Python + Selenium Vol.2 - More Complicated DOM -
  3. Web Scraping in Python + Selenium Vol.3 - Image Collection -

For the entire source codes, please jump to "5. Source Code".

2. Deliverables

Get the lecture's profile from 講師情報 - Webスクレイピング入門 and export it as a CSV file.

項目
0 講師名 今西 航平
1 所属企業 株式会社キカガク
2 生年月日 1994年7月15日
3 出身 千葉県
4 趣味 バスケットボール、読書、ガジェット集め

3. Implementation

3-1. Login

First, you need to boost Google Chrome and get to a URL.
This process is called when an instance of TextExtractor class is generated.
The constructor takes a URL(https://scraping-for-beginner.herokuapp.com/login_page) as argument.
get method enables you to get to the URL, but please do not forget to import webdriver in advance.

TextExtractor#__init__

def __init__(self, url):
    self.chrome = webdriver.Chrome(executable_path="../exec/chromedriver.exe")
    self.chrome.get(url)

TextExtractor#login takes username and password as argument.
Find #username, #password and #login-btn in the DOM with inspector.

f:id:oasist:20210129113759p:plain
Username
f:id:oasist:20210129113747p:plain
Password
f:id:oasist:20210129113729p:plain
Login Button

webdriver provides find_element_by_id method to find a specific ID element.
Also, fill in the required values with send_keys method and click a button with click button.

TextExtractor#login

def login(self, user_name, pwd):
    username = self.chrome.find_element_by_id("username")
    username.send_keys(user_name)
    password = self.chrome.find_element_by_id("password")
    password.send_keys(pwd)
    login = self.chrome.find_element_by_id("login-btn")
    login.click()

After login is done, you can see the lecturer information table.

f:id:oasist:20210129115318p:plain
Lecturer Information

3-2. Get Information

Get keys from th elements and values from td elements.
find_elements_by_tag_name method provides a list of specific tag elements.

* find_element_by_tag_name return the first specific tag element.

After you have got a list of specific tag elements, take out the text of each element as follows.

  1. Create an empty list to insert the text of each element.
  2. Take out each element in for statement and extract the text with .text attribute.
  3. If the text includes any escape sequences, replace them with a string.
  4. Append the text to the list created in the step 1.

Finally, return keys and values.

This time, I made a {key: value} dictionary to make the unit test more understandable.

TextExtractor#get_lecturer_info

def get_lecturer_info(self):
    ths = self.chrome.find_elements_by_tag_name("th")
    keys = []
    for th in ths:
        keys.append(th.text)
    tds = self.chrome.find_elements_by_tag_name("td")
    vals = []
    for td in tds:
        if "\n" in td.text:
            vals.append(td.text.replace("\n", "、"))
        else:
            vals.append(td.text)
    profile = {}
    for i in range(len(keys)):
        profile[keys[i]] = vals[i]
    return profile, keys, vals

3-3. Export CSV

TextExtractor#export_csv takes keys, values and path to export a CSV file.

  1. Create an empty data frame with DataFrame method(do not forget import pandas beforehand).
  2. Assign keys and vals to the df[label].
  3. Give the path to to_csv method which exports the CSV file in a given path.

TextExtractor#export_csv

def export_csv(self, keys, vals, path):
    df = pd.DataFrame()
    df["項目"] = keys
    df["値"] = vals
    df.to_csv(path)

4. Unit Test

test/test_text_extractor.py

import unittest
import sys
sys.path.append("../lib")
import os.path
from os import path
from text_extractor import TextExtractor

class TestTextExtractor(unittest.TestCase):
    def setUp(self):
        self.text_extractor = TextExtractor("https://scraping-for-beginner.herokuapp.com/login_page")
        self.text_extractor.login("imanishi", "kohei")

    def test_get_lecturer_info(self):
        profile, *_ = self.text_extractor.get_lecturer_info()
        self.assertEqual({
            "講師名": "今西 航平",
            "所属企業": "株式会社キカガク",
            "生年月日": "1994年7月15日",
            "出身": "千葉県",
            "趣味": "バスケットボール、読書、ガジェット集め"
        }, profile)

    def test_export_csv(self):
        _, keys, vals = self.text_extractor.get_lecturer_info()
        self.text_extractor.export_csv(keys, vals, "../csv/lecturer_info.csv")
        self.assertEqual(True, path.exists("../csv/lecturer_info.csv"))

if __name__ == "__main__":
    unittest.main()

5. Source Code

oasis-forever/web_scraping_tutorial