Oasist Blog

Deliver posts regarding linguistics, engineering and life at my will.

Web Scraping in Python + Selenium Vol.2 - More Complicated DOM -

f:id:oasist:20210128231504p:plain
Web Scraping

Contents

1. Deliverables

Get tour reviews from all pages of ランキング - Webスクレイピング入門 and export it as a CSV file.

観光地 総合評価 楽しさ 人混みの多さ 景色 アクセス
0 観光地 1 4.7 4.6 4.5 4.9 4.2
1 観光地 2 4.7 4.6 4.5 4.9 4.2
2 観光地 3 4.6 4.5 4.4 4.8 4.1
3 観光地 4 4.5 4.4 4.4 4.8 4.0
4 観光地 5 4.5 4.4 4.3 4.7 4.0
5 観光地 6 4.4 4.3 4.3 4.7 3.9
6 観光地 7 4.3 4.2 4.2 4.6 3.8
7 観光地 8 4.3 4.2 4.2 4.6 3.8
8 観光地 9 4.2 4.1 4.1 4.5 3.7
9 観光地 10 4.1 4.0 4.1 4.4 3.6
10 観光地 11 4.1 4.0 4.0 4.4 3.6
11 観光地 12 4.0 3.9 4.0 4.3 3.5
12 観光地 13 3.9 3.8 3.9 4.3 3.4
13 観光地 14 3.9 3.8 3.9 4.2 3.4
14 観光地 15 3.8 3.7 3.8 4.2 3.3
15 観光地 16 3.7 3.6 3.8 4.1 3.2
16 観光地 17 3.7 3.6 3.7 4.1 3.2
17 観光地 18 3.6 3.5 3.7 4.0 3.1
18 観光地 19 3.5 3.4 3.6 3.9 3.0
19 観光地 20 3.5 3.4 3.6 3.9 3.0
20 観光地 21 3.4 3.3 3.5 3.8 2.9
21 観光地 22 3.3 3.2 3.5 3.8 2.8
22 観光地 23 3.3 3.2 3.4 3.7 2.8
23 観光地 24 3.2 3.1 3.4 3.7 2.7
24 観光地 25 3.1 3.0 3.3 3.6 2.6
25 観光地 26 3.1 3.0 3.3 3.6 2.6
26 観光地 27 3.0 2.9 3.2 3.5 2.5
27 観光地 28 2.9 2.8 3.2 3.4 2.4
28 観光地 29 2.9 2.8 3.1 3.4 2.4
29 観光地 30 2.8 2.7 3.1 3.3 2.3

2. Implementation

2-1. Get Information

First, you need to boost Google Chrome and get a base URL.
This process is called when an instance of InfoCollector class is generated.
The constructor takes the base URL(https://scraping-for-beginner.herokuapp.com/ranking/) as argument.

InfoCollector#__init__

def __init__(self, base_url):
    self.chrome = webdriver.Chrome(executable_path="../exec/chromedriver.exe")
    self.base_url = base_url

InfoCollector#_get_url is called in each method except for InfoCollector#export_csv.
This method combine the base URL with a query string("?page={page_num}") and get to the URL.
InfoCollector#_get_categirues need no query string, so its default value is defined as "".
get method enables you to get to the URL, but please do not forget to import webdriver in advance.

InfoCollector#_get_url

def _get_url(self, query_str=""):
    url = self.base_url + query_str
    self.chrome.get(url)

f:id:oasist:20210129164536p:plain
Tour Title

InfoCollector#get_titles gets the class elements of .u_title with find_elements_by_class_name.
After you have got a list of specific class elements, take out text of each element as follows.

  1. Create an empty list to insert text of each element.
  2. Take out each element with for statement and extract text with .text attribute.
  3. If the text includes any escape sequences, split them into list of string and refer to the last element to take out the text.
  4. Append the text to the list created in the step 1.

InfoCollector#get_titles

def get_titles(self, query_str):
    self._get_url(query_str)
    elem_titles = self.chrome.find_elements_by_class_name("u_title")
    titles = []
    for elem_title in elem_titles:
        titles.append(elem_title.text.split("\n")[-1])
    return titles

f:id:oasist:20210129164656p:plain
Tour Evaluation

InfoCollector#get_evaluations gets the class elements of .u_rankBox.
After you have got a list of specific class elements, take out decimal of each element as follows.

  1. Create an empty list to insert decimals of each element.
  2. Take out each element in for statement and gets the class elements of .evaluateNumber with find_element_by_class_name.
  3. Take out the text from the element.
  4. Convert the text to a decimal with float function.
  5. Append the decimal to the list created in the step 1.

InfoCollector#get_evaluations

def get_evaluations(self, query_str):
    self._get_url(query_str)
    elem_rank_boxes = self.chrome.find_elements_by_class_name("u_rankBox")
    evaluations = []
    for elem_rank_box in elem_rank_boxes:
        evaluations.append(float(elem_rank_box.find_element_by_class_name("evaluateNumber").text))
    return evaluations

InfoCollector#_get_ranking_items returns a list of .u_categoryTipsItem elements.
This private method is used by other public methods.

InfoCollector#_get_ranking_items

def _get_ranking_items(self):
    elem_ranking_items = self.chrome.find_elements_by_class_name("u_categoryTipsItem")
    return elem_ranking_items

f:id:oasist:20210129165009p:plain
Tour Categories

InfoCollector#get_categories returns a list of the categories.
To ignore duplicates, it assigns the first index.

InfoCollector#get_categories

def get_categories(self):
    self._get_url()
    elem_ranking_items = self._get_ranking_items()
    categories = tag_elems_list(elem_ranking_items, "dt")
    return categories[0]

Now, you may find tag_elems_list function, which has the following logics.

  1. Create an empty list to insert lists of the text of each element.
  2. Take out each item(elem_ranking_items) in for statement.
  3. Create an empty temporary list to insert the text.
  4. Take out each tag element from a list returned by find_elements_by_tag_name in for statement.
  5. Extract text with .text attribute and append it to the temporary list created in the step 3.
  6. Append the temporary list to the list created in the step 1.

list_hander.tag_elems_list

def tag_elems_list(items, tag):
    elems_list = []
    for item in items:
        _elems_list = []
        for elem in item.find_elements_by_tag_name(tag):
            _elems_list.append(elem.text)
        elems_list.append(_elems_list)
    return elems_list

f:id:oasist:20210129165042p:plain
Tour Rankings

InfoCollector#get_rankings returns a two-dimensional list of the rankings.

InfoCollector#get_rankings

def get_rankings(self, query_str):
    self._get_url(query_str)
    elem_ranking_items = self._get_ranking_items()
    rankings = class_elems_list(elem_ranking_items, "is_rank")
    return rankings

f:id:oasist:20210129165106p:plain
Tour Comments

InfoCollector#get_comments returns a two-dimensional list of the comments.

InfoCollector#get_comments

def get_comments(self, query_str):
    self._get_url(query_str)
    elem_ranking_items = self._get_ranking_items()
    comments = class_elems_list(elem_ranking_items, "comment")
    return comments

Now, you may find tag_elems_list function, which has the following logics.

  1. Create an empty list to insert lists of the text of each element.
  2. Take out each item(elem_ranking_items) in for statement.
  3. Create an empty temporary list to insert the text.
  4. Take out each class element from a list returned by find_elements_by_class_name in for statement.
  5. Extract text with .text attribute and append it as float if is_float return True or as string if is_float return False to the temporary list created in the step 3
  6. Append the temporary list to the list created in the step 1.

list_hander.class_elems_list

def class_elems_list(items, klass):
    elems_list = []
    for item in items:
        _elems_list = []
        for elem in item.find_elements_by_class_name(klass):
            if isfloat(elem.text):
                _elems_list.append(float(elem.text))
            else:
                _elems_list.append(elem.text)
        elems_list.append(_elems_list)
    return elems_list

Now, you may find is_float function, which returns True unless float function raises ValueError exception if isdecimal evaluates the parameter not as integer.

decimal_handler.isfloat

def isfloat(param):
    if not param.isdecimal():
        try:
            float(param)
            return True
        except ValueError:
            return False
    else:
        return False

2-2. Export CSV

InfoCollector#export_csv takes titles, evaluations, categories, rankings and path to export a CSV file.

  1. Create an empty data frame with DataFrame method(do not forget import pandas beforehand).
  2. Assign titles and evaluations to the df[label].
  3. Create another data frame with rankings.
  4. Assign categories to columns attribute of df_rankings.
  5. Combine df with df_rankings and assign it to the variable df.
  6. Give the path to to_csv method which exports the CSV file in a given path.

InfoCollector#export_csv

def export_csv(self, titles, evaluations, categories, rankings, path):
    df = pd.DataFrame()
    df["観光地"] = titles
    df["総合評価"] = evaluations
    df_rankings = pd.DataFrame(rankings)
    df_rankings.columns = categories
    df = pd.concat([df, df_rankings], axis=1)
    df.to_csv(path)

3. Unit Test

  • TestInfoCollector#setUp boosts Google Chrome to get to the base URL and creates lists of titles, evaluations and rankings in all pages, and categories only in the first page.
  • TestInfoCollector#test_get_titles checks if InfoCollector#get_titles returns all titles.
  • TestInfoCollector#test_get_evaluations checks if InfoCollector#get_evaluations returns all evaluations.
  • TestInfoCollector#test_get_categories checks if InfoCollector#get_categories returns all categories without duplicates.
  • TestInfoCollector#test_get_rankings checks if InfoCollector#get_rankings returns all rankings.
  • TestInfoCollector#test_get_comments checks if InfoCollector#get_comments returns all comments.
  • TestInfoCollector#test_export_csv checks if InfoCollector#export_csv exports a CSV file in a right path.

test/test_text_extractor.py

import unittest
import sys
sys.path.append("../lib")
sys.path.append("../lib/concerns")
import os.path
from os import path
from info_collector import InfoCollector

class TestInfoCollector(unittest.TestCase):
    def setUp(self):
        self.info_collector = InfoCollector("https://scraping-for-beginner.herokuapp.com/ranking/")
        titles = []
        evaluations = []
        rankings = []
        for i in range(1, 4):
            titles.append(self.info_collector.get_titles("?page={}".format(i)))
            evaluations.append(self.info_collector.get_evaluations("?page={}".format(i)))
            rankings.append(self.info_collector.get_rankings("?page={}".format(i)))
        self.titles = sum(titles, [])
        self.evaluations = sum(evaluations, [])
        self.rankings = sum(rankings, [])
        self.categories = self.info_collector.get_categories()

    def test_get_titles(self):
        self.assertEqual([
            "観光地 1",
            "観光地 2",
            "観光地 3",
            "観光地 4",
            "観光地 5",
            "観光地 6",
            "観光地 7",
            "観光地 8",
            "観光地 9",
            "観光地 10",
            "観光地 11",
            "観光地 12",
            "観光地 13",
            "観光地 14",
            "観光地 15",
            "観光地 16",
            "観光地 17",
            "観光地 18",
            "観光地 19",
            "観光地 20",
            "観光地 21",
            "観光地 22",
            "観光地 23",
            "観光地 24",
            "観光地 25",
            "観光地 26",
            "観光地 27",
            "観光地 28",
            "観光地 29",
            "観光地 30",
        ], self.titles)

    def test_get_evaluations(self):
        self.assertEqual([
            4.7, 4.7, 4.6, 4.5, 4.5, 4.4, 4.3, 4.3, 4.2, 4.1,
            4.1, 4.0, 3.9, 3.9, 3.8, 3.7, 3.7, 3.6, 3.5, 3.5,
            3.4, 3.3, 3.3, 3.2, 3.1, 3.1, 3.0, 2.9, 2.9, 2.8
        ], self.evaluations)

    def test_get_categories(self):
        self.categories = self.info_collector.get_categories()
        self.assertEqual(["楽しさ", "人混みの多さ", "景色", "アクセス"], self.categories)

    def test_get_rankings(self):
        self.assertEqual([
            [4.6, 4.5, 4.9, 4.2],
            [4.6, 4.5, 4.9, 4.2],
            [4.5, 4.4, 4.8, 4.1],
            [4.4, 4.4, 4.8, 4.0],
            [4.4, 4.3, 4.7, 4.0],
            [4.3, 4.3, 4.7, 3.9],
            [4.2, 4.2, 4.6, 3.8],
            [4.2, 4.2, 4.6, 3.8],
            [4.1, 4.1, 4.5, 3.7],
            [4.0, 4.1, 4.4, 3.6],
            [4.0, 4.0, 4.4, 3.6],
            [3.9, 4.0, 4.3, 3.5],
            [3.8, 3.9, 4.3, 3.4],
            [3.8, 3.9, 4.2, 3.4],
            [3.7, 3.8, 4.2, 3.3],
            [3.6, 3.8, 4.1, 3.2],
            [3.6, 3.7, 4.1, 3.2],
            [3.5, 3.7, 4.0, 3.1],
            [3.4, 3.6, 3.9, 3.0],
            [3.4, 3.6, 3.9, 3.0],
            [3.3, 3.5, 3.8, 2.9],
            [3.2, 3.5, 3.8, 2.8],
            [3.2, 3.4, 3.7, 2.8],
            [3.1, 3.4, 3.7, 2.7],
            [3.0, 3.3, 3.6, 2.6],
            [3.0, 3.3, 3.6, 2.6],
            [2.9, 3.2, 3.5, 2.5],
            [2.8, 3.2, 3.4, 2.4],
            [2.8, 3.1, 3.4, 2.4],
            [2.7, 3.1, 3.3, 2.3]
        ], self.rankings)

    # Comments are shown at random every time the browser is booted, so the value of each element cannot be tested.
    def test_get_comments(self):
        comments = []
        for i in range(1, 4):
            comments.append(self.info_collector.get_comments("?page={}".format(i)))
        self.assertEqual(30, len(sum(comments, [])))

    def test_export_csv(self):
        self.info_collector.export_csv(self.titles, self.evaluations, self.rankings, self.categories, "../csv/tour_reviews.csv")
        self.assertEqual(True, path.exists("../csv/tour_reviews.csv"))

if __name__ == "__main__":
    unittest.main()

4. Source Code

oasis-forever/web_scraping_tutorial