Oasist Blog

Deliver posts regarding linguistics, engineering and life at my will.

Web Scraping in Python + Selenium Vol.3 - Image Collection -

f:id:oasist:20210128231504p:plain
Web Scraping

Contents

1. Deliverables

Get images from 画像 - Webスクレイピング入門 and save them in your local storage.

2. Implementation

2-1. Get Images

First, you need to boost Google Chrome and get to a URL.
This process is called when an instance of ImageCollector class is generated.
The constructor takes a URL(http://scraping-for-beginner.herokuapp.com/image) as argument.
get method enables us to get to the URL, but please do not forget to import webdriver in advance.
To share the instance variable images in the class, assign None as the initial value here.

ImageCollector#__init__

def __init__(self, url):
    self.chrome = webdriver.Chrome(executable_path="../exec/chromedriver.exe")
    self.chrome.get(url)
    self.images = None

f:id:oasist:20210129184742p:plain
Material Placeholder

ImageCollector#get_images gets the class elements of .material-placeholder with find_elements_by_class_name.
After you have got a list of specific tag elements, take out the image of each element as follows(please do not forget to import io, urllib.request and PIL.Image).

  1. Create an empty list to insert the image URL of each element.
  2. Take out each element in for statement, access the img tag with find_element_by_tag_name and extract the image URL with get_attribute("src").
  3. Create an empty list to insert the image.
  4. Access the image URL, read the bytes-like object and generate a bytes object.
  5. Open the bytes object as an image file.
  6. Append the image to the list created in the step 1.

ImageCollector#get_images

def get_images(self):
    img_divs = self.chrome.find_elements_by_class_name("material-placeholder")
    img_urls = []
    for img_div in img_divs:
        img_urls.append(img_div.find_element_by_tag_name("img").get_attribute("src"))
    self.images = []
    for img_url in img_urls:
        f = io.BytesIO(request.urlopen(img_url).read())
        image = Image.open(f)
        self.images.append(image)
    return self.images

2-2. Save Images

ImageCollector#get_images take out each image in the list of the images and save them in the given path.

ImageCollector#save_images

def save_images(self, path):
    i = 1
    for image in self.images:
        image.save(path.format(i))
        i += 1

3. Unit Test

  • TestImageCollector#setUp boosts Google Chrome and access 画像 - Webスクレイピング入門.
  • TestImageCollector#test_get_images checks the number of the images ImageCollector#get_images gets.
  • TestImageCollector#test_save_images checks if ImageCollector#save_images save the images in the right paths.

test/test_image_collector.py

import unittest
import sys
sys.path.append("../lib")
sys.path.append("../img")
import os.path
from os import path
from image_collector import ImageCollector

class TestImageCollector(unittest.TestCase):
    def setUp(self):
        self.image_collector = ImageCollector("https://scraping-for-beginner.herokuapp.com/image")

    def test_get_images(self):
        self.assertEqual(24, len(self.image_collector.get_images()))

    def test_save_images(self):
        self.image_collector.get_images()
        self.image_collector.save_images("../img/image_{:0=2}.jpg")
        self.assertEqual(True, path.exists("../img/image_01.jpg"))
        self.assertEqual(True, path.exists("../img/image_02.jpg"))
        self.assertEqual(True, path.exists("../img/image_03.jpg"))
        self.assertEqual(True, path.exists("../img/image_04.jpg"))
        self.assertEqual(True, path.exists("../img/image_05.jpg"))
        self.assertEqual(True, path.exists("../img/image_06.jpg"))
        self.assertEqual(True, path.exists("../img/image_07.jpg"))
        self.assertEqual(True, path.exists("../img/image_08.jpg"))
        self.assertEqual(True, path.exists("../img/image_09.jpg"))
        self.assertEqual(True, path.exists("../img/image_10.jpg"))
        self.assertEqual(True, path.exists("../img/image_11.jpg"))
        self.assertEqual(True, path.exists("../img/image_12.jpg"))
        self.assertEqual(True, path.exists("../img/image_13.jpg"))
        self.assertEqual(True, path.exists("../img/image_14.jpg"))
        self.assertEqual(True, path.exists("../img/image_15.jpg"))
        self.assertEqual(True, path.exists("../img/image_16.jpg"))
        self.assertEqual(True, path.exists("../img/image_17.jpg"))
        self.assertEqual(True, path.exists("../img/image_18.jpg"))
        self.assertEqual(True, path.exists("../img/image_19.jpg"))
        self.assertEqual(True, path.exists("../img/image_20.jpg"))
        self.assertEqual(True, path.exists("../img/image_21.jpg"))
        self.assertEqual(True, path.exists("../img/image_22.jpg"))
        self.assertEqual(True, path.exists("../img/image_23.jpg"))
        self.assertEqual(True, path.exists("../img/image_24.jpg"))

if __name__ == "__main__":
    unittest.main()

4. Source Code

oasis-forever/web_scraping_tutorial

5. Conclusion

If you have experienced DOM operation through JavaScript coding or E2E test in a testing framework, you will feel like you are familiar with Web Scraping.
In my case, I have experiences of System Spec implementation of RSpec with Selenium and Capybara, so it was not difficult at all.

What I learned from "Web Scraping for Efficiency in Business" of Udemy was very basic, so I need to refer to advanced materials for more complicated cases.
But the basics are the most important, so I can say it was a good leaning.

Please make sure that you do not execute web scraping at random too much because it can slow down a web server, which is regarded as a DDoS attack.

Scraping, when done in a somehwat abusive manner, can slow down scraped website's servers by sending them inadequately large amounts of requests in a short period of time, breaking the havoc on the bandwidth speeds and sometimes completely making the website unresponsive, it's a bit similar to a DDOS attack in a way. Source: web scraping ddos - Google Search