References
This lesson prepares for lesson 15 where we will create an image classifier. This content will be similar to the first lesson of the fastai course. If you have time we recommend watching the the lesson recording.
- Practical Deep Learning for Coders - Lesson 1: Image classification by fastai [video]
Goal
Before we can create an image classifier we need to create a dataset with training data. We could you one of the standard image datasets but in real life you usually need to be creative to get enough labeled data for your use-case. We will download images from Google search to download images corresponding to a certain category.
At the end of this notebook you should have at least one dataset with images of different categories. The format of the dataset should be the following:
data
│
└───dog_vs_cat_dataset
│
└───cat
│ │ cat_img_1.png
│ │ cat_img_2.png
│ │ ...
│
└───dog
│ dog_img_1.png
│ dog_img_2.png
│ ...
This is the same structure we already encountered when we trained a text classifier with ULMFiT
in the last lecture. Every class (i.e. image label) is contained in a folder that is named after the class. We built a helper class called ImageDownloader
that we can use to set this up. This class also downloads Google image search results automatically into these folders.
Prerequisites
In order to get the images from Google search we need the chromium-chromedriver
. If you run this notebook on binder chromium-chromedriver
is already installed. If you want to install it on an other machine you need to install it manually. On a linux machine this can be done with the following commands:
# ! apt-get update
# ! apt-get install chromium-chromedriver
For more information visit https://www.chromium.org/.
!pip install fastai --no-cache-dir -qq
Then we can import the ImageDownloader
and the fastai helper functions.
# uncomment if running locally or on Google Colab
# !pip install --upgrade dslectures
from dslectures.image_downloader import ImageDownloader
from fastai.vision import *
data_path = Path('../data/')
dataset_name = 'memes'
With this information we can create a new ImageDownloader
object.
img_dl = ImageDownloader(data_path, dataset_name)
This will also create a new folder in the data_path
called 'meme'
:
data_path.ls()
Create new class
We are now ready to create our first class in this dataset. We want to create two classes: one containing dank memes and another class with lame memes. Therefore we call the first class 'dank_meme'
. This will also be the name of the folder in the dataset folder. It is usually a good idea to avoid whitespaces when naming folders and files. You can replace them with underscores or dashes.
The second piece of information we need is the search query. This is what you would enter on the Google search website. In this case we want to search for 'dank memes'
.
class_name = 'dank_meme'
search_query = 'dank meme'
img_dl.add_images_to_class(class_name, search_query)
Depending on the search query this should yield somewhere between 100-700 images that are stored in the class folder. Maybe you want to download images from several search queries into the same class folder. You can do that with the image downloader. Simply create another query and pass it with the same class name.
search_query = 'great meme'
img_dl.add_images_to_class(class_name, search_query)
search_query = 'funny meme'
img_dl.add_images_to_class(class_name, search_query)
This should be enough images for that class to work with.
We now want to create the second class in our dataset with lame memes. To do so we can run the same function but with a different class_name
.
class_name = 'lame_meme'
search_query = 'lame meme'
img_dl.add_images_to_class(class_name, search_query)
You can create as many classes as you want withing one dataset. For the purpose of next weeks lecture we suggest creating at least one dataset with 2-20 classes. Of course you can also create more than one dataset!
Delete dataset
It can happen that you want to start from scratch. Unfortunately, deleting folders that contain files is not possible from within JupyterLab. You can use the following command to delete a folder with all its content.
# !rm -rf ../data/my_dataset_name/
# !rm -rf ../data/my_dataset_name/my_class/
(data_path/dataset_name).ls()
data = ImageDataBunch.from_folder(data_path/dataset_name, valid_pct=0.2, size=224)
Once we loaded the data we can plot a random selection of images from the dataset.
data.show_batch(rows=3, figsize=(8, 8))
!tar -zcf {data_path/dataset_name}.tar.gz {data_path/dataset_name}
Clicking on the folder icon on the top left of the JupyterLab user interface you should be able to navigate to the '../data/'
folder and then right click on the compressed file (with the file ending .tar.gz
) and download it.