Wikimedia Commons has thousands of images that need to be cleaned up.
These are organized into categories at Images for cleanup. While most of these categories
require manual fixing, some of them can be partly or entirely automated (and
should be fun programming exercises).
The Images with borders category was the first to catch my attention. Just look at
it:
How hard could it be, right? Not very, but to make things just a bit easier I
decided to work on the Plymouth bus subset (n = 380) first, as these all have
similar (but not identical) borders.
Grabbing the images
The first step is to fetch a list of all the images in the Image with borders
category, then select and download the Plymouth subset.
This script does just that:
Notice that multiple API calls are made. This is due to the 500 results limit
(it’s 5000 for accounts with the bot privilege set). The API returns a parameter
called gcmcontinue. By adding this to our next API call, the next batch of 500
will be returned. This is repeated until the API runs out of items to return (in
which point the gcmcontinue parameter won’t be included in the API’s response
anymore, which is the trigger that makes the script exit the for loop).
jq is used to extract the image URLs from the
API query and add them to an array.
Once we have our array with the image URLs from all of the API calls, they are
downloaded one by one using wget.
Cropping an image
As recommended by the {{Remove border}} template, jpegtran is used to do the
cropping because it allows for lossless manipulation of JPEG images. Given
that we’re going to crop hundreds of images manually (sorry, no fancy machine
learning here), the processing pipeline needs to be as efficient as possible.
This is what the procedure looks like for this image:
For each corner of the image, cut out a 32 by 32 pixel patch, blow it up to
512 by 512 pixels, and save it to disk (once with horizontal lines, once with
vertical lines):
For each side of the image, concatenate the corners on that side into one
image:
Show the sides one by one and prompt the user to select the first line that
falls outside the border region (or falls exactly on the break between border
and image). The lines are numbered from border to image. For example, the
response for the bottom side of our example image would be “2”:
For the right and bottom sides, after the line has been selected, extract
the strip that contains the break between border and image, blow it up, add
lines for each row or column of pixels (depending on the orientation of the
lines) and prompt the user to select the line that falls exactly on the
break). For example, the response for the bottom strip would be “3”:
Crop image with the obtained parameters:
The reason step 4 is not performed on the top and left side of the image is
that jpegtran can only cut at MCU boundaries for these sides. I suppose this is a limitation
imposed by the JPEG standard itself.
Here’s a video of the whole procedure:
Of course, using feh to display the borders and strips, and capture user input
is a rather quick and dirty approach, but quite functional, which is what I was
aiming for after all.
The script that does all of the above:
Cropping them all
To crop all of the images, we simply loop over the images in the
images_unprocessed/ directory, like this:
Uploading
Once all the images have been cropped, pywikibot is used to upload them to
Commons and to remove the {{Remove border}} template from page.