Wikimedia Commons has thousands of images that need to be cleaned up. These are organized into categories at Images for cleanup. While most of these categories require manual fixing, some of them can be partly or entirely automated (and should be fun programming exercises).

The Images with borders category was the first to catch my attention. Just look at it:

Examples of images with borders

How hard could it be, right? Not very, but to make things just a bit easier I decided to work on the Plymouth bus subset (n = 380) first, as these all have similar (but not identical) borders.

Grabbing the images

The first step is to fetch a list of all the images in the Image with borders category, then select and download the Plymouth subset. This script does just that:

mkdir -p images_unprocessed

cd images_unprocessed

basequery="https://commons.wikimedia.org/w/api.php?\
action=query&format=json&\
prop=imageinfo&\
generator=categorymembers&\
iiprop=url&\
gcmtitle=Category%3A+Images+with+borders&\
gcmtype=file&\
gcmlimit=500"

apiquery=$basequery

# Query API
while [ -z $done ]; do
  response=$($(echo "curl --silent $apiquery"))

  urls+=( $(echo $response | jq ".query.pages[].imageinfo[].url") )

  if echo $response | grep -q "\"gcmcontinue\":"; then
    gcmcont=$(echo $response | jq ".continue.gcmcontinue")
    apiquery=$(echo $basequery\&gcmcontinue=$gcmcont)
  else
    done=1
  fi
done

# Grab all the files (or in this case all the ones from the Plymouth set)
for url in "${urls[@]}"; do
  if echo $url | grep -q Plymouth; then
    wget $(echo $url | sed -e 's/^"//' -e 's/"$//')
  fi
done

cd ..

Notice that multiple API calls are made. This is due to the 500 results limit (it’s 5000 for accounts with the bot privilege set). The API returns a parameter called gcmcontinue. By adding this to our next API call, the next batch of 500 will be returned. This is repeated until the API runs out of items to return (in which point the gcmcontinue parameter won’t be included in the API’s response anymore, which is the trigger that makes the script exit the for loop).

jq is used to extract the image URLs from the API query and add them to an array.

Once we have our array with the image URLs from all of the API calls, they are downloaded one by one using wget.

Cropping an image

As recommended by the {{Remove border}} template, jpegtran is used to do the cropping because it allows for lossless manipulation of JPEG images. Given that we’re going to crop hundreds of images manually (sorry, no fancy machine learning here), the processing pipeline needs to be as efficient as possible. This is what the procedure looks like for this image:

Example image: uncropped

  1. For each corner of the image, cut out a 32 by 32 pixel patch, blow it up to 512 by 512 pixels, and save it to disk (once with horizontal lines, once with vertical lines):

    Example image: corners

  2. For each side of the image, concatenate the corners on that side into one image:

    Example image: borders

  3. Show the sides one by one and prompt the user to select the first line that falls outside the border region (or falls exactly on the break between border and image). The lines are numbered from border to image. For example, the response for the bottom side of our example image would be “2”:

    Example image: right side

  4. For the right and bottom sides, after the line has been selected, extract the strip that contains the break between border and image, blow it up, add lines for each row or column of pixels (depending on the orientation of the lines) and prompt the user to select the line that falls exactly on the break). For example, the response for the bottom strip would be “3”:

    Example image: right side zoom

  5. Crop image with the obtained parameters:

    Example image: cropped

The reason step 4 is not performed on the top and left side of the image is that jpegtran can only cut at MCU boundaries for these sides. I suppose this is a limitation imposed by the JPEG standard itself.

Here’s a video of the whole procedure:

Of course, using feh to display the borders and strips, and capture user input is a rather quick and dirty approach, but quite functional, which is what I was aiming for after all.

The script that does all of the above:

# This function extracts a 32 by 32 pixel block from one of the image's corners,
# blows it up to 512 by 512 pixels, and draws some lines based on which we can
# tell in which section the break between the border and the image is located.
function corner {
  # Get parameters
  offsetx=$1
  offsety=$2
  lineori=$3
  name=$4

  # Crop, magnify, draw lines
  if [ $lineori == 'horizontal' ]; then
    convert ../$filename \
      -crop 32x32+$offsetx+$offsety \
      +repage \
      -scale 512x512 \
      -fill red \
      -draw "line 0,128 512,128" \
      -draw "line 0,256 512,256" \
      -draw "line 0,384 512,384" \
    $name.png
  elif [ $lineori == 'vertical' ]; then
    convert ../$filename \
      -crop 32x32+$offsetx+$offsety \
      +repage \
      -scale 512x512 \
      -fill red \
      -draw "line 128,0 128,512" \
      -draw "line 256,0 256,512" \
      -draw "line 384,0 384,512" \
    $name.png
  fi
}

# This function crops the four sides to just the section that contains the break
# between the border and the image, and then enlarges this section so we can
# select the border with pixel level accuracy.
function zoom {
  # What side?
  side=$(wc -l cropparams | cut -d ' ' -f 1)

  # Only run this function on the right and bottom sides (jpegtran can only crop
  # the top and left sides at the JPEG MCU block level, usually 8 by 8 blocks).
  if [ "$side" -gt 2 ]; then

    # If right:
    if [ $side -eq 3 ]; then
      # Slice previously created border into strips.
      convert right.png -crop 128x right_%d.png
      # Read the cropparams file to learn which strip needs to be worked on.
      strip=$((4-($(tail -1 cropparams)/8)))
      # Draw lines on the strip.
      convert right_$strip.png \
        -fill red \
        -draw "line 0,0 0,1024" \
        -draw "line 127,0 127,1024" \
        -fill "#808080" \
        -draw "line 32,0,32,1024" \
        -draw "line 64,0 64,1024" \
        -draw "line 96,0 96,1024" \
        -fill "#404040" \
        -draw "line 16,0 16,1024" \
        -draw "line 48,0 48,1024" \
        -draw "line 80,0 80,1024" \
        -draw "line 112,0 112,1024" \
      tmp.png
      # Display the strip using feh; catch the response (i.e. at which line the
      # break between image and border is located) and write it to a file.
      feh tmp.png --cycle-once \
      --action1 "echo 8 >> cropparamsdetail" \
      --action2 "echo 7 >> cropparamsdetail" \
      --action3 "echo 6 >> cropparamsdetail" \
      --action4 "echo 5 >> cropparamsdetail" \
      --action5 "echo 4 >> cropparamsdetail" \
      --action6 "echo 3 >> cropparamsdetail" \
      --action7 "echo 2 >> cropparamsdetail" \
      --action8 "echo 1 >> cropparamsdetail" \
      --action9 "echo 0 >> cropparamsdetail"
    # If bottom, idem:
    elif [ $side -eq 4 ]; then
      strip=$((3-(($(tail -1 cropparams)/8)-1)))
      convert bottom.png -crop x128 bottom_%d.png
      convert bottom_$strip.png \
        -fill red \
        -draw "line 0,0 1024,0" \
        -draw "line 0,127 1024,127" \
        -fill "#808080" \
        -draw "line 0,32 1024,32" \
        -draw "line 0,64 1024,64" \
        -draw "line 0,96 1024,96" \
        -fill "#404040" \
        -draw "line 0,16 1024,16" \
        -draw "line 0,48 1024,48" \
        -draw "line 0,80 1024,80" \
        -draw "line 0,112 1024,112" \
      tmp.png
      feh tmp.png --cycle-once \
      --action1 "echo 8 >> cropparamsdetail" \
      --action2 "echo 7 >> cropparamsdetail" \
      --action3 "echo 6 >> cropparamsdetail" \
      --action4 "echo 5 >> cropparamsdetail" \
      --action5 "echo 4 >> cropparamsdetail" \
      --action6 "echo 3 >> cropparamsdetail" \
      --action7 "echo 2 >> cropparamsdetail" \
      --action8 "echo 1 >> cropparamsdetail" \
      --action9 "echo 0 >> cropparamsdetail"
    fi
  # If the side is the left or top side, skip the zoom crop procedure and write
  # a 0.
  else
    echo 0 >> cropparamsdetail
  fi
  exit;
}





# If we already selected the 8 pixels wide strip in which the border is located,
# call the zoom function to specify precisely the height or width at which we
# will crop.
if [ "$1" == "zoom" ]; then
  zoom $1
fi

# Check if filename was provided and file exists
if [ -z "$1" ]; then
  echo "No filename provided"
  exit
fi
if [ ! -f $1 ]; then
  echo "File does not exist"
  exit
fi

# Create directories
mkdir -p tmp
mkdir -p images_processed

# Get image width and height
filename=$1
width=$(mediainfo $1 | grep Width | tr -d ' ' | grep -o "[[:digit:]]*")
height=$(mediainfo $1 | grep Height | tr -d ' ' | grep -o "[[:digit:]]*")

# Move to temporary directory
cd tmp/

# For each of the corners of the images, extract the corner (see corner
# function) and write it to disk with vertical and once with horizontal lines
corner 0 $(($height-32)) vertical leftbottom
corner 0 $(($height-32)) horizontal bottomleft
corner 0 0 vertical lefttop
corner 0 0 horizontal topleft
corner $(($width-32)) 0 horizontal topright
corner $(($width-32)) 0 vertical righttop
corner $(($width-32)) $(($height-32)) vertical rightbottom
corner $(($width-32)) $(($height-32)) horizontal bottomright

# For each of the sides of the image, create an image containing the corners
# of the side.
montage lefttop.png leftbottom.png -tile 1x2 -geometry +0+0 left.png
montage topleft.png topright.png -geometry +0+0 top.png
montage righttop.png rightbottom.png -tile 1x2 -geometry +0+0 right.png
montage bottomleft.png bottomright.png -geometry +0+0 bottom.png

# For each of the side images just created, show the image and record the user
# input that tells in which strip the break is located.
feh left.png top.png right.png bottom.png --cycle-once \
--action1 'echo 8 >> cropparams && bash ../crop.sh zoom' \
--action2 'echo 16 >> cropparams && bash ../crop.sh zoom' \
--action3 'echo 24 >> cropparams && bash ../crop.sh zoom'

# Calculate the final cropping parameters (expressed as number of pixels from
# left/top/right/bottom).
paste -d \- cropparams cropparamsdetail | bc > cropparamsfinal

# Read the final cropping parameters into an array.
readarray -t cropparams < cropparamsfinal

# Remove the temporary directory
cd ..
rm -r tmp/

# Put each of the cropping parameters in a separate variable for convenience.
w=$(($width-${cropparams[0]}-${cropparams[2]}))
h=$(($height-${cropparams[1]}-${cropparams[3]}))
x=${cropparams[0]}
y=${cropparams[1]}

# Crop the image using jpegtran.
jpegtran -copy all -optimize -perfect -crop $w\x$h+$x+$y $filename > tmp.jpg

# Check if output file has the right dimensions
new_width=$(mediainfo tmp.jpg | grep Width | tr -d ' ' | grep -o "[[:digit:]]*")
new_height=$(mediainfo tmp.jpg | grep Height | tr -d ' ' | \
grep -o "[[:digit:]]*")

if [[ ! new_width -eq w ]]; then
  echo "Width of output file is $new_width instead of $w"
  exit
fi
if [[ ! new_height -eq h ]]; then
  echo "Height of output file is $new_width instead of $w"
  exit
fi

# Move the cropped image to images_processed and delete the original.
mv tmp.jpg images_processed/$filename
rm $filename

Cropping them all

To crop all of the images, we simply loop over the images in the images_unprocessed/ directory, like this:

for path in images_unprocessed/*jpg; do
  filename=$(echo $path | sed 's/images_unprocessed\///')
  cp $path $filename
  ./crop.sh $filename
  # If something went wrong during cropping, abort. Currently only checks
  # whether the dimensions match the dimensions specified during cropping.
  if [ -f error.token ]; then
    echo "Aborting loop."
    exit
  fi
done

Uploading

Once all the images have been cropped, pywikibot is used to upload them to Commons and to remove the {{Remove border}} template from page.

cd images_processed/

for f in *jpg; do
  # Upload the image.
  python ../../pywikibot/pwb.py upload -always -ignorewarn $f \
  "removed borders (losslessly, using jpegtran)"
  # Remove the \{\{Remove border\}\} template.
  page=$(echo "File:$f")
  python ../../pywikibot/pwb.py replace -always -nocase -regex \
  -summary:"Removed \{\{remove border\}\} template" -page:$page \
  "\{\{(crop|removeborder|remove frame|remove borders|remove border)\}\}" ""
done