The “normal” (class 0) observations were generated from tiles of the original DDSM images, and so include a number of instances of chart annotations of the following sort found in the image background:

def plot_img_from_file(filename, ax, title=None):
    img = plt.imread(filename)
    ax.imshow(img, cmap='gray')
    if title is not None: 
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(10,15))
axes = axes.ravel()
for ax, file in zip(axes, ['c311.png', 'c545.png', 'c578.png', 'c596.png', 'c1191.png']):
    plot_img_from_file(f"data/train_images/{file}", ax)


While it was easy to remove uninformative observations that were mostly or entirely black based on the distribution of pixel intensities in the images, this approach was unsuccessful at sorting out observations with the sort of “misinformative” content shown above. Because these images contained positive content that would be easy for a CNN to recognize, and because only the normal-class images contained this type of content, we feared that the classifier would learn to associate annotative lettering with normal-class observations. The trained model would therefore perform poorly when used to classify the full images, in which chart annotations were not uncommon even in scans that contain abnormalities (masses or calcifications).

We decided to try to use a model to automate the process of identifying these “misinformative” images from the set of all normal-class observations. Since we did not have a large set of scans with lettering to use for training the model in a supervised fashion, we instead had to adopt an unsupervised method for identifying these images. We therefore trained an autoencoder on the normal images, expecting that it would identify the latent characteristics of the text annotations that we ourselves could recognize by sight. We then performed unsupervised clustering in the learned latent space to identify clusters of observations with these types of features.

Load class 0 (“normal”) images

train_data_c0_df = pd.read_csv('data/training_data.csv')
train_data_c0_df = train_data_c0_df.loc[train_data_c0_df['class']==0,:]
train_data_c0_df = train_data_c0_df.drop(columns=['normal_class','class','Unnamed: 0'])
train_data_c0_df['filename'] = train_data_c0_df['id'].apply(lambda x: x+'.png')
test_data_c0_df = pd.read_csv('data/test_data.csv')
test_data_c0_df = test_data_c0_df.loc[test_data_c0_df['class']==0,:]
test_data_c0_df['filename'] = test_data_c0_df['id'].apply(lambda x: x+'.png')

Build and fit autoencoder


latent_dim = 150
img_dims = (128,128,3)

input_img = Input(shape=img_dims)

x = Conv2D(128, (3,3), padding='same', use_bias=False)(input_img)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPooling2D((2,2))(x)
x = Conv2D(64, (3,3), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPooling2D((2,2))(x)
x = Conv2D(32, (3,3), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPooling2D((2,2))(x)
shape_before_flattening = K.int_shape(x)
x = Flatten()(x)
x = Dense(256)(x)
z = Dense(latent_dim, activation='relu')(x)

encoder = Model(input_img, z)
decoder_input = Input(K.int_shape(z)[1:])

x = Dense(256)(decoder_input)
x = Dense(np.prod(shape_before_flattening[1:]), activation='relu')(x)
x = Reshape(shape_before_flattening[1:])(x)
x = UpSampling2D((2,2))(x)
x = Conv2DTranspose(32, (3,3), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = UpSampling2D((2,2))(x)
x = Conv2DTranspose(64, (3,3), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = UpSampling2D((2,2))(x)
x = Conv2DTranspose(128, (3,3), padding='same', activation='relu')(x)

decoder_output = Conv2D(3, 1, activation='sigmoid')(x)

decoder = Model(decoder_input, decoder_output)
ae_out = decoder(encoder(input_img))

ae = Model(input_img, ae_out)
ae.compile(optimizer='adam', loss='binary_crossentropy')
batch_size = 128
epochs = 30
steps_per_epoch = int(np.ceil(len(train_data_c0_df) / batch_size))
validation_steps = int(np.ceil(len(test_data_c0_df) / batch_size))
train_datagen = image.ImageDataGenerator(rescale=1./255)
test_datagen = image.ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_dataframe(

test_generator = test_datagen.flow_from_dataframe(
Found 48596 images.
Found 13360 images.
early_stopping = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')

history = ae.fit_generator(train_generator, 
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12,8))

for i, imgnum in enumerate(['c311', 'c545', 'c578']):
    filepath = f"data/train_images/{imgnum}.png"
    img = plt.imread(filepath)
    img = resize(img, img_dims, mode='constant')

    axes[0,i].set_title(f"Original: {imgnum}")

    # Map original image into the latent space
    img_encoded = encoder.predict(np.array([img]))
    img_decoded = decoder.predict(img_encoded)[0]


The reconstructed images passed through the autoencoder only vaguely resemble the original images, potentially suggesting that the model was unable to learn to identify the alphanumeric annotations as conceptually distinct from the desired content (breast tissue). We also trained several Variational Autoencoder models, though they performed no better than this vanilla autoencoder.

Map the original images into the learned latent feature space

encodings = []
ids = []
idx = []
for row in train_data_c0_df.iterrows():
    filepath = "data/train_images/" + row[1]['filename']
    img = plt.imread(filepath)
    img = resize(img, img_dims, mode='constant')
# prepare DataFrame with latent feature values for each image in the training set
encodings_df = pd.DataFrame({'idx':idx, 'id':ids, 'encodings':encodings}).set_index('idx')
encodings_df = train_data_c0_df.merge(encodings_df, left_index=True, right_index=True)
encodings_df = encodings_df.drop(columns='id_y')

encodings_feats = encodings_df['encodings'].apply(pd.Series)
encodings_feats = encodings_feats.rename(columns = lambda x : 'latentvar_' + str(x+1))
encodings_df = encodings_df.merge(encodings_feats, left_index=True, right_index=True)

Cluster analysis

latentvar_cols = [col for col in encodings_df.columns if col.startswith('latentvar_')]
X = encodings_df[latentvar_cols].copy()
y = encodings_df['id_x'].copy()

X = StandardScaler().fit_transform(X)


We first use DBSCAN to identify similar observations to the five displayed above. Since we do not know the shape of the latent feature space, and do not even roughly know the number of observations that only contain annotation text (i.e. that should be removed), DBSCAN is an ideal clustering algorithm in that it does not require that we specify the number of clusters in advance, and does not impose assumptions on the shape of the clusters.

Using the default parameters, the DBSCAN algorithm identifies all five observations as “noisy” (not belonging to a cluster). We therefore increased the maximum distance between samples for them to be considered in the same neighborhood, though obtained the same result.

dbscan_results = DBSCAN(eps=0.9).fit(X)
# distribution of cluster membership (-1 represents "noisy" observations)
dbscan_df = pd.DataFrame({'img_id': y, 'cluster_id': dbscan_results.labels_})
dbscan_df[dbscan_df['img_id'].isin(['c311', 'c545', 'c578', 'c596', 'c1191'])]
img_id cluster_id
310 c311 -1
544 c545 -1
577 c578 -1
595 c596 -1
1190 c1191 -1

k nearest neighbors

Since DBSCAN identified all of the observations as not belonging to a neighborhood, we instead tried finding the k nearest neighbors in latent space to each of the original observations using a k-dimensional tree. However, the results confirm that the observations do not appear to cluster with other observations that are in any way similar.

kdt = KDTree(X, metric='euclidean')
# get k nearest neighbors to each original image
K = 5
orig_imgs = ['c311', 'c578', 'c596']

fig, axes = plt.subplots(nrows=len(orig_imgs), ncols=K, figsize=(12,8))

for i, orig_img in enumerate(orig_imgs):
    dists, idxs = kdt.query(X[encodings_df['id_x']==orig_img,:], k=K)
    nbr_imgs = encodings_df.loc[idxs.squeeze(),'filename'].tolist()
    for j, imgfile in enumerate(nbr_imgs):    
        plot_img_from_file(f"data/train_images/{imgfile}", axes[i,j], f"{imgfile}")
for ax, row in zip(axes[:,0], orig_imgs):
    ax.annotate(row, xy=(0, 0.5), xytext=(-ax.yaxis.labelpad+2, 0),
                xycoords=ax.yaxis.label, textcoords='offset points',
                fontsize=16, ha='right', va='center')

