# Computer Vision Tutorial: Implementing Mask R-CNN for Image Segmentation_with Python Code

604-李同学

来源：https://www.analyticsvidhya.com/blog/2019/07/computer-vision-implementing-mask-r-cnn-image-segmentation/


# 一，Overview综述

• We will learn how Mask R-CNN works in a step-by-step manner我们将一步一步地学习Mask R-CNN如何工作
• We will also look at how to implement Mask R-CNN in Python and use it for our own images 我们还将研究如何在Python中实现Mask R-CNN，并将其用于我们自己的图像

# 二，Introduction

I am fascinated by self-driving cars. The sheer complexity and mix of different computer vision techniques that go into building a self-driving car system is a dream for a data scientist like me.

So, I set about trying to understand the computer vision technique behind how a self-driving car potentially detects objects. A simple object detection framework might not work because it simply detects an object and draws a fixed shape around it.

That’s a risky proposition in a real-world scenario. Imagine if there’s a sharp turn in the road ahead and our system draws a rectangular box around the road. The car might not be able to understand whether to turn or go straight. That’s a potential disaster!

Instead, we need a technique that can detect the exact shape of the road so our self-driving car system can safely navigate the sharp turns as well.

The latest state-of-the-art framework that we can use to build such a system? That’s Mask R-CNN!

So, in this article, we will first quickly look at what image segmentation is. Then we’ll look at the core of this article – the Mask R-CNN framework. Finally, we will dive into implementing our own Mask R-CNN model in Python. Let’s begin!在本文中，我们将首先快速了解什么是图像分割。然后我们来看Mask R-CNN框架。最后，我们将在Python中实现我们自己的Mask R-CNN模型。

# 三，A Brief Overview of Image Segmentation

We learned the concept of image segmentation in part 1 of this series in a lot of detail. We discussed what is image segmentation and its different techniques, like region-based segmentation, edge detection segmentation, and segmentation based on clustering. 在本系列的第1部分中，我们详细学习了图像分割的概念。我们讨论了什么是图像分割及其不同的技术，如基于区域的分割，边缘检测分割和基于聚类的分割。
I would recommend checking out that article first if you need a quick refresher (or want to learn image segmentation from scratch).

本系列的第1部分-图像分割：


I’ll quickly recap that article here. Image segmentation creates a pixel-wise mask for each object in the image. This technique gives us a far more granular understanding of the object(s) in the image. The image shown below will help you to understand what image segmentation is:我将在这里快速回顾这篇文章。图像分割为图像中的每个对象创建一个像素级的mask。这种技术让我们对图像中的对象有了更细致的理解。下图将帮助你理解什么是图像分割:

Here, you can see that each object (which are the cells in this particular image) has been segmented. This is how image segmentation works.

We also discussed the two types of image segmentation: Semantic Segmentation and Instance Segmentation. Again, let’s take an example to understand both of these types:：我们还讨论了两种类型的图像分割:语义分割和实例分割。让我们再举一个例子来理解这两种类型:

All 5 objects in the left image are people. Hence, semantic segmentation will classify all the people as a single instance. Now, the image on the right also has 5 objects (all of them are people). But here, different objects of the same class have been assigned as different instances. This is an example of instance segmentation.左边图像中的5个对象都是人。因此，语义分割将所有的人分类为一个单独的实例。现在，右边的图像也有5个对象(都是人)。但是在这里，同一个类的不同对象被分配为不同的实例。这是实例分割的一个例子。

Part one covered different techniques and their implementation in Python to solve such image segmentation problems. In this article, we will be implementing a state-of-the-art image segmentation technique called Mask R-CNN to solve an instance segmentation problem.第一部分介绍了解决图像分割问题的不同技术及其在Python中的实现。在本文中，我们将实现一种称为Mask R-CNN的最先进的图像分割技术来解决一个实例分割问题。

Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object detection tasks. For a given image, it returns the class label and bounding box coordinates for each object in the image. Mask R-CNN基本上是Faster R-CNN的扩展。Faster R-CNN被广泛用于目标检测任务。对于给定的图像，它返回图像中每个对象的类标签和边界框坐标。So, let’s say you pass the following image:假设你传递了下面的图像:

The Fast R-CNN model will return something like this:

The Mask R-CNN framework is built on top of Faster R-CNN. So, for a given image, Mask R-CNN, in addition to the class label and bounding box coordinates for each object, will also return the object mask. Mask R-CNN框架构建在Faster R-CNN之上。因此，对于给定的图像，Mask R-CNN除了返回类标签和每个对象的边框坐标外，还将返回对象Mask。

Let’s first quickly understand how Faster R-CNN works. This will help us grasp the intuition behind Mask R-CNN as well.首先让我们快速了解 Faster R-CNN是如何工作的。这也将帮助我们直观掌握Mask R-CNN的背后。

• Faster R-CNN first uses a ConvNet to extract feature maps from the images：Faster R-CNN首先使用ConvNet 从图像中提取特征图
• These feature maps are then passed through a Region Proposal Network (RPN) which returns the candidate bounding boxes 然后，这些特征图通过区域建议网络(RPN)传递，该网络(RPN)返回候选边界框
• We then apply an RoI pooling layer on these candidate bounding boxes to bring all the candidates to the same size 然后我们在这些候选边界框上应用一个RoI池化层，使所有候选边界框变为相同尺寸
• And finally, the proposals are passed to a fully connected layer to classify and output the bounding boxes for objects 最后，建议被传递到全连接层去分类，并且输出对象的边界框

Once you understand how Faster R-CNN works, understanding Mask R-CNN will be very easy. So, let’s understand it step-by-step starting from the input to predicting the class label, bounding box, and object mask.一旦你理解Faster R-CNN如何工作，那么理解Mask R-CNN将非常容易。让我们一步一步地理解它，从输入到预测类标签、边框和对象mask。

## 1，Backbone Model 骨干模型

Similar to the ConvNet that we use in Faster R-CNN to extract feature maps from the image, we use the ResNet 101 architecture to extract features from the images in Mask R-CNN. So, the first step is to take an image and extract features using the ResNet 101 architecture. These features act as an input for the next layer. 与我们在Faster R-CNN中用于从图像中提取特征图的ConvNet相似，在Mask R-CNN中，我们使用ResNet 101框架从图像中提取特征。因此，第一步是使用ResNet 101框架获取图像并提取特征。

## 2，Region Proposal Network (RPN)区域建议网络

Now, we take the feature maps obtained in the previous step and apply a region proposal network (RPM). This basically predicts if an object is present in that region (or not). In this step, we get those regions or feature maps which the model predicts contain some object.我们利用上一步得到的特征图，应用一个区域建议网络(RPM)。这可以预测一个对象是否存在于那个区域。在此步骤中，我们得到了模型预测的包含某些对象的区域或特征图。

## 3，Region of Interest (RoI)

The regions obtained from the RPN might be of different shapes, right? Hence, we apply a pooling layer and convert all the regions to the same shape. Next, these regions are passed through a fully connected network so that the class label and bounding boxes are predicted. 从RPN得到的区域可能具有不同的形状。因此，我们应用一个池化层并将所有区域转换为相同的形状。接下来，这些区域通过全连接网络进行传递，以便预测类标签和边界框。

Till this point, the steps are almost similar to how Faster R-CNN works. Now comes the difference between the two frameworks. In addition to this, Mask R-CNN also generates the segmentation mask.这些步骤几乎与Faster R-CNN的工作原理相似。这两个框架之间的区别是，Mask R-CNN还生成了分割Mask。

For that, we first compute the region of interest so that the computation time can be reduced. For all the predicted regions, we compute the Intersection over Union (IoU) with the ground truth boxes. We can computer IoU like this:

• IoU = Area of the intersection / Area of the union

• IoU=交集面积/并集面积

Now, only if the IoU is greater than or equal to 0.5, we consider that as a region of interest. Otherwise, we neglect that particular region. We do this for all the regions and then select only a set of regions for which the IoU is greater than 0.5.现在，只有当IoU大于或等于0.5时，我们才会将这个区域作为感兴趣区。否则，我们就会忽略那个特定的区域。我们对所有区域执行此操作，然后只选择IoU大于0.5的一组区域。

Let’s understand it using an example. Consider this image:举例说明，如下图像：

Here, the red box is the ground truth box for this image. Now, let’s say we got 4 regions from the RPN as shown below:红框是这个图像的ground truth框。假设我们从RPN中得到4个区域，如下所示：

Here, the IoU of Box 1 and Box 2 is possibly less than 0.5, whereas the IoU of Box 3 and Box 4 is approximately greater than 0.5. Hence. we can say that Box 3 and Box 4 are the region of interest for this particular image whereas Box 1 and Box 2 will be neglected. Box 1和Box 2的IoU可能小于0.5，而Box 3和Box 4的IoU大约大于0.5。因此，Box 3和Box 4是这个特定图像的感兴趣区域，而Box 1和Box 2将被忽略。

Next, let’s see the final step of Mask R-CNN.

### 解释2： ground truth box

Once we have the RoIs based on the IoU values, we can add a mask branch to the existing architecture. This returns the segmentation mask for each region that contains an object. It returns a mask of size 28 X 28 for each region which is then scaled up for inference. 一旦我们有了基于IoU值的RoIs，我们就可以在现有框架中添加掩码分支。这将返回包含对象的每个区域的segmentation mask。它为每个区域返回一个尺寸为28 X 28的掩码，然后对其进行放大以进行推理。

Again, let’s understand this visually. Consider the following image:让我们直观地理解一下，如下图：

The segmentation mask for this image would look something like this:这个图像的 segmentation mask 看起来是这样的:

Here, our model has segmented all the objects in the image. This is the final step in Mask R-CNN where we predict the masks for all the objects in the image.在这里，我们的模型已经分割了图像中的所有对象。这是Mask R-CNN的最后一步，在这里我们为图像中的所有对象预测masks。

Keep in mind that the training time for Mask R-CNN is quite high. It took me somewhere around 1 to 2 days to train the Mask R-CNN on the famous COCO dataset. So, for the scope of this article, we will not be training our own Mask R-CNN model. Mask R-CNN的训练时间相当高。在著名的 COCO dataset上训练Mask R-CNN花费了我大约1到2天的时间。因此，在本文中，我们不会训练自己的Mask R-CNN模型。

COCO dataset 数据集链接地址：https://cocodataset.org/#home


We will instead use the pretrained weights of the Mask R-CNN model trained on the COCO dataset. Now, before we dive into the Python code, let’s look at the steps to use the Mask R-CNN model to perform instance segmentation.我们将使用在COCO dataset上训练的Mask R-CNN模型的预训练权重。在深入研究Python代码之前，让我们先看看使用Mask R-CNN模型执行实例分割的步骤。

It’s time to perform some image segmentation tasks! We will be using the mask rcnn framework created by the Data scientists and researchers at Facebook AI Research (FAIR).我们将使用由Facebook人工智能研究院(FAIR)的数据科学家和研究人员创建的mask rcnn框架。

Mask R-CNN框架 github地址：https://github.com/matterport/Mask_RCNN


Let’s have a look at the steps which we will follow to perform image segmentation using Mask R-CNN. 使用Mask R-CNN执行图像分割的步骤。

## 1，Step 1: Clone the repository克隆仓库

First, we will clone the mask rcnn repository which has the architecture for Mask R-CNN. Use the following command to clone the repository:我们将克隆具有Mask R-CNN框架的mask rcnn仓库。使用以下命令克隆仓库:

git clone https://github.com/matterport/Mask_RCNN.git


Once this is done, we need to install the dependencies required by Mask R-CNN.克隆完成后，我们需要安装Mask R-CNN所需的依赖项。

## 2，Step 2: Install the dependencies安装依赖项

Here is a list of all the dependencies for Mask R-CNN:

numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython


You must install all these dependencies before using the Mask R-CNN framework.

Next, we need to download the pretrained weights. You can use this link to download the pre-trained weights. These weights are obtained from a model that was trained on the MS COCO dataset. Once you have downloaded the weights, paste this file in the samples folder of the Mask_RCNN repository that we cloned in step 1. 我们需要下载预训练好的权重。你可以使用这个链接下载预训练权重。这些权重是从在MS COCO dataset上训练的模型中获得的。下载权重后，将此文件粘贴到第1步中克隆的Mask_RCNN存储库的样例文件夹中。

download the pre-trained weights下载预训练权重链接：https://github.com/matterport/Mask_RCNN/releases



## 4，Step 4: Predicting for our image

Finally, we will use the Mask R-CNN architecture and the pretrained weights to generate predictions for our own images.

Once you’re done with these four steps, it’s time to jump into your Jupyter Notebook! We will implement all these things in Python and then generate the masks along with the classes and bounding boxes for objects in our images.

# 六，Implementing Mask R-CNN in Python

To execute all the code blocks which I will be covering in this section, create a new Python notebook inside the “samples” folder of the cloned Mask_RCNN repository. 执行我将在本节中介绍的所有代码块，在克隆的Mask_RCNN存储库的“samples”文件夹中创建一个新的Python notebook。

Let’s start by importing the required libraries:从导入所需的库开始:

import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt

# Root directory of the project
ROOT_DIR = os.path.abspath("../")

import warnings
warnings.filterwarnings("ignore")

sys.path.append(ROOT_DIR)  # To find local version of the library
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize
# Import COCO config
sys.path.append(os.path.join(ROOT_DIR, "samples/coco/"))  # To find local version
import coco

%matplotlib inline


Next, we will define the path for the pretrained weights and the images on which we would like to perform segmentation:我们将为预先训练好的权重和我们想要进行分割的图像定义路径:

# Directory to save logs and trained model
MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file

if not os.path.exists(COCO_MODEL_PATH):

# Directory of images to run detection on
IMAGE_DIR = os.path.join(ROOT_DIR, "images")


If you have not placed the weights in the samples folder, this will again download the weights. Now we will create an inference class which will be used to infer the Mask R-CNN model:如果您没有将权重放在samples文件夹中，这将再次下载权重。现在我们将创建一个推断类，用于推断Mask R-CNN模型:

class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1

config = InferenceConfig()
config.display()


What can you infer from the above summary? We can see the multiple specifications of the Mask R-CNN model that we will be using.从以上总结中你能推断出什么?我们可以看到多种规格的Mask R-CNN模型，我们将使用。

So, the backbone is resnet101 as we have discussed earlier as well. The mask shape that will be returned by the model is 28X28, as it is trained on the COCO dataset. And we have a total of 81 classes (including the background).主干是resnet101，我们之前也讨论过。模型返回的掩模形状是28X28，因为它是在COCO数据集上训练的。我们共有81个类(包括背景类)。

We can also see various other statistics as well, like:我们也可以看到其他的统计数据，比如:

• The input shape
• Number of GPUs to be used
• Validation steps, among other things. 验证步骤以及其他内容。
You should spend a few moments and understand these specifications. If you have any doubts regarding these specifications, feel free to ask me in the comments section below.

Next, we will create our model and load the pretrained weights which we downloaded earlier. Make sure that the pretrained weights are in the same folder as that of the notebook otherwise you have to give the location of the weights file:

# Create model object in inference mode.

# Load weights trained on MS-COCO


Now, we will define the classes of the COCO dataset which will help us in the prediction phase:我们将定义COCO数据集的类，这将在预测阶段帮助我们:

# COCO Class names
class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard',
'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']


Let’s load an image and try to see how the model performs. You can use any of your images to test the model.让我们加载一个图像，看看模型是如何运行的。您可以使用任何图像来测试模型。

# Load a random image from the images folder

# original image
plt.figure(figsize=(12,10))
skimage.io.imshow(image)


This is the image we will work with. You can clearly identify that there are a couple of cars (one in the front and one in the back) along with a bicycle.

## 2，Making Predictions

It’s prediction time! We will use the Mask R-CNN model along with the pretrained weights and see how well it segments the objects in the image. We will first take the predictions from the model and then plot the results to visualize them:我们将使用Mask R-CNN模型和预训练的权重，看看它如何分割图像中的对象。我们将首先从模型中得到预测，然后将结果绘制成图以使它们可视化:

# Run detection
results = model.detect([image], verbose=1)

# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], class_names, r['scores'])


Interesting. The model has done pretty well to segment both the cars as well as the bicycle in the image. We can look at each mask or the segmented objects separately as well. Let’s see how we can do that.模型在分割图中的小汽车和自行车上做得很好。我们也可以分别查看每个mask或分割的对象。

I will first take all the masks predicted by our model and store them in the mask variable. Now, these masks are in the boolean form (True and False) and hence we need to convert them to numbers (1 and 0). Let’s do that first:首先，我将获取模型预测的所有mask，并将它们存储在mask变量中。这些masks是布尔型的(真和假)，因此我们需要将它们转换为数字(1和0)

mask = r['masks']
# Output:
(480,640,3)


This will give us an array of 0s and 1s, where 0 means that there is no object at that particular pixel and 1 means that there is an object at that pixel. Note that the shape of the mask is similar to that of the original image (you can verify that by printing the shape of the original image).这会给我们一个0和1的数组，0表示在那个像素上没有对象1表示在那个像素上有一个对象。请注意mask的形状与原始图像的形状类似(您可以通过打印原始图像的形状来验证这一点)。

However, the 3 here in the shape of the mask does not represent the channels. Instead, it represents the number of objects segmented by our model. Since the model has identified 3 objects in the above sample image, the shape of the mask is (480, 640, 3). Had there been 5 objects, this shape would have been (480, 640, 5).然而，在这里，mask的shape中的3不代表通道。相反，它表示我们的模型分割的对象的数量。因为模型已经在上面的样本图像中识别了3个对象，mask的shape是(480,640,3)。如果有5个物体，这个shape就是(480,640,5)。

We now have the original image and the array of masks. To print or get each segment from the image, we will create a for loop and multiply each mask with the original image to get each segment:我们现在有了原始图像和masks数组。为了打印或获取图像中的每个片段，我们将创建一个for循环，并将每个mask与原始图像相乘，以获得每个片段:

for i in range(mask.shape[2]):
for j in range(temp.shape[2]):
plt.figure(figsize=(8,8))
plt.imshow(temp)


This is how we can plot each mask or object from the image. This can have a lot of interesting as well as useful use cases. Getting the segments from the entire image can reduce the computation cost as we do not have to preprocess the entire image now, but only the segments.这就是我们如何从图像中绘制每个mask或对象的方法。这可以有很多有趣的和有用的用例。从整幅图像中提取片段，无需对整幅图像进行预处理，只需对图像片段进行预处理，可以减少计算量。

## 3，Inferences推断

Below are a few more results which I got using our Mask R-CNN model:

Looks awesome! You have just built your own image segmentation model using Mask R-CNN – well done.你刚刚使用Mask R-CNN建立了自己的图像分割模型。

## 4，End Notes

I love working with this awesome Mask R-CNN framework. Perhaps I will now try to integrate that into a self-driving car system.

Image segmentation has a wide range of applications, ranging from the healthcare industry to the manufacturing industry. I would suggest you try this framework on different images and see how well it performs. Feel free to share your results with the community. 图像分割具有广泛的应用，从医疗保健行业到制造业。建议您在不同的图像上尝试这个框架，看看它的性能如何。

Vieu3.3主题

Q Q 登 录