[Tensorflow] Object Detection API - retrain mobileNet

1. Keep a trained model. Model = feature extractor layers + classification layers
2. Remove the classification layers
3. Attach new classification layer　【清空最后一层】
4. Retrain the whole model end-to-end.　【训练整个网络】

This allow to start from a good configuration of the feature extract layers weights and thus reach an optimum value in a short time.

You can think about the fine tuning like a way to start a new train with a very good initialization method for your weights (although you have to initialize your new classification layers).

[Retraining]

When, instead, we talk about retrain of a model, we usually refer to the the process of:

1. Keep a model architecture
2. Change the last classification layer in order to produce the amount of classes you want to classify　　【只改变最后一层】
3. Train the model end to end.

In this case you don't start from a good starting point as above, but instead you start from a random point in the solution space.

This means that you have to train the model for a longer time because the initial solution is not as good as the initial solution that a pretrained model gives you.

[其实，纠结的意义不大]

二、论文

2017.4 release，https://arxiv.org/pdf/1704.04861.pdf

三、应用

Not hot dog: June 23, 2017，The app is available in the Play Store today.

[Tensorflow] Object Detection API - retrain mobileNet

From: https://techcrunch.com/2017/06/28/thinking-about-hotdogs/

Fortunately, Google had just published their MobileNets paper, putting forth a novel way to run neural networks on mobile devices. The solution presented by Google offered a middle ground between the bloated Inception and the frail SqueezeNet. And more importantly, it allowed Anglade to easily tune the network to balance accuracy and compute availability.

Anglade used an open source Keras implementation from GitHub as a jumping off point. He then made a number of changes to streamline the model and optimize it for a single specialized use case.

The final model was trained on a dataset of 150,000 images.

A majority, 【40：1】

- 147,000 images, were not hotdogs,
- 3,000 of the images were of hotdogs.

This ratio was intentional to reflect the fact that most objects in the world are not hotdogs.

live-inject updates to his neural net after submitting it to the app store. And while this app was created as a complete joke, Anglade saves time at the end for an insightful discussion about the importance of UX/UI and the biases he had to account for when during the training process.

三、训练

不能从轮子开始训练

How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow, Keras & React Native

直接使用传统网络，不太适合

React Native shell.

It took slightly longer to use their transfer learning script, which helps you retrain the Inception architecture to deal with a more specific image problem. Inception is the name of a family of neural architectures built by Google to deal with image recognition problems. Inception is available “pre-trained” which means the training phase has been completed and the weights are set. Most often for image recognition networks, they have been trained on ImageNet, a yearly competition to find the best neural architecture at recognizing over 20,000 different types of objects (hotdogs are one of them). However, much like Google Cloud’s Vision API, the competition rewards breadth as much as depth here, and out-of-the-box accuracy on a single one of the 20,000+ categories can be lacking.

迁移学习 / retraining

As such, retraining (also called “transfer learning”) aims to take a full-trained neural net, and retrain it to perform better on the specific problem you’d like to handle. This usually involves some degree of “forgetting”, either by excising entire layers from the stack, or by slowly erasing the network’s ability to distinguish a type of object (e.g. chairs) in favor of better accuracy at recognizing the one you care about (i.e. hotdogs).

a just a few thousand hotdog images to get drastically enhanced hotdog recognition.

The big advantage of transfer learning are you will get better results much faster, and with less data than if you train from scratch. A full training might take months on multiple GPUs and require millions of images, while retraining can conceivably be done in hours on a laptop with a couple thousand images.

重难点在哪里

One of the biggest challenges we encountered was understanding exactly what should count as a hotdog and what should not.

Defining what a “hotdog” is ends up being surprisingly difficult (do cut up sausages count, and if so, which kinds?) and subject to cultural interpretation.

【不同形态的同一个object是否可以直接当成是多个目标来识别，这样可以么？】

Similarly, the “open world” nature of our problem meant we had to deal with an almost infinite number of inputs.

While certain computer-vision problems have relatively limited inputs (say, x-rays of bolts with or without a mechanical default), we had to prepare the app to be fed selfies, nature shots and any number of foods.

Suffice to say, this approach was promising, and did lead to some improved results, however, it had to be abandoned for a couple of reasons.

First The nature of our problem meant a strong imbalance in training data: there are many more examples of things that are not hotdogs, than things that are hotdogs.

In practice this means that if you train your algorithm on 3 hotdog images and 97 non-hotdog images, and it recognizes 0% of the former but 100% of the latter, it will still score 97% accuracy by default! This was not straightforward to solve out of the box using TensorFlow’s retrain tool, and basically necessitated setting up a deep learning model from scratch, import weights, and train in a more controlled manner.

At this point we decided to bite the bullet and get something started with Keras, a deep learning library that provides nicer, easier-to-use abstractions on top of TensorFlow, including pretty awesome training tools, and a class_weights option which is ideal to deal with this sort of dataset imbalance we were dealing with.

需要适合手机的CNN

None of them could comfortably fit on an iPhone.

They consumed too much memory, which led to app crashes, and would sometime takes up to 10 seconds to compute, which was not ideal from a UX standpoint. Many things were attempted to mitigate that, but in the end it these architectures were just too big to run efficiently on mobile.

四、小网络：Keras & SqueezeNet 方案

参数、指标

[Tensorflow] Object Detection API - retrain mobileNet

SqueezeNet vs. AlexNet, the grand-daddy of computer vision architectures. Source: SqueezeNet paper.

The problem directly ahead of us was simple: if Inception and VGG were too big, was there a simpler, pre-trained neural network we could retrain?

At the suggestion of the always excellent Jeremy P. Howard (where has that guy been all our life?), we explored Xception, Enet and SqueezeNet. We quickly settled on SqueezeNet due to its explicit positioning as a solution for embedded deep learning, and the availability of a pre-trained Keras model on GitHub (yay open-source).

138 million parameters (essentially the number of numbers necessary to model the neurons and values between them).

Inception is already a massive improvement, requiring only 23 million parameters.

SqueezeNet, in comparison only requires 1.25 million.

This has two advantages:

faster to train a smaller network. There’s less parameters to map in memory, which means you can parallelize your training a bit more (larger batch size), and the network will converge (i.e., approximate the idealized mathematical function) more quickly.
less than 10MB of RAM, while something like Inception requires 100MB or more. The delta is huge, and particularly important when running on mobile devices that may have less than 100MB of RAM available to run your app. Smaller networks also compute a result much faster than bigger ones.

There are tradeoffs of course:

A smaller neural architecture has less available “memory”: it will not be as efficient at handling complex cases (such as recognizing 20,000 different objects), or even handling complex subcases (like say, appreciating the difference between a New York-style hotdog and a Chicago-style hotdog)
As a corollary, smaller networks are usually less accurate overall than big ones. When trying to recognize ImageNet’s 20,000 different objects, SqueezeNet will only score around 58%, whereas Vgg will be accurate 72% of the time.
always more disappointing than training SqueezeNet from scratch. This could also be caused or worsened by the open-world nature of our problem.
Keras Blog.

different activation functions.

Batch Normalization helps your network learn faster by “smoothing” the values at various stages in the stack. Exactly why this works is seemingly not well-understood yet, but it has the effect of helping your network converge much faster, meaning it achieves higher accuracy with less training, or higher accuracy after the same amount of training, often dramatically so.
Activation functions are the internal mathematical functions determining whether your “neurons” activate or not. Many papers still use ReLU, the Rectified Linear Unit, but we had our best results using ELU instead.

achieve 90%+ accuracy when training from scratch, however, they were relatively brittle meaning the same network would overfit in some cases, or underfit in others when confronted to real-life testing. Even adding more examples to the dataset and playing with data augmentation failed to deliver a network that met expectations.

So while this phase was promising, and for the first time gave us a functioning app that could work entirely on an iPhone, in less than a second, we eventually moved to our 4th & final architecture.

【貌似overfit, underfit是个问题】

以上只是展示作者思维的过度内容，为下文做铺垫。

五、The DeepDog Architecture

Design

MobileNets paper, promising a new neural architecture with Inception-like accuracy on simple problems like ours, with only 4M or so parameters.

This meant it sat in an interesting sweet spot between a SqueezeNet that had maybe been overly simplistic for our purposes, and the possibly overwrought elephant-trying-to-squeeze-in-a-tutu of using Inception or VGG on Mobile. The paper introduced some capacity to tune the size & complexity of network specifically to trade memory/CPU consumption against accuracy, which was very much top of mind for us at the time.

Keras implementation was already offered publicly on GitHub by Refik Can Malli, a student at Istanbul Technical University, whose work we had already benefitted from when we took inspiration from his excellent Keras SqueezeNet implementation. The depth & openness of the deep learning community, and the presence of talented minds like R.C. is what makes deep learning viable for applications today — but they also make working in this field more thrilling than any tech trend we’ve been involved with.

Our final architecture ended up making significant departures from the MobileNets architecture or from convention, in particular:

- - 【不再用Batch Normalization & Activation between depthwise and pointwise convolutions】
  - 【SqueezeNet上的ELU更有效比ReLU】
  - 【不用PELU】
  - 【不用SELU】
  - 【BN+ELU】
  - 【BN --> activiation】
  - 【CLR 循环式学习率】
  - ρ values】

So how does this stack work exactly? Deep Learning often gets a bad rap for being a “black box”, and while it’s true many components of it can be mysterious, the networks we use often leak information about how some of their magic work. We can look at the layers of this stack and how they activate on specific input images, giving us a sense of each layer’s ability to recognize sausage, buns, or other particularly salient hotdog features.

貌似是卷积可视化工具：https://github.com/keplr-io/quiver 【在卷积可视化专题中讲解】

[Tensorflow] Object Detection API - retrain mobileNet

Training

the utmost importance. A neural network can only be as good as the data that trained it, and improving training set quality was probably one of the top 3 things we spent time on during this project. The key things we did to improve this were:

- 【要多】
- 【尽量找可能测试的图片】
- 相似的异类图片也要尽可能的多】
- 【手机照片质量不行，可增强下】
- Keras’ channel shift feature resolved most of these issues.

- b) it could be damaging to spend too much of our network’s capacity training for soft focus, when realistically most images taken with a mobile phone will not have that feature. We chose to leave this largely unaddressed as a result.

only 3k were hotdogs:

there are only so many hotdogs you can look at, but there are many not hotdogs to look at. The 49:1 imbalance was dealt with by saying a Keras class weight of 49:1 in favor of hotdogs. Of the remaining 147k images, most were of food, with just 3k photos of non-food items, to help the network generalize a bit more and not get tricked into seeing a hotdog if presented with an image of a human in a red outfit.

照片增强法

Our data augmentation rules were as follows:

- - We applied rotations within ±135 degrees — significantly more than average, because we coded the application to disregard phone orientation.
  - Height and width shifts of 20%
  - Shear range of 30%
  - Zoom range of 10%
  - Channel shifts of 20%
  - Random horizontal flips to help the network generalize

These numbers were derived intuitively, based on experiments and our understanding of the real-life usage of our app, as opposed to careful experimentation.

【image augmentation】

The network was trained using a 2015 MacBook Pro and attached external GPU (eGPU), specifically an Nvidia GTX 980 Ti (we’d probably buy a 1080 Ti if we were starting today).

We were able to train the network on batches of 128 images at a time.

The network was trained for a total of 240 epochs, meaning we ran all 150k images through the network 240 times. This took about 80 hours.

We trained the network in 3 phases:

- - Phase 1 ran for 112 epochs (7 full CLR cycles with a step size of 8 epochs), with a learning rate between 0.005 and 0.03, on a triangular 2 policy (meaning the max learning rate was halved every 16 epochs).
  - Phase 2 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.0004 and 0.0045, on a triangular 2 policy.
  - Phase 3 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.000015 and 0.0002, on a triangular 2 policy.

UPDATED: a previous version of this chart contained inaccurate learning rates.

While learning rates were identified by running the linear experiment recommended by the CLR paper, they seem to intuitively make sense, in that the max for each phase is within a factor of 2 of the previous minimum, which is aligned with the industry standard recommendation of halving your learning rate if your accuracy plateaus during training.

Paperspace P5000 instance running Ubuntu. In those cases, we were able to double the batch size, and found that optimal learning rates for each phase were roughly double as well.

其他参考资料

on Mobile Phones

Even having designed a relatively compact neural architecture, and having trained it to handle situations it may find in a mobile context, we had a lot of work left to make it run properly. Trying to run a top-of-the-line neural net architecture out of the box can quickly burns hundreds megabytes of RAM, which few mobile devices can spare today. Beyond network optimizations, it turns out the way you handle images or even load TensorFlow itself can have a huge impact on how quickly your network runs, how little RAM it uses, and how crash-free the experience will be for your users.

This was maybe the most mysterious part of this project. Relatively little information can be found about it, possibly due to the dearth of production deep learning applications running on mobile devices as of today. However, we must commend the Tensorflow team, and particularly Pete Warden, Andrew Harp and Chad Whipkey for the existing documentation and their kindness in answering our inquiries.

- 【网络已很小，没太大必要】
- 【编译自动优化】
- 【精简tf本身】
- relatively simple trick, so there may be more areas of TensorFlow’s iOS code that can be optimized for your purposes.

imported it via CoreML), and loaded the parameters into that new implementation. However, the biggest obstacle was that these new Apple libraries are only available on iOS 10+, and we wanted to support older versions of iOS. As iOS 10+ adoption and these frameworks continue to improve, there may not be a case for using TensorFlow on device in the near future.

Changing App Behavior by Injecting Neural Networks on The fly

(...)

What We Would Do Differently

There are a lot of things that didn’t work or we didn’t have time to do, and these are the ideas we’d investigate in the future:

- More carefully tune our data-augmentation parameters.
- Measure accuracy end-to-end, i.e. the final determination made by the app abstracting things like whether our app has 2 or many more categories, what the final threshold for hotdog recognition is (we ended up having the app say “hotdog” if recognition is above 0.90 as opposed to the default of 0.5), after weights are rounded, etc.
- Building a feedback mechanism into the app — to let users vent frustration if results are erroneous, or actively improve the neural network.
- > 1.0