èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽæ·±åº¦å¦ä¹ çš„ç›®æ ‡æ£€æµ‹æŠ€æœ¯çš„ç ”ç©¶

Object Detection Based on Deep Learning

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ åˆ˜åšï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åŒ—äº¬é‚®ç”µå¤§å¦ ï¼Œ ç”µåä¸Žé€šä¿¡å·¥ç¨‹ï¼Œ 2016ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ åœ¨å›¾åƒä¿¡æ¯å¿«é€Ÿè†¨èƒ€çš„ä»Šå¤©,å¦‚ä½•å¿«é€Ÿæœ‰æ•ˆçš„å¯¹é™æ€å›¾åƒè¿›è¡Œæ ‡æ³¨,ä»Žé™æ€å›¾åƒä¸æ£€æµ‹å’Œå®šä½å‡ºç›®æ ‡ç±»ç‰©ä½“,æ˜¯æœºå™¨å¦ä¹ å’Œè®¡ç®—æœºè§†è§‰é¢†åŸŸä¸æœ€åŸºç¡€å¤§æŒ‘æˆ˜ä¹‹ä¸€ã€‚ç›®æ ‡æ£€æµ‹æŠ€æœ¯æ˜¯æŒ‡ä»Žé™æ€å›¾åƒä¸æ£€æµ‹å’Œå®šä½å‡ºä¸€èˆ¬ç›®æ ‡ç±»ã€‚è¿™ä¸ªé—®é¢˜åœ¨æŠ€æœ¯ä¸Šå¾ˆéš¾æœ‰æ•ˆå®žçŽ°,ä¸»è¦å˜åœ¨è¿™å‡ æ–¹é¢çš„åŽŸå› ,ä¸€æ˜¯å› ä¸ºå¾ˆå¤šç›®æ ‡ç‰©ä½“ç±»å¯ä»¥åœ¨å¤–è§‚ä¸Šæœ‰å¾ˆå¤§ä¸åŒ,è¿™äº›å˜åŒ–ä¸ä»…ä»…æ˜¯ç”±äºŽå…‰ç…§å’Œè§’åº¦ä¸åŒå¼•èµ·çš„,è€Œä¸”è¿˜å–å†³äºŽéžåˆšæ€§å˜å½¢,ä¸å¦‚åŒæ ·æ˜¯æ±½è½¦,ä¼šæœ‰ä¸åŒçš„å½¢çŠ¶å˜å½¢ã€‚è¿‘å‡ å¹´æ¥,ç›®æ ‡æ£€æµ‹æŠ€æœ¯çš„æ€§èƒ½å‘å±•å˜å¾—å¾ˆç¼“æ…¢å¹¶å·²ç»åœæ»žä¸å‰ã€‚ç›®å‰æ€§èƒ½æœ€å¥½çš„ç›®æ ‡æ£€æµ‹ç³»ç»Ÿéƒ½æ˜¯ä¸€äº›å¾ˆå¤æ‚çš„æ•´åˆç³»ç»Ÿ,è¿™äº›ç³»ç»Ÿç»“åˆä»Žç›®æ ‡æ£€æµ‹ç¬¦æå–çš„å¤šç§ä½Žå±‚å›¾åƒç‰¹å¾å’Œä»Žåœºæ™¯åˆ†ç±»å™¨èŽ·å¾—çš„é«˜å±‚è¯å¢ƒã€‚ç”±äºŽè¿™äº›ç³»ç»Ÿå¾ˆå¤æ‚å¹¶ä¸”åªæ˜¯åŸºäºŽSIFTæˆ–HOGè¿™äº›æ‰‹å·¥è®¾è®¡çš„ä½Žå±‚å›¾åƒç‰¹å¾,æ‰€ä»¥ä¸èƒ½å¤Ÿå‡†ç¡®ã€å¿«é€Ÿåœ°æ£€æµ‹å’Œå®šä½ç›®æ ‡ç±»ã€‚åœ¨æœ¬è®ºæ–‡ä¸,æˆ‘ä»¬æ·±å…¥åˆ†æžäº†æ·±åº¦å·ç§¯ç¥žç»ç½‘ç»œåœ¨é™æ€å›¾åƒä¸ç›®æ ‡æ£€æµ‹æŠ€æœ¯ç ”ç©¶ä¸çš„åº”ç”¨ã€‚ç»“åˆå€™é€‰åŒºåŸŸæå–,æ¨¡åž‹å¾®è°ƒå’Œç‰¹å¾æå–çš„æ¦‚å¿µ,è§£å†³äº†æ·±åº¦å·ç§¯ç¥žç»ç½‘ç»œæ¨¡åž‹åœ¨ä¸Žåˆ†ç±»ä»»åŠ¡ä¸åŒçš„æ•°æ®é›†ä¸Šçš„è®ç»ƒå’Œä¼˜åŒ–é—®é¢˜,æå‡ºæ¨¡åž‹å¾®è°ƒçš„æ–¹æ³•,è®¾è®¡äº†ä¸‰ç§ä¸åŒæ·±åº¦,ä¸åŒè§„æ¨¡å¤§å°çš„å·ç§¯ç¥žç»ç½‘ç»œ,å…ˆè®ç»ƒé¢„è®ç»ƒæ¨¡åž‹,ç„¶åŽå†è¿›è¡Œæ¨¡åž‹å¾®è°ƒ,æœ€åŽä½¿ç”¨å¾®è°ƒåŽçš„æ·±åº¦æ¨¡åž‹è¿›è¡Œç›®æ ‡æ£€æµ‹ã€‚æœ¬æ–‡ä¸çš„ç›®æ ‡æ£€æµ‹ç®—æ³•èƒ½å¤Ÿå‡†ç¡®æ£€æµ‹å›¾åƒä¸çš„ä¸€èˆ¬ç›®æ ‡ç±»,å¯ä»¥å‡†ç¡®åœ°å®šä½å‡ºä¸€èˆ¬ç›®æ ‡ç±»,è¿™ä¹Ÿé—´æŽ¥è¯æ˜Žäº†æ·±åº¦æ¨¡åž‹å…·æœ‰æ¯”è¾ƒå¼ºçš„æ³›åŒ–èƒ½åŠ›ã€‚åœ¨ç›®æ ‡æ£€æµ‹è¿‡ç¨‹ä¸,å°†å¼•å…¥ä¸€äº›å›¾åƒåˆ‡å‰²ç®—æ³•,å¦‚selective-searchç®—æ³•,åº”ç”¨äºŽå‰æœŸé’ˆå¯¹å›¾åƒåˆ‡å‰²å‡ºå¾ˆå¤šå›¾åƒååŒºåŸŸ,åœ¨æœ¬æ–‡ä¸ç§°ä¹‹ä¸ºå€™é€‰åŒºåŸŸ,è¿™äº›å€™é€‰åŒºåŸŸä¸å¯èƒ½å˜åœ¨ç€éœ€è¦æ£€æµ‹çš„ç›®æ ‡ç±»ã€‚æ¤å¤–,è¿™äº›è¯†åˆ«å‡ºçš„å€™é€‰åŒºåŸŸä¼šé€šè¿‡ä¸€ä¸ªè®ç»ƒå¥½çš„åŒºåŸŸå›žå½’å™¨,å¾—åˆ°æ›´æŽ¥è¿‘çœŸå®žç‰©ä½“æ‰€åœ¨çš„åŒºåŸŸã€‚æˆ‘ä»¬é’ˆå¯¹æ·±åº¦æ¨¡åž‹çš„å†…éƒ¨ç‰¹å¾ä¸é€æ˜Ž,ç½‘ç»œè¿‡äºŽæŠ½è±¡,ä¸åˆ©äºŽç ”ç©¶äººå‘˜å¯¹æ·±åº¦æ¨¡åž‹è¿›è¡Œè®ç»ƒå’Œä¼˜åŒ–çš„é—®é¢˜,æœ¬æ–‡è®¾è®¡äº†ä¸€ç§ç±»ä¼¼åå·ç§¯ç½‘ç»œ,å°†é«˜å±‚ç‰¹å¾é‡æž„åˆ°RGBé¢œè‰²ç©ºé—´,å®žçŽ°å¯¹æ·±åº¦å·ç§¯ç¥žç»ç½‘ç»œçš„å¯è§†åŒ–æŠ€æœ¯ã€‚æˆ‘ä»¬ä»Žä¸äº†è§£åˆ°ä¸åŒå±‚æ‰€å¦ä¹ åˆ°çš„ç‰¹å¾å„æœ‰ä¸åŒã€‚æ‰€ä»¥æå‡æ·±åº¦æ¨¡åž‹æ€§èƒ½çš„å…³é”®å°±æ˜¯,å¦‚ä½•æœ‰æ•ˆçš„åˆ†æžå’Œåˆ©ç”¨æ·±åº¦å·ç§¯ç¥žç»ç½‘ç»œæ‰€æå–çš„ç‰¹å¾,åˆ†æžå‡ºæ‰€éœ€è¦ä¼˜åŒ–çš„åœ°æ–¹,ç„¶åŽå†å¯¹æ·±åº¦å·ç§¯ç¥žç»ç½‘ç»œè¿›è¡Œä¼˜åŒ–ã€‚æˆ‘ä»¬åŸºäºŽä»¥ä¸Šé’ˆå¯¹åŸºäºŽæ·±åº¦å¦ä¹ çš„ç›®æ ‡æ£€æµ‹æŠ€æœ¯çš„ç ”ç©¶,æŠ¥åå‚ä¸Žäº†2015å¹´çš„Imagenet Large Scale Visual Recognition Challenge çš„ç«žèµ›,åœ¨ILSVRC2014 Object Detectionä»»åŠ¡çš„æ•°æ®é›†ä¸Š,å®žçŽ°äº†å•ä¸€æ¨¡åž‹åœ¨detectionä»»åŠ¡æµ‹è¯•é›†ä¸Šå–å¾—mean APæŒ‡æ ‡42.3%çš„ä¼˜ç§€æˆç»©ã€‚ç›®å‰ç»“æžœè¿˜åœ¨æäº¤è¿‡ç¨‹ä¸ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Along with the rapid expansion of image and video information,how to quickly and effectively analysis and annotate the static image,detect and localize the object class from a static image,is one of the most basic challenge of machine learning and computer vision.The technology of object detection is difficult to effectively implement,because of two main reasons.Firstly,many of objects can be very different in appearance.Secondly,the difference is not only changes in illumination,but also the non-rigid deformations.Object detection is a task that detects and localizes the custom objects.Object detection performance has plateaued in the last few years.The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context.These systems are very complex and based considerably on the use of low-level image features,such as SIFT and HOG,so they can not efficiently and fast detect and localize objects.Deep learning has achieved great performance in computer vision,and deep model has been applied to a lot of tasks,such as hand-written recognition,image classification,object detection.This paper analysis the application of deep convolutional neural network on the technology research of object detection in static image.We aim to optimize a large convolutional neural network model on a small scale dataset,with the help of a large auxiliary dataset.Deep architecture performs well on large dataset,but it may occur over fitting on a small scale dataset.So transferring a big deep model trained on a large scale dataset to a new small scale dataset is meaningful.We proposed an algorithm to implement the transfer,called fine-tuning.Fine-tune a deep convolutional neural network model on a large scale dataset by replacing the classifiers of the convolutional neural network model of a new one and remaining the feature extractor.The stacked convolutional layers and pooling layers are treated as feature extractor,and the fully connected layers and the softmax layers are treated as a classifier.In addition,we compared the performance of features extracted by the convolutional neural network with the handcraft features,such as HOG and SIFT.Also we compare the performance of our deep model with the SVM classifier.In the object detection process,we will introduce some image segmentation algorithms,such as selective-search and edge-boxes.These algorithms will be applied to generate various category-independent region proposals.These region proposals define the set of candidate detections available to our detector.Large convolutional network models have recently demonstrated impressive classification performance on the ImageNet benchmark.However there is no clear understanding of why they perform so well,or how they might be improved.In this paper we introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.We also perform an ablation study to discover the performance contribution from different model layers.Based on those algorithms above,we participate the 2015 Imagenet Large Scale Visual Recognition Challenge.On the detection validation datasets,we achieve a mean average precision(mAP)of 42.3%with a single model compare other methods of models fusion.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ ç›®æ ‡æ£€æµ‹ï¼› æ·±åº¦å¦ä¹ ï¼› å·ç§¯ç¥žç»ç½‘ç»œï¼›
ã€Key wordsã€‘ object detectionï¼› deep learningï¼› convolutional neural networkï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åŒ—äº¬é‚®ç”µå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.41;TP18
ã€è¢«å¼•é¢‘æ¬¡ã€‘4
ã€ä¸‹è½½é¢‘æ¬¡ã€‘383

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽæ·±åº¦å­¦ä¹ çš„ç›®æ ‡æ£€æµ‹æŠ€æœ¯çš„ç ”ç©¶

Object Detection Based on Deep Learning

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽæ·±åº¦å¦ä¹ çš„ç›®æ ‡æ£€æµ‹æŠ€æœ¯çš„ç ”ç©¶