AbstractsComputer Science

Monocular depth estimation in images and sequences using occlusion cues

by Guillem Palou Visa




Institution: Universitat Politècnica de Catalunya
Department:
Year: 2014
Record ID: 1124618
Full text PDF: http://hdl.handle.net/10803/144653


Abstract

When humans observe a scene, they are able to perfectly distinguish the different parts composing it. Moreover, humans can easily reconstruct the spatial position of these parts and conceive a consistent structure. The mechanisms involving visual perception have been studied since the beginning of neuroscience but, still today, not all the processes composing it are known. In usual situations, humans can make use of three different methods to estimate the scene structure. The first one is the so called divergence and it makes use of both eyes. When objects lie in front of the observed at a distance up to hundred meters, subtle differences in the image formation in each eye can be used to determine depth. When objects are not in the field of view of both eyes, other mechanisms should be used. In these cases, both visual cues and prior learned information can be used to determine depth. Even if these mechanisms are less accurate than divergence, humans can almost always infer the correct depth structure when using them. As an example of visual cues, occlusion, perspective or object size provide a lot of information about the structure of the scene. A priori information depends on each observer, but it is normally used subconsciously by humans to detect commonly known regions such as the sky, the ground or different types of objects. In the last years, since technology has been able to handle the processing burden of vision systems, there has been lots of efforts devoted to design automated scene interpreting systems. In this thesis we address the problem of depth estimation using only one point of view and using only occlusion depth cues. The thesis objective is to detect occlusions present in the scene and combine them with a segmentation system so as to generate a relative depth order depth map for a scene. We explore both static and dynamic situations such as single images, frame inside sequences or full video sequences. In the case where a full image sequence is available, a system exploiting motion information to recover depth structure is also designed. Results are promising and competitive with respect to the state of the art literature, but there is still much room for improvement when compared to human depth perception performance.; Quan els humans observen una escena, son capaços de distingir perfectament les parts que la composen i organitzar-les espacialment per tal de poder-se orientar. Els mecanismes que governen la percepció visual han estat estudiats des dels principis de la neurociència, però encara no es coneixen tots els processos biològic que hi prenen part. En situacions normals, els humans poden fer servir tres eines per estimar l’estructura de l’escena. La primera és l’anomenada divergència. Aprofita l’ús de dos punts de vista (els dos ulls) i és capaç¸ de determinar molt acuradament la posició dels objectes ,que a una distància de fins a cent metres, romanen enfront de l’observador. A mesura que augmenta la distància o els objectes no es troben en el camp de visió dels dos ulls, altres…