Visualization of Oral papers presented at CVPR 2014
(PDF links coming soon)
(PDF links coming soon)
Hover over a node to see the paper title. Click on a color to only show papers connected to that cluster. Zoom and move around with normal map controls.
Papers are linked together based on TF-IDF similarity and are colored using their predicted topic index.
Toggle the topics below to sort by category. The top 10 words from each cluster are shown.
#61 - Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos [pdf]
Gunhee Kim, Eric Xing |
Abstract: In this paper, we investigate an approach for reconstructing storyline graphs from large-scale Internet images, and optionally other side information such as friendship graphs. The storyline graphs can be an effective structural summary that visualizes various events or activities recurring across the input photo sets, along with various branching narrative structure associated with the topic. In order to explore the usefulness of the storyline graphs further, we leverage them to perform the image sequential prediction tasks, from which photo recommendation applications can benefit. We formulate the storyline reconstruction problem as an inference of sparse time-varying directed graphs, and develop an optimization algorithm that achieves a number of key challenges of Web-scale storyline reconstruction, including global optimality, linear complexity, and easy parallelization. With experiments on more than 3.3 millions of images of 24 classes and user studies via Amazon Mechanical Turk, we demonstrate that the proposed algorithm is more successful for both the storyline reconstruction and the image prediction tasks over other candidate methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#120 - Unsupervised One-Class Learning for Automatic Outlier Removal [pdf]
Wei Liu, Gang Hua, John Smith |
Abstract: Outliers are pervasive in many computer vision and pattern recognition problems. Automatically eliminating outliers scattering among practical data collections becomes increasingly important, especially for Internet inspired vision applications. In this paper, we propose a novel one-class learning approach which is robust to contamination of input training data and able to discover the outliers that corrupt one class of data source. Our approach works under a fully unsupervised manner, differing from traditional one-class learning supervised by known positive labels. By design, our approach optimizes a kernel-based max-margin objective which jointly learns a large margin one-class classifier and a soft label assignment for inliers and outliers. An alternating optimization algorithm is then designed to iteratively refine the classifier and the labeling, achieving a provably convergent solution in only a few iterations. Extensive experiments conducted on four image datasets in the presence of artificial and real-world outliers demonstrate that the proposed approach is considerably superior to the state-of-the-arts in obliterating outliers from contaminated one class of images, exhibiting strong robustness at the high outlier proportion up to 60%.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we study the configurations of motion and structure that lead to inherent ambiguities in radial distortion estimation (or 3D reconstruction with unknown radial-distortions). By analyzing the motion field of radial distorted images using a very general radial distortion model, we solve for critical surface pairs that can lead to the same motion field under different radial distortions. We study the properties of the discovered critical configurations and discuss practically important configurations that can occur in real problems. We demonstrate the impact of the radial-distortion ambiguity on multi-view reconstruction with synthetic experiments and real experiments.
|
Similar papers:
[rank all papers by similarity to this]
|
#150 - Active Flattening of Curved Document Images via Two Structured Beams [pdf]
Gaofeng MENG, Ying WANG, Shenquan QU, Shiming Xiang, chunhong PAN |
Abstract: Document images captured by a digital camera often suffer from serious geometric distortions. In this paper, we propose an active method to correct geometric distortions in a camera-captured document image. Unlike many passive rectification methods that rely on text-lines or features extracted from images, our method uses two structured beams illuminating upon the document page to recover two spatial curves. A developable surface is then interpolated to the curves by finding the correspondence between them. The developable surface is finally flattened onto a plane by solving a system of ordinary differential equations. Our method is a content-free approach and can restore a corrected document image of high accuracy with undistorted contents. Experimental results on a variety of real-captured document images demonstrate the effectiveness and efficiency of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: State-of-the-art dynamic scene deblurring methods based on accurate motion segmentation assume that motion blur is small or that the specific type of motion causing the blur is known. In this paper, we study a motion segmentation-free dynamic scene deblurring method, which is unlike other conventional methods. When the motion can be approximated to linear motion that is locally (pixel-wise) varying, we can handle various types of blur caused by camera shake, including out-of-plane motion, depth variation, radial distortion, and so on. Thus, we propose a new energy model simultaneously estimating motion flow and the latent image based on robust total variation (TV)-L1 model. This approach is necessary to handle abrupt changes in motion without segmentation. Furthermore, we address the problem of the traditional coarse-to-fine deblurring framework, which gives rise to artifacts when restoring small structures with distinct motion. We thus propose a novel re-initialization method which reduces the reliance of motion flow propagated from a coarser level. Moreover, a highly effective convex optimization-based solution mitigating the computational difficulties of the TV-L1 model is established. Comparative experimental results on challenging real blurry images demonstrate the efficiency of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
#241 - Diffuse Mirrors: 3D Reconstruction from Diffuse Indirect Illumination Using Inexpensive Time-of-Flight Sensors [pdf]
Felix Heide, Lei Xiao, Wolfgang Heidrich, Matthias B. Hullin |
Abstract: The functional difference between a diffuse wall and a mirror is well understood: one scatters back into all directions, and the other one preserves the directionality of reflected light. The temporal structure of the light, however, is left intact by both: assuming simple surface reflection, photons that arrive first are reflected first. In this paper, we exploit this insight to recover objects outside the line of sight from second-order diffuse reflections, effectively turning walls into mirrors. We formulate the reconstruction task as a linear inverse problem on the transient response of a scene, which we acquire using an affordable setup consisting of a modulated light source and a time-of-flight image sensor. By exploiting sparsity in the reconstruction domain, we achieve resolutions in the order of a few centimeters for object shape (depth and laterally) and albedo. Our method is robust to ambient light and works for large room-sized scenes. It is drastically faster and less expensive than previous approaches using femtosecond lasers and streak cameras, and does not require any moving parts.
|
Similar papers:
[rank all papers by similarity to this]
|
#242 - A Primal-Dual Method for Higher-Order Multilabel Markov Random Fields [pdf]
Alexander Fix, Chen Wang, Ramin Zabih |
Abstract: Graph cuts method such as alpha-expansion [4] and fusion moves [21] have been successful at solving many optimization problems in computer vision. Higher-order Markov Random Fields (MRF's), which are important for numerous applications, have proven to be very difficult, especially for multilabel MRF's (i.e. more than 2 labels). In this paper we propose a new primal-dual energy minimization method for arbitrary higher-order multilabel MRF's. Primal-dual methods provide guaranteed approximation bounds, and can exploit information in the dual variables to improve their efficiency. Our algorithm generalizes the PD3 [18] technique for first-order MRFs, and relies on a variant of max-flow that can exactly optimize certain higher-order binary MRF's [14]. We provide approximation bounds similar to PD3 [18], and the method is fast in practice. It can optimize non-submodular MRF's, and additionally can incorporate problem-specific knowledge in the form of fusion proposals. We compare experimentally against one of the few approaches that can efficiently handle these difficult energy functions [6,10]. For higher-order denoising and stereo MRF's, our method we produce lower energy while running significantly faster.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper addresses extracting two layers from an image where one layer is smoother than the other. This problem arises most notably in intrinsic image decomposition and reflection interference removal. Layer decomposition from a single-image is inherently ill-posed and solutions require additional constraints to be enforced. We introduce a novel strategy that regularizes the gradients of two layers such that one has a long tail distribution and the other a short tail distribution. While imposing the long tail distribution is a common practice, our introduction of the short tail distribution on the second layer is unique. We formulate our problem in a probabilistic framework and describe an optimization scheme to solve this regularization with only a few iterations. We apply our approach to the intrinsic image and reflection removal problems and demonstrate high quality layer separation on par with other techniques but being significantly faster than prevailing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#261 - L0 Regularized Stationary Time Estimation for Crowd Group Analysis [pdf]
SHUAI YI, Xiaogang Wang, Cewu Lu, Jiaya Jia |
Abstract: In this paper, we introduce the research topic on stationary crowd analysis, which is as important as modeling mobile groups in crowd scenes and has many important applications in crowd surveillance. Our key contribution is to propose a robust algorithm of estimating how long a foreground pixel has been stationary. It is much more challenging than background subtraction. A failure at a single frame due to local movement of objects, lighting variation, and occlusion leads to large error on estimated stationary time. To achieve robust and accurate estimation, sparsity constraints along spatial and temporal dimensions are jointly added by second-order gradients to shape a 3D stationary time map, and it is formulated as a $L_0$ optimization problem. Optimization is jointly done on a batch of frames. Besides background subtraction, it distinguishes different foreground objects which are close or overlapped in the spatio-temporal space by using a locally shared foreground codebook. As exemplar applications, the proposed technologies are used to detect four types of stationary group activities and analyze crowd scene structures. We provide the first public benchmark dataset for stationary time estimation and stationary group analysis.
|
Similar papers:
[rank all papers by similarity to this]
|
#262 - Shape-Preserving Half-Projective Warps for Image Stitching [pdf]
Che-Han Chang, Yoichi Sato, Yung-Yu Chuang |
Abstract: This paper proposes a novel parametric warp which is a spatial combination of a projective transformation and a similarity transformation. Given the projective transformation relating two input images, based on a analysis of projective transformations, our method smoothly extrapolates the projective transformation of the overlapping regions into the non-overlapping regions and the warp gradually changes from projective to similarity across the image. The proposed warp has the strengths of both projective and similarity warps. It provides good alignment accuracy as projective warps while preserving the perspective of individual image as similarity warps. It can also be combined with more advanced local-warp-based alignment methods such as the as-projective-as-possible warp for better alignment accuracy. With the proposed warp, the field of view can be extended by stitching images with less less projective distortion (stretched shapes and enlarged sizes).
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Real-world videos of human activities exhibit temporal structure at various scales; long videos are typically composed out of multiple action instances, where each instance is itself composed of sub-actions with variable duration and orderings. Temporal grammars can presumably model such hierarchical structure, but are computationally difficult to apply for long video streams. We describe simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine. This makes parsing linear time, constant storage, and naturally online. We train grammar parameters using a latent structural SVM, where latent subactions are learned automatically. We illustrate the effectiveness of our approach over common baselines on a new 1-million frame dataset of continuous YouTube videos.
|
Similar papers:
[rank all papers by similarity to this]
|
#312 - Multiscale Centerline Detection by Learning a Scale-Space Distance Transform [pdf]
Amos Sironi, Vincent Lepetit, Pascal Fua |
Abstract: We propose a robust and accurate method to extract the centerlines and scale of tubular structures in 2D images and 3D volumes. Existing techniques rely either on filters designed to respond to ideal cylindrical structures, which lose accuracy when the linear structures become very irregular, or on classification, which is inaccurate because locations on centerlines and locations immediately next to them are extremely difficult to distinguish. We solve this problem by reformulating centerline detection in terms of a regression problem. We first train regressors to return the distances to the closest centerline in scale-space, and we apply them to the input images or volumes. The centerlines and the corresponding scale then correspond to the regressors local maxima, which can be easily identified. We show that our method outperforms state-of-the-art techniques for various 2D and 3D datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#320 - Context Driven Scene Parsing with Attention to Rare Classes [pdf]
Jimei Yang, Brian Price, Scott Cohen, Ming-Hsuan Yang |
Abstract: This paper presents a scalable scene parsing algorithm based on image retrieval and superpixel matching. We focus on rare object classes, which play an important role in achieving richer semantic understanding of visual scenes, compared to common background classes. We make two novel contributions: rare class expansion and spatial context description. Considering the long-tailed nature of the label distribution, we build a superpixel dictionary by mining exemplars of each class, which provides better regularization to superpixel classification. We construct both global and local semantic context descriptors based on classification likelihood maps, together with appearance descriptors for image retrieval and superpixel matching. Results on the SIFTflow and LMSun datasets show the superior performance of our algorithm, especially on the rare classes, without sacrificing overall labeling accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#337 - Multi-Object Tracking via Constrained Sequential Labeling [pdf]
Sheng Chen, Alan Fern, Sinisa Todorovic |
Abstract: This paper presents a new approach to tracking people in crowded scenes, where people are subject to long-term (partial) occlusions and may assume varying postures and articulations. In such videos, detection-based trackers give poor performance since detecting people occurrences is not reliable, and common assumptions about locally smooth trajectories do not hold. Rather, we use temporal mid-level features (e.g., supervoxels or dense point trajectories) as more coherent spatiotemporal basis for handling occlusion and pose variations.Thus, we formulate tracking as labeling mid-level features by object identifiers, and specify a new approach, called constrained sequential labeling (CSL), for performing this labeling. CSL uses a cost function to sequentially assign labels while respecting the implications of hard constraints computed via constraint propagation. A key feature of this approach is that it allows for the use of flexible cost functions and constraints that capture complex dependencies that cannot be represented in standard network-flow formulations. To exploit this flexibility we describe how to learn constraints and give a provably correct learning algorithms for cost functions that achieves finitetime convergence at a rate that improves with the strength of the constraints. Our experimental results indicate that CSL outperforms the state of the art on challenging real-world videos of volleyball, basketball, and pedestrians walking.
|
Similar papers:
[rank all papers by similarity to this]
|
#363 - Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context [pdf]
Simon Jones, Ling Shao |
Abstract: A recent trend of research has shown how contextual information related to an action, such as a scene or object, can enhance the accuracy of human action recognition systems. However, using context to improve unsupervised human action clustering has never been considered before, and cannot be achieved using existing clustering methods. To solve this problem we introduce a novel, general purpose algorithm, Dual Assignment k-Means (DAKM), which is uniquely capable of performing two co-occurring clustering tasks simultaneously, while exploiting the correlation information to enhance both clusterings. Furthermore, we describe a spectral extension of DAKM (SDAKM) for better performance on realistic data. Extensive experiments on synthetic data and on three realistic human action datasets with scene context show that DAKM/SDAKM can significantly outperform the state-of-the-art clustering methods by taking into account the contextual relationship between actions and scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
#366 - Bayesian View Synthesis and Image-Based Rendering Principles [pdf]
Sergi PUJADES, Bastian Goldluecke, Frederic Devernay |
Abstract: In this paper, we address the problem of synthesizing novel views from a set of input images. State of the art methods, such as the Unstructured Lumigraph, have been using heuristics to combine information from the original views, often using an explicit or implicit approximation of the scene geometry. While the proposed heuristics have been largely explored and proven to work effectively, a Bayesian formulation was recently introduced, which formalizes some of the previously proposed heuristics, pointing out which physical phenomena could lie behind each. However, some important heuristics were still not taken into account and lack proper formalization. We contribute a new physics-based generative model and the corresponding Maximum a Posteriori estimate, providing the desired unification between heuristics-based methods and a Bayesian formulation. The key point is to systematically consider the error induced by the uncertainty in the geometric proxy. We provide an extensive discussion, analyzing how the obtained equations explain the heuristics developed in previous methods. Our theoretical contribution is supported by numerical results obtained on publicly available datasets, and the source code for the proposed method is also available. Furthermore, we show that our novel Bayesian model significantly improves the quality of novel views, in particular if the scene geometry estimate is inaccurate.
|
Similar papers:
[rank all papers by similarity to this]
|
#373 - BING: Binarized Normed Gradients for Objectness Estimation at 300fps [pdf]
Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, philip Torr |
Abstract: Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary, share surprisingly strong similarity in magnitude of gradients space, when resizing their corresponding image windows into a small fixed size. Based on this observation and computational reasons, we propose to resize an image window to 8 8 and use the gradient magnitudes as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD , BITWISE SHIFT , etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1,000 proposals. With increase of the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.
|
Similar papers:
[rank all papers by similarity to this]
|
#384 - Multivariate General Linear Models (MGLM) on Riemannian Manifolds with Applications to Statistical Analysis of Diffusion Weighted Images [pdf]
Hyunwoo Kim, Nagesh Adluru, Maxwell Collins, Moo Chung, barbara Bendlin, Sterling Johnson, Richard Davidson, Vikas Singh |
Abstract: Linear regression is a parametric model which is ubiquitous in scientific analysis. The classical setup where the observations and responses, i.e., $(x_i, y_i$) pairs, are Euclidean is well studied. The setting where $y_i$ is manifold valued is a topic of much interest, motivated by applications in shape analysis, topic modeling, and medical imaging. Recent work gives strategies for max-margin classifiers, principal components analysis, and dictionary learning on certain types of manifolds. For parametric regression specifically, results within the last year provide mechanisms to regress one real-valued parameter, $x_i \in R$, against a manifold-valued variable, $y_i \in M$. We seek to substantially extend the operating range of such methods by deriving schemes for multivariate multiple linear regression --- a manifold-valued dependent variable against multiple independent variables, i.e., $f: R^n \to M$. Our variational algorithm efficiently solves for multiple geodesic bases on the manifold concurrently via gradient updates. This allows us to answer questions such as: what is the relationship of the measurement at voxel $y$ to disease when conditioned on age and gender. We show applications to statistical analysis of diffusion weighted images, which give rise to regression tasks on the manifold $GL(n)/O(n)$ for diffusion tensor images (DTI) and the Hilbert unit sphere for orientation distribution functions (ODF) from high angular resolution acquisition. The companion open-s
|
Similar papers:
[rank all papers by similarity to this]
|
#386 - Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation [pdf]
Fabio Galasso, Margret Keuper, Thomas Brox, Bernt Schiele |
Abstract: Computational and memory costs restrict spectral techniques to rather small graphs, which is a serious limitation especially in video segmentation. In this paper, we propose the use of a reduced graph based on superpixels. In contrast to previous work, the reduced graph is reweighted such that the resulting segmentation is equivalent, under certain assumptions, to that of the full graph. We consider equivalence in terms of the normalized cut and of its spectral clustering relaxation. The proposed method reduces runtime and memory consumption and yields on par results in image and video segmentation. Further, it enables an efficient data representation and update for a new streaming video segmentation approach that also achieves state-of-the-art performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#396 - Learning Scalable Discriminative Attributes with Sample Relatedness [pdf]
Jiashi Feng, Stefanie Jegelka, Trevor Darrell, Huan Xu, Shuicheng Yan |
Abstract: Attributes are widely used as mid-level descriptors of object properties in object recognition and retrieval. Mostly, such attributes are manually pre-defined based on domain knowledge, and their number is fixed. However, pre-defined attributes may fail to adapt to the properties of the data at hand, may not necessarily be discriminative, and/or may not generalize well. In this work, we propose an attribute learning framework that flexibly adapts to the complexity of the given data set and reliably discovers the inherent discriminative attributes in the data. In addition, we use the sample relatedness information to improve the generalization of the attribute representation. We demonstrate that our framework is applicable to both object recognition and complex image retrieval tasks even with few training examples. Moreover, the learned attributes also help classify novel object categories. Experimental results on the Animals with Attributes, ILSVRC2010 and PASCAL VOC2007 datasets indicate that using relatedness information leads to significant performance gains over established baselines.
|
Similar papers:
[rank all papers by similarity to this]
|
#402 - Object Non-detection: Hiding an Object from Many Viewpoints [pdf]
Andrew Owens, Connelly Barnes, Alex Flint, hanumant Singh, Bill Freeman |
Abstract: We address the problem of camouflaging a 3D object so that it is hidden from human observers in many different views. We are inspired by biological camouflage in species such as flatfish and cuttlefish, which use vision to observe the environment, and then adjust pigments in their skin to create texture which conceals the animal. We propose a similar modular approach for digital camouflage, with two stages: capture, and computation of the camouflage. In the first stage, capture, we take a set of photographs from many views of a scene. One photograph is taken from each view from which an object should be concealed. We recover the camera poses and place a synthetic (digital) 3D object in the captured scene. In the second stage, computation, we propose several new camouflage models for computing the object's texture given the captured scene. Ideally, the object would be hidden from every view, but this is typically impossible except for very flat objects. We evaluate whether the object was successfully hidden by psychophysical experiments on Amazon Mechanical Turk. Our proposed methods significantly improve over a naive camouflage strategy. We contribute a dataset of 20 indoor and outdoor scenes that can be used to objectively measure the success of a camouflage algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
#418 - DeepPose: Human Pose Estimation via Deep Neural Networks [pdf]
Alexander Toshev, Christian Szegedy |
Abstract: We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regressors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formulation which capitalizes on recent advances in Deep Learning. We present a detailed empirical analysis with state-of-art or better performance on four academic benchmarks of diverse real-world images.
|
Similar papers:
[rank all papers by similarity to this]
|
#432 - Neural Decision Forests for Semantic Image Labelling [pdf]
Samuel Rota Bulo', Peter Kontschieder |
Abstract: In this work we present Neural Decision Forests, a novel approach to jointly tackle data representation- and discriminative learning within randomized decision trees. Recent advances of deep learning architectures demonstrate the power of embedding representation learning within the classifier - An idea that is intuitively supported by the hierarchical nature of the decision forest model where the input space is typically left unchanged during training and testing. We bridge this gap by introducing randomized Multi-Layer Perceptrons (rMLP) as new split nodes which are capable of learning non-linear, data-specific representations and taking advantage of them by finding optimal predictions for the emerging child nodes. To prevent overfitting, we i) randomly select the image data fed to the input layer, ii) automatically adapt the rMLP topology to meet the complexity of the data arriving at the node and iii) introduce an L1-norm based regularization that additionally sparsifies the network. The key findings in our experiments on three different semantic image labelling datasets are consistently improvement results and significantly compressed trees compared to conventional classification trees.
|
Similar papers:
[rank all papers by similarity to this]
|
#444 - Learning Everything about Anything: Webly-Supervised Visual Concept Learning [pdf]
Santosh Kumar Divvala, Ali Farhadi, Carlos Guestrin |
Abstract: Intra-class appearance variation has been regarded as one of the main nuisances in recognition. Recent works have proposed several interesting cues to reduce the visual complexity of a class, ranging from the use of simple anno- tations such as viewpoint or aspect-ratio to those requiring expert knowledge, e.g., visual phrases, poselets, attributes, etc. However, exploring intra-class variance still remains open. In this paper, we introduce an approach to discover an exhaustive concept-specific vocabulary of visual vari- ance, that is biased towards what the human race has ever cared about. We present a fully automated method that learns models of actions, interactions, and attributes for any concept (i.e., scenes, actions, objects, emotions, places, emotions, celebrities, professions, etc). Using our frame- work, we have already trained models for over 10000 varia- tions within 100 concepts and automatically annotated over 2 million images. We show a list of potential applications that our model enables across vision and NLP. We invite the interested reader to use our (doubly anonymous) system at http://goo.gl/O99uZ2 to train a detector for a concept of their choice.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper addresses the problem of assigning class labels to image pixels, where classes of interest include objects and scene surfaces. Following recent holistic formulations, we cast scene labeling as the MAP assignment of a fully connected conditional random field (CRF) grounded onto superpixels. CRF inference is posed as quadratic program (QP), and solved using our new stochastic search algorithm, called Heuristic-Score Beam Search (HSBS). HSBS gradually builds a search tree, where search states correspond to candidate scene labelings. Successor states are generated from a select set of parent states until convergence. HSBS is defined by three functions: Successor -- for stochastic exploration of the search space by generating successor states; Heuristic -- for evaluating and selecting top $B$ states for exploration; and Score -- for selecting the ``best'' leaf state as the solution. We prove that HSBS efficiently maximizes the QP objective of our CRF inference. HSBS is well-suited for scene labeling, because it explicitly accounts for spatial extents of objects; strictly conforms to inconsistency constraints from domain knowledge; and has low memory and computational costs. Effectiveness of HSBS for scene labeling is evaluated on the MSRC, Stanford Backgroud, PASCAL VOC 2009 and 2010 datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Parallax handling is a challenging task for image stitching. This paper presents a local stitching method to handle parallax based on the observation that input images do not need to be perfectly aligned over the whole overlapped region for stitching. Instead, they only need to be aligned in a way that there exists a local region where they can be seamlessly blended together without noticeable artifacts. We adopt a hybrid alignment model that combines homography and content-preserving warping to provide flexibility for handling parallax and avoid objectionable local distortion. We then develop an efficient randomized algorithm to search for a homography, which, combined with content-preserving warping, allows for optimal stitching. We predict how well a homography enables plausible stitching by finding a plausible seam and using the seam cost as the quality score. We develop a seaming finding method that can estimate a plausible seam from only roughly aligned images by considering both geometrical alignment and image content. We then pre-align input images using the optimal homography and further use content-preserving warping to locally refine alignment. We finally compose aligned images together using a standard seam-cutting algorithm and multi-band blending algorithm. Our experiments show that our method can effectively stitch images with large parallax that are difficult for existing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#489 - Second-order Shape Optimization for Geometric Inverse Problems in Vision [pdf]
Jonathan Balzer, Stefano Soatto |
Abstract: We develop a method for optimization in shape spaces, i.e., sets of surfaces modulo re-parametrization. Unlike previously proposed gradient flows, we achieve superlinear convergence rates through a subtle approximation of the shape Hessian, which is generally hard to compute and suffers from a series of degeneracies. Our analysis highlights the role of mean curvature motion in comparison with first-order schemes: instead of surface area, our approach penalizes deformation, either by its Dirichlet energy or total variation. Latter regularizer sparks the development of an alternating direction method of multipliers on triangular meshes. Therein, a conjugate-gradients solver enables us to bypass formation of the Gaussian normal equations appearing in the course of the overall optimization. We combine all of the aforementioned ideas in a versatile geometric variation-regularized Levenberg-Marquardt-type method applicable to a variety of shape functionals, depending on intrinsic properties of the surface such as normal field and curvature as well as its embedding into space. Promising experimental results are reported.
|
Similar papers:
[rank all papers by similarity to this]
|
#492 - Hierarchical Subquery Evaluation for Active Learning on a Graph [pdf]
Oisin Mac Aodha, Neill Campbell, Jan Kautz, Gabriel Brostow |
Abstract: To train good supervised and semi-supervised object classifiers, it is critical that we not waste the time of the human experts who are providing the training labels. Existing active learning strategies can have uneven performance, being efficient on some datasets but wasteful on others, or inconsistent just between runs on the same dataset. We propose perplexity-based graph construction and a new hierarchical subquery evaluation algorithm to combat this variability, and to release the potential of Expected Error Reduction. Under some specific circumstances, Expected Error Reduction has been one of the strongest-performing informativeness criteria for active learning. Until now, it has also been prohibitively costly to compute for sizeable datasets. We demonstrate our highly practical algorithm, comparing it to other active learning measures on classification datasets that vary in sparsity, dimensionality, and size. Our algorithm is consistent over multiple runs and achieves high accuracy, while querying the human expert for labels at a frequency that matches their desired time budget.
|
Similar papers:
[rank all papers by similarity to this]
|
#493 - PANDA: Pose Aligned Networks for Deep Attribute Modeling [pdf]
Ning Zhang, Lubomir Bourdev, Marc'Aurelio Ranzato, Manohar Paluri, Trevor Darrell |
Abstract: We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by flat low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.
|
Similar papers:
[rank all papers by similarity to this]
|
#507 - Realtime and Robust Hand Tracking from Depth [pdf]
Chen Qian, Xiao Sun, Yichen Wei, Xiaoou Tang, Jian Sun |
Abstract: We present a realtime and robust hand tracking system using a depth sensor. It tracks a fully articulated hand under large viewpoints in realtime (25 FPS on a desktop without using a GPU) and with high accuracy (below 10 mm). To our knowledge, it is the first system that achieves such robustness, accuracy, and performance simultaneously. Our system is made of several novel and effective components. We use a simple hand model and define a fast cost function. Those are critical for realtime performance. Previous optimization methods are not suitable for our simple cost function. We instead propose a hybrid optimization scheme that overcomes their drawbacks, achieves quick convergence and good accuracy. We present new finger detection and hand initialization methods that greatly enhance the robustness of tracking.
|
Similar papers:
[rank all papers by similarity to this]
|
#512 - Energy based multi-model fitting & matching for 3D reconstruction [pdf]
Hossam Isack, Yuri Boykov |
Abstract: Standard geometric model fitting methods take as an input a fixed set of feature pairs greedily matched based only on their appearances. Inadvertently, many valid matches are discarded due to repetitive texture or large baseline between view points. To address this problem, matching should consider both feature appearances and geometric fitting errors. We jointly solve feature matching and multi-model fitting problems by optimizing one energy. The formulation is based on our generalization of the assignment problem and its efficient min-cost-max-flow solver. Our approach significantly increases the number of correctly matched features,improves the accuracy of fitted models, and is robust to larger baselines.
|
Similar papers:
[rank all papers by similarity to this]
|
#513 - Robust Subspace Segmentation with Block-diagonal Prior [pdf]
Jiashi Feng, Zhouchen Lin, Huan Xu, Shuicheng Yan |
Abstract: The subspace segmentation problem is addressed in this paper by effectively constructing an exactly block diagonal sample affinity matrix. The block diagonal structure is heavily desired for accurate sample clustering but rather difficult to obtain. Most current state-of-the-art subspace segmentation methods (such as SSC and LRR) resort to alternative structural priors (such as sparseness and low-rankness) to construct the affinity matrix. In this work, we propose a graph Laplacian constraint based formulation to directly pursue the block diagonal structure, and then develop an efficient stochastic subgradient algorithm for optimization. Moreover, two new subspace segmentation methods, the block diagonal SSC and LRR, are devised in this work. To the best of our knowledge, this is the first research attempt to explicitly pursue such a block diagonal structure. Extensive experiments on face clustering, motion segmentation and graph construction for semi-supervised learning clearly demonstrate the superiority of our novelly proposed subspace segmentation methods. In particular, we achieve the state-of-the-art performance for the motion segmentation task on the Hopkins 155 benchmark dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: As the collection of large datasets becomes increasingly automated, the occurrence of outliers will increase -- "big data" implies "big outliers". While principal component analysis (PCA) is often used to reduce the size of data, and scalable solutions exist, it is well-known that outliers can arbitrarily corrupt the results. Unfortunately, state-of-the-art approaches for robust PCA do not scale beyond small-to-medium sized datasets. To address this, we introduce the Grassmann Average (GA), which expresses dimensionality reduction as an average of the subspaces spanned by the data. Because averages can be efficiently computed, we immediately gain scalability. While GA coincides with PCA for Gaussian data, it is already more robust to outliers. We then exploit the fact that averages can be made robust to formulate the Robust Grassmann Average (RGA). Robustness can be with respect to vectors (subspaces) or elements of vectors; we focus on the latter and use a trimmed average. The resulting Trimmed Grassmann Average (TGA) is particularly appropriate for computer vision because it is robust to pixel outliers. The algorithm has low computational complexity and minimal memory requirements, making it scalable to "big noisy data." We demonstrate TGA for background modeling, video restoration, and shadow removal. We show scalability by performing robust PCA on the entire Star Wars IV movie; a task beyond any currently existing method. Source code will be made available.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The world is full of objects with complex reflectances, situated in complex illumination environments. Past work on full 3D geometry recovery, however, has tried to handle this complexity by framing it into simplistic models of reflectance (Lambetian, mirrored, or diffuse plus specular) or illumination (one or more point light sources). Though there has been some recent progress in directly utilizing such complexities for recovering a single view geometry, it is not clear how such single-view methods can be extended to reconstruct the full geometry. To this end, we derive a probabilistic geometry estimation method that fully exploits the rich signal embedded in complex appearance. Though each observation provides partial and unreliable information, we show how to estimate the reflectance responsible for the diverse appearance, and unite the orientation cues embedded in each observation to reconstruct the underlying geometry. We demonstrate the effectiveness of our method on synthetic and real-world objects. The results show that our method performs accurately across a wide range of real-world environments and reflectances that lies between the extremes that have been the focus of past work.
|
Similar papers:
[rank all papers by similarity to this]
|
#562 - Adaptive Partial Differential Equation Learning for Visual Saliency Detection [pdf]
Risheng Liu, Zhouchen Lin, Shiguang Shan |
Abstract: Partial Differential Equation (PDE) has been successful for solving many low-level vision tasks. However, it is a challenging task to directly utilize PDE for visual saliency detection due to the difficulty of incorporating human perception and high-level priors to a PDE system. Instead of designing PDE with fixed formulation and boundary condition, this paper proposes a novel framework to adaptively learn a PDE system from the image for visual saliency detection. We assume that the saliency of image elements can be carried out from the relevances to the saliency seeds (i.e., the most representative salient elements). In this view, a general Linear Elliptic System with Dirichlet boundary (LESD) is introduced to model the diffusions from seeds to other relevant points. For a given image, we first learn a guidance map to fuse human prior knowledge to the diffusion system. Then by optimizing a discrete submodular function constrained with this LESD and a uniform matroid, the saliency seeds (i.e., boundary conditions) can be learnt for this image, thus achieve an optimal PDE system to model the evolution of visual saliency. Experimental results on various challenging image sets show the superiority of our proposed learning-based PDE for visual saliency detection.
|
Similar papers:
[rank all papers by similarity to this]
|
#570 - Local Regularity-driven City-scale Facade Detection from Aerial Images [pdf]
Yanxi Liu, Jingchen Liu |
Abstract: We present a novel regularity-driven framework for facade detection from aerial images of urban scenes. By exploring the reciprocal relation between regularity and sparsity computationally, we choose GINI-index as a new, local regularity measurement for urban scenes. We illustrate a simple and fast GINI-index optimization algorithm using GRASP for near-regular facade-region detection from high resolution airborne images, each of which typically contains more than 200 facades. Our experimental results on images from three different cities (NYC, SF, Rome) demonstrate superior performance on facade detection in both accuracy and speed over state of the art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#630 - L0 norm based dictionary learning method with global convergence [pdf]
Chenglong Bao, Hui Ji, Yuhui Quan, Zuowei Shen |
Abstract: In recent years, dictionary learning for sparse modelling has been an important tool for many applications in computer vision, which usually results in solving some challenging non-convex minimization problem in terms of computational feasibility and convergence analysis. Many iterative methods have been proposed to tackle such a non-convex optimization problem by either replacing the L0 norm by its convex relaxation (e.g. L1 norm) or using the greedy algorithm based on some heuristic such as orthogonal matching pursuit. In this paper, we proposed a fast iterative scheme for L0 norm based dictionary learning. In particular, we showed that the algorithm is globally convergent to a stationary point, which is the first among the existing methods. The advantages of the proposed method on the stability and the convergence is demonstrated in the applications of image recovery and recognition. It is shown in the experiments that the proposed one takes less computational time to achieve comparable performance to some widely used one such as the widely used K-SVD method.
|
Similar papers:
[rank all papers by similarity to this]
|
#648 - Optimal Decisions from Probabilistic Models: the Intersection-over-Union Case [pdf]
Sebastian Nowozin |
Abstract: A probabilistic model allows us to reason about the world and make statistically optimal decisions using Bayesian decision theory. However, in practice the intractability of the decision problem forces us to adopt simplistic loss functions such as the 0/1 loss or Hamming loss and as result we make poor decisions through MAP estimates or through low-order marginal statistics. In this work we investigate optimal decision making for more realistic loss functions. Specifically we consider the popular intersection-over-union (IoU) score used in image segmentation benchmarks and show that it results in a hard combinatorial decision problem. To make this problem tractable we propose a statistical approximation to the objective function, as well as an approximate algorithm based on parametric linear programming. We apply the algorithm on three benchmark datasets and obtain improved intersection-over-union scores compared to maximum-posterior-marginal decisions. Our work raises important questions for proponents of probabilistic models in computer vision.
|
Similar papers:
[rank all papers by similarity to this]
|
#673 - Rich feature hierarchies for accurate object detection and semantic segmentation [pdf]
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik |
Abstract: Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
|
Similar papers:
[rank all papers by similarity to this]
|
#683 - Surface-from-Gradients: An Approach Based on Discrete Geometry Processing [pdf]
Wuyuan Xie, Yunbo Zhang, Charlie C L Wang |
Abstract: In this paper, we propose an efficient method to reconstruct surface-from-gradients (SfG). Our method is formulated under the framework of discrete geometry processing. Unlike the existing SfG approaches, we transfer the continuous reconstruction problem into a discrete space and efficiently solve the problem via a sequence of least-square optimization steps. Our discrete formulation brings three advantages: 1) the reconstruction preserves sharp-feature, 2) sparse/incomplete set of gradients can be well handled, and 3) domains of computation can have irregular boundaries. Generally, these strengths of our method help overcome the unwanted distortions during the surface reconstruction. Our formulation is direct and easy to implement, and the comparisons with state-of-the-arts show the effectiveness of our method.
|
Similar papers:
[rank all papers by similarity to this]
|
#691 - Nearest Neighbor-based Label Transfer for Weakly Supervised Multiclass Video Segmentation [pdf]
Xiao Liu, Dacheng Tao, Mingli Song, Ying Ruan, Chun Chen, Jiajun Bu |
Abstract: The desire of enabling computers to learn semantic concepts from large quantities of Internet videos has motivated increasing interests on semantic video understanding, while video segmentation is important yet challenging for understanding videos. The main difficulty of video segmentation arises from the burden of labeling training samples, making the problem largely unsolved. In this paper, we present a novel nearest neighbor-based label transfer scheme for weakly supervised video segmentation. Whereas previous weakly supervised video segmentation methods have been limited to the two-class case, our proposed scheme focuses on more challenging multiclass video segmentation, which finds a semantically meaningful label for every pixel in a video. Our scheme enjoys several favorable properties when compared with conventional methods. First, a weakly supervised hashing procedure is carried out to handle both metric and semantic similarity. Second, the proposed nearest neighbor-based label transfer algorithm effectively avoids overfitting caused by weakly supervised data. Third, a multi-video graph model is built to encourage smoothness between regions that are spatiotemporally adjacent and similar in appearance. We demonstrate the effectiveness of the proposed scheme by comparing it with several other state-of-the-art weakly supervised segmentation methods on one new Wild8 dataset and a publicly available YTO dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
#754 - Minimal Solvers for Relative Pose with a Single Unknown Radial Distortion [pdf]
Yubin Kuang, Jan Erik Solem, Kalle Astroem, Fredrik Kahl |
Abstract: In this paper, we study the problems of estimating relative pose between two cameras in the presence of radial distortion. Specifically, we consider minimal problems where one of the cameras has no or known radial distortion. There are three useful cases for this setup of a single camera with unknown distortion: (i) fundamental matrix estimation where the two cameras are uncalibrated, (ii) essential matrix estimation for partially calibrated camera pair, (iii) essential matrix estimation for one calibrated camera and one camera with unknown focal length. We study the parameterization of these three problems and derive fast polynomial solvers based on Gr{\"o}bner basis methods. We demonstrate the numerical stability of the solvers on synthetic data. We have also applied these minimal solvers on real images with convincing results.
|
Similar papers:
[rank all papers by similarity to this]
|
#776 - Partial Optimality by Pruning for MAP-inference with General Graphical Models [pdf]
Paul Swoboda, Bogdan Savchynskyy, Joerg Kappes, Christoph Schnrr |
Abstract: We consider the energy minimization problem for undirected graphical models, also known as MAP-inference problem for Markov random fields which is NP-hard in general. We propose a novel polynomial time algorithm to obtain a part of its optimal {\em non-relaxed integral} solution. Our algorithm is initialized with variables taking integral values in the solution of a convex relaxation of the MAP-inference problem and iteratively prunes those, which do not satisfy our criterion for partial optimality. We show that our pruning strategy is in a certain sense theoretically optimal. Also empirically our method outperforms previous approaches in terms of the number of persistently labelled variables. The method is very general, as it is applicable to models with arbitrary factors of an arbitrary order and can employ any solver for the considered relaxed problem. Our method's runtime is determined by the runtime of the convex relaxation solver for the MAP-inference problem.
|
Similar papers:
[rank all papers by similarity to this]
|
#782 - Learning to disambiguate indistinguishable objects over time: weakly supervised structured learning [pdf]
Luca Fiaschi, Ferran Diego, Konstantin Gregor, Ullrich Koethe, Marta Zlatic, Fred Hamprecht |
Abstract: We use weakly supervised structured learning to track and disambiguate the identity of multiple indistinguishable, translucent and deformable objects that can overlap for many frames. For this challenging problem, we propose a novel model which handles occlusions, complex motions and non-rigid deformations by jointly optimizing the flows of multiple latent intensities across frames. These flows are latent variables for which the user cannot directly provide labels. Instead, we propose a structured learning formulation that uses only partial user annotations to find the best hyperparameters of the model. The approach is evaluated on a challenging dataset for multiple Drosophila larvae tracking which we make publicly available. Our method tracks multiple larvae in spite of their poor distinguishability and minimizes the number of identity switches during prolonged mutual occlusion.
|
Similar papers:
[rank all papers by similarity to this]
|
#787 - Filter Forests for Learning Data-Dependent Convolutional Kernels [pdf]
Sean Ryan Fanello, Cem Keskin, Pushmeet Kohli, Shahram Izadi, Jamie Shotton, Antonio Criminisi, Ugo Pattacini, Tim Paek |
Abstract: We propose `filter forests' (FF), an efficient new discriminative approach for predicting continuous variables given a signal and its context. FF can be used for general signal restoration tasks that can be tackled via convolutional filtering, where it attempts to learn the optimal filtering kernels to be applied to each data point. The model can learn both the size of the kernel and its values, conditioned on the observation and its spatial or temporal context. We show that FF compares favorably to both Markov random field based and recently proposed regression forest based approaches for labelling problems in terms of efficiency and accuracy. In particular, we demonstrate how FF can be used to learn optimal denoising filters for natural images as well as for other tasks such as depth image refinement, and 1D signal magnitude estimation. Numerous experiments and quantitative comparisons show that FFs achieve accuracy at par or superior to recent state of the art techniques, while being several orders of magnitude faster.
|
Similar papers:
[rank all papers by similarity to this]
|
#803 - Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech Recognition [pdf]
Jingyong Su, Anuj Srivastava, Fillipe Souza, Sudeep Sarkar |
Abstract: In statistical analysis of video sequences for speech recognition, and more generally activity recognition, it is natural to treat temporal evolutions of features as trajectories on Riemannian manifolds. However, different evolution patterns result in arbitrary parameterizations of trajectories. We investigate a recent framework from statistics literature that handles this nuisance variability using a cost function/distance for temporal registration and statistical summarization & modeling of trajectories. It is based on a mathematical representation of trajectories, termed transported square-root vector field (TSRVF), and the L2 norm on the space of TSRVFs. We apply this framework to the problem of speech recognition using both audio and visual components. In each case, we extract features, form trajectories on corresponding manifolds, and compute parametrization-invariant distances using TSRVFs for speech classification. On the OuluVS database the classification performance under metric increases significantly by nearly 100% under both modalities and for all choices of features. We obtained speaker-dependent classification rate of 70% and 96% for visual and audio components, respectively.
|
Similar papers:
[rank all papers by similarity to this]
|
#819 - Video Event Detection by Inferring Temporal Instance Labels [pdf]
Kuan-Ting Lai, Felix Yu, Ming-Syan Chen, Shih-Fu Chang |
Abstract: Video event detection allows intelligent indexing of video content based on events. Traditional approaches extract features from video frames or shots, then quantize and pool the features to form a single vector representation for the entire video. Though simple and efficient, the final pooling step may lead to loss of temporally local information, which is important in indicating which part in a long video signifies presence of the event. In this work, we propose a novel instance-based video event detection approach. We represent each video as multiple "instances'', defined as video segments of different temporal intervals. The objective is to learn an instance-level event detection model based on only video-level labels. To solve this problem, we propose a large-margin formulation which treats the instance labels as hidden latent variables, and simultaneously infers the instance labels as well as the instance-level classification model. Our framework infers optimal solutions that assume positive videos have a large number of positive instances while negative videos have the fewest ones. Extensive experiments on large-scale video event datasets demonstrate significant performance gains. The proposed method is also useful in explaining the detection results by localizing the temporal segments in a video which is responsible for the positive detection.
|
Similar papers:
[rank all papers by similarity to this]
|
#824 - Decorrelating Semantic Visual Attributes by Resisting the Urge to Share [pdf]
Dinesh Jayaraman, Fei Sha, Kristen Grauman |
Abstract: Existing methods to learn visual attributes are prone to learning the wrong thing---namely, properties that are correlated with the attribute of interest among training samples. Yet, many proposed applications of attributes rely on being able to learn the correct semantic concept corresponding to each attribute. We propose to resolve such confusions by jointly learning decorrelated, discriminative attribute models. Leveraging side information about semantic relatedness, we develop a multi-task learning approach that uses structured sparsity to encourage feature competition among unrelated attributes and feature sharing among related attributes. On three challenging datasets, we show that accounting for structure in the visual attribute space is key to learning attribute models that preserve semantics, yielding improved generalizability that helps in the recognition and discovery of unseen object categories.
|
Similar papers:
[rank all papers by similarity to this]
|
#826 - Triangulation embedding and democratic aggregation for image search [pdf]
Herve Jegou, Andrew Zisserman |
Abstract: We consider the design of a single vector representation for an image that embeds and aggregates a set of local patch descriptors such as SIFT. More specifically we aim to construct a dense representation, like the Fisher Vector or VLAD, though of small or intermediate size. We make two contributions, both aimed at regularizing the individual contributions of the local descriptors in the final representation. The first is a novel embedding method that avoids the dependency on absolute distances by encoding directions. The second contribution is a ``democratization" strategy that further limits the interaction of unrelated descriptors in the aggregation stage. These methods are complementary and give a substantial performance boost over the state of the art in image search with short or mid-size vectors, as demonstrated by our experiments on standard public image retrieval benchmarks.
|
Similar papers:
[rank all papers by similarity to this]
|
#829 - Image Fusion with Local Spectral Consistency and Dynamic Gradient Sparsity [pdf]
Chen Chen, Junzhou Huang, Wei Liu |
Abstract: In this paper, we propose a novel method for image fusion from a high resolution panchromatic image and a low resolution multispectral image at the same geographical location. Different from previous methods, we do not make any assumption about the upsampled multispectral image, but only assume that the fused image after downsampling should be close to the original multispectral image. This is a severely ill-posed problem and a dynamic gradient sparsity penalty is thus proposed for regularization. Incorporating the intra- correlations of different bands, this penalty can effectively exploit the prior information (e.g. sharp boundaries) from the panchromatic image. A new convex optimization algorithm is proposed to efficiently solve this problem. Extensive experiments on four multispectral datasets demonstrate that the proposed method significantly outperforms the state-of-the-arts in terms of both spatial and spectral qualities.
|
Similar papers:
[rank all papers by similarity to this]
|
#846 - Transparent Object Reconstruction via Coded Transport of Intensity [pdf]
Chenguang Ma, Xing Lin, Jinli Suo, Qionghai Dai, Gordon Wetzstein |
Abstract: Capturing and understanding visual signals is one of the core interests of computer vision. Much progress has been made w.r.t. many aspects of imaging, but the reconstruction of refractive phenomena, such as turbulence, gas and heat flows, liquids, or transparent solids, has remained a challenging problem. In this paper, we derive an intuitive formulation of light transport in refractive media using light fields and the transport of intensity equation. We show how coded illumination in combination with pairs of recorded images allow for robust computational reconstruction of dynamic two and three-dimensional refractive phenomena.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Groups are the primary entities that make up a crowd. Understanding group-level dynamics and properties is thus scientifically important and practically useful in a wide range of applications, especially for crowd understanding. In this study we show that fundamental group-level properties, such as intra-group stability and inter-group conflict, can be systematically quantified by visual descriptors. This is made possible through learning a novel Collective Transition prior, which leads to a robust approach for group segregation in public spaces. From the prior, we further devise a rich set of group property visual descriptors. These descriptors are scene-independent, and can be effectively applied to public spaces with variety of crowd densities and distributions. Extensive experiments on hundreds of public scene video clips demonstrate that such property descriptors are not only useful but also necessary for group state analysis and crowd scene understanding.
|
Similar papers:
[rank all papers by similarity to this]
|
#886 - Learning Euclidean-to-Riemannian Metric for Point-to-Set Classification [pdf]
Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xilin Chen |
Abstract: Recently, increasing methods have been suggested for the problem of point-to-set classification, e.g., still-to-video face recognition, which tends to prevail in computer vision. To our best knowledge, few works focus on learning a desirable point-to-set distance metric. While points lie in classical Euclidean space, recent studies have gained success by modeling sets as points in specific Riemannian space. In this paper, we propose to learn the Euclidean-to-Riemannian Metric between Euclidean points (i.e., original points) and Riemannian points (i.e., models of the original sets) for point-to-set classification. Due to the heterogeneity of the two spaces, we need to map the Euclidean points and Riemannian points to a common Euclidean subspace, where Euclidean distance can be applied. Specifically, we derive a unified framework to firstly embed the two heterogeneous spaces into reproducing kernel Hilbert spaces by preserving the original data distribution and geometric structure, and then learn corresponding transformations from each of the Hilbert spaces to the final common subspace. Extensive experiments clearly demonstrate the superiority of our approach over the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The limitations of current state-of-the-art methods for single-view depth estimation and semantic segmentations are closely tied to the property of perspective geometry, that the perceived size of the objects scales inversely with the distance. In this paper, we show that we can use this property to reduce the learning of a pixel-wise depth classifier to a much simpler classifier predicting only the likelihood of a pixel being at an arbitrarily fixed {\it canonical} depth. The likelihoods for any other depths can be obtained by applying the same classifier after appropriate image manipulations. Such transformation of the problem to the canonical depth removes the training data bias towards certain depths and the effect of perspective. The approach can be straight-forwardly generalized to multiple semantic classes, improving both depth estimation and semantic segmentation performance by directly targeting the weaknesses of independent approaches. Conditioning the semantic label on the depth provides a way to align the data to their physical scale, allowing to learn a more discriminative classifier. Conditioning depth on the semantic class helps the classifier to distinguish between ambiguities of the otherwise ill-posed problem. We tested our algorithm on the KITTI road scene dataset and NYU2 indoor dataset and obtained obtained results that significantly outperform current state-of-the-art in both single-view depth and semantic segmentation domain.
|
Similar papers:
[rank all papers by similarity to this]
|
#959 - Adaptive Color Attributes for Real-Time Visual Tracking [pdf]
Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg, Joost van de Weijer |
Abstract: Visual tracking is a challenging problem in computer vision. Most state-of-the-art visual trackers either rely on luminance information or use simple color representations for image description. Contrary to visual tracking, for object recognition and detection, sophisticated color features when combined with luminance have shown to provide excellent performance. Due to the complexity of the tracking problem, the desired color feature should be computationally efficient, and possess a certain amount of photometric invariance while maintaining high discriminative power. This paper investigates the contribution of color in a tracking-by-detection framework. Our results suggest that color attributes provides superior performance for visual tracking. We further propose an adaptive low-dimensional variant of color attributes. Both quantitative and attribute-based evaluations are performed on 41 challenging benchmark color sequences. The proposed approach improves the baseline intensity-based tracker by 24% in median distance precision. Furthermore, we show that our approach outperforms state-of-the-art tracking methods while running at more than 100 frames per second.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We address the problem of automatically populating object category detection datasets with dense, per-object 3D reconstructions, bootstrapped from class labels, ground truth figure-ground segmentations and a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion, then reconstructs objects shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions. The visual hull sampling process attempts to intersect an object's projection cone with the cones of minimal subsets of other similar objects among those pictured from certain vantage points. We show that our method is able to produce recognizable per-object 3D reconstructions on one of the most challenging existing object-category detection datasets, PASCAL VOC. Our results may re-stimulate once popular geometry-oriented model-based recognition approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: When do the visual rays associated with triplets of point correspondences converge, that is, intersect in a common point? Classical models of trinocular geometry based on the fundamental matrices and trifocal tensor associated with the corresponding cameras only provide partial answers to this fundamental question, in large part because of underlying, but seldom explicit, general configuration assumptions. This paper uses elementary tools from projective line geometry to provide necessary and sufficient geometric and analytical conditions for convergence in terms of transversals to triplets of visual rays, without any such assumptions. In turn, this yields a novel and simple minimal parameterization of trinocular geometry for cameras with non-collinear or collinear pinholes.
|
Similar papers:
[rank all papers by similarity to this]
|
#971 - 3D Shape and Indirect Appearance by Structured Light Transport [pdf]
Matthew O'Toole, John Mather, Kyros Kutulakos |
Abstract: We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time \emph{in the optical domain}, prior to acquisition. This yields three key abilities, which we demonstrate with our experimental prototype: (1) producing live indirect-only video that works for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods appropriate for dynamic shape capture.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Many state-of-the-art image restoration approaches do not scale well to larger images, such as megapixel images common in the consumer segment. Computationally expensive optimization is often the culprit. While efficient alternatives exist, they have not reached the same level of image quality. The goal of this paper is to develop an effective approach to image restoration that offers both computational efficiency and high restoration quality. To that end we propose shrinkage fields, a random field-based architecture that combines the image model and the optimization algorithm in a single unit. The underlying shrinkage operation bears connections to wavelet approaches, but is used here in a random field context. Computational efficiency is achieved by construction through the use of convolution and DFT as the core components; high restoration quality is attained through loss-based training of all model parameters and the use of a cascade architecture. Unlike heavily engineered solutions, our learning approach can be adapted easily to different trade-offs between efficiency and image quality. We demonstrate state-of-the-art restoration results with high levels of computational efficiency, and significant further speedup possibilities through inherent parallelism.
|
Similar papers:
[rank all papers by similarity to this]
|
#1012 - Temporal Sequence Modeling For Video Event Detection [pdf]
Quanfu Fan, Yu Cheng, Sharath Pankanti |
Abstract: We present a novel approach for event detection in video by temporal sequence modeling. Exploiting temporal information has lain at the core of many approaches for action and activity recognition. Unlike previous works doing temporal modeling at semantic event level with ground truth, we propose to model temporal dependencies in the data at sub-event granularity level without using event annotations. This frees our model from ground truth and addresses several limitations of temporal modeling in previous work. Based on this idea, we represent a video by a sequence of visual words learnt from the video, and apply the Sequence Memoizer to capture long-range dependencies in a temporal context in the visual sequence. This temporal model is further integrated with event classification for jointly performing segmentation and classification of events in a video. We demonstrate the efficacy of our approach on two challenging datasets for visual recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#1031 - Reflectance and Fluorescent Spectra Recovery based on Fluorescent Chromaticity Invariance under Varying Illumination [pdf]
Ying Fu, Antony Lam, Yasuyuki Kobashi, Imari Sato, Takahiro Okabe, Yoichi Sato |
Abstract: In recent years, fluorescent analysis of scenes has received attention. Fluorescence can provide additional information about scenes, and has been used in applications such as camera spectral sensitivity estimation, 3D reconstruction, and color relighting. In particular, hyperspectral images of reflective-fluorescent scenes provide a rich amount of data. However, due to the complex nature of fluorescence, hyperspectral imaging methods rely on specialized equipment such as hyperspectral cameras and specialized illuminants. In this paper, we propose a more practical approach to hyperspectral imaging of reflective-fluorescent scenes using only a conventional RGB camera and varied colored illuminants. The key idea of our approach is to exploit a unique property of fluorescence: the chromaticity of fluorescent emissions are invariant under different illuminants. This allows us to robustly estimate spectral reflectance and fluorescent emission chromaticity. We then show that given the spectral reflectance and fluorescent chromaticity, the fluorescence absorption and emission spectra can also be estimated. We demonstrate in results that all scene spectra can be accurately estimated from RGB images. Finally, we show that our method can be used to accurately relight scenes under novel lighting.
|
Similar papers:
[rank all papers by similarity to this]
|
#1054 - Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group [pdf]
Raviteja Vemulapalli, Felipe Arrate, Rama Chellappa |
Abstract: Recently introduced cost-effective depth sensors coupled with the real-time skeleton estimation algorithm of Shotton et al. [16] have resulted in a renewed interest in skeleton-based human action recognition. Most of the earlier skeleton-based approaches used either the joint locations or the joint angles to represent a human skeleton. In this paper, we propose a new skeletal representation that explicitly models the 3D geometric relationships between various body parts using translations and rotations in 3D space. Since 3D rigid body motions are members of the special Euclidean group SE(3), the proposed skeletal representation lies in the Lie group SE(3) . . . SE(3), which is a curved manifold. With the proposed representation human actions can be modeled as curves in this Lie group. Since classification of curves in this Lie group is not an easy task, we map the action curves from the Lie group to its Lie algebra, which is a vector space. We then perform classification using a combination of dynamic time warping, Fourier temporal pyramid representation and linear SVM. Experimental results on three action datasets show that the proposed representation performs better than various other commonly-used skeletal representations. The proposed approach also outperforms various state-of-the-art skeleton-based human action recognition approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Humans are capable of perceiving a scene at a glance as well as obtaining a deep understanding given additional time. The time course of this vital competence is riddled with controversy and remains a mystery still today. Today's vision algorithms rarely address computational budget constraints and even less so have principled ways to optimize for a fixed computational budget or even any budget. We present a computational model for learning strategies that optimize anytime performance of a visual architecture. It parametrizes the deployment of feature computation and binding/classification in a process oriented view. Computation is performed incrementally and decisions are taken at test time dependent on observed data as well as intermediate results. We show the applicability to standard recognition problems in scene and object recognition. In addition, we show how to incorporate a semantic back-off strategy into our model that mimics the time course of human perception. At each point in time our system gives the maximally specific answer given a desired level of accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1062 - Preconditioning for Accelerated Iteratively Reweighted Least Squares in Structured Sparsity Reconstruction [pdf]
Chen Chen, Junzhou Huang, Lei He |
Abstract: In this paper, we propose a novel algorithm for structured sparsity reconstruction. This algorithm is based on the iterative reweighted least squares (IRLS) framework, and accelerated by preconditioned conjugate gradient method. The convergence rate of the proposed algorithm is almost the same as that of the traditional IRLS algorithms, that is, exponentially fast. Moreover, with the devised preconditioner, the computational cost for each iteration is significantly less than that of traditional IRLS algorithms, which makes it feasible for large scale problems. Besides the fast convergence, this algorithm can be flexibly applied to standard sparsity, group sparsity, and overlapping group sparsity problems. Experiments are conducted on a practical application compressive sensing magnetic resonance imaging. Results demonstrate that the proposed algorithm achieves superior performance over 9 state-of-the-art algorithms in terms of both accuracy and computational cost.
|
Similar papers:
[rank all papers by similarity to this]
|
#1078 - Photometric Stereo using Constrained Bivariate Regression for General Isotropic Surfaces [pdf]
Satoshi Ikehata, Kiyoharu Aizawa |
Abstract: This paper presents a purely pixel-wise photometric stereo method that stably handles general isotropic surfaces. Following recently proposed sum-of-lobes representation of the isotropic reflectance function, we construct a constrained bivariate regression problem where the regression function is approximated by a smooth, bivariate Bernstein polynomials. By considering the inverse representation of the image formation process, the unknown normal vector is separated from the unknown inverse reflectance function, and then we may accurately compute the unknown surface normals by solving a simple and efficient quadratic programming problem. Extensive evaluations are performed that show state-of-the-art performance using both synthetic and real-world images.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we propose the application of principal component analysis (PCA) to scale-spaces. PCA is a standard method used in computer vision. Because the translation of an input image into scale-space is a continuous operation, it requires the extension of conventional finite matrix-based PCA to an infinite number of dimensions. Here, we use spectral theory to resolve this infinite eigenproblem through the use of integration, and we propose an approximate solution based on polynomial equations. In order to clarify its eigensolutions, we apply spectral decomposition to gaussian scale-space and scale-normalized Laplacian of Gaussian (LoG) space. As an application of this proposed method, we introduce a method for generating gaussian blur images and scale-normalized LoG images, demonstrating that the accuracy of such an image can be made very high by using an arbitrary scale calculated through simple linear combination. Then, as more practical examples, we propose a new Scale Invariant Feature Transform (SIFT) detector.
|
Similar papers:
[rank all papers by similarity to this]
|
#1108 - Optimizing Over Radial Kernels on Compact Manifolds [pdf]
Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, Mehrtash Harandi |
Abstract: We tackle the problem of optimizing over all possible positive definite radial kernels on Riemannian manifolds for classification. Kernel methods on Riemannian manifolds have recently become increasingly popular in computer vision. However, the number of known positive definite kernels on manifolds remain very limited. Furthermore, most kernels typically depend on at least one parameter that needs to be tuned for the problem at hand. A poor choice of kernel, or of parameter value, may yield significant performance drop-off. Here, we show that positive definite radial kernels on the unit $n$-sphere, the Grassmann manifold and Kendall's shape manifold can be expressed in a simple form whose parameters can be automatically optimized within a support vector machine framework. We demonstrate the benefits of our kernel learning algorithm on object, face, action and shape recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: When one records a video/image sequence through a transparent medium (e.g. glass), the image is often a superposition of a transmitted layer (scene behind the medium) and a reflected layer. Recovering the two layers from such images seems to be a highly ill-posed problem since the number of unknowns to recover is twice as many as the given measurements. In this paper, we propose a robust method to separate these two layers from multiple images, which exploits the correlation of the transmitted layer across multiple images, and the sparsity and independence of the gradient fields of the two layers. A novel Augmented Lagrangian Multiplier based algorithm is designed to efficiently and effectively solve the decomposition problem. The experimental results on both simulated and real data demonstrate the superior performance of the proposed method over the state of the arts, in terms of speed, accuracy, and simplicity.
|
Similar papers:
[rank all papers by similarity to this]
|
#1175 - Face Alignment at 3000 FPS via Regressing Local Binary Features [pdf]
Shaoqing Ren, Xudong Cao, Yichen Wei, Jian Sun |
Abstract: This paper presents a highly efficient, very accurate regression approach for face alignment. Our approach has two novel components: a set of local binary features, and a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for every facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for final output. Because extracting and regressing local binary features is computationally very cheap, our system achieves over 3,000 fps for locating 68 landmarks while achieving the state-of-the-art results on current most challenging benchmarks.
|
Similar papers:
[rank all papers by similarity to this]
|
#1210 - Image-based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models [pdf]
Konstantinos Rematas, Tobias Ritschel, Mario Fritz, Tinne Tuytelaars |
Abstract: We propose a technique to use the structural information extracted from a set of 3D models of an object class to improve novel-view synthesis for images showing unknown instances of this class. These novel views can be used to ``amplify'' training image collections that typically contain only a low number of views or lack certain classes of views entirely (e.\,g.\ top views). We extract the correlation of position, normal, reflectance and appearance from computer-generated images of a few exemplars and use this information to infer new appearance for new instances. We show that our approach can improve performance of state-of-the-art detectors using real-world training data. Additional applications include guided versions of inpainting, 2D-to-3D conversion, super-resolution and non-local smoothing.
|
Similar papers:
[rank all papers by similarity to this]
|
#1230 - Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [pdf]
Maxime Oquab, Ivan Laptev, Leon Bottou, Josef Sivic |
Abstract: Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The success of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level features used in other image classification methods. Learning CNNs, however, amounts to estimating millions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be efficiently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Subspace clustering is a powerful technology for clustering data according to the underlying subspaces. Representation based methods are the most popular subspace clustering approach in recent years. In this paper, we analyze the grouping effect of representation based methods in depth. In particular, we introduce the enforced grouping effect conditions, which greatly facilitate the analysis of grouping effect. We further find that grouping effect is important for subspace clustering, which should be explicitly enforced in the data self-representation model, rather than implicitly implied by the model as in some prior work. Based on our analysis, we propose the SMooth Representation (SMR) model. We also propose a new affinity measure based on the grouping effect, which proves to be much more effective than the commonly used one. As a result, our SMR significantly outperforms the state-of-the-art ones on benchmark datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The initial steps of many computer vision algorithms are interest point extraction and matching. In larger image sets the pairwise matching of interest point descriptors between images is an important bottleneck. For each descriptor in one image the (approximate) nearest neighbor in the other one has to be found and checked against the second-nearest neighbor to ensure the correspondence is unambiguous. Here, we asked the question how to best decimate the list of interest points without losing matches, i.e. we aim to speed up matching by filtering out, in advance, those points which would not survive the matching stage. It turns out that the best filtering criterion is not the response of the interest point detector, which in fact is not surprising: the goal of detection are repeatable and well-localized points, whereas the objective of the selection are points whose descriptors can be matched successfully. We show that one can in fact learn to predict which descriptors are matchable, and thus reduce the number of interest points significantly without losing too many matches. We show that this strategy, as simple as it is, greatly improves the matching success with the same number of points per image. Moreover, we embed the prediction in a state-of-the-art Structure-from-Motion pipeline and demonstrate that it also outperforms other selection methods at system level.
|
Similar papers:
[rank all papers by similarity to this]
|
#1279 - Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-rank Matrices [pdf]
Xianbiao Shu, Fatih Porikli, Narendra Ahuja |
Abstract: Low-rank recovery from a corrupted observation has many applications in computer vision. Conventional methods address this problem by iterating between nuclear norm minimization and sparsity minimization. However, low-rank recovery by nuclear norm minimization is computationally prohibitive for large scale problems such as video segmentation. Here, we propose a Robust Orthogonal Subspace Learning (ROSL) method to achieve efficient low-rank recovery. Our intuition is a novel rank measure on the low-rank matrix that imposes the group sparsity of its coefficients under orthonormal subspace, which enables to recover it by fast sparse coding. We describe an efficient algorithm to solve the low-rank recovery at quadratic complexity of the matrix size. We analyze theoretical bounds to validate that this rank measure is lower bounded by nuclear norm and it has the same global minimum as the latter. To further accelerate ROSL to linear complexity, we also describe a version empowered by a random sampling, ROSL+. Our extensive evaluations and comparisons demonstrate that both ROSL and ROSL+ provide superior efficiency (several order of magnitude speed-up) against the state-of-the-art methods without compromising the accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1294 - Cut, Glue & Cut: A Fast, Approximate Solver for Multicut Partitioning [pdf]
Thorsten Beier, Thorben Kroeger, Joerg Kappes, Ullrich Koethe, Fred Hamprecht |
Abstract: Recently, unsupervised image segmentation has become increasingly popular. Starting from a superpixel segmentation, an edge-weighted region adjacency graph is constructed. Amongst all segmentations of the graph, the one which best conforms to the given image evidence, as measured by the sum of cut edge weights, is chosen. Since this problem is NP-hard, we propose a new approximate solver based on the move-making paradigm: first, the graph is recursively partitioned into small regions (cut phase). Then, for any two adjacent regions, we consider alternative cuts of these two regions defining possible moves (glue & cut phase). For planar problems, the optimal move can be found, whereas for non-planar problems, efficient approximations exist. We evaluate our algorithm on published and new benchmark datasets, which we make available here. The proposed algorithm finds segmentations that, as measured by a loss function, are as close to the ground-truth as the global optimum found by exact solvers. It does so significantly faster then existing approximate methods, which is important for large-scale problems.
|
Similar papers:
[rank all papers by similarity to this]
|
#1310 - Local Layering for Joint Motion Estimation and Occlusion Detection [pdf]
Deqing Sun, Ce Liu, Hanspeter Pfister |
Abstract: Most motion estimation algorithms (optical flow, layer models) cannot handle large amount of occlusion in textureless regions, as motion is often initialized with no occlusion assumption despite that occlusion may be included in the final objective. To handle such situations, we propose a local layering model where motion and occlusion relationships are inferred jointly. In particular, the uncertainties of occlusion relationships are retained so that motion is inferred by considering all the possibilities of local occlusion relationships. In addition, the local layer model handles articulated objects with self-occlusion. We demonstrate that the local layering model can handle motion and occlusion well for both challenging synthetic and real sequences.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Recent works have considered shape recovery for an object of unknown BRDF using light source or object motions. This paper proposes a theory that addresses the remaining problem of determining shape from the (small or differential) motion of the camera, for unknown isotropic BRDFs. Our theory derives a differential stereo relation that relates camera motion to depth of a surface with unknown isotropic BRDF, which generalizes traditional Lambertian assumptions. Under orthographic projection, we show shape may not be constrained by differential stereo for general isotropic BRDFs, but two motions suffice to yield an invariant for several restricted (still unknown) BRDFs. For the perspective case, we show that three differential motions suffice to yield the surface depth for unknown isotropic BRDF and unknown directional lighting, while additional constraints are obtained with restrictions on the BRDF or lighting. The limits imposed by our theory are intrinsic to the shape recovery problem and independent of choice of reconstruction method. We outline with experiments how potential reconstruction methods may exploit our theory.
|
Similar papers:
[rank all papers by similarity to this]
|
#1333 - Multi-Output Learning for Camera Relocalization [pdf]
Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Shahram Izadi, Andrew Fitzgibbon, Toby Sharp |
Abstract: We address the problem of estimating the pose of a camera relative to a known 3D scene from a single RGB-D frame. We formulate this problem as inversion of the generative rendering procedure, i.e., we want to find the pose for the camera under which the rendered 3D scene is most similar to the input. This is a non-convex optimization problem which has a number of local optima. We propose a hybrid generative-discriminative learning based architecture that consists of: (i) a set of M predictors which generate M camera-pose hypotheses; and (ii) a selector or aggregator that tries to infer the best pose from the multiple pose hypotheses based on a similarity function. We are interested in predictors that not only produce good hypotheses but also hypotheses that are different from each other. Thus, we propose and study a number of methods for learning marginally relevant predictors; and compare their performance when used with different selection procedures. We evaluate our method on a recently released dataset of challenging camera-pose estimation problems. Experiments show that our method learns to make multiple predictions that are marginally relevant and can effectively select an accurate prediction. Furthermore, our method outperforms state-of-the-art discriminative learning based methods for camera relocalization.
|
Similar papers:
[rank all papers by similarity to this]
|
#1335 - 3D Pictorial Structures: From Single to Multiple Human Pose Estimation [pdf]
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, Slobodan Ilic |
Abstract: In this work, we address the problem of 3D pose estimation of multiple humans from multiple views. Compared to single human 3D pose estimation, this problem is more challenging because of the much larger state space and occlusions. Not knowing in advance the identity of the humans, results in ambiguities across views. To address these problems, we first create a reduced state space by triangulation of corresponding body joints obtained from part detectors in pairs of camera views. In order to resolve the ambiguities of wrong and mixed body parts of multiple humans after triangulation and also those coming from false positive body part detections, we introduce a novel 3D Pictorial Structures (3DPS) model. Our model infers 3D human body configurations from our reduced state space. The 3DPS model is generic and applicable to both single and multiple human pose estimation. In order to compare to the state-of-the art approaches, we first evaluate our method on single human 3D pose estimation in HumanEva-I [24] and KTH Multiview Football Dataset II [8] datasets. Then we introduce and evaluate our method on two datasets for multiple human 3D pose estimation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1347 - Novel methods for multilinear data completion and de-noising based on tensor-SVD [pdf]
Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, Misha Kilmer |
Abstract: In this paper we propose novel methods for completion (from limited samples) and de-noising of multilinear (tensor) data and as an application consider 3-D and 4- D (color) video data completion and de-noising. We exploit the recently proposed tensor-Singular Value Decomposition (t-SVD)[11]. Based on t-SVD, the notion of multilinear rank and a related tensor nuclear norm was proposed in [11] to characterize informational and structural complexity of multilinear data. We first show that videos with linear camera motion can be represented more efficiently using t-SVD compared to the approaches based on vectorizing or flattening of the tensors. Since efficiency in representation implies efficiency in recovery, we outline a tensor nuclear norm penalized algorithm for video completion from missing entries. Application of the proposed algorithm for video recovery from missing entries is shown to yield a superior performance over existing methods. We also consider the problem of tensor robust Principal Component Analysis (PCA) for de-noising 3-D video data from sparse random corruptions. We show superior performance of our method compared to the matrix robust PCA adapted to this setting as proposed in [4].
|
Similar papers:
[rank all papers by similarity to this]
|
#1429 - Convolutional Neural Networks for No-Reference Image Quality Assessment [pdf]
Le Kang, Peng Ye, Yi Li, David Doermann |
Abstract: In this work we describe a Convolutional Neural Network (CNN) to accurately predict image quality without reference image. Taking image patches as input, the proposed CNN works in the spatial domain without encoding hand-crafted features that are employed by most previous methods. The network consists of one convolutional layer with max and min pooling, and two fully connected layers followed by an output node. Within the network structure, feature learning and regression are integrated into one optimization process, which leads to a more effective model for estimating image quality. The proposed approach achieves state of the art performance on the LIVE dataset and shows excellent generalizing ability in cross dataset experiments. Furthermore experiments on images with local distortions demonstrate the local quality estimation ability of our CNN, which is rarely reported in previous literature.
|
Similar papers:
[rank all papers by similarity to this]
|
#1475 - Patch to the Future: Unsupervised Visual Prediction [pdf]
Jacob Walker, Abhinav Gupta, Martial Hebert |
Abstract: In this paper we present a conceptually simple but surprisingly powerful method for visual prediction which combines the effectiveness of mid-level visual elements with temporal modeling from a decision-theoretic framework. Our framework can be learned in a completely unsupervised manner from a large collection of videos. However, more importantly, because our approach models the prediction framework on these mid-level elements, we can not only predict the possible motion in the scene but also predict visual appearances --- how are appearances going to change with time. This yields a visual ''hallucination'' of probable events on top of the scene. We show that our method is able to accurately predict and visualize simple future events; We also show that our approach is comparable to supervised methods for event prediction.
|
Similar papers:
[rank all papers by similarity to this]
|
#1585 - Fast and Accurate Image Matching with Cascade Hashing for 3D Reconstruction [pdf]
Jian Cheng, Cong Leng, Jiaxiang Wu, Hainan Cui, Hanqing Lu |
Abstract: Image matching is one of the most challenging stages in 3D reconstruction, which usually occupies half of computational cost and inaccurate matching may lead to failure of reconstruction. Therefore, fast and accurate image matching is very crucial for 3D reconstruction. In this paper, we proposed a Cascade Hashing strategy to speedup the image matching. In order to accelerate the image matching, the proposed cascade hashing method is designed to be three-layer structure: hashing lookup, hashing remapping, and hashing ranking. Each layer adopts different measures and filtering strategies, which is demonstrated to be less sensitive to noise. Extensive experiments show that image matching can be accelerated by our approach in hundreds times than brute force matching, even achieves ten times or more than Kd-tree based matching while retaining comparable accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1599 - FAUST: Dataset and evaluation for 3D mesh registration [pdf]
Federica Bogo, Javier Romero, Matthew Loper, Michael Black |
Abstract: New scanning technologies are increasing the importance of 3D mesh data and the need for algorithms that can reliably align it. Surface registration is important for building full 3D models from partial scans, creating statistical shape models, shape retrieval, and tracking. The problem is particularly challenging for non-rigid and articulated objects like human bodies. Existing synthetic datasets (e.g. TOSCA) do not represent the challenge of registering real-world data. Establishing ground-truth correspondences for real 3D scans however is difficult. We address this with a novel mesh registration technique that integrates 3D shape and appearance information to produce high-quality alignments. We define a new dataset called FAUST that contains 300 scans of 10 subjects in a wide range of poses together with an evaluation methodology. To remove ambiguities, we paint the subjects with high-frequency textures and use an extensive validation process to ensure accurate ground truth. We observe that current shape registration methods have trouble with this real-world data as they are limited to low-resolution, watertight, meshes without holes or noise. The dataset and evaluation website will be available for research purposes.
|
Similar papers:
[rank all papers by similarity to this]
|
#1670 - A Compact and Discriminative Face Track Descriptor [pdf]
Omkar Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman |
Abstract: Our goal is to learn a compact, discriminative vector representation of a face-track, suitable for the face recognition tasks of verification and classification. To this end, we propose a novel face-track descriptor, based on the Fisher Vector representation, and make the following contributions: first, the descriptor is suitable for tracks of both frontal and profile faces, and is agnostic about their pose; second, we obtain a compact descriptor using discriminative dimensionality reduction, and an extremely compact descriptor using binarization; third, the descriptor can be computed quickly (using hard quantization) and its compact size and fast computation render it very suitable for large scale visual repositories. In the experiments we show that the descriptor exceeds the state of the art on both face verification task (using the standard YouTube Faces and INRIA-Buffy benchmarks), and face classification task (using the standard Oxford-Buffy dataset). Furthermore, the descriptor demonstrates good generalization when trained on one dataset and tested on another, reflecting its tolerance to the dataset bias.
|
Similar papers:
[rank all papers by similarity to this]
|
#1710 - Local Submodular Approximations for Binary Pairwise Energies [pdf]
Lena Gorelick, Yuri Boykov, Olga Veksler, Ismail BenAyed, Andrew Delong |
Abstract: Many computer vision problems require optimization of binary non-submodular energies. We propose a general optimization framework based on local submodular approximations (LSA). Unlike standard LP relaxation methods that linearize the whole energy globally, our approach iteratively approximates the energies locally. On the other hand, unlike standard local optimization methods (e.g. gradient descent or projection techniques) we use non-linear submodular approximations and optimize them without leaving the domain of integer solutions. We discuss two specific LSA algorithms based on "trust region" and "auxiliary function" principles, LSA-TR and LSA-AUX. These methods obtain state-of-the-art results on a wide range of applications outperforming many standard techniques such as LBP, QPBO, and TRWS. While our paper is focused on pairwise energies, our ideas extend to higher-order problems.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Histogram-based features have significantly contributed to recent development of image classifications, such as by SIFT local descriptors. In this paper, we propose a method to efficiently transform those histogram features for improving the classification performance. The (L1-normalized) histogram feature is regarded as a probability mass function, which is modeled by Dirichlet distribution. Based on the probabilistic modeling, we induce the Dirichlet Fisher kernel for transforming the histogram feature vector. The method works on the individual histogram feature to enhance the discriminative power at a low computational cost. On the other hand, in the bag-of-feature (BoF) framework, the Dirichlet mixture model can be extended to Gaussian mixture by transforming histogram-based local descriptors, e.g., SIFT, and thereby we propose the method of Dirichlet-derived GMM Fisher kernel. In the experiments on diverse image classification tasks including recognition of subordinate objects and material textures, the proposed methods improve the performance of the histogram-based features and BoF-based Fisher kernel, being favorably competitive with the state-of-the-arts.
|
Similar papers:
[rank all papers by similarity to this]
|
#1759 - Optimal Visibility Estimation for Large-Scale Dynamic 3D Reconstruction [pdf]
Hanbyul Joo, Hyun Soo Park, Yaser Sheikh |
Abstract: We present an algorithm to reconstruct the 3D motion of an event from a large number of videos, by explicitly estimating the time-varying visibility of each 3D point. Our algorithm takes, as input, camera poses and image sequences, and outputs the optimal set of the visible cameras and the reconstructed 3D trajectories. We reconstruct the patch motion (location and normal) by triangulating image flow in each camera within a RANSAC framework. The obtained patch motion enables us to define the likelihood of visibility in each camera, based on the geometric consistency of motion and the local similarity of appearance. We adaptively fuse these two sources of information based on the motion characteristics of the point, in conjunction with a Markov Random Field model that rewards consistent visibilities in proximal cameras. An optimal estimate of visibility is obtained by finding the minimum cut in a graph over cameras. We demonstrate that as the number of cameras increases, our algorithm produces longer trajectories, at more locations, and at higher accuracies than methods than ignore visibility or use appearance cues alone.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We consider the discrete pairwise energy minimization problem (weighted constraint satisfaction, max-sum labeling) and methods that identify partial assignment of variables that is globally optimal. Existing methods are based on seemingly {\em different} sufficient conditions. We propose a new sufficient condition for partial optimality which is: (1) verifiable in polynomial time (2) invariant to reparametrization of the problem and permutation of labels and (3) includes many existing methods as special cases. We obtain a unified description of different methods and identify their common properties. The proposed condition is derived by using the relaxation technique coherent with the relaxation for energy minimization. We study the problem of finding the maximum partial optimal assignment identifiable by the new sufficient condition. For a subclass within the sufficient condition, we propose polynomial algorithms that are guaranteed to find the same or larger part of optimal assignment than several existing methods we have unified.
|
Similar papers:
[rank all papers by similarity to this]
|
#1779 - Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation [pdf]
Catalin Ionescu, Joao Carreira, Cristian Sminchisescu |
Abstract: Recently, the emergence of Kinect systems demonstrated the benefits of predicting an intermediate body part labeling for 3D human pose estimation, in conjunction with RGB- D imagery. The availability of depth information plays a critical role, so an important question is whether similar ideas can be developed with sufficient robustness towards estimating 3D pose from RGB images. This paper provides evidence for a positive answer, by leveraging (a) 2D hu- man body part labeling in images, (b) second-order label- sensitive pooling over dynamically computed regions result- ing from a hierarchical decomposition of the body, and (c) fixed-point structured-output modeling to contextualize the process based on 3D pose estimates. For robustness and generalization, we take advantage of a recent large-scale 3D human motion capture dataset, Human3.6M[17], which we augment with additional human body part image label- ing annotations. We provide extensive experimental analy- sis where alternative intermediate representations are com- pared and report a substantial 35% error reduction over competitive discriminative baselines that regress 3D human pose against global HOG features.
|
Similar papers:
[rank all papers by similarity to this]
|
#1796 - Video Motion Segmentation Using New Adaptive Manifold Denoising Model [pdf]
Dijun Luo, Heng Huang |
Abstract: Video motion segmentation techniques automatically segment and track objects and regions from videos or image sequences as a primary processing step for many computer vision applications. We propose a novel motion segmentation approach for both rigid and non-rigid objects using adaptive manifold denoising. We first introduce an adaptive kernel space in which two feature trajectories are mapped into the same point if they belong to the same rigid object. After that, we employ an embedded manifold denoising approach with the adaptive kernel to segment the motion of rigid and non-rigid objects. The major observation is that the non-rigid objects often lie on a smooth manifold with deviations which can be removed by manifold denoising. We also show that performing manifold denoising on the kernel space is equivalent to doing so on its range space, which theoretically justifies the embedded manifold denoising on the adaptive kernel space. Experimental results indicate that our algorithm, named Adaptive Manifold Denoising (AMD), is suitable for both rigid and non-rigid motion segmentation. Our algorithm works well in many cases where several state-of-the-art algorithms fail.
|
Similar papers:
[rank all papers by similarity to this]
|
#1797 - A Riemannian framework for matching point clouds represented by the Schr\"{o}dinger distance transform [pdf]
Yan Deng, Anand Rangarajan, Baba Vemuri |
Abstract: In this paper, we cast the problem of point cloud matching as a shape matching problem by transforming each of the given point clouds into a shape representation called the Schr\"{o}dinger distance transform (SDT) representation. This is achieved by solving a static Schr\"{o}dinger equation instead of the corresponding static Hamilton-Jacobi equation in this setting. The SDT representation is an analytic expression and following the theoretical physics literature, can be normalized to have unit $\ell_2$ norm-- making it a \emph{square-root density}, which is identified with a point on a unit Hilbert sphere, whose intrinsic geometry is fully known. The Fisher-Rao metric, a natural metric for the space of densities leads to analytic expressions for the geodesic distance between points on this sphere. In this paper, we use this well known Riemannian framework never before used for point cloud matching, and present a novel point cloud matching algorithm. We pose the point set matching under rigid and non-rigid transformations in this framework and solve for the transformations using standard nonlinear optimization techniques. Finally, to evaluate the performance of our algorithm---dubbed SDTM---we present several synthetic and real data examples along with extensive comparisons to state-of-the-art techniques. The experiments show that our algorithm outperforms state-of the-art point set registration algorithms on many quantitative metrics.
|
Similar papers:
[rank all papers by similarity to this]
|
#1801 - Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models [pdf]
Mathieu Aubry, Bryan Russell, Alexei Efros, Josef Sivic |
Abstract: This paper poses object category detection in images as a type of 2D to 3D alignment problem, utilizing the large quantities of 3D CAD models that have been made publicly available on-line. Using the ``chair'' class as a running example, we propose an exemplar-based 3D category representation, which can explicitly model chairs of different styles as well as the large variation in viewpoint. We develop an approach to establish part-based correspondences between 3D CAD models and real photographs. This is achieved by (i) representing each 3D model using a set of view-dependent mid-level visual elements learned from synthesized views in a discriminative fashion, (ii) carefully calibrating the individual element detectors with respect to each other on a common dataset of negative images, and (iii) matching them to the test image allowing for small mutual deformations but preserving the viewpoint and style constraints. We demonstrate the ability of our system to align 3D models with 2D objects on the challenging PASCAL VOC images, which depict a wide variety of chairs in complex scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
#1818 - Fourier Analysis on Transient Imaging by Multifrequency Time-of-Flight Camera [pdf]
Jingyu Lin, Yebin Liu, Matthias Hullin, Qionghai Dai |
Abstract: In this paper we investigate the problem of transient image reconstruction from measurements captured by multifrequency time-of-flight (TOF) cameras and reveal the system frequency condition for exact transient image reconstruction. We also propose a reconstruction approach based on Fourier analysis, which addresses the issues of denoising correlation matrix, removing harmonic component disturbance, and recovering missing low spectrum of a transient image. Our approach is implemented pixel-wise such that it is of much low computational cost and spatial cost. We evaluate our approach on both synthetic and real data sets, and obtain high quality transient images comparable to or even better than the state of the art approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#1819 - DeepFace: Closing the Gap to Human-Level Performance in Face Verification [pdf]
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf |
Abstract: In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We contribute to both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a rigid transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities, where each identity has an average of over a thousand samples. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approaching human-level performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#1921 - Large-scale Video Classification using Convolutional Neural Networks [pdf]
Andrej Karpathy, Sanketh Shetty, George Toderici, Rahul Sukthankar, Thomas Leung, Li Fei-Fei |
Abstract: Convolutional Neural Networks (CNNs) are a class of deep learning models that have shown impressive performance on large-scale image recognition problems. In this paper, we report results of extensive large-scale experiments on applying CNNs to a dataset of 1 million YouTube videos belonging to 487 classes. We identify and evaluate multiple approaches to extending the connectivity of CNNs in time domain and propose a multiresolution architecture that learns features on two separate streams of processing: one stream models scene context via low-resolution frames, while the other focuses on high-resolution detail in a center region. We demonstrate that our networks significantly outperform standard feature-based methods. Moreover, we show that the learned features generalize to other video classification benchmarks and in particular use them to obtain state-of-the-art results on the UCF-101 Action Recognition dataset (63.3%, up from 43.9%).
|
Similar papers:
[rank all papers by similarity to this]
|
#1927 - Joint Coupled-Feature Representation and Coupled Boosting for AD Diagnosis [pdf]
Yinghuan Shi, Heung-Il Suk, Yang Gao, Dinggang Shen |
Abstract: Recently, there has been a great interest in computer-aided AD/MCI diagnosis. Previous machine learning based methods defined the diagnosis process as a classification task and considered the low-level features extracted from neuroimaging data with no consideration of relations among them. However, from a neuroscience point of view, its well known that a human brain is a complex system that multiple brain regions are anatomically connected and functionally interact with each other. Therefore, it is natural to hypothesize that the low-level features extracted from MRI and PET in multiple ROIs are related to each other in some ways. To this end, in this paper, we first devise a coupled feature representation by utilizing intra- and inter- coupled interaction relationship. Regarding multi-modal data fusion, we propose a novel coupled boosting algorithm that analyzes the pairwise coupled-diversity correlation between modalities. Specifically, we formulate a new weight updating function, which considers both incorrectly and inconsistently classified samples. In our experimental results on the ADNI dataset, the proposed method presented the best performances with accuracies of 94.7% and 80.1% for AD vs. NC and MCI vs. NC classifications, respectively, outperforming the competing methods and the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1938 - Covariance Trees for 2D and 3D Processing [pdf]
Thierry Guillemot, Andrs Almansa, Tamy Boubekeur |
Abstract: Gaussian Mixture Models have become one of the major tools in modern statistical image processing because they closely capture the fine grained structure of natural images, whose high-dimensional patches seem to lie on a low-dimensional non-linear smooth manifold. The use of such models for various restoration tasks, from denoising to reconstruction from incomplete data, has been demonstrated in the last few years to allow for significant improvements with respect to the state of the art. Nevertheless the adoption of such models has been slow, and their real potential has not yet been completely unleashed, mainly because fitting such models to large image databases requires either extremely computer-intensive brute-force approaches [Levin-Nadler 2011], or simplifying assumptions on the number of Gaussians in the mixture [Yu-Sapiro-Mallat 2012], not to mention technical issues on the initialization of non-convex minimization procedures that are used to fit such models to data. This work provides a general and flexible tool for dealing with continuous families of multivariate Gaussian models: both efficient learning over a large database and fast querying are supported. In order to circumvent the difficulties of previous approaches: (i) a hierarchical data structure is used that accelerates both learning and querying; (ii) rather than fixing the number of Gaussian models in advance, our hierarchical structure represents the data manifold at various scales during the l
|
Similar papers:
[rank all papers by similarity to this]
|
#1978 - Latent Regression Forest: Structural Estimation of 3D Articulated Hand Posture [pdf]
Danhang Tang, alykhan Tejani, Hyung Jin Chang, Tae-Kyun Kim |
Abstract: In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards; our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarized as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.
|
Similar papers:
[rank all papers by similarity to this]
|
#1995 - A Mixture of Manhattan Frames: Beyond the Manhattan World [pdf]
Julian Straub, Guy Rosman, Oren Freifeld, John Leonard, John Fisher III |
Abstract: Man-made objects and buildings exhibit a clear structure in the form of orthogonal and parallel planes. This observation, commonly referred to as the Manhattan-world (MW) model, has been widely exploited in computer vision and robotics. At both larger and smaller scales, the scale of a city, indoor scenes or smaller objects, a more flexible model is merited. Here, we propose a novel probabilistic model that describes scenes as mixtures of Manhattan Frames (MF) -- sets of orthogonal and parallel planes. By exploiting the geometry of both orthogonality constraints and the unit sphere, our approach allows us to describe man-made structures in a flexible way, We propose an inference that is a hybrid of Gibbs sampling and gradient-based optimization of a robust cost function over the SO(3) manifold. A MF merging mechanism allows us to infer the model order. We show the versatility of our Mixture-of-Manhattan-Frames (MMF) model by describing complex scenes from ASUS Xtion PRO depth images and aerial-LiDAR measurements of an urban center. Additionally, we demonstrate that the model lends itself to depth focal-length calibration of RGB-D cameras as well as to plane segmentation.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Image and video are often described by multiple types of local descriptors such as SIFT, HOG, HOF etc. Classification and retrieval tasks can benefit from fusing descriptors with different types. Current fusing methods mainly consist of early fusion and late fusion which conduct fusing before or after encoding local descriptors. In this paper, we propose a new representation, Multi-View Super Vector (MVSV), which unifies fusing and coding multiple types of local feature descriptors into the same framework. The main contributions are two folds. First, we propose a generative mixture model of probabilistic canonical correlation analyzers (M-PCCA) to capture the local correlation between different descriptor types, and develop learning algorithms to estimate the parameters of M-PCCA. Second, we utilize the hidden variables and gradient vectors of M-PCCA to construct MVSV for representing video or image. MVSV encodes both the share and private information from different types of local descriptors. We examine the performance of the proposed methods on video based action recognition tasks. Experimental results on HMDB51 and UCF101 datasets show that MVSV outperforms Fisher vectors with early fusion and late fusion strategy. Our method also achieves state-of-the-art performance on these datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#2026 - Recognition of Complex Events exploiting Temporal Dynamics between Underlying Concepts [pdf]
Subhabrata Bhattacharya |
Abstract: Concept based representations are readily being used to understand structure of complex events in web videos, as opposed to representations that are based on bag of low level features. We present two complementary algorithms to extract much richer temporal information from available concept based representations, improving understanding and recognition of complex events. In our approach, each video is represented as an ordered vector time-series, where each time-step is a vector of confidences returned by a set of pre-trained action concept detectors. Assuming, a set of vector-time series from the same event class are emanated from a single linear dynamical system, we obtain its signature using a subspace system identification technique based on Singular Value Decomposition of Hankel Matrices (SSID-S). A rather complementary signature (H-S), is then computed based on Harmonic Analysis of a vector time-series, which exploits characteristics such as lag-independence, frequency proximity etc. These two signatures are finally concatenated to form a combined signature for a video which is further used in a linear SVM framework for recognition purposes. Experiments conducted on NIST's, TRECVID datasets for Multimedia Event Detection (MED 2011 \& MED 2012), demonstrate promising fidelity of our method in extracting meaningful temporal interactions between concepts, outperforming the state of the art in complex event recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#2043 - Socially-aware Large-scale Crowd Forecasting [pdf]
Alexandre Alahi, Vignesh Ramanathan, Li Fei-Fei |
Abstract: In crowded space such as city centers or train stations, human mobility looks complex, but is often influenced only by a few causes. We propose to quantitatively study crowded environments by introducing a dataset of 42 million trajectories collected in two train stations. Given the dataset, we address the problem of forecasting pedestrians' destinations, a central problem in understanding large-scale crowd mobility. In this setting, we need to overcome the challenges posed by a limited number of observations (e.g. sparse cameras), and change in pedestrian appearance cues across different cameras. In addition, we often have restrictions in the way pedestrians can move in a scene, encoded as priors over origin and destination (OD) preferences. We propose a new descriptor coined as Social Affinity Maps (SAM) to link broken or unobserved trajectories of individuals in the crowd, while using the OD-prior in our framework. Our experiments show improvement in performance through the use of SAM features and OD prior. To the best of our knowledge, our work is one of the first studies that provides encouraging results towards a better understanding of crowd behavior at the scale of million pedestrians.
|
Similar papers:
[rank all papers by similarity to this]
|
#2090 - Deformable Registration of Feature-Endowed Point Sets Based on Tensor Fields [pdf]
Demian Wassermann, James Ross, George Washko, Sandy Wells, Raul San Jose-Estepar |
Abstract: The main contribution of this work is a framework to register anatomical structures characterized as a point set where each point has an associated symmetric matrix. These matrices can represent problem-dependent characteristics of the registered structure. For example, in airways, matrices can represent the orientation and thickness of the structure. Our framework relies on a dense tensor field representation which we implement sparsely as a kernel mixture of tensor fields. We equip the space of tensor fields with a norm that serves as a similarity measure. To calculate the optimal transformation between two structures we minimize this measure using an analytical gradient for the similarity measure and the deformation field, which we restrict to be a diffeomorphism. We illustrate the value of our tensor field model by comparing our results with scalar and vector field based models. Finally, we evaluate our registration algorithm on synthetic data sets and validate our approach on manually annotated airway trees.
|
Similar papers:
[rank all papers by similarity to this]
|
#11 - Total Variation Blind Deconvolution: The Devil is in the Details [pdf]
Daniele Perrone, Paolo Favaro |
Abstract: In this paper we study the problem of blind deconvolution. Our analysis is based on the algorithm of Chan and Wong~\cite{Chan1998} which popularized the use of sparse gradient priors via total variation. We use this algorithm because many methods in the literature are essentially adaptations of this framework. Such algorithm is an iterative alternating energy minimization where at each step either the sharp image or the blur function are reconstructed. Recent work of Levin et al. \cite{Levin2011Understanding} showed that any algorithm that tries to minimize that same energy would fail, as the desired solution has a higher energy than the no-blur solution, where the sharp image is the blurry input and the blur is a Dirac delta. However, experimentally one can observe that Chan and Wong's algorithm converges to the desired solution even when initialized with the no-blur one. We provide both analysis and experiments to resolve this paradoxical conundrum. We find that both claims are right. The key to understanding how this is possible lies in the details of Chan and Wong's implementation and in how seemingly harmless choices result in dramatic effects. Our analysis reveals that the delayed scaling (normalization) in the iterative step of the blur kernel is fundamental to the convergence of the algorithm. This then results in a procedure that eludes the no-blur solution, despite it being a global minimum of the original energy. We introduce an adaptation of
|
Similar papers:
[rank all papers by similarity to this]
|
#15 - Jointly Summarizing Large-Scale Web Images and Videos for the Storyline Reconstruction [pdf]
Gunhee Kim, Leonid Sigal, Eric Xing |
Abstract: In this paper, we address the problem of jointly summarizing large-scale Flickr images and YouTube user videos. Starting from the intuition that the characteristics of the two media are different yet complementary, we develop a fast and easily-parallelizable approach for creating not only high-quality video summary but also a novel structural summary of online images as storyline graphs, which can illustrate various events or activities associated with the topic in a form of a branching network. In our approach, the video summarization is achieved by diversity ranking on the similarity graphs between images and video frames. The reconstruction of storyline graphs is formulated as the inference of sparse time-varying directed graphs from a set of photo streams with assistance of videos. For evaluation, we create the datasets of 20 outdoor recreational activities, consisting of 2.7M of Flickr images and 16K of YouTube user videos. Due to the large-scale nature of our problems, we evaluate our algorithm via crowdsourcing using Amazon Mechanical Turk. In our experiments, we demonstrate that the proposed joint summarization approach outperforms other important baselines and our own methods using videos or images only.
|
Similar papers:
[rank all papers by similarity to this]
|
#20 - Stable and Informative Spectral Signatures for Graph Matching [pdf]
Nan Hu, Raif Rustamov, Leonidas J. Guibas |
Abstract: In this paper, we consider the approximate weighted graph matching problem and introduce stable and informative first and second order compatibility terms suitable for inclusion into the popular integer quadratic program formulation. Our approach relies on a rigorous analysis of stability of spectral signatures based on the graph Laplacian. In the case of the first order term, we derive an objective function that measures both the stability and informativeness of a given spectral signature. By optimizing this objective, we design new spectral node signatures tuned to a specific graph to be matched. We also introduce the pairwise heat kernel distance as a stable second order compatibility term; we justify its plausibility by showing that in a certain limiting case it converges to the classical adjacency matrix-based second order compatibility function. We have tested our approach on a set of synthetic graphs, the widely-used CMU house sequence, and a set of real images. These experiments show the superior performance of our first and second order compatibility terms as compared with the commonly used ones.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Metric learning is very useful for image retrieval, classification and identification. This paper introduces a regularization method to explicitly control the rank of a learned symmetric positive semidefinite distance matrix. To this end, we propose to incorporate in the objective function a linear regularization term that consists in minimizing the k smallest eigenvalues of the distance matrix. It is equivalent to minimizing the trace of the product of the distance matrix with a matrix in the convex hull of rank-k projection matrices, called a Fantope. Based on this new regularization method, we derive an optimization scheme to efficiently learn the distance matrix. We demonstrate the effectiveness of the method on synthetic and challenging real datasets of face verification and image classification with relative attributes, on which our method outperforms state-of-the-art metric learning algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#23 - One Millisecond Face Alignment with an Ensemble of Regression Trees [pdf]
Vahid Kazemi, Josephine Sullivan |
Abstract: This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
|
Similar papers:
[rank all papers by similarity to this]
|
#24 - From Categories to Individuals in Real Time --- A Unified Boosting Approach [pdf]
David Hall, Pietro Perona |
Abstract: A method for online, real-time learning of individual-object detectors is presented. Starting with a pre-trained boosted category detector, an individual-object detector is trained with near-zero computational cost. The individual detector is obtained by using the same feature cascade as the category detector along with elementary manipulations of the thresholds of the weak classifiers. This is ideal for online operation on a video stream or for interactive learning with a human in the loop. Applications addressed by this technique are reidentification and individual tracking. Experiments on two challenging pedestrian and face datasets indicate that it is indeed possible to learn identity classifiers in real-time; besides being faster-trained, our classifier has better detection rates than previous methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#34 - Visual Persuasion: Inferring the Communicative Intents of Images [pdf]
Jungseock Joo, Weixin Li, Francis Steen, Song Chun Zhu |
Abstract: In this paper we introduce the novel problem of understanding visual persuasion. Modern mass media and advertising make extensive use of images and video to present arguments and influence public opinion, and their techniques are widely studied in media research, political science, and psychology, typically using small, hand-coded datasets. We propose to extend the significant advances in syntactic analyses, such as the detection and identification of objects and sentiments in images and video, to the higher-level challenge of understanding the underlying communicative intent implied in the images. We define the problem of inferring communicative intents from images in a computational framework, and demonstrate the feasibility of progress in a case study from politics, a domain of intense competitive persuasion with continuously measurable outcomes in opinion polls. To this end, we identify 9 dimensions of persuasive intent latent in images of politicians, e.g., ``Trustworthy'', as well as 12 syntactical attributes, e.g.,, ``Smile'', from which one can semantically infer communicative intents. We present a new dataset of 866 images of politicians labeled with ground-truth intents in the form of ranking. In this application, we show that our learned model predicts communicative intents in a large dataset. These results demonstrate that a systematic focus on visual persuasion opens up the field of computer vision to a new class of investigations around mediated images, inter
|
Similar papers:
[rank all papers by similarity to this]
|
#35 - BirdMachine: Large-scale Fine-grained Visual Categorization of Birds [pdf]
Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle Alexander, David Jacobs, Peter Belhumeur |
Abstract: We address the problem of large-scale fine-grained visual categorization, describing new methods we have used to produce an online field guide to 500 North American bird species. We focus on the challenges raised when such a system is asked to distinguish between highly similar species of birds. First, we develop "one-vs-most" classifiers. By eliminating highly similar species during training, these classifiers achieve more accurate and intuitive results. Second, we show how to estimate spatio-temporal class priors from observations that are sampled at irregular and biased locations. We show how these priors can be used to significantly improve performance. We then show recognition performance that significantly exceeds the state-of-the-art on a new, large dataset that we make publicly available. These recognition methods are integrated into the online field guide, which is also publicly available.
|
Similar papers:
[rank all papers by similarity to this]
|
#36 - DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting [pdf]
Chen Sun, Ram Nevatia |
Abstract: We propose a unified framework to simultaneously classify high-level events, identify important segments and generate descriptions for large amounts of unconstrained web videos. The motivation is our observation that many video events are characterized by certain evidence types of important segments. Our goal is to find the important segments and capture their information for event classification and recounting (description). We introduce an evidence localization model (ELM) where evidence types and locations are modeled as latent variables. We impose constraints on global video appearance, local evidence appearance and the temporal structure of evidence types. The model is learned via a max-margin framework and allows efficient inference. Our method does not require annotating sources of evidence, and is jointly optimized for event classification and recounting. Experimental results are shown on the challenging TRECVID 2013 MEDTest dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
#39 - Transfer Joint Matching for Visual Domain Adaptation [pdf]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Philip Yu |
Abstract: Visual domain adaptation, which learns an accurate classifier for a new domain using labeled images from an old domain, has shown promising value in computer vision yet still been a challenging problem. Most prior works have explored two learning strategies independently for domain adaptation: feature matching and instance reweighting. In this paper, we show that both strategies are important and inevitable when the domain difference is substantially large. We therefore put forward a novel Transfer Joint Matching (TJM) approach to model them in a unified optimization problem. Specifically,TJM aims to reduce the domain difference by jointly matching the features and reweighting the instances across domains in a principled dimensionality reduction procedure, and construct new feature representation that is invariant to both the distribution difference and the irrelevant instances. Comprehensive experimental results verify that TJM can significantly outperform competitive methods for cross-domain image recognition problems.
|
Similar papers:
[rank all papers by similarity to this]
|
#40 - Occluding Contours for Multi-View Stereo [pdf]
Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, Steve Seitz |
Abstract: This paper leverages occluding contours (aka "internal silhouettes") to improve the performance of multi-view stereo methods. The contributions are 1) a new technique to identify free-space regions arising from occluding contours, and 2) a new approach for incorporating the resulting free-space constraints into Poisson surface reconstruction. The proposed approach outperforms state of the art MVS techniques for challenging Internet datasets, yielding dramatic quality improvements both around object contours and in surface detail.
|
Similar papers:
[rank all papers by similarity to this]
|
#43 - Raw-to-raw: Mapping between image sensor color responses [pdf]
Rang Nguyen, Dilip Prasad, Michael Brown |
Abstract: Camera images saved in raw format are being adopted in computer vision tasks since raw values represent minimally processed sensor responses. Camera manufacturers, however, have yet to adopt a standard for raw images and current raw-rgb values are device specific due to different sensors spectral sensitivities. This results in significantly different raw images for the same scene captured with different cameras. This paper focuses on estimating a mapping that can convert a raw image of an arbitrary scene and illumination from one camera's raw space to another. To this end, we examine various mapping strategies including linear and non-linear transformations applied both in a global and an illumination-specific manner. We show that illumination-specific mappings give the best result, however, at the expense of requiring a large number of transformations. To address this issue, we introduce an illumination-independent mapping approach that uses white-balancing to assist in reducing the number of required transformations. We show that this approach achieves state-of-the-art results on a range of consumer cameras and images of arbitrary scenes and illuminations.
|
Similar papers:
[rank all papers by similarity to this]
|
#45 - Sparse Representation for Edit Propagation of High-Resolution Images [pdf]
Xiaowu Chen, Jianwei Li, Dongqing Zou, Xiaochun Cao, Qinping Zhao, Hao (Richard) Zhang |
Abstract: We introduce the use of sparse representation for edit propagation of high-resolution images or video. Previous approaches for edit propagation typically employ a global optimization over the whole set of image pixels, incurring a prohibitively high memory and time consumption for high-resolution images. Rather than propagating an edit pixel by pixel, we follow the principle of sparse representation to obtain a compact set of representative samples (or features) and perform edit propagation on the samples instead. The sparse set of samples provide an intrinsic basis for an input image, and the coding coefficients capture the linear relationship between all pixels and the samples. The representative set of samples is computed by a novel scheme which maximizes the KL-divergence between each sample pair. We show several applications of sparsity-based edit propagation including video recoloring, theme editing, and seamless cloning, operating on both color and texture features. We demonstrate that with a sample-to-pixel ratio in the order of 0.01%, signifying a significant reduction on memory, our method still maintains a high-degree of visual fidelity.
|
Similar papers:
[rank all papers by similarity to this]
|
#54 - The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities [pdf]
Hilde Kuehne, Ali Arslan, Thomas Serre |
Abstract: This paper describes a framework for modeling human activities as temporally structured processes. Our approach is motivated by the inherently hierarchical nature of human activities and the close correspondence between human actions and speech: We model action units using HMMs much like words in speech; action units then form the building blocks for more complex activities using an ``action grammar'', much like words for sentences. To evaluate our approach, we collected a large dataset of daily cooking activities: The dataset includes a total of 52 participants, each performing a total of 10 cooking activities in multiple real-life kitchens, resulting in more than 77 hrs of video footage. We fully annotated the dataset at both a fine, motor-command level and a coarser, goal-oriented level. We test the approach using the HTK toolkit, a state-of-the-art speech recognition engine in combination with different feature descriptors. We evaluate the proposed approach on multiple tasks from activity recognition to frame-based action recognition and semantic parsing. Our results demonstrate the benefits of structured temporal generative approaches over existing discriminative approaches in coping with the complexity of human daily life activities.
|
Similar papers:
[rank all papers by similarity to this]
|
#55 - Newton Greedy Pursuit: a Quadratic Approximation Method for Sparsity-Constrained Optimization [pdf]
Xiao-Tong Yuan, Qingshan Liu |
Abstract: First-order greedy selection algorithms have been widely applied to sparsity-constrained convex optimization. The main theme of this type of methods is to evaluate the function gradient in the previous iteration to update the non-zero entries and their values in the next iteration. In contrast, relatively less effort has been made in the study of second-order greedy selection method additionally utilizing the Hessian information. Inspired by the classic constrained Newton method, we propose in this paper the NewTon Greedy Pursuit (NTGP) method to approximately minimizes a twice differentiable function over sparsity constraint. At each iteration, NTGP constructs a second-order Taylor expansion to approximate the cost function, and then estimates the next iterate by optimizing the constructed quadratic model over sparsity constraint. Theoretical analysis shows that under proper conditions NTGP converges superlinearly until an estimation error bound is reached. We demonstrate the improved computational efficiency of our method over first-order greedy selection methods in sparse logistic regression tasks.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Hashing technique has become a promising approach for fast similarity search. Most of existing hashing research pursue the binary codes for the same type of entities by preserving their similarities. In practice, there are many scenarios involving nearest neighbor search on the data given in matrix form, where two different types of, yet naturally associated entities respectively correspond to its two dimensions or views. To fully explore the duality between the two views, we propose a collaborative hashing scheme for the data in matrix form to enable fast search in various applications such as image search using bag of words and recommendation using user-item ratings. By simultaneously preserving both the entity similarities in each view and the interrelationship between views, our collaborative hashing effectively learn the compact binary codes and the explicit hash functions for out-of-sample extension in an alternating optimization way. Extensive evaluations are conducted on three well-known datasets for search inside a single view and search across different views, demonstrating that our proposed method outperforms state-of-the-art baselines, with significant accuracy gains ranging from 7.67% to 45.87% relatively.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we present the first local descriptor designed for dynamic surfaces. A dynamic surface is a surface that can undergo non-rigid deformation (e.g., human body surface). Using state-of-the-art technology, details on dynamic surfaces such as cloth wrinkle or facial expression can be accurately reconstructed. Hence, various results (e.g., surface rigidity, elasticity, etc.) could be derived by microscopic categorization of surface elements. We propose a timing-based descriptor to model local spatiotemporal variations of surface intrinsic properties. The low-level descriptor encodes gaps between local event dynamics of neighboring keypoints using timing structure of linear dynamical systems (LDS). We also introduce the bag-of-timings (BoT) paradigm for surface dynamics characterization. Experiments are performed on synthesized and real-world datasets. We show the proposed descriptor can be used for challenging dynamic surface classification and segmentation with respect to rigidity at surface keypoints.
|
Similar papers:
[rank all papers by similarity to this]
|
#66 - Learning to Detect Ground Control Points for Improving the Accuracy of Stereo Matching [pdf]
Aristotle Spyropoulos, Nikos Komodakis, Philippos Mordohai |
Abstract: While machine learning has been instrumental to the ongoing progress in most areas of computer vision, it has not been applied to the problem of stereo matching with similar frequency or success. We present a supervised learning approach for predicting the correctness of stereo matches based on a random forest and a set of features that capture various forms of information about each pixel.We show highly competitive results in predicting the correctness of matches and in confidence estimation, which allows us to rank pixels according to the reliability of their assigned disparities. Moreover, we show how these confidence values can be used to improve the accuracy of disparity maps by integrating them with an MRF-based stereo algorithm. This is an important distinction from current literature that has mainly focused on sparsification by removing potentially erroneous disparities to generate quasi-dense disparity maps.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper solves the speed bottleneck of deformable part model (DPM), while maintaining the state-of-the-art accuracy in detection for challenging datasets. Three prohibitive steps in cascade version of DPM are accelerated, including 2D correlation between root filter and feature map, cascade part pruning and HOG feature extraction. For 2D correlation, the root filter is constrained to be low rank, so that 2D correlation can be calculated by more efficient linear combination of 1D correlations. A proximal gradient algorithm is adopted to progressively learn the low rank filter in a discriminative manner. For cascade part pruning, neighborhood aware cascade is proposed to capture the dependence in neighborhood regions for aggressive pruning. Instead of explicit computation of part scores, hypotheses can be pruned by scores of neighborhoods under the first order approximation. For HOG feature extraction, look-up tables are constructed to replace expensive calculations of orientation partition and magnitude with simpler matrix index operations. Extensive experiments show that (a) the proposed method is 4 times faster than the current fastest DPM method with similar accuracy on Pascal VOC, (b) the proposed method achieves state-of-the-art accuracy on pedestrian and face detection task with frame-rate speed.
|
Similar papers:
[rank all papers by similarity to this]
|
#75 - Packing and Padding: Coupled Multi-index for Accurate Image Retrieval [pdf]
Liang Zheng, Shengjin Wang, Ziqiong Liu, Qi Tian |
Abstract: In Bag-of-Words (BoW) based image retrieval, the SIFT visual word has a low discriminative power, so false positive matches occur prevalently. Apart from the information loss during quantization, another cause is that the SIFT feature only describes the local gradient distribution. To address this problem, this paper proposes a coupled Multi-Index (c-MI) framework to perform feature fusion at indexing level. Basically, complementary features are coupled into a multi-dimensional inverted index. Each dimension of c-MI corresponds to one kind of feature, and the retrieval process votes for images similar in both SIFT and other feature spaces. Specifically, we exploit the fusion of local color feature into c-MI. While the precision of visual match is greatly enhanced, we adopt Multiple Assignment to improve recall. The joint cooperation of SIFT and color features significantly reduces the impact of false positive matches. Extensive experiments on several benchmark datasets demonstrate that c-MI improves the retrieval accuracy significantly, while consuming only half of the query time compared to the baseline. Importantly, we show that c-MI is well complementary to many prior techniques. Assembling these methods, we have obtained an mAP of 85.8% and N-S score of 3.85 on Holidays and Ukbench datasets, respectively, which are the best results ever published.
|
Similar papers:
[rank all papers by similarity to this]
|
#83 - From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity [pdf]
Nam Vo, Aaron Bobick |
Abstract: We propose a probabilistic method for parsing complex activities that are defined as composition of sub-activities. The temporal structure is represented by a string-length limited stochastic context-free grammar. Given the grammar, a Bayes network is generated where the variable nodes correspond to the start and end times of component actions, and the network integrates information about duration of each primitive action, visual detection results for each primitive action, and the activity's temporal structure. At each moment in time during the activity, message passing is used to perform exact inference yielding the posterior probabilities of the start and end times for each different actions. We provide demonstrations of this framework being applied to various vision tasks such as action prediction, classification of the high-level activities or temporal segmentation of a test sequence; the method is also applicable in Human Robot Interaction domain where continually prediction of human's actions is needed.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We describe an approach for simultaneous localization and calibration of a stream of range images. Our approach jointly optimizes the camera trajectory and a calibration function that corrects the camera's unknown nonlinear distortion. Experiments with real-world benchmark data and synthetic data show that our approach increases the accuracy of camera trajectories and geometric models estimated from range video produced by consumer-grade cameras.
|
Similar papers:
[rank all papers by similarity to this]
|
#90 - Global Optimization for Depth Reconstruction from Speckle Patterns [pdf]
Qifeng Chen, Vladlen Koltun |
Abstract: We present an approach to increasing the accuracy of range images produced by speckle-based range cameras. Our approach optimizes a global objective on the range image. The optimization is performed by a convergent block coordinate descent scheme that updates a horizontal or vertical line in each iteration. We show that this update can be performed optimally in linear time. The resulting algorithm is extremely efficient and trivially parallelizable. Experiments with ground-truth data demonstrate that our algorithm is significantly more accurate than alternative algorithms for optimizing the same objective and that our approach is significantly more accurate than alternative range image rectification schemes.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We propose a kernel-based framework for computing components from a set of surface normals. This framework allows us to easily demonstrate that component analysis can be performed directly upon normals. We link previously proposed mapping functions, the azimuthal equidistant projection (AEP) and principal geodesic analysis (PGA), to our kernel-based framework. We also propose a new mapping function based upon the cosine distance between normals. We demonstrate the robustness of our proposed kernel when trained with noisy training sets. We also compare our kernels within an existing shape-from-shading (SFS) algorithm. Our spherical representation of normals, when combined with the robust properties of cosine kernel, produces a very robust subspace analysis technique. In particular, our results within SFS show a substantial qualitative and quantitative improvement over existing techniques.
|
Similar papers:
[rank all papers by similarity to this]
|
#103 - The Shape-Time Random Field for Semantic Video Labeling [pdf]
Andrew Kae, Erik Learned-Miller, Benjamin Marlin |
Abstract: We propose a novel discriminative model for semantic labeling in videos by incorporating a temporal shape prior to model both the shape and temporal dependencies of an object in video. While the conditional random field (CRF) can label regions in video frames, and can be extended to incorporate temporal dependencies between frames, it typically lacks a global shape prior, which can be informative. Recent work has shown how to incorporate a global shape prior into a CRF for image labeling, but this prior does not account for temporal dependencies. The conditional restricted Boltzmann machine (CRBM) can model temporal dependencies and has been used to successfully learn walking styles from motion-capture data. In this work we use the CRBM to model not only the shape of on object in a video but also the temporal dependencies of the object from previous frames. We incorporate this CRBM prior (to model the shape and temporal dependencies) along with the CRF (to model local dependencies) to create a new state-of-the-art model for the task of semantic labeling in videos. In particular, we explore the task of labeling faces into Hair/Skin/Background regions in videos from the YouTube Faces Database (YFDB). Our combined approach outperforms two baselines: a CRF with temporal potentials and a CRF with a global shape prior but without temporal dependencies.
|
Similar papers:
[rank all papers by similarity to this]
|
#105 - Bayes Merging of Multiple Vocabularies for Scalable Image Retrieval [pdf]
Liang Zheng, Shengjin Wang, Wengang Zhou, Qi Tian |
Abstract: The Bag-of-Words (BoW) representation is well applied to recent state-of-the-art image retrieval works. In this model, the vocabulary is of key importance. Typically, multiple vocabularies are generated to correct quantization artifacts and improve recall. However, this routine is corrupted by vocabulary correlation, i.e., overlapping among different vocabularies. Vocabulary correlation leads to an over-counting of the indexed features in the overlapped area, or the intersection set, thus compromising the retrieval accuracy. In order to address the correlation problem while preserve the benefit of high recall, this paper proposes a Bayes merging approach to down-weight the indexed features in the intersection set. Through explicitly modeling the correlation problem in a probabilistic view, a joint similarity on both image- and feature-level is estimated for the indexed features in the intersection set. We evaluate our method through extensive experiments on three benchmark datasets. Albeit simple, Bayes merging can be well applied in various merging tasks, and consistently improves the baselines on multi-vocabulary merging. Moreover, Bayes merging is efficient in terms of both time and memory cost, and yields competitive performance compared with the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#108 - Joint Depth Estimation and Camera Shake Removal from Single Blurry Image [pdf]
Zhe Hu, Li Xu, Ming-Hsuan Yang |
Abstract: Camera shake during exposure time often results in spatially variant blur effect of the image. The non-uniform blur effect is not only caused by the camera motion, but also the depth variation of the scene. The objects close to the camera sensors are likely to appear more blurry than those at a distance in such cases. However, recent non-uniform deblurring methods do not explicitly consider the depth factor or assume fronto-parallel scenes with constant depth for simplicity. While single image non-uniform deblurring is a challenging problem, the blurry results in fact contains depth information which can be exploited. We propose to jointly estimate scene depth and remove non-uniform blur caused by camera motion by exploiting their underlying geometric relationships, with only single blurry image as input. Toward this, we present a a unified layer-based model for depth-involved deblurring, and develop an expectation-maximization scheme to solve the problem. Experiments on challenging examples demonstrate that both depth and camera shape removal can be well addressed within the unified framework.
|
Similar papers:
[rank all papers by similarity to this]
|
#128 - Unsupervised Learning for Graph Matching: An Attempt to Define and Extract Soft Attributed Patterns [pdf]
Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, Ryosuke Shibasaki |
Abstract: Graph matching is a fundamental problem in computer vision, and is widely applied to the matching of 2D and 3D objects. In this paper, we define the soft attributed pattern (SAP) oriented towards attributed relational graphs (ARGs), which describes the pattern of common sub-graphs among the ARGs, considering both the graphical structure and graph attributes. We propose a direct solution to extract the maximal SAP among the ARGs without node enumeration, and thus use it to extend the concept of the unsupervised learning for graph matching. Given an initial graph template and a number of ARGs, we modify the graph template into the maximal SAP in an unsupervised fashion, achieving good matching performance between the template and the ARGs. Our method exhibits superior performance to conventional methods for learning graph matching on RGB and RGB-D images.
|
Similar papers:
[rank all papers by similarity to this]
|
#132 - When 3D Reconstruction Meets Ubiquitous RGB-D Images [pdf]
Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, Ryosuke Shibasaki |
Abstract: 3D reconstruction from a single image is a classical problem in computer vision. However, it still poses great challenges for the reconstruction of daily-use objects with irregular shapes. In this paper, we propose to learn 3D reconstruction knowledge from informally captured RGB-D images, which will probably be ubiquitously used in daily life. The learning of 3D reconstruction is defined as a category modeling problem, in which a model for each category is trained to encode category-specific knowledge for 3D reconstruction. The category model estimates the pixel-level 3D structure of an object from its 2D appearance, by taking into account considerable variations in rotation, 3D structure, and texture. Learning 3D reconstruction from ubiquitous RGB-D images creates a new set of challenges. Experimental results have demonstrated the effectiveness of the proposed approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#133 - Scalable 3D Tracking of Multiple Interacting Objects [pdf]
Nikolaos Kyriazis, Antonis Argyros |
Abstract: We consider the problem of tracking multiple interacting objects in 3D, using RGBD input and by considering a hypothesize-and-test approach. Due to their interaction, objects to be tracked are expected to occlude each other in the field of view of the camera observing them. A naive approach would be to employ a Set of Independent Trackers (SIT) and to assign one tracker to each object. This approach scales well with the number of objects but fails as occlusions become stronger due to their disjoint consideration. The solution representing the current state of the art employs a single Joint Tracker (JT) that accounts for all objects simultaneously. This directly resolves ambiguities due to occlusions but has a computational complexity that grows geometrically with the number of tracked objects. We propose a middle ground, namely an Ensemble of Collaborative Trackers (ECT), that combines best traits from both worlds to deliver a practical and accurate solution to the multi-object 3D tracking problem. We present quantitative and qualitative experiments with several synthetic and real world sequences of diverse complexity. Experiments demonstrate that ECT manages to track far more complex scenes than JT at a computational time that is only slightly larger than that of SIT.
|
Similar papers:
[rank all papers by similarity to this]
|
#137 - The Role of Context for Object Detection and Semantic Segmentation in the Wild [pdf]
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille |
Abstract: In this paper we study the role of context in modern detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge. We believe this data will give plenty of extrachallenges to the community, as it provides 548 new object classes for detection and 603 classes for semantic segmentation. We analyze the ability of state-of-the-art methods to perform semantic segmentation in this new setting. Our analyses show that NN based approaches perform poorly on semantic segmentation of context classes, which shows the variability of PASCAL imagery. Furthermore, improvements of existing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable partbased model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that the model significantly helps in detecting objects at all scales.
|
Similar papers:
[rank all papers by similarity to this]
|
#142 - Simplex-Based 3D Spatio-Temporal Feature Description for Action Recognition [pdf]
Hao Zhang, Wenjun Zhou, Christopher Reardon, Lynne Parker |
Abstract: We present a novel feature description algorithm to describe 3D local spatio-temporal features for human action recognition. Our descriptor avoids the singularity and limited discrimination power issues of traditional 3D descriptors by quantizing and describing visual features in the simplex topological vector space. Specifically, given a features support region containing a set of 3D visual cues, we decompose the cues orientation into three angles, transform the decomposed angles into the simplex space, and describe them in such a space. Then, quadrant decomposition is performed to improve discrimination, and a final feature vector is composed from the resulting histograms. We develop intuitive visualization tools for analyzing feature characteristics in the simplex topological vector space. Experimental results demonstrate that our novel simplex-based orientation decomposition (SOD) descriptor substantially outperforms traditional 3D descriptors for the challenging KTH, UCF Sport, and Hollywood-2 benchmark action datasets. In addition, the results show that our SOD descriptor is a superior individual descriptor for action recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Given a static scene, a human can trivially enumerate the myriad of things that can happen next and also characterize the relative likelihood of each. In the process, we make use of enormous amounts of common-sense knowledge about how the world works. In this paper, we investigate learning this common sense knowledge from data. To overcome a lack of densely annotated spatiotemporal data, we learn from bounding-box-level annotation of sequences of abstract images gathered using crowdsourcing. We demonstrate qualitatively and quantitatively that our models produce convincing scene predictions on both the abstract images as well as natural images taken from the internet.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper proposes a robust tracking method that uses interval analysis. Any single posterior model necessarily includes a modeling uncertainty (error), and thus, the posterior should be represented as an interval of probability. Then, the objective of visual tracking becomes to find the best state that maximizes the posterior and minimizes its interval simultaneously. By minimizing the interval of the posterior, our method can reduce the modeling uncertainty in the posterior. In this paper, the aforementioned objective is achieved by using the M4 estimation, which combines the Maximum a Posterior (MAP) estimation with Minimum Mean-Square Error (MMSE), Maximum Likelihood (ML), and Minimum Interval Length (MIL) estimations. In the M4 estimation, our method maximizes the posterior over the state obtained by the MMSE estimation. The method also minimizes interval of the posterior by reducing the gap between the lower and upper bounds of the posterior. The gap is reduced when the likelihood is maximized by the ML estimation and the interval length of the state is minimized by the MIL estimation. The experimental results demonstrate that M4 estimation can be easily integrated into conventional tracking methods and can greatly enhance their tracking accuracy. In several challenging datasets, our method outperforms state-of-the-art tracking methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#162 - Scanline Sampler without Detailed Balance: An Efficient MCMC for MRF Optimization [pdf]
Wonsik Kim, Kyoung Mu Lee |
Abstract: Markov chain Monte Carlo (MCMC) is an elegant tool, widely used in variety of areas. In computer vision, it has been used for the inference on the Markov random field model (MRF). However, MCMC less concerned than other deterministic approaches although it converges to global optimal solution in theory. The major obstacle is its slow convergence. To come up with faster sampling method, we investigate two ideas: breaking detailed balance and updating multiple number of nodes at a time. Although detailed balance is considered to be essential element of MCMC, it actually is not the necessary condition for the convergence. In addition, exploiting the structure of MRF, we introduce a new kernel which updates multiple number of nodes in a scanline rather than a single node. Those two ideas are integrated in a novel way to develop an efficient method called scanline sampler without detailed balance. In experimental section, we apply our method to the OpenGM2 benchmark of MRF optimization and show the proposed method achieves faster convergence than the conventional approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Depth captured by consumer RGB-D cameras is often noisy and misses values at some pixels, especially around object boundaries. Most existing methods complete the missing depth values guided by the corresponding color image. When the color image is noisy or the correlation between color and depth is weak, the depth map cannot be properly enhanced. In this paper, we present a depth map enhancement algorithm that performs depth map completion and de-noising simultaneously. Our method is based on the observation that for each RGB-D patch, if we find similarly looking patches, they lie in a very low-dimensional subspace. We can then assemble similar patches into a matrix and enforce this low-rank subspace constraint. This low-rank subspace constraint essentially captures the underlying structure in the RGB-D patches and enables robust depth enhancement against the noise or weak correlation between color and depth. Based on this subspace constraint, our method formulates depth map enhancement as a low-rank matrix completion problem. Since the rank of a matrix changes over matrices, we develop a data-driven method to automatically determine the rank number for each matrix. The experiments with our method on public benchmark show that our method can effectively enhance depth maps from consumer RGB-D cameras.
|
Similar papers:
[rank all papers by similarity to this]
|
#178 - Describing Textures in the Wild [pdf]
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Andrea Vedaldi |
Abstract: Patterns and textures are defining characteristics of many natural objects: a shirt can be striped, the wings of a butterfly can be veined, and the skin of an animal can be scaly. Aiming at supporting this analytical dimension in image understanding, we address the challenging problem of describing textures with semantic attributes. We identify a rich vocabulary of forty-seven texture terms and use them to describe a large dataset of patterns collected in the wild. The resulting Describable Textures Dataset (DTD) is the basis to seek for the best texture representation for recognizing describable texture attributes in images. We port from object recognition to texture recognition the Improved Fisher Vector (IFV) and show that, surprisingly, it outperforms specialized texture descriptors not only on our problem, but also in established material recognition datasets. We also show that the describable attributes are excellent texture descriptors, transferring between datasets and tasks; in particular, combined with IFV, they significantly outperform the state-of-the-art by more than 8% on both FMD and KTHTIPS-2b benchmarks. We also demonstrate that they produce intuitive descriptions of materials and Internet images.
|
Similar papers:
[rank all papers by similarity to this]
|
#180 - Parsing World's Skylines using Shape-Constrained MRFs [pdf]
Rashmi Tonge, Subhransu Maji, C.V. Jawahar |
Abstract: We propose an approach for extracting the detailed structure of buildings in typical skyline images. Our approach is based on a Markov Random Field (MRF) formulation that exploits the fact that such images contain highly overlapping objects of similar shapes. Our contributions are the following: (1) A dataset of 120 skyline images containing over 4,000 buildings that are individually labeled that allows us to quantitatively evaluate the performance of various methods, (2) An analysis of low-level features that are useful for segmentation of buildings, and (3) A shape-constrained MRF that enforces shape priors over the regions. We perform experiments on automatic and interactive setting, and show that in both cases to our formulation offers an order of magnitude speedup over traditional approaches and improves performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#182 - StoryGraphs: Narrative Charts for TV series [pdf]
Makarand Tapaswi, Martin Buml, Rainer Stiefelhagen |
Abstract: We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions in a narrative chart. We also propose a scene detection method that lends itself well to generate oversegmented scenes which is used to partition the video. The positioning of the characters in the chart is formulated as an optimization problem wherein we trade off between the aesthetics of the chart and its functionality. Using automatic person identification, we generate StoryGraphs on 3 diverse TV series encompassing a total of 22 episodes. We define quantitative criteria to evaluate StoryGraphs and also compare them against episode summaries to evaluate their ability to provide the episode overview.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We pose unseen view synthesis as a probabilistic tensor completion problem. Given images of people organized by their rough viewpoint, we form a 3D appearance tensor indexed by images (pose examples), viewpoints, and image positions. After discovering the low-dimensional latent factors that approximate that tensor, we can impute its missing entries. In this way, we generate novel synthetic views of people---even when they are observed from just one camera viewpoint. We show that the inferred views are both visually and quantitatively accurate. Furthermore, we demonstrate their value for recognizing actions in unseen views and estimating viewpoint in novel images. While existing methods are often forced to choose between data that is either realistic or multi-view, our virtual views offer both, thereby allowing greater robustness to viewpoint in novel images.
|
Similar papers:
[rank all papers by similarity to this]
|
#200 - Orientational Pyramid Matching for Recognizing Indoor Scenes [pdf]
Lingxi Xie, Jingdong Wang, Bo Zhang, Qi Tian |
Abstract: Scene recognition is a basic task towards image understanding. Spatial Pyramid Matching (SPM) has been shown to be a popular solution for spatial context modeling. In this paper, we introduce an alternative approach, Orientational Pyramid Matching (OPM), for orientational context modeling. Our approach is motivated by the observation that the 3D orientations of objects are a crucial factor to discriminate indoor scenes. The novelty lies in that OPM uses the 3D orientations to form the pyramid and produce the pooling regions, which is unlike SPM that uses the spatial positions to form the pyramid. Experimental results over the challenging MIT Indoor-67 dataset show that OPM achieves the performance comparable with SPM and that OPM and SPM make complementary contributions and their combination gives the state-of-the-art performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#212 - Joint Unsupervised Multi-Class Image Segmentation [pdf]
Fan Wang, Qixing Huang, Maks Ovsjanikov, Leonidas J. Guibas |
Abstract: Joint segmentation of image sets is a challenging problem, especially when there are multiple objects with variable appearances shared among the images in the collection and the set of objects present in each particular image is itself varying and unknown. In this paper, we present a novel method to jointly segment a set of images with objects from multiple classes. We first establish consistent functional maps across the input images, and introduce a formulation that explicitly models partial similarity across images instead of global consistency. Given the optimized maps across the images, multiple groups of consistent segmentations are found such that they align with segmentation cues in the images, agree with the functional maps, and are mutually exclusive. The proposed fully unsupervised approach exhibits a significant improvement over the state-of-the-art methods, as shown on the co-segmentation data sets MSRC, Flickr, and PASCAL.
|
Similar papers:
[rank all papers by similarity to this]
|
#213 - Looking Beyond the Visible Scene [pdf]
Joseph Lim, Aditya Khosla, Antonio Torralba, Byoungkwon An An |
Abstract: A common thread that ties previous works in scene understanding together is their focus on the aspects directly present in a scene such as its categorical classification or the set of objects. In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is not just a collection of objects and their configuration or the labels assigned its pixels, it is so much more. From a simple observation of a scene, we can tell a lot about the environment surrounding the scene such as the potential establishments near it, the potential crime rate in the area, or even the economic climate. In this work, we explore several areas from both the human perception and computer vision perspective. Specifically, we show that its possible to predict the distance of surrounding establishments such as McDonald's or hospitals even by using a scenes located far from them. We go a step further to show that both humans and computers perform reasonably at navigating the environment based only on visual cues from scenes that contain no direct information about the target. Lastly, we show that it is possible to predict the crime rates in an area simply by looking at a scene without any real-time criminal activity. Simply put, here, we illustrate that it is possible to look beyond the visible scene.
|
Similar papers:
[rank all papers by similarity to this]
|
#218 - Towards Multi-view and Partially-occluded Face Alignment [pdf]
Junliang Xing, Zhiheng Niu, Junshi Huang, Weiming Hu, Shuicheng Yan |
Abstract: We present a robust algorithm to locate facial landmarks under different views and possibly severe occlusions. To build reliable relationships between face appearance and shape with large view variations, we propose to formulate face alignment as an $\ell_1$-induced Stagewise Relational Dictionary (SRD) learning problem. During each training stage, the SRD model learns a relational dictionary to capture consistent relationships between face appearance and shape, which are respectively modeled by the pose-indexed image features and the shape displacements for current estimated landmarks. During testing, the SRD model automatically selects a sparse set of the most related shape displacements for the testing sample and uses them to refine its shape iteratively. To locate face landmarks under occlusions, we further propose to learn an occlusion dictionary to model different kinds of partial face occlusions. By deploying the occlusion dictionary into the SRD model, the alignment performance for occluded faces can be further improved. Our algorithm is simple, effective, and easy to implement. Extensive experiments on two benchmark datasets and two newly built datasets have demonstrated its superior performances over the state-of-the-art methods, especially for faces with large view variations and/or occlusions.
|
Similar papers:
[rank all papers by similarity to this]
|
#220 - Learning Mid-level Filters for Person Re-identification [pdf]
Rui Zhao, Wanli Ouyang, Xiaogang Wang |
Abstract: In this paper, we propose a novel approach of learning mid-level filters from automatically discovered patch clusters for person re-identification. It is well motivated by our study on what are good filters for person re-identification. Our mid-level filters are discriminatively learned for identifying specific visual patterns and distinguishing persons, and have good cross-view invariance. First, local patches are qualitatively measured and classified with their discriminative power. Discriminative and representative patches are collected for filter learning. Second, patch clusters with coherent appearance are obtained by pruning hierarchical clustering trees, and a simple but effective cross-view training strategy is proposed to learn filters that are view-invariant and discriminative. Third, filter responses are integrated with patch matching scores in RankSVM training. The effectiveness of our approach is validated on the VIPeR dataset and the CUHK Campus dataset. The learned mid-level features are complementary to existing handcrafted low-level features, and improve the best Rank-1 matching rate on the VIPeR dataset by 14%.
|
Similar papers:
[rank all papers by similarity to this]
|
#230 - Automatic Face Reenactment [pdf]
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Perez, Christian Theobalt |
Abstract: We propose an image-based facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserv- ing the original target performance. Our system is fully au- tomatic and does not require a database of source expres- sions. Instead, it is able to produce convincing reenactment results from a short source video of the user performing ar- bitrary facial gestures captured with an off-the-shelf cam- era, such as a webcam. Our reenactment pipeline is con- ceived as part image retrieval and part face transfer: Image retrieval is based on temporal clustering of target frames and a novel image matching metric that combines appear- ance and motion to select candidate frames from the source video, while face transfer is done by a 2D warping strat- egy that preserves the users identity. Our system excels in simplicity because it does not rely on a 3D face model, it is robust under head motion and does not require the source and target performance to be similar. We show convincing reenactment results for videos that we recorded ourselves and for low-quality footage taken from the Internet.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We present an approach that takes a single photograph of a child as input and automatically produces a series of age-progressed outputs between 1 and 80 years of age, accounting for pose, expression, and illumination. Leveraging thousands of photos of children and adults at many ages from the Internet, we first show how to compute average image subspaces that are pixel-to-pixel aligned and model variable lighting. These averages depict a prototype man and woman aging from 0 to 80, under any desired illumination, and capture the differences in shape and texture between ages. Applying these differences to a new photo yields an age progressed result. Contributions include re-lightable age subspaces, a novel technique for subspace-to-subspace alignment, and the most extensive evaluation of age progression techniques in the literature.
|
Similar papers:
[rank all papers by similarity to this]
|
#234 - Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification [pdf]
Yadong MU, Gang Hua, Wei Fan, Shi-Fu Chang |
Abstract: This paper presents a novel algorithm which uses hash bits for efficiently optimizing non-linear kernel SVM in very large scale visual classification problems. Our key idea is to represent each sample with compact hash bits and define an inner product over these bits, which serves as the surrogate of the original nonlinear kernels. Then the optimal solution of the nonlinear SVM can be transformed into solving a linear SVM over the hash bits. The proposed Hash-SVM enjoys both greatly reduced data storage owing to the compact binary representation, as well as the (sub-)linear training complexity via linear SVM. As a crucial component of Hash-SVM, we propose a novel hashing scheme for arbitrary non-linear kernels via random subspace projection in reproducing kernel Hilbert space. Our comprehensive analysis reveals a well behaved theoretic bound of the deviation between the proposed hashing-based kernel approximation and the original kernel function. We also derived moderate requirements on the hash bits for achieving a satisfactory accuracy level. Several experiments on large-scale visual classification benchmarks are conducted, including one with over 1 million images. The results well demonstrated the superiority of our algorithm when compared with other alternatives.
|
Similar papers:
[rank all papers by similarity to this]
|
#237 - Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts [pdf]
Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille |
Abstract: Detecting objects becomes difficult when we should deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformation and partial occlusion in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals of low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables our model to represent an exponential number of holistic object and body part combinations to better deal with different detectability patterns caused by deformations, occlusions or low resolution. We apply our method to the six animal categories in the Pascal VOC dataset and show that our method significantly improves state-of-the-art by 4.1 AP and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc). This makes use of a new dataset of fully annotated object parts for Pascal VOC 2010, which provides the mask for the parts.
|
Similar papers:
[rank all papers by similarity to this]
|
#247 - Quality-based Multimodal Classification Using Tree-Structured Sparsity [pdf]
Soheil Bahrampour, Asok Ray, nasser Nasrabadi, Kenneth Jenkins |
Abstract: Recent studies have demonstrated advantages of information fusion based on sparsity models for multimodal classification. Among several sparsity models, tree-structured sparsity provides a flexible framework for extraction of cross-correlated information from different sources and for enforcing group sparsity at multiple granularities. However, the existing algorithm only solves an approximated version of the cost functional and the resulting solution is not necessarily sparse at group levels. This paper reformulates the tree-structured sparse model for multimodal classification task. An accelerated proximal algorithm is proposed to solve the optimization problem, which is an efficient tool for feature-level fusion among either homogeneous or heterogeneous sources of information. In addition, a (fuzzy-set-theoretic) possibilistic scheme is proposed to weight the available modalities, based on their respective reliability, in a joint optimization problem for finding the sparsity codes. This approach provides a general framework for quality-based fusion that offers added robustness to several sparsity-based multimodal classification algorithms. To demonstrate their efficacy, the proposed methods are evaluated on three different applications -- multiview face recognition, multimodal face recognition, and target classification.
|
Similar papers:
[rank all papers by similarity to this]
|
#249 - Diversity-Enhanced Condensation Algorithm and Its Application for Robust and Accurate Endoscope Electromagnetic Tracking [pdf]
Ying Wan, Xiongbiao Luo, Sean He, Jie Yang, Terry Peters, kensaku Mori |
Abstract: The paper proposed a diversity-enhanced condensation algorithm to address the particle degeneracy or impoverishment problem which particle filtering methods usually suffer from. The particle diversity plays an important role in state prorogation since it affects the algorithm's performance. Unfortunately, the condensation algorithm easily gets trapped in local minima due to the shortage of particle modes. We introduce a modified evolutionary computing method, adaptive differential evolution, to resolve the particle impoverishment under a proper size of the particle population. We applied our proposed method to endoscope electromagnetic tracking for estimating three-dimensional motion of the endoscopic camera. Validation on a dynamic phantom proves that our proposed method offers a more robust and accurate tracking framework than previous methods by reduce the tracking error from 4.8 mm to 3.2 mm.
|
Similar papers:
[rank all papers by similarity to this]
|
#267 - Cross-Scale Cost Aggregation for Stereo Matching [pdf]
Kang Zhang, Yuqiang Fang, Dongbo Min, Lifeng Sun, Shiqiang Yang, Shuicheng Yan, Qi Tian |
Abstract: Human beings process stereoscopic correspondence across multiple scales. However, this bio-inspiration is ignored by state-of-the-art cost aggregation methods for dense stereo correspondence. In this paper, a generic cross-scale cost aggregation framework is proposed to allow multi-scale interaction in cost aggregation. We firstly reformulate cost aggregation from a unified optimization perspective and show that different cost aggregation methods essentially differ in the choices of similarity kernels. Then, an inter-scale regularizer is introduced into optimization and solving this new optimization problem leads to the proposed framework. Since the regularization term is independent of the similarity kernel, various cost aggregation methods can be integrated into the proposed general framework. We show that the cross-scale framework is important as it effectively and efficiently expands state-of-the-art cost aggregation methods and leads to significant improvements, when evaluated on Middlebury, KITTI and New Tsukuba datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#276 - Learning an image-based motion context for multiple people tracking [pdf]
Laura Leal-Taix, Michele Fenzi, Alina Kuznetsova, Bodo Rosenhahn, Silvio Savarese |
Abstract: We present a novel method for multiple people tracking that leverages a generalized model for capturing interactions among individuals. At the core of our model is a learned dictionary of interaction feature strings which capture the relationship between the motion of the targets. These feature strings, created from low-level image features, lead to a much richer representation of the physical interactions between targets compared to hand-specified social force models that previous works have introduced for tracking. One disadvantage of using social forces is that all pedestrians need to be detected in order for the forces to be applied, while our method is able to encode the effect of undetected targets, making the tracker more robust to partial occlusions. The interaction feature strings are used in a Random Forest framework to track the targets according to the features surrounding them. Results on six publicly available sequences show that our method outperforms state-of-the-art approaches in multiple people tracking.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Scan-line optimization via cost accumulation has become very popular for stereo estimation in computer vision applications and is often combined with a semi-global integration strategy, known as SGM. This paper introduces this combination as a general and effective optimization technique. It is the first time that this concept is applied to 3D medical image registration. The presented algorithm, SGM-3D, employs a coarse-to-fine strategy and reduces the search space dimension for consecutive pyramid levels by a fixed linear rate. This allows it to handle large displacements to an extent that is required for clinical applications in high dimensional data. SGM-3D is evaluated in context of pulmonary motion analysis on the recently extended DIR-lab benchmark that provides ten 4D computed tomography (CT) image data sets, as well as ten challenging 3D CT scan pairs from the COPDgene study archive. Results show that both registration errors as well as run-time performance are very competitive with current state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Most previous works on video action recognition primarily use complex hand-designed local features, such as popular SIFT, HOG and SURF, but these approaches are time-consuming and difficult to extend to other sensor modalities. Recent studies discover that there is no universally best hand-engineered features for all datasets, and learning features directly from the dataset itself may be more advantageous. One such endeavor is Slow Feature Analysis (SFA) proposed by Wiskott and Sejnowski \cite{sfa}. SFA can learn the invariant and slowly varying features from input signals and has proved to be valuable in human action recognition \cite{sfa_action}. It is also observed that the multi-layer feature representation has succeeded remarkably in idespread machine learning applications. In this paper, we propose to combine SFA with deep learning techniques to learn hierarchical representations from the high-resolution video data. Specifically, we use a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs. Sharing the same merits of deep learning, the proposed method is generic and fully automated. Our classification results on Hollywood2, KTH and UCF sports datasets are superior to most of previous published results. To highlight some, on the challenging Hollywood2 dataset, our recognition rate shows approximately $1\%$ improvement in comparison to most of hand-designed methods even without supervising and dense sa
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We propose Ordered Subspace Clustering (OSC) to segment data drawn from a sequentially ordered union of subspaces. Current subspace clustering techniques learn the relationships within a set of data and then use a separate clustering algorithm such as NCut for final segmentation. In contrast our technique, under certain conditions, is capable of segmenting clusters intrinsically without providing the number of clusters as a parameter. Similar to Sparse Subspace Clustering (SSC) we formulate the problem as one of finding a sparse representation but include a new penalty term to take care of sequential data. We test our method on data drawn from infrared hyper spectral data, video sequences and face images. Our experiments show that our method, OSC, outperforms the state of the art methods: Spatial Subspace Clustering (SpatSC), Low-Rank Representation (LRR) and SSC.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The output of many algorithms in computer-vision is either non-binary maps or binary maps (e.g., salient object detection and object segmentation). Several measures have been suggested to evaluate the accuracy of these foreground maps. In this paper, we show that the most commonly-used measures for evaluating both non-binary maps and binary maps do not always provide a reliable evaluation. This includes the Area-Under-the-Curve measure, the Average-Precision measure, the F-measure, and the evaluation measure of the PASCAL VOC segmentation challenge. We start by identifying three causes of inaccurate evaluation. We then propose a new measure that amends these flaws. An appealing property of our measure is being an intuitive generalization of the F-measure. Finally we propose four meta-measures to compare the adequacy of evaluation measures. We show via experiments that our novel measure is preferable.
|
Similar papers:
[rank all papers by similarity to this]
|
#295 - An Automated Estimator of Image Visual Realism Based on Human Cognition [pdf]
Shaojing Fan, Tian-Tsong Ng, Jonathan Herberg, Bryan Koenig, Cheston Tan, Rang-ding Wang |
Abstract: Assessing the visual realism of images is increasingly becoming an essential aspect of fields ranging from computer graphics (CG) rendering to photo manipulation. In this paper we systematically evaluate factors underlying human perception of visual realism and use that information to create an automated assessment of visual realism. We make the following unique contributions. First, we established a benchmark dataset of images with empirically determined visual realism scores. Second, we identified attributes potentially related to image realism, and used correlational techniques to determine that realism was most related to image naturalness, familiarity, aesthetics, and semantics. Third, we created an attributes-motivated, automated computational model that estimated image visual realism quantitatively. Using human assessment as a benchmark, the model was below human performance, but outperformed other state-of-the-art algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#302 - Robust Online Multi-Object Tracking based on Tracklet Confidence and Online Discriminative Appearance Learning [pdf]
Seung-Hwan Bae, Kuk-Jin Yoon |
Abstract: Online multi-object tracking aims at producing complete tracks of multiple objects using the information up to the present time. It still remains a difficult problem in complex scenes, because of frequent occlusion by a clutter or other objects, similar appearances of different objects, and so on. In this paper, we propose a robust online multi-object tracking method that can handle those difficulties effectively. We first propose the tracklet confidence using the detectability and continuity of a tracklet, and formulate a multi-object tracking problem based on the tracklet confidence. The multi-object tracking problem is then solved by associating tracklets in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections and fragmented tracklets are linked up with others without any iterative and expensive associations. Here, for the reliable association between tracklets and detections, we also propose a novel online learning method using an incremental linear discriminant analysis for discriminating the appearances of objects. By exploiting the proposed learning method, the tracklet association can be successfully achieved even under severe occlusion. Experiments with challenging public datasets show obvious performance improvement over other batch and online tracking methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#304 - Discrete-Continuous Gradient Orientation Estimation for Faster Unsupervised Segmentation [pdf]
Michael Donoser, Dieter Schmalstieg |
Abstract: The state-of-the-art in fully unsupervised segmentation builds hierarchical segmentation structures based on analyzing local feature cues in spectral settings. Due to their impressive performance, such segmentation approaches have become building blocks in many computer vision applications. Nevertheless, the main bottlenecks are still the computationally demanding processes of local feature extraction and subsequent spectral analysis. In this paper, we demonstrate that based on effectively trained random forests aiming at a discrete-continuous optimization of oriented gradient signals, we are able to provide segmentation performance competitive to state-of-the-art (even without any additional spectral analysis) while reducing computation time by a factor of 30. The output of our algorithm is a hierarchy of segmentation results with differing granularity, and in such a way we are able to provide useful input to various computer vision applications at significantly reduced runtime.
|
Similar papers:
[rank all papers by similarity to this]
|
#305 - Good Vibrations: A Modal Analysis Approach for Sequential Non-Rigid Structure from Motion [pdf]
Antonio Agudo, Lourdes Agapito, Begoa Calvo, Jose M. Montiel |
Abstract: We propose an online solution to Non-Rigid Structure from Motion that performs camera pose and 3D shape esti- mation of highly deformable surfaces on a frame-by-frame basis. Our method models non-rigid deformations as a linear combination of some mode shapes obtained using modal analysis from continuum mechanics. The shape is first discretized into linear elastic triangles, modelled by means of finite elements, which are used to pose the force balance equations for an un-damped free vibrations model. The shape basis computation comes down to solving an eigenvalue problem, without the requirement of a learning step. The camera pose and time varying weights that de- fine the shape at each frame are then estimated on the fly, in an online fashion, using bundle adjustment over a sliding window of image frames. The result is a low computational cost method that can run sequentially in real-time. We show experimental results on synthetic sequences with ground truth 3D data and real videos for different scenarios ranging from sparse to dense scenes. Our sys- tem exhibits a good trade-off between accuracy and com- putational budget, it can handle missing data and performs favourably compared to competing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#306 - Incremental Learning of NCM Forests for Large-Scale Image Classification [pdf]
Marko Ristin, Matthieu Guillaumin, Juergen Gall, Luc Van Gool |
Abstract: In recent years, large image data sets such as ImageNet, TinyImages or ever-growing social networks like Flickr have emerged, posing new challenges to image classication that were not apparent in smaller image sets. In particular, the efcient handling of dynamically growing data sets, where not only the amount of training images, but also the number of classes increases over time, is a relatively unexplored problem. To remedy this, we introduce Nearest Class Mean Forests (NCMF), a variant of Random Forests where the decision nodes are based on nearest class mean (NCM) classication. NCMFs not only outperform conventional random forests, but are also well suited for integrating new classes. To this end, we propose and compare several approaches to incorporate data from new classes, so as to seamlessly extend the previously trained forest instead of re-training them from scratch. In our experiments, we show that NCMFs trained on small data sets with 10 classes can be extended to large data sets with 1000 classes without signicant loss of accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The probabilistic methods based on Symmetrical Gauss Mixture Model(SGMM)[3,12,7] have achieved great success in point sets registration, but are seldom used to find the correspondences between two images due to the complexity of the non-rigid transformation and too many outliers. In this paper we propose an Asymmetrical GMM(AGMM) for point sets matching between a pair of images. Different from the previous SGMM, the AGMM gives each Gauss component a different weight which is related to the feature similarity between the data point and model point, which leads to two effective algorithms: the Single Gauss Model for Mismatch Rejection(SGMR) algorithm and the AGMM algorithm for point sets matching. The SGMR algorithm iteratively filters mismatches by estimating a non-rigid transformation between two images based on the spatial coherence of point sets. The AGMM algorithm combines the feature information with position information of the SIFT feature points extracted from the images to achieve point sets matching so that much more correct correspondences with high precision can be found. A number of comparison and evaluation experiments reveal the excellent performance of the proposed SGMR algorithm and AGMM algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
#313 - A Compositional Model for Low-Dimensional Image Set Representation [pdf]
Hossein Mobahi, Ce Liu, Bill Freeman |
Abstract: Learning a low-dimensional representation of images is useful for various applications in graphics and computer vision. Manifold learning on images addresses this problem. However, existing works either require very dense sampling of the space, or are applicable to patch level only, ignoring global structures in the images. We present a simple method that operates on the entire image, but can learn from small sized datasets. The model relies on a composition structure of color, shape, and appearance. We show that each component can be approximated by a low-dimensional subspace when the others are factored out. Our formulation allows for very efficient learning and experiments show encouraging results.
|
Similar papers:
[rank all papers by similarity to this]
|
#317 - Associative embeddings for large-scale knowledge transfer with self-assessment [pdf]
Alexander Vezhnevets, Vittorio Ferrari |
Abstract: We propose a method for knowledge transfer between semantically related classes in ImageNet. By transferring knowledge from the images that have bounding-box annotations to the others, our method is capable of automatically populating ImageNet with many more bounding-boxes and even pixel-level segmentations. The underlying assumption that objects from semantically related classes look alike is formalized in our novel Associative Embedding (AE) representation. AE recovers the latent low-dimensional space of appearance variations among image windows. The dimensions of AE space tend to correspond to aspects of window appearance (e.g. side view, close up, background). We model the overlap of a window with an object using Gaussian Processes (GP) regression, which spreads annotation smoothly through AE space. The probabilistic nature of GP allows our method to perform self-assessment, i.e. assigning a quality estimate to its own output. It enables trading off the amount of returned annotations for their quality. A large scale experiment on 219 classes and 0.5 million images demonstrates that our method outperforms state-of-the-art methods and baselines for both object localization and segmentation. Using self-assessment we can automatically return bounding-box annotations for 30\% of all images with high localization accuracy (i.e. 73\% average overlap with ground-truth).
|
Similar papers:
[rank all papers by similarity to this]
|
#318 - Max-Margin Boltzmann Machines for Object Segmentation [pdf]
Jimei Yang, Simon Safar, Ming-Hsuan Yang |
Abstract: We present Max-Margin Boltzmann Machines (MMBMs) for object segmentation. MMBMs are essentially a class of Conditional Boltzmann Machines that model the joint distribution of hidden variables and output labels conditioned on input observations. In addition to image-to-label connections, we build direct image-to-hidden connections to facilitate global shape prediction, and thus derive a simple Iterated Conditional Modes algorithm for efficient maximum a posteriori inference. We formulate a max-margin objective function for discriminative training, and analyze the effects of different margin functions on learning. We evaluate MMBMs using three datasets against state-of-the-art methods to demonstrate the strength of the proposed algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Occlusion poses a significant difficulty for object recognition due to the combinatorial diversity of possible occlusion patterns. We take a strongly supervised, non-parametric approach to modeling occlusion by learning deformable models with many local part mixture templates using large quantities of synthetically generated training data. This allows the model to learn the appearance of different occlusion patterns including figure-ground cues such as the shapes of occluding contours as well as the co-occurrence statistics of occlusion between neighboring parts. We test the resulting model on human pose estimation under heavy occlusion and find it produces improved localization accuracy. The underlying part mixture-structure also allows the model to make compelling predictions of figure-ground-occluder segmentations.
|
Similar papers:
[rank all papers by similarity to this]
|
#325 - Occlusion Coherence: Localizing Occluded Faces with a Hierarchical Deformable Part Model [pdf]
Golnaz Ghiasi, Charless Fowlkes |
Abstract: The presence of occluding objects significantly impacts performance of systems for object recognition. However, occlusion is typically treated as an unstructured source of noise and explicit models for occluders have lagged behind those for object appearance and shape. In this paper we describe a hierarchical deformable part model for face detection and keypoint localization that explicitly models occlusions of parts. The proposed model structure makes it possible to augment positive training data with large numbers of synthetically occluded instances. This allows us to easily incorporate the statistics of occlusion patterns in a discriminatively trained model. We test the model on several benchmarks for keypoint localization including challenging sets featuring significant occlusion. We find that the addition of an explicit model of occlusion yields a system that outperforms existing approaches in keypoint localization accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#328 - Modeling long-tail distributions of object subcategories [pdf]
Xiangxin Zhu, Dragomir Anguelov, Deva Ramanan |
Abstract: We argue that object subcategories follow a long-tail distribution: a few subcategories are common, while many are rare. We describe distributed algorithms for learning large-mixture models that capture long-tail distributions, which are hard to model with current approaches. We introduce a generalized notion of mixtures (or subcategories) that allow for examples to be shared across multiple subcategories. We optimize our models with a discriminative meanshift clustering algorithm that searches over mixtures in a distributed, brute-force fashion. We have used our scalable system to train tens of thousands of deformable mixtures for VOC objects. We demonstrate significant performance improvements, particularly for object classes that are characterized by large appearance variation.
|
Similar papers:
[rank all papers by similarity to this]
|
#331 - A General and Simple Method for Camera Pose and Focal Length Determination [pdf]
Yinqiang Zheng, Shigeki Sugimoto, Imari Sato, Masatoshi Okutomi |
Abstract: In this paper, we revisit the pose determination problem of a partially calibrated camera with unknown focal length, hereafter referred to as the P\(n\)Pf problem, by using \(n\) (\(n\geq4\)) 3D-to-2D point correspondences. Our core contribution is to introduce the angle constraint and derive a compact bivariate polynomial equation for each point triplet. Based on this polynomial equation, we propose a truly general method for the P\(n\)Pf problem, which is best suited both to the minimal 4-point based RANSAC application, and also to large scale scenarios with thousands of points, irrespective of the 3D point configuration. In addition, by solving bivariate polynomial systems via Sylvester resultant, our method is very simple and easy to implement. Its simplicity is especially obvious when one needs to develop a fast enough solver for the 4-point case. Experiment results have also demonstrated its superiority in accuracy and efficiency when compared with the existing state-of-the-art solutions.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Research in computer vision has resulted in many models, some specialized for one problem, others more general. In the meantime, experimental vision scientists have collected invaluable behavioral data. Here, to help focus research efforts onto the hardest unsolved problems, and bridge computer and human vision, we define a battery of 5 tests that measure the gap between human and machine performances in several dimensions (generalization across scene categories, generalization from images to edge maps and line drawings, invariance to rotation and scaling, local/global information with jumbled images, object recognition performance). These tests assess models in achieving human-level object and scene recognition, irrespective of implementation details (biologically-inspired or not). To objectively quantify this, in addition to accuracy, we also measure the correlation between model and human error patterns. Experimenting over 7 scene and object datasets, where human data is available, and gauging 14 well-established models, we find that none fully resembles humans in all aspects, and we learn from each test which models and features are more promising in approaching humans in the tested dimension. Across all tests, we find that models based on local edge histograms consistently resemble humans more, while several scene statistics or gist models do perform well with both scenes and objects. While computer vision has long been inspired by human vision
|
Similar papers:
[rank all papers by similarity to this]
|
#335 - SteadyFlow: Spatially Smooth Optical Flow for Video Stabilization [pdf]
Shuaicheng Liu, Lu Yuan, Ping Tan, Jian Sun |
Abstract: We propose a novel motion model, SteadyFlow, to represent the motion between neighboring video frames for stabilization. A SteadyFlow is a specific optical flow by enforcing strong spatial coherence, such that smoothing feature trajectories can be replaced by smoothing pixel (motion) profiles, which are motion vectors collected at the same pixel location in the SteadyFlow over time. In this way, we can avoid brittle feature tracking in a video stabilization system. Besides, SteadyFlow is a more general 2D motion model which can deal with spatially-variant motion. We initialize the SteadyFlow by optical flow and then discard discontinuous motions by a spatial-temporal analysis and fill in missing regions by motion completion. Our experiments demonstrate the effectiveness of our stabilization on real-world challenging videos.
|
Similar papers:
[rank all papers by similarity to this]
|
#336 - Nonparametric Context Modeling of Local Appearance for Pose- and Expression-Robust Facial Landmark Localization [pdf]
Brandon Smith, Jonathan Brandt, Zhe Lin, Li Zhang |
Abstract: This paper addresses the problem of facial landmark localization on faces with extreme head poses and expressions. We propose a data-driven approach that models the correlations between each landmark and its surrounding appearance features. At runtime, each feature casts a weighted vote to predict landmark locations, where the weight is precomputed to take into account the feature's discriminative power. The feature voting-based landmark detection is more robust than previous local appearance-based detectors; we combine it with non-parametric shape regularization to build a novel facial landmark localization pipeline that is robust to scale, in-plane rotation, expression, and most importantly, extreme head pose. We achieve state-of-the-art performance on two especially challenging datasets populated by faces with extreme head poses and expressions.
|
Similar papers:
[rank all papers by similarity to this]
|
#338 - Fast Edge-Preserving PatchMatch for Large Displacement Optical Flow [pdf]
Linchao Bao, Qingxiong Yang, Hailin Jin |
Abstract: We present a fast local optical flow algorithm that can handle large displacement motions. Our algorithm is inspired by recent successes of local methods in stereo matching and optical flow as well as approximate nearest neighbor field algorithms. The main novelty is a fast randomized edge-preserving approximate nearest neighbor field algorithm which propagates self-similarity patterns in addition to propagating offsets. Together with a hierarchical matching scheme, our method can produce high-quality flow in a very fast speed. Experimental results on public optical flow benchmarks show that our method is significantly faster than competitors without compromising on quality, especially when scenes contain large motions. In fact, the performance on MPI Sintel benchmark clearly demonstrates the effectiveness of our method for handling large displacement motions.
|
Similar papers:
[rank all papers by similarity to this]
|
#341 - Co-Occurrence Statistics for Zero-Shot Classification [pdf]
Thomas Mensink, Cees Snoek, Efstratios Gavves |
Abstract: In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Different from the common strategy in the literature, that requires manually defined attribute-to-class mappings, we rely on easy to obtain co-occurrence statistics of class labels harvested from existing annotations, web-search hit counts or image tags. Our main contribution is to use inter-dependencies that arise naturally between classes, for zero-shot classification. We propose various similarity metrics for leveraging the these co-occurrences, and show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three challenging multi-labelled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification.
|
Similar papers:
[rank all papers by similarity to this]
|
#345 - Analysis by Synthesis: Object Recognition by Object Reconstruction [pdf]
Mohsen Hejrati, Deva Ramanan |
Abstract: We introduce a new approach for recognizing and reconstructing 3D objects in images. Our approach is based on an analysis by synthesis strategy. We use a forward synthesis model to constructs possible geometric interpretations of the world, and then selects the interpretation that best agrees with the measured visual evidence. This forward model synthesizes visual templates defined on invariant (HOG) features. These visual templates are discriminatively trained to be accurate for inverse estimation. We introduce an efficient brute-force approach to inference that searches through a large number of candidate reconstructions, returning the optimal one (or multiple likely candidates, if desired). One benefit of such an approach is that recognition is inherently (re)constructive. We show state of the art performance for detection and reconstruction on two challenging 3D object recognition datasets of cars and cuboids.
|
Similar papers:
[rank all papers by similarity to this]
|
#346 - Object Classification with Adaptive Regions [pdf]
Hakan Bilen, Marco Pedersoli, Vinay Namboodiri, Tinne Tuytelaars, Luc Van Gool |
Abstract: In classification of objects substantial work has gone into improving the low level representation of an image by considering various aspects such as different features, a number of feature pooling and coding techniques and considering different kernels. Unlike these works, in this paper, we propose to enhance the \textit{semantic representation} of an image. We aim to learn the most important visual components of an image and how they interact in order classify the objects correctly. To achieve our objective, we propose a new latent SVM model for category level object classification. Starting from image-level annotations, we jointly learn the object class and its context in terms of spatial location (where) and appearance (what). Furthermore, to regularize the complexity of the model we learn the spatial and co-occurrence relations between adjacent regions, such that unlikely configurations are penalized. Experimental results demonstrate that the proposed method can consistently enhance results on the challenging Pascal VOC dataset in terms of classification. We also show how semantic representation can be exploited for finding similar content.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Weighted median, in the form of either solver or filter, has been employed in a wide range of computer vision applications for its beneficial properties in sparsity representation. But it is hard to be accelerated due to both the spatial varying weight and median property compared with other local filters. We propose an efficient scheme to reduce computation complexity from O(r2) to O(r) where r is the kernel size. Our contribution is on a new joint-histogram representation, median tracking, and a new data structure that enables fast data access. The effectiveness of this scheme is demonstrated on optical flow estimation, stereo matching, structure-texture separation, image filtering, to name a few. The running time is largely shortened from several minutes to less than 1 second.
|
Similar papers:
[rank all papers by similarity to this]
|
#356 - Tracklet Association with Online Reidentification in Network Flow Optimaiztion for Long-term Multi-Person Tracking [pdf]
BING WANG, Gang Wang, Kap Luk Chan, LI WANG |
Abstract: This paper presents a novel introduction of online reidentification in track fragment (tracklet) association by network flow optimization for long-term multi-person tracking. Different from other network flow formulation, each node in our network represents a tracklet, and each edge represents the likelihood of neighboring tracklets belonging to the same trajectory as measured by our proposed affinity score. In our method, target-specific similarity metrics are learned leading to the appearance-based models used in the reidentification. Trajectory-based tracklets are refined by the learned metrics to account for appearance consistency and to identify reliable tracklets. The metrics are then re-learned using reliable tracklets for computing tracklet affinity scores. Long-term tracjectories are then obtained by network flow optimization. Occlusions and missed detections are handled by a trajectory completion step. Our method is effective for long-term tracking even when the targets are spatially close or completely occluded by others. We validate our proposed framework on several public datasets and show that it outperforms several state of art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#368 - Piecewise Planar and Compact Floorplan Reconstruction from Images [pdf]
Ricardo Cabral, Yasutaka Furukawa |
Abstract: This paper presents a system that automatically reconstructs piecewise planar and compact floorplans from panorama images, which are then converted to high quality texture-mapped models for free-viewpoint scene visualization. There are two main challenges in image-based floorplan reconstruction. The first challenge is the lack of 3D information that can be extracted from images through Structure from Motion and Multi-View Stereo, since indoor scenes abound with non-diffuse and homogeneous surfaces plus clutter. The second challenge is the need of a sophisticated regularization technique that enforces piecewise planarity to suppress clutter and yields high quality texture mapped models. Our technical contributions are twofold. First, we propose a novel structure classification technique to classify each pixel to three structure regions, which provides 3D cues even from a single image. Second, we cast floorplan reconstruction as a shortest path problem on a specially crafted graph, which enables us to enforce piecewise planarity. Besides producing compact piecewise planar models, this formulation allows us to directly control the output complexity (i.e., the number of vertices). We evaluate our system on a number of real businesses, and show that our texture mapped mesh models provide compelling free-viewpoint visualization experiences, when compared against the state-of-the-art and ground truth.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Persistent surveillance of large geographic areas from unmanned aerial vehicles allows us to learn much about the daily activities in the region of interest. Nearly all of the approaches addressing tracking in this imagery are detection based and rely on background subtraction or frame differencing to provide detections. This, however, makes it difficult to track targets once they slow down or stop, which is not acceptable for persistent tracking, our goal. We present a multiple target tracking approach that does not exclusively rely on background subtraction and is better able to track targets through stops. It accomplishes this by effectively running two trackers in parallel: one based on detections from background subtraction providing target initialization and reacquisition, and one based on a target state regressor providing frame to frame tracking. We evaluated the proposed approach on a long sequence from a wide area aerial imagery dataset, and the results show improved object detection rates and id-switch rates with limited increases in false alarms compared to the competition.
|
Similar papers:
[rank all papers by similarity to this]
|
#375 - Are Cars Just 3D Boxes? - Jointly Estimating the 3D Shape of Multiple Objects [pdf]
Muhammad Zeeshan Zia, Michael Stark, Konrad Schindler |
Abstract: Current systems for scene understanding typically represent objects as 2D or 3D bounding boxes. While these representations have proven robust in a variety of applications, they provide only coarse approximations to the true 2D and 3D extent of objects. As a result, object-object interactions, such as occlusions or ground-plane contact, can be represented only superficially. In this paper, we approach the problem of scene understanding from the perspective of 3D shape modeling, and design a 3D scene representation that reasons jointly about the 3D shape of multiple objects. This representation allows to express 3D geometry and occlusion on the fine detail level of individual vertices of 3D wireframe models, and makes it possible to treat dependencies between objects, such as occlusion reasoning, in a deterministic way. In our experiments, we demonstrate the benefit of jointly estimating the 3D shape of multiple objects in a scene over working with coarse boxes, on the recently proposed KITTI dataset of realistic street scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
#378 - Graph Cut based Continuous Stereo Matching using Locally Shared Labels [pdf]
Tatsunori Taniai, Yasuyuki Matsushita, Takeshi Naemura |
Abstract: We present an accurate and efficient stereo matching method using locally shared labels, a new labeling scheme that enables spatial propagation in MRF inference using graph cuts. They give each pixel and region a set of candidate disparity labels, which are randomly initialized, spatially propagated, and refined for continuous disparity estimation. We cast the selection and propagation of locally-defined disparity labels as fusion-based energy minimization. The joint use of graph cuts and locally shared labels has advantages over previous approaches based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; enables powerful randomized search; helps to find good smooth, locally planar disparity maps, which are reasonable for natural scenes; allows parallel computation of both unary and pairwise costs. Our method is evaluated using the Middlebury stereo benchmark and achieves first place in sub-pixel accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#382 - Two-View Camera Calibration for Multi-Layer Flat Refractive Interface [pdf]
Xida Chen, Yee Hong Yang |
Abstract: In this paper, we present a novel refractive calibration method for an underwater stereo camera system where both cameras are looking through multiple parallel flat refractive interfaces. At the heart of our method is an important finding that the thickness of the interface can be estimated from a set of pixel correspondences in the stereo images when the refractive axis is given. To our best knowledge, such a finding has not been studied or reported. Moreover, by exploring the search space for the refractive axis and using reprojection error as a measure, both the refractive axis and the thickness of the interface can be recovered simultaneously. Our method does not require any calibration target such as a checkerboard pattern which may be difficult to manipulate when the cameras are deployed deep undersea. The implementation of our method is simple. In particular, it only requires solving a set of linear equations of the form $Ax = b$ and applies sparse bundle adjustment to refine the initial estimated results. Extensive experiments have been carried out which include simulations with and without outliers to verify the correctness of our method as well as to test its robustness to noise and outliers. The results of real experiments are also provided. The accuracy of our results is comparable to that of a state-of-the-art method that requires known 3D geometry of a scene.
|
Similar papers:
[rank all papers by similarity to this]
|
#390 - Deblurring Low-light Images with Light Streaks [pdf]
Zhe Hu, Sunghyun Cho, Jue Wang, Ming-Hsuan Yang |
Abstract: Images taken in low-light conditions with handheld cameras are often blurry due to the required long exposure time. Although significant progress has been made recently on image deblurring, state-of-the-art approaches often fail on low-light images, as these images do not contain sufficient salient features that deblurring methods rely on. On the other hand, light streaks are common phenomenons in low-light images that contain rich blur information, but have not been extensively explored in previous approaches. In this work, we propose a new method that utilizes light streaks to help deblur low-light images. Our approach first automatically detects useful light streaks in the input image, and then poses them as constraints for estimating the blur kernel in an optimization framework. Experimental results show that by explicitly modeling light streaks in the deblur process, our approach could obtain good results on challenging real-world examples that no other methods could achieve before.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We present a simple vector quantizer that combines low distortion with fast search and apply it to approximate nearest neighbor (ANN) search in high dimensional spaces. Leveraging the very same data structure that is used to provide non-exhaustive search, i.e., inverted lists or a multi-index, the idea is to locally optimize an individual product quantizer (PQ) per cell and use it to encode residuals. Local optimization is over rotation and space decomposition; interestingly, we apply a parametric solution that assumes a normal distribution and is extremely fast to train. With a reasonable space and time overhead that is constant in the data size, we set a new state-of-the-art on several public datasets, including a billion-scale one.
|
Similar papers:
[rank all papers by similarity to this]
|
#405 - Generalized Nonconvex Nonsmooth Low-Rank Minimization [pdf]
Canyi Lu, Shuicheng Yan, Zhouchen Lin |
Abstract: As surrogate functions of $L_0$-norm, many nonconvex penalty functions have been proposed to enhance the sparse vector recovery. It is easy to extend these nonconvex penalty functions on singular values of a matrix to enhance low-rank matrix recovery. However, different from convex optimization, solving the nonconvex low-rank minimization problem is much more challenging than the nonconvex sparse minimization problem. We observe that all the existing nonconvex penalty functions are concave and monotonically increasing on $[0,\infty)$. Thus their gradients (or supergradient at the nonsmooth point) are decreasing functions. Based on this property, we propose an Iteratively Reweighted Nuclear Norm (IRNN) algorithm to solve the nonconvex nonsmooth low-rank minimization problem. IRNN iteratively solves a Weighted Singular Value Thresholding (WSVT) problem. By setting the weight vector as the gradient of the concave penalty function, the WSVT problem has a closed form solution, whose computational cost is the same as Singular Value Thresholding (SVT). In theory, we prove that IRNN decreases the objective function value monotonically, and any limit point is a stationary point. Extensive experiments on both synthetic data and real images demonstrate that the proposed algorithm enhances the low-rank matrix recovery compared with state-of-the-art convex algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#406 - Multiple Target Tracking Based on Hierarchical Relation Hypergraph [pdf]
Longyin Wen, Wenbo Li, Zhen Lei, Stan Li |
Abstract: Multi-target tracking is an important but challenging task in computer vision field. Most of the previous data as- sociation based methods merely consider the relationships between detections in the limited local temporal domain, leading to their difficulties in handling long-term occlusion and distinguishing the spatially close targets with similar appearance in crowded scenes. In this paper, we propose a novel data association approach based on the hierarchical relation hypergraph, which formulates the tracking task as a hierarchical dense neighborhoods searching problem on the dynamically constructed affinity graph. The relationships between different detections across the spatio-temporal do- main are considered in a high-order way, which makes the tracker robust to the spatially close targets with similar ap- pearance. Meanwhile, the hierarchical design of the opti- mization process fuels our tracker to the long-term occlu- sion with more robustness. Extensive experiments on vari- ous challenging datasets (i.e. PETS2009, ParkingLot), in- cluding both low-density and high-density sequences, vali- date the superiority of our tracker over other state-of-the- art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#407 - Discriminative Deep Metric Learning for Face Verification in the Wild [pdf]
Junlin Hu, Jiwen Lu, Yap-Peng Tan |
Abstract: This paper presents a new discriminative deep metric learning (DDML) method for face verification in the wild. Different from existing metric learning-based face verification methods which aim to learn a Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, simultaneously, the proposed DDML trains a deep neural network which learns a pair of hierarchical nonlinear transformations to project face pairs into two feature subspaces, one subspace for each sample in the pair, under which the distance of each positive face pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold, respectively, so that discriminative information can be exploited in the deep network. Our method achieves the state-of-the-art face verification performance on the widely used LFW and YouTube Faces (YTF) datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We focus the problem of estimating the ground plane orientation and location in monocular video sequence from a moving observer. Our only assumptions are that that the 3D ego motion t and the ground plane normal n are orthogonal, and that n and t are smooth over time. We formulate the problem as a state-continuous Hidden Markov Model (HMM) where the hidden state contains t and n and may be estimated by sampling and decomposing homographies. We show that using blocked Gibbs sampling, we can infer the hidden state with high robustness towards outliers, drifting trajectories, rolling shutter and an imprecise intrinsic calibration. Since our approach does not need any initial orientation prior, it works for arbitrary camera orientations.
|
Similar papers:
[rank all papers by similarity to this]
|
#414 - Human Body Shape Estimation Using a Multi-Resolution Manifold Forest [pdf]
Frank Perbet, Sam Johnson, Minh-Tri Pham, Bjrn Stenger |
Abstract: This paper proposes a method for estimating the 3D body shape of a person with robustness to clothing. We formulate the problem as optimization over the manifold of valid depth maps of body shapes learned from synthetic training data. The manifold itself is represented using a novel data structure, a Multi-Resolution Manifold Forest (MRMF), which contains vertical edges between tree nodes as well as horizontal edges between nodes that correspond to overlapping partitions. We show that this data structure allows both efficient localization and navigation on the manifold, for on-the-fly building of local linear models (manifold charting). We demonstrate shape estimation results on clothed users, showing significant improvement in accuracy over global shape models and models using pre-computed clusters. We further compare the MRMF with alternative manifold charting methods on a public dataset for reconstructing 3-D motion from noisy 2-D marker observations, obtaining state-of-the-art results.
|
Similar papers:
[rank all papers by similarity to this]
|
#416 - Separation of Line Drawings Based on Split Faces for 3D Object Reconstruction [pdf]
Changqing ZOU |
Abstract: Reconstructing 3D objects from single line drawings is of- ten desirable in computer vision and graphics applications. If the line drawing of a complex 3D object is decomposed into primitives of simple shape, the object can be easily re- constructed. We propose an effective method to conduct the line drawing separation and turn a complex line drawing into parametric 3D models. This is achieved by recursively sep- arating the line drawing using two types of split faces. Our experiments show that the proposed separation method can generate more basic and simple line drawings, and its com- bination with the example-based reconstruction can robustly recover more complex parametric 3D objects than previous methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Sparse coding is a widely involved technique in computer vision. However, the expensive computational cost can hamper its applications, typically when the codebook size must be limited due to concerns on running time. In this paper, we study a special case of sparse coding in which the codebook is a Cartesian product of two subcodebooks. We present algorithms to decompose this sparse coding problem into smaller subproblems, which can be separately solved. Our solution, named as Product Sparse Coding (PSC), reduces the time complexity from O(K) to O(\sqrt{K}) in the codebook size $K$. In practice, this can be 20-100x faster than standard sparse coding. In experiments we demonstrate the efficiency and quality of this method on the applications of image classification and
|
Similar papers:
[rank all papers by similarity to this]
|
#423 - Robust and Efficient Full-Angle Quaternions for Matching Arrays of 3D Rotations [pdf]
Stephan Liwicki, Stefanos Zafeiriou, Maja Pantic, Bjrn Stenger, Minh-Tri Pham |
Abstract: Matching sets of features often involve dealing with corrupted data. In this paper, we introduce a new distance for robustly matching arrays of 3D rotations. We show that the distance leads to a new and efficient representation for 3D rotations which we coin full-angle quaternion (FAQ). We apply the distance and the representation to 3D shape recognition and 2D object tracking from color video. In the former application, we introduce efficient hashing of scaling and translation concurrently. In the latter application, we utilize subspace learning with the proposed FAQ representation. In both cases, our approach outperforms state of-the-art approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
#427 - Look at the Driver, Look at the Road: No Distraction! No Accident! [pdf]
Mahdi Rezaei, Reinhard Klette |
Abstract: The paper proposes an advanced driver-assistance system that correlates driver's attention to the road and traffic conditions by analyzing both simultaneously. In particular, we aim at the prevention of rear-end crashes due to driver fatigue or distraction. We propose an asymmetric appearance-modeling technique and 2D-to-3D registration to define the driver's head pose (in 6 degrees of freedom), yawing detection, and head-nodding detection. Global Haar (GHaar) classifiers are used for vehicle detection. Using a fuzzy-logic inference system, we develop an integrated solution to cover all of the above subjects. We demonstrate real-time performance of the proposed method for real-world scenarios.
|
Similar papers:
[rank all papers by similarity to this]
|
#430 - Efficient Boosted Exemplar-based Face Detection [pdf]
Haoxiang Li, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Gang Hua |
Abstract: Despite the fact that face detection has been studied intensively over the past several decades, the problem is still not completely solved. Challenging conditions, such as extreme pose, lighting, and occlusion, have historically hampered traditional, model-based methods. In contrast, exemplar-based face detection has been shown to be effective, even under these challenging conditions, primarily because a large exemplar database is leveraged to cover all possible visual variations. However, relying heavily on a large exemplar database to deal with the face appearance variations makes the detector impractical due to the high space and time complexity. We construct an efficient boosted exemplar-based face detector which overcomes the defect of the previous work by being faster, more memory efficient, and more accurate. In our method, exemplars as weak detectors are discriminatively trained and selectively assembled in the boosting framework which largely reduces the number of required exemplars. Notably, we propose to include non-face images as negative exemplars to actively suppress false detections to further improve the detection accuracy. We verify our approach over two public face detection benchmarks and one personal photo album, and achieve significant improvement over the state-of-the-art algorithms in terms of both accuracy and efficiency.
|
Similar papers:
[rank all papers by similarity to this]
|
#436 - Multiscale Combinatorial Grouping [pdf]
Pablo Arbelaez, Jordi Pont-Tuset, Jon Barron, Ferran Marques, Jitendra Malik |
Abstract: We propose a unified approach for bottom-up hierarchical image segmentation and object candidate generation for recognition, called Multiscale Combinatorial Grouping (MCG). For this purpose, we first develop a fast normalized cuts algorithm. We then propose a high-performance hierarchical segmenter that makes effective use of multiscale information and diversified inputs. Finally, we propose a grouping strategy that combines our multiscale regions into highly-accurate object candidates by exploring efficiently their combinatorial space. We conduct extensive experiments on both the BSDS500 and on the PASCAL 2012 segmentation datasets, showing that MCG produces state-of-the-art contours, regions and object candidates.
|
Similar papers:
[rank all papers by similarity to this]
|
#439 - Non-Parametric Bayesian Constrained Local Models [pdf]
Pedro Martins, Rui Caseiro, Jorge Batista |
Abstract: This work presents a novel non-parametric Bayesian formulation for aligning faces in unseen images. Popular approaches, such as the Constrained Local Models (CLM) or the Active Shape Models (ASM), perform facial alignment through a local search, combining an ensemble of detectors with a global optimization strategy that constraints the facial feature points to be within the subspace spanned by a Point Distribution Model (PDM). The global optimization can be posed as a Bayesian inference problem, looking to maximize the posterior distribution of the PDM parameters in a maximum a posteriori (MAP) sense. Previous approaches rely exclusively on Gaussian inference techniques, i.e. both the likelihood (detectors responses) and the prior (PDM) are Gaussians, resulting in a posterior which is also Gaussian, whereas in this work the posterior distribution is modeled as being non-parametric by a Kernel Density Estimator (KDE). We show that this posterior distribution can be efficiently inferred using Sequential Monte Carlo methods, in particular using a Regularized Particle Filter (RPF). The technique is evaluated in detail on several standard datasets (IMM, BioID, XM2VTS, LFW and FGNET Talking Face) and compared against state-of-the-art CLM methods. We demonstrate that inferring the PDM parameters non-parametrically significantly increase the face alignment performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#441 - Feature-Independent Action Spotting Without Human Localization, Segmentation or Frame-wise Tracking [pdf]
Chuan Sun, Hassan Foroosh |
Abstract: In this paper, we propose an unsupervised framework for action spotting in videos that does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.). Furthermore, our solution requires no human localization, segmentation, or framewise tracking. This is achieved by treating the problem holistically as that of extracting the internal dynamics of video cuboids by modeling them in their natural form as multilinear tensors. To extract their internal dynamics, we devised a novel Two-Phase Decomposition (TP-Decomp) of a tensor that generates very compact and discriminative representations that are robust to even heavily perturbed data. Technically, a Rank-based Tensor Core Pyramid (Rank-TCP) descriptor is generated by combining multiple tensor cores under multiple ranks, allowing to represent video cuboids in a hierarchical tensor pyramid. The problem then reduces to a template matching problem, which is solved efficiently by using two boosting strategies: (1) to reduce search space, we filter the dense trajectory cloud extracted from the target video; (2) to boost the matching speed, we perform matching in an iterative coarse-to-fine manner. Experiments on 5 benchmarks show that our method outperforms current state-of-the-art under various challenging conditions. We also created a challenging dataset called Heavily Perturbed Video Array (HPVA) to validate the robustness of our framework under heavily perturbed situations.
|
Similar papers:
[rank all papers by similarity to this]
|
#446 - A Multigraph Representation for Improved Unsupervised/Semi-supervised Learning of Human Actions [pdf]
Simon Jones, Ling Shao |
Abstract: Graph-based methods are a useful class of methods for improving the performance of unsupervised and semi-supervised machine learning tasks, such as clustering or information retrieval. However, the performance of such methods is highly dependent on how well the affinity graph reflects the original data structure. We propose that multimedia such as images or videos consist of multiple separate components, and therefore more than one graph is required to fully capture the relationship between them. Accordingly, we present a new spectral method -- the Feature Grouped Spectral Multigraph (FGSM) -- which comprises the following steps. First, mutually independent subsets of the original feature space are generated through feature clustering. Secondly, a separate graph is generated from each feature subset. Finally, a spectral embedding is calculated on each graph, and the embeddings are scaled/aggregated into a single representation. Using this representation, a variety of experiments are performed on three learning tasks -- clustering, retrieval and recognition -- on human action datasets, demonstrating considerably better performance than the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
#447 - Semi-supervised Relational Topic Model for Weakly Annotated Image Recognition in Social Media [pdf]
Zhenxing Niu, Gang Hua, Xinbo Gao, Qi Tian |
Abstract: In this paper, we address the problem of recognizing images with weakly annotated text tags. Most previous work either cannot be applied to the scenarios where the tags are loosely related to the images; or simply take a pre-fusion at the feature level or a post-fusion at the decision level to combine the visual and textual content. Instead, we first encode the text tags as the relations among the images, and then propose a semi-supervised relational topic model (ss-RTM) to explicitly model the image content and their relations. In such way, we can efficiently leverage the loosely related tags, and build an intermediate level representation for a collection of weakly annotated images. The intermediate level representation can be regarded as a mid-level fusion of the visual and textual content, which is able to explicitly model their intrinsic relationships. Moreover, image category labels are also modeled in the ss-RTM, and recognition can be conducted without training an additional discriminative classifier. Our extensive experiments on social multimedia datasets (images+tags) demonstrated the advantages of the proposed model.
|
Similar papers:
[rank all papers by similarity to this]
|
#450 - Learning-Based Atlas Selection for Multiple-Atlas Segmentation [pdf]
Gerard Sanroma, Guorong Wu, Yaozong Gao, Dinggang Shen |
Abstract: Recently, multi-atlas segmentation (MAS) has achieved a great success in the medical imaging area. The key assumption of MAS is that multiple atlases encompass richer anatomical variability than a single atlas. Therefore, we can label the target image more accurately by mapping the label information from the appropriate atlas images that have the most similar structures. The problem of atlas selection, however, still remains unexplored. Current state-of-the-art MAS methods rely on image similarity to select a set of atlases. Unfortunately, this heuristic criterion is not necessarily related to segmentation performance and, thus may undermine segmentation results. To solve this simple but critical problem, we propose a learning-based atlas selection method to pick up the best atlases that would eventually lead to more accurate image segmentation. Our idea is to learn the relationship between the pairwise appearance of observed instances (a pair of atlas and target images) and their final labeling performance (in terms of Dice ratio). In this way, we can select the best atlases according to their expected labeling accuracy. It is worth noting that our atlas selection method is general enough to be integrated with existing MAS methods. As is shown in the experiments, we achieve significant improvement after we integrate our method with 3 widely used MAS methods on ADNI and LONI LPBA40 datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#462 - Modeling Image Patches with a Generic Dictionary of Mini-Epitomes [pdf]
George Papandreou, Liang-Chieh Chen, Alan Yuille |
Abstract: The goal of this paper is to question the necessity of features like SIFT in categorical visual recognition tasks. As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support image classification performance on par with optimized SIFT-based techniques in a bag-of-visual-words setting. Key ingredient of the proposed model is a compact dictionary of mini-epitomes, learned in an unsupervised fashion on a large collection of images. The use of epitomes allows us to explicitly account for photometric and position variability in image appearance. We show that this flexibility considerably increases the capacity of the dictionary to accurately approximate the appearance of image patches and support recognition tasks. For image classification, we develop histogram-based image encoding methods tailored to the epitomic representation, as well as an ``epitomic footprint'' encoding which is easy to visualize and highlights the generative nature of our model. We discuss in detail computational aspects and develop efficient algorithms to make the model scalable to large tasks. The proposed techniques are evaluated with experiments on the challenging PASCAL VOC-07 image classification benchmark.
|
Similar papers:
[rank all papers by similarity to this]
|
#464 - Model Transport: Towards Scalable Transfer Learning on Manifolds [pdf]
Oren Freifeld, Soren Hauberg, Michael Black |
Abstract: We consider the intersection of two research fields: \emph{transfer learning} and \emph{statistics on manifolds}. In particular, we consider, for manifold-valued data, transfer-learning of tangent-space models such as Gaussians distributions, PCA, regression, or classifiers. Though one would hope to simply use ordinary \Rn-transfer-learning ideas, the manifold structure prevents it. We overcome this by basing our method on (inner-product-preserving) \emph{parallel transport}, a well-known tool used in other problems of statistics on manifolds in computer vision. At first, this straightforward idea seems to suffer from an obvious shortcoming: Transporting large datasets is prohibitively expensive, hindering the scalability of the approach. Fortunately, with our approach, \emph{we never transport data}. Rather, we show how the \emph{statistical models} themselves can be transported, and prove that for the above tangent-space models the transport ``commutes'' with learning. Consequently, our compact framework, applicable to a large class of manifolds, is not restricted by the size of either the training or test sets. We demonstrate the approach by transferring PCA and regression models of real-world data involving 3D shapes and image descriptors.
|
Similar papers:
[rank all papers by similarity to this]
|
#471 - Image Preconditioning: Balancing Contrast and Ringing [pdf]
Yu Ji, Jinwei Ye, Sing Bing Kang, Jingyi Yu |
Abstract: The goal of image preconditioning is to process an image such that after being convolved with a known kernel, will appear close to the sharp reference image. In a practical setting, the preconditioned image has significantly higher dynamic range than the latent image. As a result, some form of tone mapping is needed. In this paper, we show how global tone mapping functions affect contrast and ringing in image preconditioning. In particular, we show that linear tone mapping eliminates ringing but incurs severe contrast loss, while non-linear tone mapping functions such as Gamma curves slightly enhances contrast but introduces ringing. To enable quantitative analysis, we design new metrics to measure the contrast of an image with ringing. Specifically, we set out to find its "equivalent ringing-free" image that matches its intensity histogram and uses its contrast as the measure. We illustrate our approach on projector defocus compensation and visual acuity enhancement. Compared with the state-of-the-art, our approach significantly improves the contrast. We believe our technique is the first to analytically trade-off between contrast and ringing.
|
Similar papers:
[rank all papers by similarity to this]
|
#475 - Complex Non-Rigid Motion 3D Reconstruction by Union of Subspaces [pdf]
Yingying Zhu, Dong Huang, Fernando de la Torre, Simon Lucey |
Abstract: With the increasing need of human action/behaviour analysis community for recovering 3D complex nonrigid motion ( e. g. multiple actions and human-object/humanhuman interaction) from 2D projections in image sequences, the existing approaches for Non-Rigid Structure from Motion (NRSfM) have met the grand challenge on 3D complex nonrigid motion reconstruction. The standard NRSfM models nonrigid motion by a single low rank subspace [7], while the literature shows that the complex nonrigid motion (multiple human actions) stem from a union of subspaces [11, 6, 13]. Solving complex 3D motion in a single-subspace, one can only approximate the union of subspaces by its convex envelope, therefore, produce random combinations of the original 3D actions within the envelope. An ideal solution is to cluster the 3D motion into local motion subspaces, and apply the standard NRSfM in each subspace. However, 3D motion is not available in the first place, and clustering 2D projections does not produce 3D subspaces due to the projection ambiguities and relative camera motion. To address this dilemma, we propose to directly solve for complex 3D nonrigid motion resides in a union of subspaces. By simultaneously solving for NRSfM and subspace clustering, our approach registers the 2D observations in a union of subspaces automatically grouped by the reconstructed 3D motion. Experiments on both synthetic and real videos illustrate the benefits of our approach for the comple
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this work, we propose a new framework for recognizing RGB images captured by the conventional cameras by leveraging a set of labeled RGB-D data, in which the depth features can be additionally extracted from the depth images. We formulate this task as a new unsupervised domain adaptation (UDA) problem, in which we aim to take advantage of the additional depth features in the source domain and also cope with the data distribution mismatch between the source and target domains. To effectively utilize the additional depth features, we seek two optimal projection matrices to map the samples from both domains into a common space by preserving as much as possible correlations between the visual features and depth features. To effectively employ the training samples from the source domain for learning the target classifier, we reduce the data distribution mismatch by minimizing the Maximum Mean Discrepancy (MMD) criterion, which compares the data distributions for each type of feature in the common space. Based on the above two motivations, we propose a new SVM based objective function to simultaneously learn the two projection matrices and the optimal target classifier in order to well separate the source samples from different classes when using each type of feature in the common space. An efficient alternating optimization algorithm is developed to solve our new objective function. Comprehensive experiments for object recognition and gender recognition demonstrate the effectiv
|
Similar papers:
[rank all papers by similarity to this]
|
#481 - Cross-view Action Modeling, Learning and Recognition [pdf]
Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, Song Chun Zhu |
Abstract: Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multi-view Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We describe a new approach for generating regular-speed, low-frame-rate (LFR) video from a high-frame-rate (HFR) input while preserving the important moments in the original. We call this {\em time-mapping}, a time-based analogy to high dynamic range to low dynamic range spatial tone-mapping. Our approach makes these contributions: (1) a robust space-time saliency method for evaluating visual importance, (2) a re-timing technique to temporally resample based on frame importance, and (3) temporal filters to enhance the rendering of salient motion. Results of our space-time saliency method on a benchmark dataset show it is state-of-the-art. In addition, the benefits of our approach to HFR-to-LFR time-mapping over more direct methods are demonstrated in a user study.
|
Similar papers:
[rank all papers by similarity to this]
|
#485 - Stereo under Sequential Optimal Sampling: A Statistical Analysis Framework for Search Space Reduction [pdf]
Yilin Wang, Jan-Michael Frahm, Enrique Dunn, Ke Wang |
Abstract: We develop a sequential optimal sampling framework for stereo disparity estimation by adapting the Sequential Probability Ratio Test (SPRT) model. The proposed framework operates over local image neighborhoods by iteratively estimating single pixel disparity values until sufficient evidence has been gathered to either validate or contradict the current hypothesis regarding local scene structure. The output of our sampling within a given region is a set of sampled pixel positions along with a robust and compact estimate of the set of disparities contained within that region. The attainment of such disparity set enables the effective reduction of the disparity search space for all remaining non-sampled pixels. Accordingly, our sampling framework is a general pre-processing mechanism aimed at reducing computational complexity of disparity search algorithms. We build upon this framework to propose an efficient plane propagation mechanism that leverages the pre-computed sampling positions and the local structure model described by the local disparity set. Our experiments demonstrate the effectiveness and efficiency of the proposed approach when compared to recent state of the art.
|
Similar papers:
[rank all papers by similarity to this]
|
#486 - Empirical Minimum Bayes Risk Prediction: How to extract an extra 3% performance from vision models with just two more parameters [pdf]
Vittal Premachandran, Daniel Tarlow, Dhruv Batra |
Abstract: When building vision systems to predict structured objects like image segmentations or human pose, we are often concerned with performing well under a task-specific evaluation measure. An ongoing research challenge is how to make predictions so as to maximize performance on these evaluation measures. In this work, we present a simple meta-algorithm that is surprisingly effective. The algorithm takes as input a model that would normally be the final product, and learns two parameters so as to optimize performance on the task-specific measure. We demonstrate the approach in several domains, taking existing state-of-the-art algorithms and improving performance by up to 5%, simply with two extra parameters.
|
Similar papers:
[rank all papers by similarity to this]
|
#497 - Matrix-Similarity Based Loss Function and Feature Selection for Alzheimer's Disease Diagnosis [pdf]
Xiaofeng Zhu, Heung-Il Suk, Dinggang Shen |
Abstract: Recent studies on AD and/or MCI diagnosis have shown that the tasks of identifying brain disease and predicting clinical scores are highly related to each other. Furthermore, it has been shown that feature selection with a manifold learning or a sparse model can handle the problems of high feature dimensionality and small sample size. However, the tasks of clinical score regression and clinical label classification were often conducted separately in the previous studies. Regarding the feature selection, to our best knowledge, most of the previous work considered a loss function defined as an element-wise difference between the target values and the predicted ones. In this paper, we consider the problems of joint regression and classification for AD/MCI diagnosis and propose a novel matrix-similarity based loss function that uses high-level information inherent in the target response matrix and imposes the information to be preserved in the predicted response matrix. The newly devised loss function is combined with a group lasso method for joint feature selection across tasks,i.e., prediction of clinical scores and a class label. In order to validate the effectiveness of the proposed method, we conducted experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and showed that the newly devised loss function helps enhance the performances of both clinical score prediction and disease status identification, outperforming the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#498 - Active Frame, Location, and Detector Selection for Automated and Manual Video Annotation [pdf]
Vasiliy Karasev, Avinash Ravichandran, Stefano Soatto |
Abstract: We describe an information-driven active selection approach to determine which detectors to deploy at which location in which frame of a video shot to minimize semantic class label uncertainty at every pixel, with the smallest computational cost that ensures a given uncertainty bound. We show minimal performance reduction compared to a ``paragon'' algorithm running all detectors at all locations in all frames, at a small fraction of the computational cost. Our method can handle uncertainty in the labeling mechanism, so it can handle both ``oracles'' (manual annotation) or noisy detectors (automated annotation).
|
Similar papers:
[rank all papers by similarity to this]
|
#508 - Saliency Optimization from Robust Background Detection [pdf]
Wangjiang Zhu, Shuang Liang, Yichen Wei, Jian Sun |
Abstract: Recent progresses in salient object detection have exploited the boundary prior, or background information, to assist other saliency cues such as contrast and achieve state of the art results. However, their usage of boundary prior is still simple, fragile, and the integration with other cues is mostly heuristic. In this work, we present new methods to address these issues. Firstly, we propose a robust background measure, called \emph{boundary connectivity}. It characterizes the spatial layout of image regions with respect to image boundaries and is much robust. It has an intuitive geometrical interpretation and provides unique benefits that are absent in previous saliency measures. Secondly, we propose a principled optimization framework to integrate multiple low level cues, including our background measure, to obtain clean and uniform saliency maps. Our formulation is intuitive, efficient and obtains state of the art results on several benchmark datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#514 - Robust Refinement of GPS-Tags Using RandomWalks with an Adaptive Damping Factor [pdf]
Amir Roshan Zamir |
Abstract: The number of GPS-tagged images available on the web is increasing at a rapid rate. The majority of such location tags are specified by the users, either through manual tagging or localization-chips embedded in the cameras. However, a known issue with user shared images is the unreliability of such GPS-tags; in this paper, we propose a method for addressing this problem. We assume a large dataset of GPS-tagged images which includes an unknown subset with contaminated tags is available. We develop a robust method for identification and refinement of the subset with contaminated tags using the rest of the images in the dataset. In the proposed method, we form triplets of matching images and use them for estimating the location of the query image utilizing structure from motion. We generate a large number of such estimations, which include inaccurate ones due to the noisy GPS-tags in the dataset, and perform random walks on them in order to identify the subset with the maximal agreement. Finally, we refine the GSP-tag of the image utilizing the identified consistent subset using a weighted mean. We propose a new damping factor for random walks which adopts itself to various levels of noise in the input. We evaluated the proposed framework on a dataset of over 18k user-shared images; the experiments show it robustly and consistently improves the accuracy of GPS-tags under diverse scenarios.
|
Similar papers:
[rank all papers by similarity to this]
|
#515 - Color Transfer using Probabilistic Moving Least Squares [pdf]
Youngbae Hwang, Joon-Young Lee, In So Kweon, Seon Joo Kim |
Abstract: This paper introduces a new color transfer method which is a process of transferring color of an image to match the color of another image of the same scene. The color of a scene may vary from image to image because the photographs are taken at different times, with different cameras, and under different camera settings. To solve for a full nonlinear and nonparametric color mapping in the 3D RGB color space, we propose a scattered point interpolation scheme using moving least squares and strengthen it with a probabilistic modeling of the color transfer in the 3D color space to deal with mis-alignments and noise. Experiments show the effectiveness of our method over previous color transfer works both quantitatively and qualitatively. In addition, our framework can be applied for various instances of color transfer such as transferring color between different camera models, camera settings, and illumination conditions, as well as for video color transfers.
|
Similar papers:
[rank all papers by similarity to this]
|
#517 - Single-View 3D Scene Parsing by Attributed Grammar [pdf]
Xiaobai Liu, Yibiao Zhao, Song Chun Zhu |
Abstract: In this paper, we present an attributed grammar for parsing man-made outdoor scenes into semantic surfaces, and recovering its 3D model simultaneously. The grammar takes superpixels as its terminal nodes and use five production rules to generate the scene into a hierarchical parse graph. Each graph node actually correlates with a surface or a composite of surfaces in the 3D world or the 2D image. They are described by attributes for the global scene model, e.g. focal length, vanishing points, or the surface properties, e.g. surface normal, contact line with other surfaces, and relative spatial location etc. Each production rule is associated with some equations that constraint the attributes of the parent nodes and those of their children nodes. Given an input image, our goal is to construct a hierarchical parse graph by recursively applying the five grammar rules while preserving the attributes constraints. We develop an effective top-down/bottom-up cluster sampling procedure which can explore this constrained space efficiently. We evaluate our method on both public benchmarks and newly built datasets, and achieve state-of-the-art performances in terms of layout estimation and region segmentation. We also demonstrate that our method is able to recover detailed 3D model with relaxed Manhattan structures which clearly advances the state-of-the-arts of single-view 3D reconstruction.
|
Similar papers:
[rank all papers by similarity to this]
|
#518 - Calibrating a non-isotropic near point light source using a plane [pdf]
Jaesik Park, Sudipta Sinha, Yasuyuki Matsushita, Yu-Wing Tai, In So Kweon |
Abstract: We show that a non-isotropic near point light source rigidly attached to a camera can be calibrated using multiple images of a weakly textured planar scene. We prove that if the radiant intensity distribution (RID) of a light source is radially symmetric with respect to its dominant direction, then the shading observed on a Lambertian scene plane is bilaterally symmetric with respect to a 2D line on the plane. The symmetry axis detected in an image provides a linear constraint for estimating the dominant light axis. The light position and RID parameters can then be estimated using a linear method as well. Specular highlights if available can also be used for light position estimation. We also extend our method to handle non-Lambertian surfaces which we model using biquadratic BRDFs. We have evaluated our method on synthetic data. Our experiments on real scenes show that our method works well in practice and enables light calibration without the need for specialized hardware.
|
Similar papers:
[rank all papers by similarity to this]
|
#519 - Exploiting Shading Cues in Kinect IR Images for Geometry Refinement [pdf]
Gyeongmin Choe, Jaesik Park, Yu-Wing Tai, In So Kweon |
Abstract: In this paper, we propose a method to refine geometry of 3D meshes from Kinect fusion by exploiting shading cues captured from the infrared (IR) camera of Kinect. A major benefit of using the Kinect IR camera instead of a RGB camera is that the IR images captured by Kinect are narrow band images which filtered out most undesired ambient light that makes our system robust to natural indoor illumination. We define a near light IR shading model which describes the captured intensity as a function of surface normals, albedo, lighting direction, and distance between light source and surface points. To resolve ambiguity in our model between normals and distance, we utilize an initial 3D mesh from Kinect fusion and multi-view information to reliably estimate surface details that were not reconstructed by Kinect fusion. Our approach directly operates on mesh model for geometry refinement. The effectiveness of our approach is demonstrated through several challenging real-world examples.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Given two images, we want to predict which exhibits a particular visual attribute more than the other---even when the two images are quite similar. Existing relative attribute methods rely on global ranking functions; yet rarely will the visual cues relevant to a comparison be constant for all data, nor will humans' perception of the attribute necessarily permit a global ordering. To address these issues, we propose a local learning approach for fine-grained visual comparisons. Given a novel pair of images, we learn a local ranking model on the fly, using only analogous training comparisons. We show how to identify these analogous pairs using learned metrics. With results on three challenging datasets---including a large newly curated dataset for fine-grained comparisons---our method outperforms state-of-the-art methods for relative attribute prediction.
|
Similar papers:
[rank all papers by similarity to this]
|
#525 - Compact Representation for Image Classification: To Choose or to Compress? [pdf]
Yu Zhang, Jianxin Wu, Jianfei Cai |
Abstract: In large scale image classification, features such as Fisher vector or VLAD have achieved state-of-the-art results. However, the combination of large number of examples and high dimensional vectors necessitates dimensionality reduction, in order to reduce its storage and CPU costs to a reasonable range. In spite of the popularity of various feature compression methods, this paper argues that feature selection is a better choice than feature compression. We show that strong multicollinearity among feature dimensions may not exist, which undermines feature compression's effectiveness and renders feature selection a natural choice. We also show that many dimensions are noise and throwing them away is helpful for classification. We propose a supervised mutual information (MI) based importance sorting algorithm to choose features. Combining with 1-bit quantization, MI feature selection has achieved both higher accuracy and less computational cost than state-of-the-art feature compression methods such as product quantization and BPBC.
|
Similar papers:
[rank all papers by similarity to this]
|
#535 - Alert: Predicting Failures [pdf]
Peng Zhang, Jiuling Wang, Ali Farhadi, Martial Hebert, Devi Parikh |
Abstract: In real applications, not only is it important for computer vision systems to fail infrequently, it is also important for them to fail gracefully (e.g. with some warning). While the former has been the primary focus of the community, in this work, we hope to draw the community's attention to the latter problem. We introduce ALERT: a straightforward and general system that can predict the likely accuracy (or failure) of any computer vision system on an input instance. We promote two metrics to evaluate such failure prediction systems. We show that ALERT fairs surprisingly well at these metrics on a variety of applications such as semantic segmentation, vanishing point and camera parameter estimation, and image memorability prediction. We also explore attribute prediction, where classifiers are typically meant to generalize to new unseen categories. We show that ALERT can be useful in predicting failures of this transfer. Finally, we leverage ALERT to improve the performance of a downstream application of attribute prediction: zero-shot learning. We show that ALERT can outperform several strong baselines for zero-shot learning on four datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#542 - Surface Registration by Optimization in Constrained Diffeomorphism Space [pdf]
Wei Zeng, Lok Ming Lui, Xianfeng Gu |
Abstract: This work proposes a novel framework for optimization in the constrained diffeomorphism space for deformable surface registration. The registration is formulated as an optimization problem in a constrained diffeomorphism space. First the diffeomorphism space is modeled as a special complex functional space on the source surface, the Beltrami coefficient space. The landmark constraints and the physical feasibility constraints define subspaces in the Beltrami coefficient space. Then the harmonic energy of the registration is minimized in the constrained subspaces. The minimization is achieved by alternating the optimization step and the projection step. The optimization step is to diffuse the Beltrami coefficient, and the projection step first deforms the conformal structure by the current Beltrami coefficient, then composes with a harmonic map from the deformed conformal structure to the target. The registration result is diffeomorphic, guarantees the landmark constraints, satisfies the physical constraints, and minimizes the conformality distortion.
|
Similar papers:
[rank all papers by similarity to this]
|
#553 - Low-Cost Compressive Sensing for Color Video and Depth [pdf]
Xin Yuan, Patrick Llull, Xuejun Liao, Jianbo Yang, David Brady, Guillermo Sapiro, Lawrence Carin |
Abstract: A simple and inexpensive (low-power and lowbandwidth) modification is made to a conventional off-the-shelf color video camera, from which we recover multiple color frames for each of the original measured frames, and each of the recovered frames can be focused at a different depth. The recovery of multiple frames for each measured frame is made possible via high-speed coding, manifested via translation of a single coded aperture; the inexpensive translation is constituted by mounting the binary code on a piezoelectric device. To simultaneously recover depth information, a liquid lens is modulated at high speed, via a variable voltage. Consequently, during the aforementioned coding process, the liquid lens allows the camera to sweep the focus through multiple depths. In addition to designing and implementing the camera, fast recovery is achieved by an anytime algorithm exploiting the group-sparsity of wavelet/DCT coefficients.
|
Similar papers:
[rank all papers by similarity to this]
|
#557 - Deformable Object Matching via Deformation Decomposition based 2D Label MRF [pdf]
Kangwei Liu, zhang Junge, Kaiqi Huang, Tieniu Tan |
Abstract: Deformable object matching, which is also called elastic matching or deformation matching, is an important and challenging problem in computer vision. Although numerous deformation models have been proposed in different tasks, not many of them investigate the intrinsic physics underlying deformation. Due to the lack of physical analysis, these models cannot describe the structure changes of deformable objects very well. Motivated by this, we analyze the deformation physically and propose a novel deformation decomposition model to represent various deformations. Based on the physical model, we formulate the matching problem as a two-dimensional label Markov Random Field. The MRF energy function is derived from the deformation decomposition model. Furthermore, we propose a two-stage method to optimize the MRF energy function. To provide a quantitative benchmark, we build a deformation matching database with an evaluation criterion. Experimental results show that our method outperforms previous approaches especially on complex deformations.
|
Similar papers:
[rank all papers by similarity to this]
|
#558 - Similarity-Aware Patchwork Assembly for Depth Image Super-Resolution [pdf]
Jing Li, Zhichao Lu, Gang Zeng, Hongbin Zha |
Abstract: This paper describes a patchwork assembly algorithm for depth image super-resolution. An input low resolution depth image is disassembled into parts by matching similar regions on a set of high resolution training images, and a super-resolution image is then assembled using these corresponding matched counterparts. We convert the super-resolution problem into a Markov random field labeling problem, and propose a unified formulation embedding (1) the consistency between the resolution enhanced image and the original input, (2) the similarity of disassembled parts with the corresponding regions on training images, (3) the depth smoothness in local neighborhoods, (4) the additional geometric constraints from self-similar structures in the scene, and (5) the boundary coincidence between the resolution enhanced depth image and an optional aligned high resolution intensity image. Experimental results on both synthetic and real-world data demonstrate that the proposed algorithm is capable of recovering high quality depth images with X 4 resolution enhancement along each coordinate direction, and that it outperforms the state-of-the-arts [14] in both qualitative and quantitative evaluations.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We present a nonrigid shape matching technique for establishing correspondences of incomplete 3D surfaces that exhibit intrinsic reflectional symmetry. We formulate the shape matching problem as a quadratic assignment problem (QAP) which incorporates point-wise and pairwise matching constraints. The key for solving the symmetry ambiguity problem is to define a point-wise constraint from a local descriptor that is sensitive to local asymmetry such that we can discriminate global symmetry pairs, e.g. the left hand and the right hand. The proposed descriptor is based on a local depth map whose view-up direction is aligned with the gradient of a scalar field computed on a surface. Because this scalar field is smooth and isometric-invariant, the proposed descriptor is robust to isometric deformations as well as local geometric changes. Incompleteness of input surfaces is handled by constructing a pairwise constraint using the diffusion distance. Since we use a binary representation for a pairwise affinity, our technique is also robust to non-isometric deformations. To solve QAP efficiently, we propose a graph matching algorithm called iterative spectral relaxation which combines spectral embedding and spectral graph matching. The benefit of this algorithm is its near global convergence, while retaining efficiency. Experimental results show that our method can match a wide range of models and achieve a comparable result with other state-of-the art techniques on a surface correspond
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: State-of-the-art patch-based image representations involve a pooling operation that aggregates statistics computed from local descriptors. Standard pooling operations include average and max pooling. Average pooling lacks discriminability because the resulting representation is strongly influenced by frequent yet often uninformative descriptors, but only weakly influenced by rare yet potentially highly-informative ones. Max pooling equalizes the influence of frequent and rare descriptors but is only applicable to representations that rely on count statistics, such as the bag-of-visual-words (BOV). We propose a novel pooling mechanism that involves re-weighting the per-patch statistics. It achieves the same equalization effect as max pooling but is applicable beyond the BOV and especially to the state-of-the-art Fisher Vector -- hence the name Generalized Max Pooling (GMP). We show on five public image classification benchmarks that the proposed GMP performs on par with, and sometimes significantly better than, heuristic alternatives.
|
Similar papers:
[rank all papers by similarity to this]
|
#583 - The Synthesizability of texture examples [pdf]
Dengxin Dai, Hayko Riemenschneider, Luc Van Gool |
Abstract: While example-based texture synthesis (ETS) has been widely used to generate impressive high quality textures of desired size, not all images are equally good as the examples. In this paper we investigate the problem of predicting the synthesizability of an given image how synthesizable it is by ETS. We introduce a database (32, 000 texture samples) of which all images have been annotated in terms of their synthesizability. We design a set of texture features, such as homogeneity, repetitiveness, and regularity, and train a predictor using these features on the data collection. This work is the first attempt to quantify this image property, and we find that the synthesizability of images can be learned and predicted. In experiments, we verify the ectiveness of several designed features, and verify the usefulness of image synthesizability for multiple applications: perform an initial selection of examples for large-scale texture synthesis, trim images to parts that are more synthesizable, and serve as a feature for image recognition. Also, we suggest which texture synthesis method is best suited for synthesis of the given image.
|
Similar papers:
[rank all papers by similarity to this]
|
#585 - Unsupervised Learning of Dictionaries of Hierarchical Compositional Models [pdf]
Jifeng Dai, Yi Hong, WENZE Hu, Ying Nian Wu |
Abstract: This paper proposes an unsupervised method for learning dictionaries of hierarchical compositional models for representing natural images. Each model is in the form of a template that consists of a small group of part templates that are allowed to shift their locations and orientations relative to each other, and each part template is in turn a composition of Gabor wavelets that are also allowed to shift their locations and orientations relative to each other. Given a set of unannotated training images, a dictionary of such hierarchical templates are learned so that each training image can be represented by a small number of templates that are spatially translated, rotated and scaled versions of the templates in the learned dictionary. The learning algorithm iterates between the following two steps: (1) Image encoding by a template matching pursuit process that involves a bottom-up template matching sub-process and a top-down template localization sub-process. (2) Dictionary re-learning by a shared matching pursuit process. Experimental results show that the proposed approach is capable of learning meaningful templates, and the learned templates are useful for tasks such as domain adaption and image cosegmentation.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The use of wearable cameras makes it possible to record life logging egocentric videos. Browsing such long unstructured videos is time consuming and tedious. Segmentation into meaningful chapters is an important first step towards adding structure to egocentric videos, enabling efficient browsing, indexing and summarization of the long videos. Two sources of information for video segmentation are (i) the motion of the camera wearer, and (ii) the objects and activities recorded in the video. In this paper we address the motion cues for video segmentation. Motion based segmentation is especially difficult in egocentric videos when the camera is constantly moving due to natural head movement of the wearer. We propose a robust temporal segmentation of egocentric videos into a hierarchy of motion classes using a new {\em Array of Motion Integrators}. Unlike instantaneous motion vectors, segmentation using integrated motion vectors perform well even in dynamic and crowded scenes. No assumptions are made on the underlying scene structure and the algorithm works in indoor as well as outdoor situations. We demonstrate the effectiveness of our approach using publicly available videos as well as choreographed videos. An approach is also presented to compute the fixation of wearer's gaze in the walking portion of the egocentric videos.
|
Similar papers:
[rank all papers by similarity to this]
|
#591 - Efficient Localization with Fisher Vectors using Approximate Normalizations [pdf]
Dan Oneata, Jakob Verbeek, Cordelia Schmid |
Abstract: The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word representation. Transformation of the FV by power and $\ell_2$ normalizations has been shown to significantly improve its performance. With these normalizations included, this representation has yielded state-of-the-art results for a wide number of image and video classification and retrieval tasks. The normalizations, however, render the representation non-additive over local descriptors. Combined with its high dimensionality, this makes the FV computationally very expensive for the purpose of localization tasks. In this paper we, first, present approximations to both these normalizations, which yield significant improvements in the memory requirements and computational costs of the FV when used for localization. Second, we show how these approximations can be used to define upper-bounds on the score function that can be efficiently evaluated, which paves the way for the use of branch-and-bound search as an alternative to exhaustive scanning window search. We present experimental evaluation results on classification and temporal localization of actions in videos. These show that the proposed approximations lead to speed-ups of at least one order of magnitude, while maintaining state-of-the-art action localization performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#592 - Fast Approximate Inference in Higher Order MRF-MAP Labeling Problems [pdf]
Chetan Arora, S.N. Maheshwari, Subhashis Banerjee, Prem Kalra |
Abstract: Use of higher order clique potentials for modeling inference problems has exploded in last few years. The algorithmic schemes proposed so far does not scale well with increasing clique size, thus limiting their usage to clique size of 4 in practice. Generic Cuts (GC) of Arora et al. [8] shows that when potentials are submodular, inference problems can be solved optimally in polynomial time for fixed size cliques. In this paper we report an algorithm called Approximate Cuts (AC) which uses a generalization of the gadget of GC and provides an approximate solution to inference in 2-label MRF-MAP problems with cliques of size k 2. The algorithm gives optimal solution for submodular potentials. When potentials are non-submodular, we show that important properties such as weak persistency hold for solution inferred by AC. AC is a polynomial time primal dual approximation algorithm for fixed clique size. We show experimentally that AC not only provides significantly better solutions in practice, it is hundreds of times faster than message passing schemes like Dual Decomposition [20] and TRWS [17] or Reduction based techniques like [10, 13, 15].
|
Similar papers:
[rank all papers by similarity to this]
|
#594 - Multi Label Generic Cuts: Optimal Inference in Multi Label Multi Clique MRF-MAP Problems [pdf]
Chetan Arora, S.N. Maheshwari |
Abstract: We propose an algorithm called Multi Label Generic Cuts (MLGC) for computing optimal solutions to MRF- MAP problems with submodular multi label multi-clique potentials. A transformation is introduced to convert a m- label k-clique problem to an equivalent 2-label(mk)-clique problem. We show that if the original multi-label problem is submodular then the transformed 2-label multi-clique problem is also submodular. We exploit sparseness in the feasible configurations of the transformed 2-label problem to suggest an improvement to Generic Cuts [3] to solve the 2-label problems efficiently. The algorithm runs in time O(m^k n^3 ) in the worst case (k is the order of cliques, m is the number of labels and n is the number of pixels) generalizing O(2^k n^3) running time of Generic Cuts. We show experimentally that MLGC is an order of magnitude faster than the current state of the art [17, 19]. While the result of MLGC is optimal for submodular clique potential it is significantly better than the compared methods even for problems with non-submodular clique potential.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We present a novel object recognition framework based on multiple figure-ground hypotheses with a large object spatial support, generated by bottom-up processes and midlevel cues in an unsupervised manner. We exploit the benefit of regression for discriminating segments categories and qualities, where a regressor is trained to each category using the overlapping observations between each figureground segment hypothesis and the ground-truth of the target category in an image. Object recognition is achieved by maximizing a submodular objective function, which maximizes the similarities between the selected segments (i.e., facility locations) and their group elements (i.e., clients), penalizes the number of selected segments, and more importantly, encourages the consistency of object categories corresponding to maximum regression values from different category-specific regressors for the selected segments. The proposed framework achieves impressive recognition results on three benchmark datasets, including PASCAL VOC 2007, Caltech-101 and ETHZ-shape.
|
Similar papers:
[rank all papers by similarity to this]
|
#598 - A Probabilistic Framework for Multitarget Tracking with Mutual Occlusions [pdf]
Menglong Yang, Yiguang Liu, Stan Li |
Abstract: Mutual occlusions among targets can cause track loss or target position deviation. This is because the observation likelihood of a occluded target can vanish even when we have the estimated location of the target. This paper presents a novel probability framework for multitarget tracking with mutual occlusions. The primary contribution of this work is the introduction of a vectorial {\bf occlusion variable} as part of the solution. The occlusion variable describes occlusion states of the targets. This forms the basis of the proposed probability framework, with the following further contributions: 1) Likelihood: A new observation likelihood model is presented, in which the likelihood of an occluded target is computed by referring to both of the occluded and occluding targets. 2) Priori: Markov random field (MRF) is used to model the occlusion priori such that less likely ''circular'' or ''cascading'' types of occlusions have lower priori probabilities. Both the occlusion priori and the motion priori take into consideration the state of occlusion. 3) Optimization: A realtime RJMCMC-based algorithm with a new move type called ''occlusion state update'' is presented. Experiments are performed in comparison with several state-of-the-art algorithms. Results show that the proposed framework can handle occlusions well, including even long duration of full occlusions, which may cause tracking failures in the traditional methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#600 - Multi-fold MIL Training for Weakly Supervised Object Localization [pdf]
Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid |
Abstract: Object category localization is a challenging and fundamental problem in computer vision. Standard supervised training requires bounding box annotations of object instances. The time-consuming manual annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. We represent detection windows using the powerful Fisher vector representation, and reduce the storage and computational costs using a selective search strategy. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when high-dimensional representations, such as the Fisher vector, are used. We present a detailed experimental evaluation using the VOC 2007 dataset. Compared to state-of-the-art weakly supervised detectors, our approach better localizes objects in the training images, which translates into an improvement of detection performance from 15.0% to 22.4% mAP.
|
Similar papers:
[rank all papers by similarity to this]
|
#605 - Fast Supervised Hashing with Decision Trees for High-Dimensional Data [pdf]
Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, David Suter |
Abstract: Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the Hamming space. Non-linear hash functions have demonstrated the advantage over linear ones due to their powerful generalization capability. In the literature, kernel functions are typically used to achieve non-linearity in hashing, which achieve encouraging retrieval performance at the price of slow evaluation and training time. For the first time, we propose to use boosted decision trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence more suitable for hashing with high dimensional data. We separate the problem of learning hash functions into two independent sub-problems: binary code inference (via efficient Graph-Cut) and training of boosted decision trees via fitting the binary codes. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time. Especially for high-dimensional data, our method is orders of magnitude faster than many methods in terms of training time.
|
Similar papers:
[rank all papers by similarity to this]
|
#607 - Point Matching in the Presence of Outliers in Both Point Sets: A Concave Optimization Approach [pdf]
Wei Lian, Lei Zhang |
Abstract: Recently, a concave optimization approach has been proposed to solve the robust point matching (RPM) problem. This method is globally optimal, but it requires that each point in the model point set has a counterpart in the data point set. Unfortunately, such a requirement may not be satisfied in some applications due to the presence of outliers in both point sets. To address this problem, we drop this condition and reduce the objective function of RPM to a function with few nonlinear terms by eliminating the transformation variables. The resulting function, however, is no longer quadratic. We prove that it is still concave over the feasible region of point correspondence. The branch-and-bound algorithm can then be used for optimization. To improve the efficiency of the branch-and-bound algorithm whose bottleneck lies in the computation of the lower bound, we propose a new lower bounding scheme which has a k-cardinality linear assignment formulation and can be efficiently solved. Experimental results demonstrate that the proposed concave optimization algorithm outperforms state-of-the-arts in its robustness to disturbances and point matching accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#621 - Robust Surface Reconstruction via Triple Sparsity [pdf]
Hicham Badri, Hussein Yahia, Driss Aboutajdine |
Abstract: Reconstructing a surface/image from corrupted gradient fields is a crucial step in many imaging applications where a gradient field is subject to both noise and unlocalized outliers, resulting typically in a non-integrable field. We present in this paper a new optimization method for robust surface reconstruction. The proposed formulation is based on a triple sparsity prior : a sparse prior on the residual gradient field and a double sparse prior on the surface itself. We develop an efficient alternate minimization strategy to solve the proposed optimization problem. The method is able to recover a good quality surface from severely corrupted gradients thanks to its ability to handle both noise and outliers. We demonstrate the performance of the proposed method on synthetic and real data. Experiments show that the proposed solution outperforms some existing methods in the three possible cases : noise only, outliers only and mixed noise/outliers.
|
Similar papers:
[rank all papers by similarity to this]
|
#622 - Robust Estimation of 3D Human Poses from Single Images [pdf]
CHUNYU WANG, Yizhou Wang, Zhouchen Lin, Alan Yuille, Wen Gao |
Abstract: Human pose estimation is a key step to action recognition. We propose a method of estimating 3D human poses from single images, which works in conjunction with an existing 2D pose/joint detector. 3D pose estimation is challenging because multiple 3D poses may correspond to the same 2D pose after projection due to lack of depth information. Moreover, current 2D pose estimators are usually inaccurate, which may cause big errors in the 3D pose estimation. We address the challenges in three ways: (i) We represent a 3D pose as a linear combination of a sparse set of bases learned from 3D human skeletons. (ii) We enforce limb length constraints to eliminate anthropomorphically implausible poses. (iii) We estimate a 3D pose by minimizing the $L_1$-norm error between the projection of 3D joints and the corresponding 2D detections. The $L_1$-norm loss term is robust to inaccurate 2D joint estimations. We use the alternating direction method (ADM) to solve the $L_1$ minimization problem efficiently. Our approach outperforms the state-of-the-arts on three benchmark datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#624 - A Minimal Solution to the Generalized Pose-and-Scale Problem [pdf]
Jonathan Ventura, Clemens Arth, Gerhard Reitmayr, Dieter Schmalstieg |
Abstract: We propose a solution to a novel generalized camera pose problem which includes the internal scale of the generalized camera as an unknown parameter. This further generalization of the well-known absolute camera pose problem has applications in multi-frame loop closure. While a well-calibrated camera rig has a fixed and known scale, camera trajectories produced by monocular motion estimation necessarily lack a scale estimate. Thus, when performing loop closure in monocular visual odometry, or registering separate structure-from-motion reconstructions, we must estimate a seven degree-of-freedom similarity transform from corresponding observations. Existing approaches solve this problem, in specialized configurations, by aligning 3D triangulated points or individual camera pose estimates. Our approach handles general configurations of rays and points and directly estimates the full similarity transformation from the 2D-3D correspondences. Four correspondences are needed in the minimal case, which has eight possible solutions. The minimal solver can be used in a hypothesize-and-test architecture for robust transformation estimation. Our solver also produces a least-squares estimate in the overdetermined case. The approach is evaluated experimentally on synthetic and real datasets, and is shown to produce higher accuracy solutions to multi-frame loop closure than existing approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Fisher Kernels and Deep Belief Networks were two developments with significant impact on large-scale object categorization in the last years. Both approaches were shown to achieve state-of-the-art results on large-scale object categorization datasets, such as ImageNet. Conceptually, however, they are perceived as very different and it is not uncommon for heated debates to spring up when advocates of both paradigms meet at conferences or workshops. In this work, we emphasize the similarities between both architectures rather than their differences and we argue that such a unified view allows us to transfer ideas from one domain to the other. As a concrete example we introduce a training method that learns a support vector machine classifier with Fisher kernel at the same time as a task-specific data representation. The basis for this is a reinterpretation of a support vector classifiers with Fisher kernel as a multi-layer feed forward network. Its final layer is the classifier, parameterized by a weight vector, and the two previous layers compute Fisher vectors, parameterized by the coefficients of a Gaussian mixture model. We introduce a gradient-descent based learning algorithm that, in contrast to other feature learning techniques, is not just derived from intuition or biological analogy, but has a theoretical justification in the framework of statistical learning theory. Our experiments show that the new training procedure leads to significant improvements in classificat
|
Similar papers:
[rank all papers by similarity to this]
|
#637 - MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation [pdf]
Jiajun Wu, Yibiao Zhao, Jun-Yan Zhu, Zhuowen Tu |
Abstract: Interactive segmentation, in which a user provides a bounding box to an object of interest for image segmentation, has been applied to a variety of applications in image editing, crowdsourcing, computer vision, and medical imaging. The challenge of this semi-automatic image segmentation task is to deal with the uncertainty of the foreground object within the bounding box. Here, we turn the interactive image segmentation problem into a multiple instance learning (MIL) formulation, named MILCut, by generating positive bags from pixels of sweeping lines within the bounding box. We provide a justification to our formulation and develop an algorithm with significant performance and efficiency gain over existing state-of-the-art systems. The results on two benchmark datasets for interactive segmentation demonstrate the evident advantage of our approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#638 - Beyond Comparing Image Pairs: Setwise Active Learning for Relative Attributes [pdf]
Lucy Liang, Kristen Grauman |
Abstract: It is useful to automatically compare images based on their visual properties---for example, to predict which image is brighter, more feminine, or more blurry. However, comparative models are inherently more costly to train than their classification counterparts. Manually labeling all pairwise comparisons is intractable, so which pairs should a human supervisor compare? We explore active learning strategies for training relative attribute ranking functions, with the goal of requesting human comparisons only where they are most informative. We introduce a novel setwise criterion that requests a partial ordering for a set of examples that minimizes the cumulative rank margin in attribute space, subject to a visual diversity constraint. The setwise criterion helps amortize effort by identifying mutually informative comparisons, and the diversity requirement safeguards against requests a human viewer will find ambiguous. We develop an efficient strategy to search for sets that meet this criterion. On three challenging datasets, the proposed method outperforms existing active rank learning methods, demonstrating the importance of focusing attention when learning comparative attribute models.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The objective of this work is object category detection in large scale image datasets, where the object category is specified by a sliding window HOG classifier, and retrieval should be immediate at run time in the manner of Video Google. We make the following three contributions: (i) a new image representation based on mid-level discriminative patches, that is designed to be suited to immediate object category detection and inverted file indexing; (ii) a sparse representation of a HOG classifier using a set of mid-level discriminative classifier patches; and (iii) a fast method for spatial reranking images on their detections. We evaluate the detection method on the standard PASCAL VOC 2007 dataset, together with a 85K image subset of ImageNet, and demonstrate near state of the art detection performance at low ranks whilst maintaining immediate re trieval speeds. Applications are also demonstrated using an exemplar-SVM for pose matched retrieval.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Multiple image capturing is a simple way to increase the chance of capturing a good photo with a light-weight hand-held camera, for which the camera-shake blur is typically a nuisance problem. The naive approach of selecting the single best captured photo as output does not take full advantage of all the observations. Conventional multi-image blind deblurring methods can take all observations as input but usually require the multiple images are well aligned. However, the multiple blurry images captured in presence of camera shake are rarely free from mis-alignment. Registering multiple blurry images is a challenging task due to the presence of blur while deblurring of multiple blurry images requires accurate alignment, leading to an intrinsically coupling problem. In this paper, we propose a blind multi-image restoration method which can achieve joint alignment, non-uniform deblurring, together with resolution enhancement from multiple low quality images. Experiments on several real-world images with comparison to some previous methods validated the effectiveness of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
#643 - Data-driven Flower Petal Modeling with Botany Priors [pdf]
Chenxi Zhang, Mao Ye, BO FU, Ruigang Yang |
Abstract: In this paper we focus on the 3D modeling of flower, in particular the petals. The complex structure, severe occlusions, and wide variations make the reconstruction of their 3D models a challenging task. Therefore, even though a flower is the most distinctive part of a plant, there has been little modeling study devoted to it. We overcome these challenges by combining data driven modeling techniques with domain knowledge from botany. %are mostly designed for macro structures, such as trees or foliage; or based on pure synthesis given predefined rules with user interactions. Taking a 3D point cloud of an input flower scanned from a single view, our method starts with a level set based segmentation of each individual petal, using both appearance and position information. Each segmented petal is then fitted with a scale-invariant morphable petal shape model, which is constructed from individually scanned exemplar petals. Novel constraints based on botany studies, such as the number and spatial layout of petals, are incorporated into the fitting process for realistically reconstructing occluded regions and maintaining correct 3D spatial relations. Finally, the reconstructed petal shape is texture mapped using the registered color images, with occluded regions filled in by content from visible ones. Experiments show that our approach can obtain realistic modeling of flowers with noticeable occlusions and shape variations, and is invariant to flower size.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Existing saliency detection approaches use images as inputs and are sensitive to foreground/background similarities, complex background textures, and occlusions. We explore the problem of using light fields as input for saliency detection. Our technique is enabled by the availability of commercial plenoptic cameras that capture the light field of a scene in a single shot. We show that the unique refocusing capability of light fields provides useful focusness, depths, and objectness cues. We further develop a new saliency detection algorithm tailored for light fields. To validate our approach, we acquire a light field database of a range of indoor and outdoor scenes and generate the ground truth saliency map. Experiments show that our saliency detection scheme can robustly handle challenging scenarios such as similar foreground and background, cluttered background, complex occlusions, \etc, and achieve high accuracy and robustness.
|
Similar papers:
[rank all papers by similarity to this]
|
#646 - Pedestrian Detection in Low-resolution Imagery by Learning Multi-scale Intrinsic Motion Structures (MIMS) [pdf]
Jiejie Zhu |
Abstract: Detecting pedestrians at a distance from large-format wide-area imageries is a challenging problem because of low ground sampling distance (GSD) and the low frame rate of the imagery. In such a scenario, the approaches based on appearance cues alone easy to fail because pedestrians are only a few pixels in size. Frame-differencing and optical flow based approaches also give poor detection results due to noise, camera jitter and parallax. To overcome these challenges, we propose a novel approach to extract Multi-scale Intrinsic Motion Structure features from the pedestrian's motion patterns for pedestrian detection. The MIMS feature encodes the intrinsic motion properties of an object consisting of a few pixels, which is location, velocity and trajectory-shape invariant. The extracted MIMS representation is highly robust to noise in comparison with other approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
#652 - Measuring Distance Between Unordered Sets of Different Sizes [pdf]
Andrew Gardner, Jinko Kanno, Rastko Selmic, Christian Duncan |
Abstract: We present a distance metric based upon the notion of minimum-cost injective mappings between sets. Our function satisfies metric properties as long as the cost of the minimum mappings is derived from a semimetric, for which the triangle inequality is not necessarily satisfied. We show that the Jaccard distance (alternatively biotope, Tanimoto, or Marczewski-Steinhaus distance) may be considered the special case for finite sets where costs are derived from the discrete metric. Extensions that allow premetrics (not necessarily symmetric), multisets (generalized to include probability distributions), and multiple mappings are given that expand the versatility of the metric without sacrificing metric properties. The function has potential applications in pattern recognition, machine learning, and information retrieval.
|
Similar papers:
[rank all papers by similarity to this]
|
#656 - Strokelets: A Learned Multi-Scale Representation for Scene Text Recognition [pdf]
Cong Yao, Xiang Bai, Baoguang Shi, Wenyu Liu |
Abstract: Driven by the wide range of applications, scene text detection and recognition have become active research topics in computer vision. Though extensively studied, localizing and reading text in uncontrolled environments remain extremely challenging, due to various interference factors. In this paper, we propose a novel multi-scale representation for scene text recognition. This representation consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities. Strokelets possess four distinctive advantages: (1) Usability: automatically learned from bounding box labels; (2) Robustness: insensitive to interference factors; (3) Generality: applicable to variant languages; and (4) Expressivity: effective at describing characters in natural scenes. Extensive experiments on standard benchmarks verify the advantages of strokelets and demonstrate that the proposed algorithm outperforms the state-of-the-art methods in the literature.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The appearance of an attribute can vary considerably from class to class (e.g., a ``fluffy" dog vs.~a ``fluffy" towel), making standard class-independent attribute models break down. Yet, training object-specific models for each attribute can be impractical, and defeats the purpose of using attributes to bridge category boundaries. We propose a novel form of transfer learning that addresses this dilemma. We develop a tensor factorization approach which, given a sparse set of class-specific attribute classifiers, can infer new ones for object-attribute pairs unobserved during training. For example, even though the system has no labeled images of striped dogs, it can use its knowledge of other attributes and objects to tailor ``stripedness" to the dog category. With two large-scale datasets, we demonstrate both the need for category-sensitive attributes as well as our method's successful transfer. Our inferred attribute classifiers perform similarly well to those trained with the luxury of labeled class-specific instances, and much better than those restricted to traditional modes of transfer.
|
Similar papers:
[rank all papers by similarity to this]
|
#668 - Online Object Tracking, Learning and Parsing with And-Or Graphs [pdf]
Yang Lu, Tianfu Wu, Song Chun Zhu |
Abstract: This paper presents a framework for simultaneously tracking, learning and parsing objects with a hierarchical and compositional And-Or graph (AOG) representation. The AOG is discriminatively learned online to account for the appearance (e.g., lighting and partial occlusion) and structural (e.g., different poses and viewpoints) variations of the object itself, as well as the distractors (e.g., similar objects) in the scene background. In tracking, the state of the object (i.e., bounding box) is inferred by parsing with the current AOG using a spatial-temporal dynamic programming (DP) algorithm. When the AOG grows big for handling objects with large variations in long-term tracking, we propose a bottom-up/top-down scheduling scheme for efficient inference, which performs focused inference with the most stable and discriminative small sub-AOG. During online learning, the AOG is re-learned iteratively with two steps: (i) Identifying the false positives and false negatives of the current AOG in a new frame by exploiting the spatial and temporal constraints observed in the trajectory; (ii) Updating the structure of the AOG, and re-estimating the parameters based on the augmented training dataset. In experiments, the proposed method outperforms the state-of-the-art tracking algorithms on a recent public tracking benchmarks with 50 testing videos and 29 publicly available trackers evaluated \cite{trackingBenchmark}.
|
Similar papers:
[rank all papers by similarity to this]
|
#669 - Co-Segmentation of Textured 3D Shapes with Sparse Annotations [pdf]
Mehmet Yumer, Won Chun, Ameesh Makadia |
Abstract: We present a novel co-segmentation method for textured 3D shapes. Our algorithm takes a collection of textured shapes belonging to the same category and sparse annotations of foreground segments, and produces a joint dense segmentation of the shapes in the collection. We model the segments present in the shape collection by a collectively trained Gaussian mixture model. The final model segmentation is formulated as an energy minimization across all models jointly, where intra-model edges control the smoothness and separation of model segments, and inter-model edges impart global consistency. We show promising results on two large real-world datasets, and also compare with previous shape-only 3D segmentation methods using publicly available datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: With the widespread availability of video cameras, we are facing an ever-growing enormous collection of unedited and unstructured video data. Due to lack of an automatic way to generate summaries from this large collection of consumer videos, they can be tedious and time consuming to index or search. In this work, we propose online video highlighting, a principled way of generating short video summarizing the most important and interesting contents of an unedited and unstructured video, costly both time-wise and financially for manual processing. Specifically, our method learns a dictionary from given video using group sparse coding, and updates atoms in the dictionary on-the-fly. A summary video is then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary. The online fashion of our proposed method enables it to process arbitrarily long videos and start generating summaries before seeing the end of the video. Moreover, the processing time required by our proposed method is close to the original video length, achieving quasi real-time summarization speed. Theoretical analysis, together with experimental results on more than 12 hours of surveillance and YouTube videos are provided, demonstrating the effectiveness of online video highlighting.
|
Similar papers:
[rank all papers by similarity to this]
|
#671 - Beyond Human Opinion Scores: Blind Image Quality Assessment based on Synthetic Scores [pdf]
Peng Ye, David Doermann |
Abstract: General purpose blind image quality assessment (BIQA) aims to develop some computational model that can predict human perceived quality of distorted images without knowing the on-distorted reference images and any prior knowledge on the types of image distortions. State-of-the-art general purpose BIQA methods rely on 1) examples of distorted images and 2) corresponding human opinion scores to learn a regression function that maps image features to the quality score. These types of models are considered "opinion-aware" (OA) BIQA models. A large set of human scored training examples is usually required to train a reliable OA-BIQA model. However, obtaining human opinion score through subjective testing is often expensive and time-consuming. It is therefore desirable to develop "opinion-free" (OF) BIQA models that do not require human opinion scores for training. This paper proposes BLISS (Blind Learning of Image Quality using Synthetic Scores). BLISS is a simple, yet effective method for extending OA-BIQA models to OF-BIQA models. Instead of training on human opinion scores, we propose to train BIQA models on Full-Reference (FR) IQA measures. State-of-the-art FR measures yield high correlation with human opinion scores, therefore they can serve as an approximation to human opinion scores. Unsupervised rank aggregation is applied to combine different FR measures to generate a synthetic score, which serves as a better "gold standard". Extensive experiments on three standard IQA
|
Similar papers:
[rank all papers by similarity to this]
|
#672 - Detecting Objects using Deformation Dictionaries [pdf]
Bharath Hariharan, Piotr Dollar, Larry Zitnick |
Abstract: Several popular and effective object detectors model intra-class variations due to deformations and appearance changes separately. This reduces model complexity while enabling detection of objects across change in viewpoint, object pose, etc. The Deformable Part Model (DPM) is perhaps the most successful such model to date. A common assumption is that the exponential number of templates enabled by a DPM is critical to its success. In this paper, we show the counter-intuitive result that it is possible to achieve similar accuracy using a small dictionary of global deformations. Each component in our model is represented by a single HOG template and a dictionary of flow fields that determine the deformations the template may undergo. While the number of candidate deformations is dramatically fewer than that for a DPM, the deformed templates tend to be plausible and interpretable. In addition, we discover that the set of deformation bases is actually transferable across object categories and that learning shared bases across similar categories can even boost accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#675 - Using k-poselets for detecting people and localizing their keypoints [pdf]
Bharath Hariharan, Georgia Gkioxari, Ross Girshick, Jitendra Malik |
Abstract: A k-poselet is a Deformable Part Model with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground truth annotations. A separate HOG template is used to learn the appearance of each part. The parts are allowed to move with respect to each other with a deformation cost that is learned at training time. This model is richer than both the traditional version of poselets (Bourdev et al) and DPMs (Felzenszwalb et al) and experimental results verify its superiority at person detection as well as keypoint prediction.
|
Similar papers:
[rank all papers by similarity to this]
|
#678 - Super Normal Vector for Activity Recognition Using Depth Sequences [pdf]
Xiaodong Yang, Yingli Tian |
Abstract: This paper presents a new framework for human activity recognition from video sequences captured by a depth camera. We cluster hypersurface normals in depth sequences to form polynormal which is used to jointly characterize the local motion and shape information. In order to globally capture the spatial and temporal orders, an adaptive spatio-temporal pyramid is introduced to subdivide a depth video into a set of space-time grids. We then propose a novel scheme of aggregating the low-level polynormals into the Super Normal Vector (SNV) which can be seen as a simplified version of the Fisher kernel representation. In the extensive experiments, we achieve classification results superior to all previous published results on the four public benchmark datasets, i.e., MSRAction3D, MSRDailyActivity3D, MSRGesture3D, and MSRActionPairs3D.
|
Similar papers:
[rank all papers by similarity to this]
|
#689 - Weighted Nuclear Norm Minimization with Application to Image Denoising [pdf]
Shuhang Gu, Lei Zhang, Xiangchu Feng, Wangmeng Zuo |
Abstract: As a convex relaxation of the low rank matrix factorization problem, the nuclear norm minimization has been attracting significant research interest in recent years. The standard nuclear norm minimization regularizes each singular value equally to pursue the convexity of the objective function. However, this greatly restricts its capability and flexibility in dealing with many practical problems (e.g., denoising), where the singular values have clear physical meanings and should be treated differently. In this paper we study the weighted nuclear norm minimization (WNNM) problem with F-norm data fidelity, where the singular values are assigned different weights. The solutions of the WNNM problem are analyzed under different weighting conditions. We then apply the proposed WNNM algorithm to image denoising by exploiting the image nonlocal self-similarity. Experimental results clearly show that the proposed WNNM algorithm outperforms many state-of-the-art denoising algorithms such as BM3D in terms of both quantitative measure and visual perception quality.
|
Similar papers:
[rank all papers by similarity to this]
|
#692 - Semi-Supervised Coupled Dictionary Learning for Person Re-identification [pdf]
Xiao Liu, Mingli Song, Dacheng Tao, Xingchen Zhou, Chun Chen, Jiajun Bu |
Abstract: The desirability of being able to search for specific persons in surveillance videos captured by different cameras has increasingly motivated interest in the problem of person re-identification, which is a critical yet under-addressed challenge in multi-camera tracking systems. The main difficulty of person re-identification arises from the variations in human appearances from different camera views. In this paper, to bridge the human appearance variations across cameras, two coupled dictionaries that relate to the gallery and probe cameras are jointly learned in the training phase from both labeled and unlabeled images. The labeled training images carry the relationship between features from different cameras, and the abundant unlabeled training images are introduced to exploit the geometry of the marginal distribution for obtaining robust sparse representation. In the testing phase, the feature of each target image from the probe camera is first encoded by the sparse representation and then recovered in the feature space spanned by the images from the gallery camera. The features of the same person from different cameras are similar following the above transformation. Experimental results on publicly available datasets demonstrate the superiority of our method.
|
Similar papers:
[rank all papers by similarity to this]
|
#703 - Object-based Multiple Foreground Video Co-segmentation [pdf]
Huazhu Fu, Dong Xu, Bao Zhang, Stephen Lin |
Abstract: We present a video co-segmentation method that uses category-independent object proposals as its basic element and can extract multiple foreground objects in a video set. The use of object elements overcomes limitations of low-level feature representations in separating complex foregrounds and backgrounds. We formulate object-based co-segmentation as a co-selection graph in which regions with foreground-like characteristics are favored while also accounting for intra-video and inter-video foreground coherence. To handle multiple foreground objects, we expand the co-selection graph model into a proposed multi-state selection graph model (MSG) that optimizes the segmentations of different objects jointly. This extension into the MSG can be applied not only to our co-selection graph, but also can be used to turn any standard graph model into a multi-selection solution that can be optimized directly by existing energy minimization techniques. Our experiments show that our object-based multiple foreground video co-segmentation method (ObMiC) compares well to related techniques on both single and multiple foreground cases.
|
Similar papers:
[rank all papers by similarity to this]
|
#708 - A 3D Feature for Moving Range Scanning Systems [pdf]
Xiangqi Huang, Bo Zheng, Takeshi Masuda, Katsushi Ikeuchi |
Abstract: Laser range sensors are often demanded to mount on a moving platform for achieving the efficiency of 3D reconstruction, SLAM, object recognition, etc. However, such a moving system often suffers from the difficulty of matching the distorted 3D range images. In this paper, we propose novel 3D features which can be robustly extracted and matched together for distorted range scans captured by a moving system. Our feature extraction employs Morse function theory to construct measure function which obtains invariant critical points under the 3D surface distortion. Then at each critical point, we extract the maximally stable region as interest region by disconnectivity as well as the extremal region for comparison. Our feature description are designed as two processes: 1) affine-based normalization and 2) critical net construction. The former normalizes the detected local regions to canonical shapes while the later connects detected local regions with a subgraph. In experiments, we demonstrate that the proposed 3D feature achieves substantially better performance for distorted surface matching in comparison to state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#709 - Learning Fine-grained Image Similarity with Deep Ranking [pdf]
Jiang Wang, Yang Song, Thomas Leung, Charles Rosenberg, James Philbin, Bo Chen, Ying Wu |
Abstract: Learning fine-grained image similarity models is a very challenging task. Fine-grained image similarity is usually characterized by very subtle differences that are difficult to be distinguished with hand-crafted features. We propose a ranking model that employs deep neural network learning techniques to learn image similarity models directly from images. We call this a deep ranking model. Compared to similar models based on hand-crafted features, the deep ranking model has higher learning capacity to better characterize the subtle differences required for fine-grained image similarity. We also propose an effective triplet sampling algorithm to learn the model with distributed asynchronized stochastic gradient. The experimental results show that the proposed algorithm outperforms both the state-of-the-art hand-crafted visual feature-based methods and deep neural-network classification models.
|
Similar papers:
[rank all papers by similarity to this]
|
#711 - Domain Adaptation on the Statistical Manifold [pdf]
Mahsa Baktashmotlagh, Mehrtash Harandi, Brian Lovell, Mathieu Salzmann |
Abstract: In this paper, we tackle the problem of unsupervised domain adaptation for classification. In the unsupervised scenario where no labeled samples from the target domain are provided, a popular approach consists in transforming the data such that the source and target distributions become similar. To compare the two distributions, existing approaches make use of the Maximum Mean Discrepancy (MMD). However, this does not exploit the fact that probability distributions lie on a Riemannian manifold. Here, we propose to make better use of the structure of this manifold and rely on the distance on the manifold to compare the source and target distributions. In this framework, we introduce a sample selection method and a subspace-based method for unsupervised domain adaptation, and show that both these manifold-based techniques outperform the corresponding approaches based on the MMD. Furthermore, we show that our subspace-based approach yields state-of-the-art results on a standard object recognition benchmark.
|
Similar papers:
[rank all papers by similarity to this]
|
#712 - Human Action Recognition Based on Context-Dependent Graph Kernels [pdf]
Baoxin Wu, Chunfeng Yuan, Weiming Hu |
Abstract: Graphs are a powerful tool to model structured objects, but it is nontrivial to measure the similarity between two graphs. In this paper, we construct a two-graph model to represent human actions by recording the spatial and temporal relationships among local features. We also propose a novel family of context-dependent graph kernels (CGKs) to measure similarity between graphs. First, local features are used as the vertices of the two-graph model and the relationships among local features in the intra-frames and inter-frames are characterized by the edges. Then, the proposed CGKs are applied to measure the similarity between actions represented by the two-graph model. Graphs can be decomposed into numbers of primary walk groups with different walk lengths and our CGKs are based on the context-dependent primary walk group matching. Taking advantage of the context information makes the correctly matched primary walk groups dominate in the CGKs and improves the performance of similarity measurement between graphs. Finally, a generalized multiple kernel learning with a proposed l12-norm regularization is applied to combine these CGKs optimally together and simultaneously train a set of action classifiers. We conduct a series of experiments on four public action datasets. Our approach achieves a comparable performance to the state-of-the-art approaches which demonstrates the effectiveness of the two-graph model and the CGKs in recognizing human actions
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Recently, high dimensional representation such as VLAD or FV has shown excellent accuracy in action recognition. This paper shows that a proper encoding built upon VLAD can get further accuracy boost comparable to the accuracy gains achieved by replacing bag-of-features with the FV or VLAD representation. We empirically evaluated various VLAD improvement technologies to determine good practices in VLAD-based video encoding. Furthermore, we propose an interpretation that VLAD is a maximum entropy linear feature learning process. Combining this new perspective with observed VLAD data distribution properties, we propose a simple, lightweight, but powerful bimodal encoding method. Evaluated on 3 benchmark action recognition datasets (UCF101, HMDB51 and Youtube), the bimodal encoding consistently improves VLAD by large margins in action recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#719 - Multi-source Deep Learning for Human Pose Estimation [pdf]
Wanli Ouyang, Xiaogang Wang, Xiao Chu |
Abstract: Visual appearance score, appearance mixture type and deformation are three important information sources for human pose estimation. This paper proposes to build a multi-source deep model in order to extract non-linear representation from these different aspects of information sources. With the deep model, the global, high-order human body articulation patterns in these information sources are extracted for pose estimation. The task for estimating body locations and the task for human detection are jointly learned using a unified deep model. The proposed approach can be viewed as a post-processing of pose estimation results and can flexibly integrate with existing methods by taking their information sources as input. By extracting the non-linear representation from multiple information sources, the deep model outperforms state-of-the-art by up to 8.6 percent on three public benchmark datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In kernel based learning, the kernel trick transforms the original representation of a feature instance into a vector of similarities with the training feature instances, known as kernel representation. However, feature instances are sometimes ambiguous and the kernel representation calculated based on them do not possess any discriminative information, which can eventually harm the trained classifier. To address this issue, we propose to automatically select good feature instances when calculating the kernel representation in multiple kernel learning. Specifically, for the kernel representation calculated for each input feature instance, we multiply it element-wise with a latent binary vector named as instance selection variables, resulting a new kernel representation with attenuated effect from the similarities calculated on ambiguous feature instances. Beta process is employed for generating the prior distribution for the introduced latent instance selection variables. We then propose a Bayesian graphical model which integrates both MKL learning and inferences for the distribution of the latent instance selection variables. Variational inference is derived for model learning under a max-margin principle. Qualitative and quantitative evaluations on a synthetic data, UCL toy datasets, two image classification benchmarks and an action recognition video benchmark demonstrate the effectiveness of the proposed method and its high discriminative capabilit
|
Similar papers:
[rank all papers by similarity to this]
|
#726 - SCAMS: Simultaneous Clustering and Model Selection [pdf]
Zhuwen Li, Loong-Fah Cheong, Steven Zhiying Zhou |
Abstract: While clustering has been well studied in the past decade, model selection has drawn less attention. This paper addresses both problems in a joint manner with an indicator matrix formulation, in which the clustering cost is penalized by a Frobenius inner product term and the group number estimation is achieved by a rank minimization. As affinity graphs generally contain positive edge values, a sparsity term is further added to avoid the trivial solution. We then carefully investigate the convex relaxations of this unified problem and solve it efficiently using the Alternating Direction Method of Multipliers. The highly constrained nature of the optimization provides our algorithm with the robustness to deal with the varying and often imperfect input affinity matrices arising from different applications and different group numbers. Evaluations on the synthetic data as well as two real world problems show the superiority of the method across a large variety of settings.
|
Similar papers:
[rank all papers by similarity to this]
|
#732 - Collective Matrix Factorization Hashing for Multimodal Data [pdf]
Guiguang Ding, Yuchen Guo, Jile Zhou |
Abstract: Nearest neighbor search methods based on hashing have attracted considerable attention for effective and efficient large-scale similarity search in computer vision and information retrieval community. In this paper, we study the problems of learning hash functions in the context of multimodal data for cross-view similarity search. We put forward a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH learns unified hash codes by collective matrix factorization with latent factor model from different modalities of one instance, which can not only supports cross-view search but also increases the search accuracy by merging multiple view information sources. We also prove that CMFH, a similarity-preserving hashing learning method, has upper and lower boundaries. Extensive experiments verify that CMFH significantly outperforms several state-of-the-art methods on three different datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we tackle the problem of co-localization in real-world images. Co-localization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images. Although similar problems such as co-segmentation and weakly supervised localization have been previously studied, we focus on being able to perform co-localization in real-world settings, which are typically characterized by large amounts of intra-class variation, inter-class diversity, and annotation noise. To address these issues, we present a joint image-box formulation for solving the co-localization problem, and show how it can be relaxed to a convex quadratic program which can be efficiently solved. We perform an extensive evaluation of our method compared to previous state-of-the-art approaches on the challenging PASCAL VOC 2007 and Object Discovery datasets. In addition, we also present a large-scale study of co-localization on ImageNet, involving ground-truth annotations for 3,624 classes and approximately 1 million images.
|
Similar papers:
[rank all papers by similarity to this]
|
#735 - User-Specific Hand Modeling from Monocular Depth Sequences [pdf]
Jonathan Taylor, Richard Stebbing, Varun Ramakrishna, Cem Keskin, Jamie Shotton, Shahram Izadi, Andrew Fitzgibbon, Aaron Hertzmann |
Abstract: This paper presents a method for acquiring dense non-rigid shape and deformation from a single monocular depth sensor. We consider an important special case of acquisition from nonrigid scenes: when a rough model template is available. We focus on modeling the human hand, and assume that a single rough model template is available. We combine and extend existing work on model-based tracking, subdivision surface fitting, and mesh deformation to acquire detailed hand models from as few as 15 frames of depth data. We propose an objective that measures the error of fit between each sampled data point and a continuous model surface defined by a rigged control mesh, and use as-rigid-as-possible (ARAP) regularizers to cleanly separate the model and template geometries. Our use of a smooth model based on subdivision surfaces allows simultaneous optimization over both correspondences and model parameters, avoiding the use of iterated closest point (ICP) which can lead to slow convergence. Automatic initialization is obtained using a regression forest trained to infer approximate correspondences. Experiments show that the resulting meshes model the user's hand shape more accurately than just adapting the shape parameters of the skeleton, and that the retargeted skeleton accurately models the user's articulations. We investigate the effect of various modeling choices, and show the benefits of using subdivision surfaces and ARAP regularization.
|
Similar papers:
[rank all papers by similarity to this]
|
#750 - Multi-feature Spectral Clustering with Minimax Optimization [pdf]
Hongxing Wang, Chaoqun Weng, Junsong Yuan |
Abstract: In this paper, we propose a novel formulation for multi-feature clustering using minimax optimization. To find a consensus clustering result that is agreeable to all feature modalities, our objective is to find a universal feature embedding, which not only fits each individual feature modality well, but also unifies different feature modalities by minimizing their pairwise disagreements. The loss function consists of both (1) unary embedding cost for each modality, and (2) pairwise disagreement cost for each pair of modalities, with weighting parameters automatically selected to maximize the loss. By performing minimax optimization, we can minimize the loss for a worst case with maximum disagreements, thus can better handle noisy feature modalities. To solve the minimax optimization, an iterative solution is proposed to update the universal embedding, individual embedding, and fusion weights, separately. Our minimax optimization has only one global parameter. The superior results on various multi-feature clustering tasks validate the effectiveness of our approach when compared with the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#751 - DAISY Filter Flow: A Generalized Discrete Approach to Dense Correspondences [pdf]
Hongsheng Yang, Wen-Yan Lin, Jiangbo Lu |
Abstract: Establishing dense correspondences reliably between a pair of images is an important vision task with many applications. Though significant advance has been made towards estimating dense stereo and optical flow fields for two images adjacent in viewpoint or in time, building reliable dense correspondence fields for two general images still remains largely unsolved. For instance, two given images sharing some content exhibit dramatic photometric and geometric variations, or they depict different 3D scenes of similar scene characteristics. Fundamental challenges to such an image or scene alignment task are often multifold, which render many existing techniques fall short of producing dense correspondences robustly and efficiently. This paper presents a novel approach called DAISY filter flow to address this challenging task. The DAISY filter flow algorithm leverages and extends a few established techniques: 1) DAISY descriptors, 2) filter-based efficient flow inference, and 3) the PatchMatch fast search. Coupling and optimizing these modules seamlessly with image segments as the bridge, our approach enables efficiently performing dense descriptor-based correspondence field estimation in a generalized high-dimensional label space, which is augmented by scales and rotations. Experiments on a variety of challenging scenes show that the proposed approach estimates spatially coherent yet discontinuity-preserving image alignment results both robustly and efficiently.
|
Similar papers:
[rank all papers by similarity to this]
|
#766 - Occlusion Geodesics for Online Multi-Object Tracking [pdf]
Horst Possegger, Thomas Mauthner, Peter Roth, Horst Bischof |
Abstract: Robust multi-object tracking-by-detection requires the correct assignment of noisy detection results to object trajectories. We address this problem by proposing an online approach based on the observation that object detectors primarily fail if objects are significantly occluded. In contrast to most existing work, we only rely on geometric information to efficiently overcome detection failures. In particular, we exploit the spatio-temporal evolution of occlusion regions, detector reliability, and target motion prediction to robustly handle missed detections. In combination with a conservative association scheme for visible objects, this allows for real-time tracking of multiple objects from a single static camera, even in complex scenarios. Our evaluations on publicly available multi-object tracking benchmark datasets demonstrate superior performance compared to the state-of-the-art in online and offline multi-object tracking.
|
Similar papers:
[rank all papers by similarity to this]
|
#772 - Multi-target Tracking with Motion Context in Tensor Power Iteration [pdf]
Xinchu Shi, Haibin Ling, Weiming Hu, Chunfeng Yuan |
Abstract: Multiple target tracking (MTT) is often formulated as a (multi-frame) data association problem, and different optimization approaches have been proposed to capture the association solution. Most existing approaches, however, treat different targets as independent of each other, thereby ignoring the interaction between subjects. In this paper, we model interactions between neighbor targets by pair-wise motion context, and further encode such context into the global association optimization. To solve the resulting global non-convex maximization, we propose an effective and efficient power iteration framework. This solution enjoys two advantages for MTT: First, it allows us to combine the global energy accumulated from individual trajectories and the between-trajectory interaction energy into a united optimization, which can be solved by the proposed power iteration algorithm. Second, the framework is flexible to accommodate various types of pairwise context models and we in fact studied two different context models in this paper. For evaluation, we apply the proposed methods to four public datasets involving different challenging scenarios such as dense aerial borne traffic tracking, dense point set tracking, and semi-crowded pedestrian tracking. In all the experiments, our approaches demonstrate very promising results in comparison with state-of-the-art trackers.
|
Similar papers:
[rank all papers by similarity to this]
|
#778 - Facial Expression Recognition via a Boosted Deep Belief Network [pdf]
Ping Liu, shizhong han, zibo meng, Yan Tong |
Abstract: A training process for facial expression recognition is usually performed sequentially in three individual stages: feature learning, feature selection, and classifier construction. Extensive empirical studies are needed to search for an optimal combination of feature representation, feature set, and classifier to achieve good recognition performance. This paper presents a novel Boosted Deep Belief Network (BDBN) for performing the three training stages jointly in a unified framework. Through the proposed BDBN framework, a set of features, which is effective to characterize expression-related facial appearance/shape changes, can be learned and selected to form a boosted strong classifier in a statistical way. As learning continues, the strong classifier is improved iteratively and more importantly, the discriminative capabilities of selected features are strengthened as well according to their relative importance to the strong classifier via a joint fine-tune process in the BDBN framework. Extensive experiments on two public databases showed that the BDBN framework yielded significant improvements in facial expression analysis.
|
Similar papers:
[rank all papers by similarity to this]
|
#794 - Bayesian Active Contours with Affine-Invariant, Elastic Shape Prior [pdf]
Darshan Bryner, Anuj Srivastava |
Abstract: Active contour, especially in conjunction with prior-shape models, has become an important tool in image segmentation. However, most contour methods use shape priors based on similarity-shape analysis, i.e.\ analysis that is invariant to rotation, translation, and scale. In practice, the training shapes used for prior-shape models may be collected from viewing angles different from those for the test images and require invariance to a larger class of transformation. Using an elastic, affine-invariant shape modeling of planar curves, we propose an active contour algorithm in which the training and test shapes can be at arbitrary affine transformations, and the resulting segmentation is robust to perspective skews. We construct a shape space of affine-standardized curves and derive a statistical model for capturing class-specific shape variability. The active contour is then driven by the true gradient of a total energy composed of a data term, a smoothing term, and an affine-invariant shape-prior term. This framework is demonstrated using a number of examples involving the segmentation of occluded or noisy images of targets subject to perspective skew.
|
Similar papers:
[rank all papers by similarity to this]
|
#795 - Frequency-Based 3D Reconstruction of Transparent and Specular Objects [pdf]
Ding Liu, Xida Chen, Yee-Hong Yang |
Abstract: 3D reconstruction of transparent and specular objects is a very challenging topic in computer vision. For transparent and specular objects, which have complex interior and exterior structures that can reflect and refract light in a complex fashion, it is difficult, if not impossible, to use either passive stereo or the traditional structured light methods to do the reconstruction. We propose a frequency-based 3D reconstruction method, which incorporates the frequency-based matting method. Similar to the structured light methods, a set of frequency-based patterns are projected onto the object, and a camera captures the scene at the same time. Each pixel of the captured image is analyzed along the time axis and the corresponding signal is transformed to the frequency-domain using the Discrete Fourier Transform. Since the frequency is only determined by the source that creates it, the frequency of the signal can uniquely identify the location of the pixel in the patterns. In this way, the correspondences between the pixels in the captured images and the points in the patterns can be acquired. Using a new labelling procedure, the surface of transparent and specular objects can be reconstructed with very encouraging results.
|
Similar papers:
[rank all papers by similarity to this]
|
#801 - Bi-label Propagation for Generic Multiple Object Tracking [pdf]
Wenhan Luo, Tae-Kyun Kim, Bjrn Stenger, Xiaowei Zhao, Roberto Cipolla |
Abstract: In this paper, we propose a label propagation framework to handle the multiple object tracking (MOT) problem for a generic object type (cf. pedestrian tracking). Given a target object by an initial bounding box, all objects of the same type are localized together with their identities. We treat this as a problem of propagating bi-labels, i.e. a binary class label for detection and individual object labels for tracking. To propagate the binary class label, we adopt clustered Multiple Task Learning (cMTL) while enforcing spatio-temporal consistency and show that this improves the performance when given limited training data. To track objects, we propagate labels from trajectories to detections based on affinity using appearance, motion, and context. Experiments on public and challenging new sequences show that the proposed method improves over the current state of the art on this task.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper we provide the first, to the best of our knowledge, probabilistic formulation of one of the most successful and well-studied statistical models of shape and texture, i.e. Active Appearance Models (AAMs). To this end, we use a simple probabilistic model for texture generation assuming both Gaussian noise and Gaussian prior over the latent texture space. We retrieve the shape parameters by formulating a cost function obtained by marginalizing out the latent texture space. This results in a fast implementation when compared to other simultaneous algorithms for fitting AAMs, mainly due to the removal of the calculation of texture parameters. We proceed with demonstrating that, contrary to what is believed regarding the performance of AAMs in generic fitting scenarios, optimization of the proposed cost function produces results that outperforms discriminatively trained state-of-the-art algorithms in the problem of facial alignment "in the wild".
|
Similar papers:
[rank all papers by similarity to this]
|
#814 - Transformation Pursuit for Image Classification [pdf]
Mattis Paulin, Jerome REVAUD, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid |
Abstract: An approach to learning invariances in image classifica- tion is to augment the training set with transformed ver- sions of the original images. However, given a large set of possible transformations, selecting an optimal subset of transformations is challenging. Indeed, transformations are not equally informative and adding uninformative transfor- mations increases training time with no gain in accuracy. We propose a principled algorithm Image Transforma- tion Pursuit (ITP) for the automatic selection of trans- formations. ITP works in a greedy fashion, by selecting at each iteration the one that yields the highest accuracy gain. ITP also allows to efficiently explore complex transforma- tions, that are combinations of basic transformations. We report results on two public benchmarks: the CUB dataset of bird images and the ImageNet 2010 challenge. We report on CUB an improvement of top-1 accuracy from 28.2% to 45.2% and on ImageNet an improvement of top-5 accuracy from 70.1% to 74.9%.
|
Similar papers:
[rank all papers by similarity to this]
|
#817 - Dense Non-Rigid Shape Correspondence using Random Forests [pdf]
Emanuele Rodola, Samuel Rota Bulo', Thomas Windheuser, Matthias Vestner, Daniel Cremers |
Abstract: We propose a shape matching method that produces dense correspondences tuned to a specific class of shapes and deformations. In a scenario where this class is represented by a small set of typical example shapes, the proposed method learns a shape decscriptor that captures these shapes and their deformations. The approach enables the \emph{wave kernel signature} to extend the class of recognized deformations from near isometries to the deformations appearing in the example set by means of a \emph{random forest} classifier. With the help of the introduced spatial regularization, the proposed method achieves significant improvements over the baseline approach and obtains state-of-the-art results while keeping short computation times.
|
Similar papers:
[rank all papers by similarity to this]
|
#820 - Finding the Subspace Mean or Median to Fit Your Need [pdf]
Timothy Marrinan, Michael Kirby, Bruce Draper, Chris Peterson |
Abstract: Many computer vision algorithms employ subspace models to represent data. Many of these approaches benefit from the ability to create an average or prototype for a set of subspaces. The most popular method in these situations is the Karcher mean, also known as the Riemannian center of mass. The prevalence of the Karcher mean may lead some to assume that it provides the best average in all scenarios. However, other subspace averages that appear less frequently in the literature may be more appropriate for certain tasks. The extrinsic manifold mean, the $L_2$-median, and the flag mean are alternative averages that can be substituted directly for the Karcher mean in many applications. This paper evaluates the characteristics and performance of these four averages on synthetic and real-world data. While the Karcher mean generalizes the Euclidean mean to the Grassman manifold, we show that the extrinsic manifold mean, the $L_2$-median, and the flag mean behave more like medians and are therefore more robust to the presence of outliers among the subspaces being averaged. We also show that while the Karcher mean and $L_2$-median are computed using iterative algorithms, the extrinsic manifold mean and flag mean can be found analytically and are therefore orders of magnitude faster in practice. Finally, we show that the flag mean is a generalization of the extrinsic manifold mean that permits subspaces with different numbers of dimensions to be averaged. The result is a "coo
|
Similar papers:
[rank all papers by similarity to this]
|
#836 - Deblurring Text Images via L0-Regularized Intensity and Gradient Prior [pdf]
Jinshan Pan, Zhe Hu, Zhixun Su, Ming-Hsuan Yang |
Abstract: We propose a simple yet effective L0-regularized prior based on intensity and gradient for text image deblurring. The proposed image prior is motivated by observing distinct properties of text images. Based on this prior, we develop an efficient optimization method to generate reliable intermediate results for kernel estimation. The proposed method does not require any complex filtering strategies to select salient edges which are critical to the state-of-the-art deblurring algorithms. We discuss the relationship with other deblurring algorithms based on edge selection and provide insight on how to select salient edges more principally. In the final latent image restoration step, we develop a simple method to remove artifacts and render better deblurred images. Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art text image deblurring methods. In addition, we show that the proposed method can be effectively applied to deblur low-illumination images.
|
Similar papers:
[rank all papers by similarity to this]
|
#848 - Manifold Based Dynamic Texture Synthesis from Extremely Few Samples [pdf]
Hongteng Xu, Hongyuan Zha, Mark Davenport |
Abstract: In this paper, we present a novel method to synthesize dynamic texture sequences from extremely few samples, e.g., merely two possibly disparate frames, leveraging both Markov Random Fields (MRFs) and manifold learning. Decomposing textural image as a set of patches, we achieve dynamic texture synthesis by estimating sequences of temporal patches. We select candidates for each temporal patch from spatial patches based on MRFs and regard them as samples from a low-dimensional manifold. After mapping candidates to a low-dimensional latent space, we estimate the sequence of temporal patches by finding an optimal trajectory in the latent space. Guided by some key properties of trajectories of realistic temporal patches, we derive a curvature-based trajectory selection algorithm. In contrast to the methods based on MRFs or dynamic systems that rely on a large amount of samples, our method is able to deal with the case of extremely few samples and requires no training phase. We compare our method with the state of the art and show that our method not only exhibits superior performance on synthesizing textures but it also produces results with pleasing visual effects.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Our goal is to obtain a noise-free, high resolution (HR) image, from an observed, noisy, low resolution (LR) image. The conventional approach of preprocessing the image with a denoising algorithm, followed by applying a super-resolution (SR) algorithm, has an important limitation: Along with noise, some high frequency content of the image (particularly textural detail) is invariably lost during the denoising step. This 'denoising loss' restricts the performance of the subsequent SR step, wherein the challenge is to synthesize such textural details. In this paper, we show that high frequency details in the noisy image (which is ordinarily removed by denoising algorithms) can be effectively used to obtain the missing textural details in the HR domain. To do so, we first obtain HR versions of both the noisy and the denoised images, using a patch-similarity based SR algorithm. We then show that by taking a convex linear combination of orientation and frequency selective bands of the noisy and the denoised HR images, we can obtain a desired HR image where (i) some of the textural signal lost in the denoising step is effectively recovered in the HR domain, and (ii) additional textures can be easily synthesized by appropriately constraining the parameters of the convex combination. We show that this part-recovery and part-synthesis of textures through our algorithm yields HR images that are visually more pleasing than those obtained using the conventional processing pipeline. Furthe
|
Similar papers:
[rank all papers by similarity to this]
|
#857 - Local Readjustment for High-Resolution 3D Reconstruction [pdf]
Siyu Zhu, Tian Fang, Jianxiong Xiao, Long Quan |
Abstract: Global bundle adjustment usually converges to a nonzero residual and produces sub-optimal camera poses for local areas, which leads to loss of details for high resolution reconstruction. Instead of trying harder to optimize everything globally, we argue that we should live with the non-zero residual and adapt the camera poses to local areas. To this end, we propose a segment-based approach to readjust the camera poses locally and improve the reconstruction for fine geometry details. The key idea is to partition the globally optimized structure-from-motion points into well-conditioned segments for re-optimization, reconstruct their geometry individually, and fuse everything back into a consistent global model. This significantly reduces severe propagated errors and estimation biases caused by the initial global adjustment. The results on several datasets demonstrate that this approach can significantly improve the reconstruction accuracy, while maintaining the consistency of the 3D structure between segments.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: While techniques that segment shapes into visually meaningful parts have generated impressive results, these techniques also have only focused on relatively simple shapes, such as those composed of a single object either without holes or with few simple holes. In many applications, shapes created from images can contain many overlapping objects and holes. These holes may come from sensor noise or may have important part of the shape and arbitrarily complex. These complexities that appear in real-world 2D shapes can pose grand challenges to the existing part segmentation methods. In this paper, we propose a new decomposition method, called Dual-space Decomposition that handles complex 2D shapes by recognizing the importance of holes and classifying holes as either topological noise or structurally important features. Our method creates a nearly convex decomposition of a given shape by segmenting both positive and negative regions of the shape. We compare our results to segmentation produced by non-expert human subjects. Based on two evaluation methods, we show that this new decomposition method creates statistically similar and sometimes better segmentation comparing to those produced by human subjects.
|
Similar papers:
[rank all papers by similarity to this]
|
#863 - Aliasing Detection and Reduction in Plenoptic Imaging [pdf]
Zhaolin Xiao, Qing Wang, Jingyi Yu, Guoqing Zhou |
Abstract: When using plenoptic camera for digital refocusing, angular undersampling can cause severe (angular) aliasing artifacts. Previous approaches have focused on avoiding aliasing by pre-processing the acquired light field via prefiltering, demosaicing, reparameterization, etc. In this paper, we present a different solution that first detects and then removes aliasing at the light field refocusing stage. Different from previous frequency domain aliasing analysis, we carry out a spatial domain analysis to reveal whether the aliasing would occur and uncover where in the image it would occur. The spatial analysis also facilitates easy separation of the aliasing vs. non-aliasing regions and aliasing removal. Experiments on both synthetic scene and real light field camera array data sets demonstrate that our approach has a number of advantages over the classical prefiltering and depth-dependent light field rendering techniques.
|
Similar papers:
[rank all papers by similarity to this]
|
#873 - Finding Vanishing Points via Point Alignments in Image Primal and Dual Domains [pdf]
Jos Lezama, Rafael Grompone von Gioi, Jean-Michel Morel, Gregory Randall |
Abstract: We present a novel method for automatic vanishing point detection based on primal and dual point alignment detection. The very same point alignment detection algorithm is used twice: first in the image domain to group line segment endpoints into more precise lines. Second, it is used in the dual domain where converging lines become aligned points. The use of the recently introduced PCLines dual spaces and a robust point alignment detector, leads to a very accurate algorithm. Experimental results on two public standard datasets show that our method significantly advances the state-of-the-art in the Manhattan world scenario, while producing state-of-the-art performances in non-Manhattan scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
#877 - Investigating Haze-relevant Features in A Learning Framework for Image Dehazing [pdf]
Ketan Tang, Jianchao Yang, Jue Wang |
Abstract: Haze is one of the major factors that degrade outdoor images. Removing haze from a single image is known to be severely ill-posed, and assumptions made in previous methods do not hold in many situations. In this paper, we systematically investigate different haze-relevant features in a learning framework to identify the best feature combination for image dehazing. We show that the dark-channel feature is the most informative one for this task, which confirms the observation of He et al. from a learning perspective, while other haze-relevant features also contribute significantly in a complementary way. We also find that surprisingly, the synthetic hazy image patches we use for feature investigation serve well as training data for realworld images, which allows us to train specific models for specific applications. Experiment results demonstrate that the proposed algorithm outperforms state-of-the-art methods on both synthetic and real-world datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#880 - Stacked Progressive Auto-Encoder (SPAE) for Face Recognition Across Poses [pdf]
Meina Kan, Shiguang Shan, Hong Chang, Xilin Chen |
Abstract: Identifying subjects with variations caused by poses is one of the most challenging tasks in face recognition, since the difference in appearances caused by poses may be even larger than the difference due to identity. Inspired by the observation that pose variations change non-linearly but smoothly, we propose to learn pose-robust features by modeling the complex non-linear transform from the non-frontal face images to frontal ones through an deep network in an progressive way, termed as stacked progressive auto-encoders (SPAE). Specifically, each shallow progressive auto-encoder of the stacked network is developed to map the face images at large poses to a virtual view at smaller ones, and meanwhile keep those images already at smaller poses unchanged. Then, stacking multiple these shallow auto-encoders can convert a non-frontal face image to frontal one progressively, which means the pose variations are narrowed down to zero step by step. As a result, the outputs of the topmost hidden layers of the stacked network contain very small pose variations, which can be used as the pose-robust features for face recognition. An additional attractiveness of the proposed method is that no pose estimation is needed for the test images. The proposed method is tested on two datasets with pose variations, i.e., MultiPIE and FERET datasets, and the experimental results demonstrate the superiority of our method to the existing works, especially to those 2D ones.
|
Similar papers:
[rank all papers by similarity to this]
|
#900 - Locality in Generic Instance Search from One Example [pdf]
Ran Tao, Efstratios Gavves, Cees Snoek, Arnold Smeulders |
Abstract: This paper aims for generic instance search from a single example. Where the state-of-the-art relies on global image representation for the search, we proceed by including locality at all steps of the method. As the first novelty, we consider many boxes as candidate targets by an efficient point-indexed representation independent of the number of boxes considered. The same representation allows, as the second novelty, the application of very large vocabularies in the powerful Fisher vector and VLAD. As the third novelty we propose to emphasize local search in feature space by an exponential similarity function. Locality is advantageous in instance search as it will rest on the matching unique details. We demonstrate a substantial increase in generic instance search performance from one example on three standard datasets with buildings, logos, and scenes from 0.443 to 0.620 in mAP.
|
Similar papers:
[rank all papers by similarity to this]
|
#907 - High Quality Photometric Reconstruction using a Depth Camera [pdf]
Avishek Chatterjee, Sk Mohammadul Haque, Venu Madhav Govindu |
Abstract: In this paper we present a depth-guided photometric 3D reconstruction method that works solely with a depth camera like the Kinect. Existing methods that fuse depth with normal estimates use an external RGB camera to obtain photometric information and treat the depth camera as a black box that provides a low quality depth estimate. Our contribution to such methods are two fold. Firstly, instead of using an extra RGB camera, we use the infra-red (IR) camera of the depth camera system itself to directly obtain high resolution photometric information. We believe that ours is the first method to use an IR depth camera system in this manner. Secondly, photometric methods applied to complex objects result in numerous holes in the reconstructed surface due to shadows and self-occlusions. To mitigate this problem, we develop a simple and effective multiview approach that fuses depth and normal information from multiple viewpoints to build a complete, consistent and accurate 3D surface representation. We demonstrate the efficacy of our method to generate high quality 3D surface reconstructions for some complex 3D figurines.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Rotation in shape recognition is regarded as a puzzling nuisance in most algorithms. In this paper we address three fundamental issues brought by rotated shapes: 1) is alignment among shapes necessary? If the answer is no, 2) how to exploit information in different rotations of the same shape? and 3) how to use rotation unaware local features for rotation aware shape recognition? We argue that the origin of these issues is the use of hand crafted rotation-unfriendly features and measurements. Therefore our goal is to learn a set of hierarchical features that describe all rotated versions of a shape as one class, with the capability of distinguishing different such classes. We propose to rotate shapes as many times as possible as training samples, and learn the hierarchical feature representation by effectively adopting a convolutional neural network. We further show that our method is very efficient because the convolutional network responses of all n rotated versions of the same shape can be computed at the expense of O(log n) factor, instead of the naive O(n). We tested the algorithm on three real datasets: Swedish Leaves dataset, ETH-80 Shape dataset, and a subset of the recently collected Leafsnap dataset. Our approach used the curvature scale space and outperformed the state of the art.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Current natural text detection methods focus on the bottom up approaches, which rely heavily on strong multi-stage hypotheses for characters and words, and use strong hand crafted heuristic rules to group potential elements into text lines. In contrast of this methodology, we suggest to use weak hypotheses in a similarity clustering framework, followed by a simple region-based filtering as post-processing. We treat text line detection as a graph partitioning problem, where each vertex is represented by a Maximally Stable Extremal Region (MSER). First, weak hypotheses are proposed by grouping MSERs into multiple overlapping regions, based on their spatial alignment with respect to their neighbors. Then, higher-order correlation clustering (HOCC) is used to partition the MSERs into textline candidates, using the hypotheses as soft constraints to enforce long range interactions. We further propose a regularization method to solve the Semidefinite Programming problem in the inference. Finally we use a simple texton-based texture classifier to filter out the non-text areas. This framework allows us to naturally handle multiple orientations, languages and fonts. Experiments show competitive performance using this framework. On a recent dataset, we achieved $ 9\%$ performance gain in precision, with comparable recall versus competing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#915 - Detection, Rectification and Segmentation of Co-planar Repeated Patterns [pdf]
James Pritts, Ondrej Chum, Jiri Matas |
Abstract: This paper presents a novel and general method for detection, rectification and segmentation of co-planar repeated patterns imaged. The only assumption on the image content is that repeated elements of the pattern can be mapped to each other in the scene plane by a set of Euclidean transformations. This is a very general assumption that covers nearly all commonly seen man-made repetitive patterns. In addition, novel linear constraints are exploited that enable geometric ambiguity reduction between the rectification of the imaged pattern and the real-world pattern. The remaining ambiguity is within a similarity if the scene plane contains repeated elements that are rotated differently, or within a similarity with a scale ambiguity along the axis of symmetry if any of the elements are reflected. The method is successfully tested on a broad range of image types including those where state-of-the-art methods fail.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The common existence of image blur brings out a practically important question -- what is effective to differentiate between blurred and unblurred image regions by nature. We address it by studying a few blur feature representations in image gradient and Fourier domains and through data-driven local filters. Unlike previous methods which are often based on restoration and deconvolution mechanisms, our features are constructed to enhance discriminative power and are adaptive to varying blur scale in images. To avail evaluation, we build a new blur perception dataset containing thousands of images with labeled ground-truth. Our results are applied to facilitate several applications, including blur region segmentation, deblurring.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we present a novel online visual tracking method based on linear representation. First, we present a novel probability continuous outlier model (PCOM) to depict the continuous outliers that occur in the linear representation model. In the proposed model, the element of the noisy observation sample can be either represented by a PCA subspace with small Guassian noise or treated as an arbitrary value with a uniform prior, in which the spatial consistency prior is exploited by using a binary Markov random filed model. Then, we derive the objective function of the PCOM method, the solution of which can be iteratively obtained by the outlier-free least squares and standard max-flow/min-cut steps. Finally, based on the proposed PCOM method, we design an effective observation likelihood function and a simple update scheme for visual tracking. Both qualitative and quantitative evaluations demonstrate that our tracker achieves very favorable performance in terms of both accuracy and speed.
|
Similar papers:
[rank all papers by similarity to this]
|
#923 - Clothing Co-Parsing by Joint Image Segmentation and Labeling [pdf]
Wei Yang, Liang Lin, Ping Luo |
Abstract: This paper aims at developing an integrated system of clothing co-parsing: given a database of clothes/human images that are unsegmented but annotated with tags, jointly parse them into semantic clothing configurations. We propose a data-driven framework consisting of two phases of inference. The first phase, referred as ``image co-segmentation'', iterates to extract consistent regions on images and jointly refines the regions over all images by employing the exemplar-SVM (E-SVM) technique. In the second phase (i.e. ``region co-labeling''), we construct a multi-image graphical model by taking the segmented regions as vertices, incorporating several contexts of clothing configuration (\eg, item location and mutual interactions). The joint label assignment can be solved using an efficient message passing algorithm. In addition to evaluate our framework on the Fashionista dataset \cite{Fashion}, we construct a dataset called CCP consisting of 2098 high-resolution street fashion photos to demonstrate the performance of our system. We achieve 90.29% / 88.23% segmentation accuracy and 65.52% / 63.89% recognition rate on the Fashionista and the CCP datasets, respectively, which are superior compared with state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#926 - Randomized Max-Margin Compositions for Visual Recognition [pdf]
Angela Eigenstetter, Bjorn Ommer |
Abstract: A main theme in object detection are currently discriminative part-based models. The powerful model that combines all parts is then typically only feasible for few constituents, which are in turn iteratively trained to make them as strong as possible. We follow the opposite strategy by randomly sampling a large number of instance specific part classifiers. Due to their number, we cannot directly train a powerful classifier to combine all parts. Therefore, we randomly group them into fewer, overlapping compositions that are trained using a maximum-margin approach. In contrast to the common rationale of compositional approaches, we do not aim for semantically meaningful ensembles. Rather we seek randomized compositions that are discriminative and generalize over all instances of a category. Our approach not only localizes objects in cluttered scenes, but also explains them by parsing with compositions and their constituent parts. Experiments on PASCAL VOC07, on the VOC10 evaluation server, and on the MITIndoor scene dataset show the competitive performance of the approach. Moreover, we evaluate the individual contributions and potential of compositions and their parts in separate experiments.
|
Similar papers:
[rank all papers by similarity to this]
|
#935 - Learning Expressionlets on Spatio-Temporal Manifold for Dynamic Facial Expression Recognition [pdf]
Mengyi Liu, Shiguang Shan, Ruiping Wang, Xilin Chen |
Abstract: Facial expressions are temporally dynamic events which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e. expressionlet. Specifically, our method contains three key components: 1) each expression video clip is modeled as a spatiotemporal manifold (STM) formed by dense low-level features; 2) a Universal Manifold Model (UMM) is learned over all low-level features and represented as a set of local ST modes to statistically unify all the STMs. 3) the local modes on each STM can be instantiated by fitting to UMM, and the corresponding expressionlet is constructed by modeling the variations in each local ST mode. With above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and AFEW. In all cases, our method reports results better than the known state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we propose a Switchable Deep Network (SDN) for pedestrian detection. The SDN automatically learns hierarchical features, salience maps, and mixture representations of different body parts. Pedestrian detection faces the challenges of background clutter and large variations of pedestrian appearance due to pose and viewpoint changes and other factors. One of our key contributions is to propose a Switchable Restricted Boltzmann Machine (SRBM) to explicitly model the complex mixture of visual variations at multiple levels. At the feature levels, it automatically estimates saliency maps for each test sample in order to separate background clutters from discriminative regions for pedestrian detection. At the part and body levels, it is able to infer the most appropriate template for the mixture models of each part and the whole body. We have devised a new generative algorithm to effectively pre-train the SDN and then fine-tune it with back-propagation. Our approach is evaluated on the Caltech and ETH datasets and achieves the state-of-the-art detection performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#945 - Patch-based Evaluation of Image Segmentation [pdf]
Christian Ledig, Wenzhe Shi, Wenjia Bai, Daniel Rueckert |
Abstract: The quantification of similarity between image segmentations is a complex yet important task. The ideal similarity measure should be unbiased to segmentations of different volume and complexity, and be able to quantify and visualise segmentation bias. Similarity measures based on overlap, e.g. Dice score, or surface distances, e.g. Hausdorff distance, clearly do not satisfy all of these properties. To address this problem, we introduce Patch-based Evaluation of Image Segmentation (PEIS), a general method to assess segmentation quality. Our method is based on finding patch correspondences and the associated patch displacements, which allow the estimation of segmentation bias. We quantify both the agreement of the segmentation boundary and the conservation of the segmentation shape. We further assess the segmentation complexity within patches to weight the contribution of local segmentation similarity to the global score. We evaluate PEIS on both synthetic data and two medical imaging datasets. On synthetic segmentations of different shapes, we provide evidence that PEIS, in comparison to the Dice score, produces more comparable scores, has increased sensitivity and estimates segmentation bias accurately. On cardiac MR images, we demonstrate that PEIS can evaluate the performance of a segmentation method independent of the size or complexity of the segmentation under consideration. On brain MR images, we compare five different automatic hippocampus segmentation techniques using
|
Similar papers:
[rank all papers by similarity to this]
|
#951 - Learning to Learn, from Transfer Learning to Domain Adaptation: A Unifying Perspective [pdf]
Novi Patricia, Barbara Caputo |
Abstract: The transfer learning and domain adaptation problems originate from a distribution mismatch between the source and target data distribution. The causes of such mismatch are traditionally considered different. Thus, transfer learn- ing and domain adaptation algorithms are designed to ad- dress different issues, and cannot be used in both settings unless substantially modified. Still, one might argue that these problems are just different declinations of learning to learn, i.e. the ability to leverage over prior knowledge when attempting to solve a new task. We propose a learning to learn framework able to lever- age over source data regardless of the origin of the distri- bution mismatch. We consider prior models as experts, and use their output confidence value as features. We use them to build the new target model, combined with the features from the target data through a high-level cue integration scheme. This results in a class of algorithms usable in a plug-and-play fashion over any learning to learn scenario, from binary and multi-class transfer learning to single and multiple source domain adaptation settings. Experiments on several public datasets show that our approach consis- tently achieves the state of the art.
|
Similar papers:
[rank all papers by similarity to this]
|
#956 - Who Do I Look Like? Determining Parent-Offspring Resemblance via Genetic Features [pdf]
Afshin Dehghan, Enrique Ortiz |
Abstract: Recent years have seen a major push for face recognition technology due to the large expansion of image sharing on social networks. In this paper, we consider the difficult task of determining parent-offspring resemblance using genetic features to answer the question "Who do I look like?" Although, humans can perform this job at a rate higher than chance, it is not clear how humans do it [2]. However, recent studies in anthropology [23] have determined which features tend to be the most genetically discriminative. In this study, we aim to not only create an accurate system for resemblance detection, but bridge the gap between studies in anthropology with computer vision techniques. In this paper, we aim to answer two key questions: 1) Do offspring resemble their parents? and 2) Do offspring resemble one parent more than the other? We propose an algorithm that fuses the features and metrics discovered via gated autoencoders with a discriminative neural network layer that learns the optimal, or what we call genetic, features to delineate parent-offspring relationships. We further analyze the correlation between our automatically detected features and those found in anthropological studies. Meanwhile, our method outperforms the state-of-the-art in kinship verification by 3-10% depending on the relationship using specific (father-son, mother-daughter, etc.) and generic models.
|
Similar papers:
[rank all papers by similarity to this]
|
#957 - From Human-Annotated to Machine-Discovered Concepts using Consensus Regularization [pdf]
Afshin Dehghan, Haroon Idrees |
Abstract: A video captures a sequence and interactions of concepts that can be static, for instance, objects or scenes, or dynamic, such as actions. For large datasets containing hundreds of thousands of images or videos, it is impractical to manually annotate all the concepts, or all the instances of a single concept. However, a large set of concepts can be discovered automatically from unlabeled videos which can capture and express the entire dataset. The downside to these machine-discovered concepts is meaninglessness, i.e., they are devoid of semantics and interpretation. In this paper, we present an approach that leverages on the strengths of human-annotated and machine-discovered concepts by learning a relationship between them. Since instances of a human concept share visual similarity, the proposed approach uses a novel soft-consensus regularization to learn the mapping that enforces instances from each human concept to have similar representations. The testing is performed by projecting the query onto the machine-discovered concepts and new representations, with non-negativity and unit summation constraints for probabilistic interpretation. We tested our formulation on TRECVID MED and SIN tasks, and obtained encouraging results.
|
Similar papers:
[rank all papers by similarity to this]
|
#966 - Bags of Spacetime Energies for Dynamic Scene Recognition [pdf]
Christoph Feichtenhofer, Axel Pinz, Richard Wildes |
Abstract: This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition. The approach builds on primitive features that uniformly capture spatial and temporal orientation structure of the imagery (e.g., video), as extracted via application of a bank of spatiotemporally oriented filters. Various feature encoding techniques are investigated to abstract the primitives to an intermediate representation that is best suited to dynamic scene representation. Further, a novel approach to adaptive pooling of the encoded features is presented that captures spatial layout of the scene even while being robust to situations where camera motion and scene dynamics are confounded. The resulting overall approach has been evaluated on two standard, publically available dynamic scene datasets. The results show that in comparison to a representative set of alternatives, the proposed approach outperforms the previous state-of-the-art in classification accuracy by 10%
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We have discovered that 3D reconstruction can be achieved from a single still photographic capture due to accidental motions of the photographer, even while attempting to hold the camera still. We present a novel 3D reconstruction system tailored for this problem that produces depth maps from short video sequences or bursts of still photos from standard cell phone, point-and-shoot, without the need for multi-lens optics, active sensors, or special motions by the photographer. This result leads to the possibility that depth maps of sufficient quality for applications like perspective change, simulated aperture, and object segmentation, can come ``for free'' for a certain fraction of still photographs. Our system first uses bundle adjustment to estimate camera poses, whose initialization and parameterization make use of the small motion assumption. In multiview stereo, we proposes to build long range connection between pixels to effectively regularize the noisy photo consistency measurement.
|
Similar papers:
[rank all papers by similarity to this]
|
#978 - Using Projection Kurtosis Concentration Of Natural Images For Blind Noise Covariance Matrix Estimation [pdf]
Siwei Lyu |
Abstract: Kurtosis of 1D projections provides important statistical characteristics of natural images. In this work, we first provide a theoretical underpinning to a recently observed phenomenon known as {\em projection kurtosis concentration} that the kurtosis of natural images over different band-pass channels tend to concentrate around a ``typical'' value. Based on this analysis, we further describe a new method to estimate the covariance matrix of correlated Gaussian noise from a noise corrupted image using {\em random} band-pass filters. We demonstrate the effectiveness of our blind noise covariance matrix estimation method on natural images.
|
Similar papers:
[rank all papers by similarity to this]
|
#981 - Ask the image: supervised pooling to preserve feature locality [pdf]
Sean Ryan Fanello, Nicoletta Noceti, Carlo Ciliberto, Giorgio Metta, Francesca Odone |
Abstract: In this paper we propose a weighted supervised pooling method for visual recognition systems. We combine a standard Spatial Pyramid Representation which is commonly adopted to encode spatial information, with an appropriate Feature Space Representation favouring semantic information in an appropriate feature space. For the latter, we propose a weighted pooling strategy exploiting data supervision to weigh each local descriptor coherently with its likelihood to belong to a given object class. The two representations are then combined adaptively with Multiple Kernel Learning. Experiments on common benchmarks (Caltech-256 and PASCAL VOC-2007) show that our image representation improves the current visual recognition pipeline and it is competitive with similar state-of-art pooling methods. We also evaluate our method on a real Human-Robot Interaction setting, where the pure Spatial Pyramid Representation does not provide sufficient discriminative power, obtaining a remarkable improvement.
|
Similar papers:
[rank all papers by similarity to this]
|
#987 - Light Field Stereo Matching Using Bilateral Statistics of Surface Cameras [pdf]
Can Chen, Haiting Lin, Zhan Yu, Sing Bing Kang, Jingyi Yu |
Abstract: In this paper, we introduce a bilateral consistency metric on the surface camera (SCam) for light field stereo matching to handle significant occlusion. The concept of SCam is used to model angular radiance distribution with respect to a 3D point. Our bilateral consistency metric is used to indicate the probability of occlusions by analyzing the SCams. We further show how to distinguish between on-surface and free space, textured and non-textured, and Lambertian and specular through bilateral SCam analysis. To speed up the matching process, we apply the edge-preserving guided filter on the consistency-disparity curves. Experimental results show that our technique outperforms both the state-of-the-art and the recent light field stereo matching methods, especially near occlusion boundaries.
|
Similar papers:
[rank all papers by similarity to this]
|
#1001 - Video Classification Based on Generalized Maximum Co-occurrence Cliques [pdf]
Amir Roshan Zamir, Shayan Modiri Assari |
Abstract: We address the problem of classifying complex videos based on their content. A typical approach to this problem is performing the classification using semantic attributes, commonly termed concepts, which occur in the video. In this paper, we propose a contextual approach to video classification based on Generalized Maximum Clique Problem (GMCP) which leverages the co-occurrence of concepts as the context model. Specifically, we propose to represent a class based on the co-occurrence of its concepts and classify a video based on matching its semantic co-occurrence pattern to each class representation. We perform the matching using GMCP which finds the strongest clique of co-occurring semantic concepts in a video. We argue that, in principal, the co-occurrence of concepts yields a richer representation of a video compared to most of the current approaches. Additionally, we propose a novel optimal solution to GMCP based on Mixed Binary Integer Programming (MBIP). The evaluations show our novel approach, which opens new opportunities for further research in this direction, outperforms several well established video categorization methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1009 - Congruency-Based Reranking [pdf]
Itai Ben Shalom, Adiel Ben Shalom, Noga Levy, Lior Wolf, Tamir Hazan, Nachum Dershowitz, Yaniv Bar, Roni Shweka, Yaacov Choueka |
Abstract: We present a tool for re-ranking the results of a specific query by considering the $(n+1) \times (n+1)$ matrix of pairwise similarities among the elements of the set of $n$ retrieved results and the query itself. The re-ranking thus makes use of the similarities between the various results and does not employ additional sources of information. The tool is based on employing graphical Bayesian models, which reinforce retrieved items strongly linked to other retrievals, and on repeated clustering in order to measure the stability of the obtained associations. The utility of the tool is demonstrated within the context of visual search of documents from the Cairo Genizah and for retrieval of paintings by the same artist and of the same style.
|
Similar papers:
[rank all papers by similarity to this]
|
#1016 - Predicting User Annoyance Using Image Attributes [pdf]
Gordon Christie, Amar Parkash, Ujwal Krothapalli, Devi Parikh |
Abstract: Computer Vision algorithms make mistakes. In human-centric applications, some mistakes are more annoying to users than others. In order to design algorithms that minimize the annoyance to users, we need access to an annoyance or cost matrix that holds the annoyance of each type of mistake. Such matrices are not readily available, especially for a wide gamut of human-centric applications where annoyance is tied closely to human perception. To avoid having to conduct extensive user studies to gather the annoyance matrix for all possible mistakes, we propose predicting the annoyance of previously unseen mistakes by learning from example mistakes and their corresponding annoyance. We promote the use of attribute-based representations to transfer this knowledge of annoyance. Our experimental results with faces and scenes demonstrate that our approach can predict annoyance more accurately than baselines. We show that as a result, our approach makes less annoying mistakes in a real-world image retrieval application.
|
Similar papers:
[rank all papers by similarity to this]
|
#1019 - Talking Heads: Detecting Humans and Recognizing Their Interactions [pdf]
Minh Hoai, Andrew Zisserman |
Abstract: The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard arrangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for detecting sets of people in TV material. The model predicts the scale and location of each upper body in the configuration, has efficient and globally optimal inference, and is trained using a maximum margin framework. Second, we show that the configuration model outperforms a Deformable Part Model (DPM) for predicting upper body locations in video frames. Experiments are performed over two datasets: the TV Human Interaction dataset and 150 episodes from four different TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows.
|
Similar papers:
[rank all papers by similarity to this]
|
#1023 - 3D Modeling from Wide Baseline Range Scans using Contour Coherence [pdf]
Ruizhe Wang, Jongmoo Choi, Gerard Medioni |
Abstract: Registering 2 or more range scans is a fundamental problem, with application to 3D modeling. While this problem is well addressed by existing techniques such as ICP when the views overlap significantly, no satisfactory solution exists for wide baseline registration. We propose here a novel approach which leverages contour coherence and allows us to align two wide baseline range scans with limited overlap. We maximize the contour coherence by iteratively building robust corresponding pairs on apparent contours and minimizing their distances. We use the contour coherence under a multi-view rigid registration framework, and this enables the reconstruction of accurate and complete 3D models from as few as 4 frames. We further extend it to handle articulations. After modeling with a few frames, in case higher accuracy is required, more frames can be easily added in a drift-free manner by a conventional registration method. Experimental results on both synthetic and real data demonstrate the effectiveness and robustness of our contour coherence based registration approach to wide baseline range scans, and to 3D modeling.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLAD's difference coding, even with L2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the 2011 CU-Bird species.
|
Similar papers:
[rank all papers by similarity to this]
|
#1029 - Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning [pdf]
Bang Zhang, Yi Wang, Yang Wang, fang Chen |
Abstract: Many prevalent multi-class classification approaches can be unified and generalized by the output coding framework which usually consists of three phases:(1)coding,(2)learning binary classifiers, and(3)decoding. Most of these approaches focus on the first two phases and predefined distance function is used for decoding. In this paper, however, we propose to perform learning in coding space for more adaptive decoding, thereby improving overall performance. Ramp loss is exploited for measuring multi-class decoding error. The proposed algorithm has uniform stability. It is insensitive to data noises and scalable with large scale datasets. Generalization error bound and numerical results are given with promising outcomes. The outcome of the coding space learning in turn helps to improve binary classifiers. This is useful for resolving some difficult machine learning problems. To show this, the proposed method is extended for hypothesis transfer learning (HTL) which is a transfer learning framework only exploiting source domain hypotheses. Our method efficiently transfers knowledge from multiple source domains to multiple target domains by alternating coding space learning and target domain classifier learning. Empirical results are encouraging.
|
Similar papers:
[rank all papers by similarity to this]
|
#1041 - PatchMatch Based Joint View Selection and Depthmap Estimation [pdf]
Enliang Zheng, Vladimir Jojic, Enrique Dunn, Jan-Michael Frahm |
Abstract: We propose a multi-view depthmap estimation approach aimed at adaptively ascertaining the pixel level data asso- ciations between a reference image and all the elements of a source image set. Namely, we address the question, What aggregation subset of the source image set should we use to estimate the depth of a particular pixel in the reference im- age? We pose the problem within a probabilistic framework that jointly models pixel-level view selection and depthmap estimation given the local pairwise image photoconsistency. The corresponding graphical model is solved by combining variational inference with PatchMatch-like depth sampling and propagation. Experimental results on standard multi- view benchmarks convey the state-of-the art estimation ac- curacy afforded by mitigating spurious pixel-level data as- sociations. Conversely, experiments on large internet crowd sourced data demonstrate the robustness of our approach a- gainst unstructured and heterogeneous image capture char- acteristics. Moreover, the linear computational and stor- age requirements of our formulation, as well as its inherent parallelism, enables an efficient and scalable GPU based implementation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1045 - Asymmetric sparse kernel approximations for large-scale visual search [pdf]
Damek Davis, Stefano Soatto, Jonathan Balzer |
Abstract: We introduce an asymmetric sparse approximate embedding optimized for fast kernel comparison operations arising in large-scale visual search. In contrast to other methods that perform an explicit approximate embedding using kernel PCA followed by a distance compression technique in $\R^d$, which loses information at both steps, our method utilizes the implicit kernel representation directly. In addition, we empirically demonstrate that our method needs no {\em explicit} training step and can operate with a dictionary of random exemplars from the dataset. We evaluate our method on three benchmark image retrieval datasets: SIFT1M, ImageNet, and 80M-TinyImages.
|
Similar papers:
[rank all papers by similarity to this]
|
#1052 - Multiple Granularity Analysis for Fine-grained Action Detection [pdf]
Bingbing Ni, Pierre Moulin |
Abstract: We propose to decompose the fine-grained human activ- ity analysis problem into two sequential tasks with increas- ing granularity. Firstly, we infer the rough interaction sta- tus (i.e., which object is being manipulated). Knowing that the major challenge is frequent mutual occlusions during manipulation, we propose an interaction tracking frame- work in which hand/object position and status of interac- tion are jointly tracked by explicitly modeling the contex- tual information between occlusion and interaction status. Secondly, the inferred the hand/object position and rough interaction status are utilized to form a more compact and discriminative action representation and detection strategy, which effectively prune large amount of motion features from irrelevant spatio-temporal positions. We perform com- prehensive experiments on two challenging fine-grained ac- tivity dataset (i.e., cooking action) and the results show that the proposed framework achieves high accuracy/robustness in tracking multiple mutually occluded hand/object during manipulation as well as the significant recognition accu- racy improvement on fine-grained action recognition over the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We introduce a general framework for quickly augmenting a dataset containing pre-existing source annotations of one type (\eg, segmentations) with a new type of target annotation (\eg, part annotations). As annotators label new target annotations, we incrementally learn a translator from source to target labels as well as a computer-vision-based structured predictor. These two components are combined together to form an improved prediction system that is used to accelerate collection of target annotations via active learning and interactive labeling techniques. We show how the method can be applied to a wide variety computer vision learning problems and annotation schemes, including bounding boxes, segmentations, 2D and 3D part-based systems, and class and attribute labels. The proposed system will be a useful tool toward exploring new types of representations beyond simple bounding boxes, object segmentations, and class labels, toward building interactive methods for evolving definitions of part, attribute, and action vocabularies without relabeling the entire dataset, and toward finding new ways to exploit existing large datasets with traditional types of annotations like SUN~\cite{xiao2010sun}, Image Net~\cite{imagenet}, and Pascal VOC~\cite{everingham2010pascal}. \textit{TODO: summarize experimental results}
|
Similar papers:
[rank all papers by similarity to this]
|
#1064 - Scattering Parameters and Surface Normals from Homogeneous Translucent Materials using Photometric Stereo [pdf]
Bo Dong, Kathleen Moore, Weiyi Zhang, Pieter Peers |
Abstract: This paper proposes a novel photometric stereo solution to jointly estimate surface normals and scattering parameters from a flat homogeneous translucent object. Similar to classic photometric stereo, our method only requires as few as three observations of the translucent object under directional lighting. Naively applying classic photometric stereo results in blurred photometric normals. We develop a novel blind deconvolution algorithm based on inverse rendering for recovering the sharp surface normals and the material properties. We demonstrate our method on a variety of translucent objects.
|
Similar papers:
[rank all papers by similarity to this]
|
#1070 - Three Guidelines of Online Learning for Large-Scale Visual Recognition [pdf]
Yoshitaka Ushiku, Tatsuya Harada |
Abstract: Combinations of high-dimensional features and linear classifiers are widely used today for large-scale visual recognition. Numerous so-called mid-level features have been developed and mutually compared on an experimental basis. Although various learning methods for linear classification have also been proposed in machine learning and natural language processing literature, they have rarely been evaluated for visual recognition. In this paper, we give guidelines via investigations of state-of-the-art online learning methods of linear classifiers. Many methods have been evaluated using toy data and natural language processing problems such as document classification. Consequently, we gave those methods a unified interpretation from the viewpoint of visual recognition. Results of controlled comparisons indicate three guidelines that might change the pipeline for visual recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#1071 - The Secrets of Salient Object Segmentation [pdf]
Yin Li, Xiaodi Hou, Christof Koch, James Rehg, Alan Yuille |
Abstract: In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifies serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasising the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disconnection between fixations and salient object segmentation, but also mislead the algorithm designing. Based on our analysis, we propose a new high quality dataset that offers both fixation and salient object segmentation ground-truth. With fixations and salient object being presented simultaneously, we are able to bridge the gap between fixations and salient objects, and propose a novel method for salient object segmentation. Our model gives superior performance in segmenting salient objects. We report significant benchmark progress on existing datasets, as well as our newly proposed dataset of salient object segmentation.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We introduce a method to reduce most higher-order terms of Markov Random Fields with binary labels into lower-order ones without introducing any new variables, while keeping the minimizer of the energy unchanged. While the method does not reduce all terms, it can be used with existing techniques that transforms arbitrary terms (by introducing auxiliary variables) and improve the speed. The method eliminates a higher-order term in the polynomial representation of the energy by finding the value assignment to the variables involved that cannot be part of a global minimum and increasing the potential value only when that particular combination occurs by the exact amount that makes the potential of lower order. We also introduce a heuristic that forego the guarantee of exact equivalence of minimizer in favor of speed. With experiments on the same field of experts dataset used in previous work, we show that the roof-dual labeling after the reduction labels significantly more variables and the energy converges more rapidly.
|
Similar papers:
[rank all papers by similarity to this]
|
#1086 - Discriminative Ferns Ensemble for Hand Pose Recognition [pdf]
Eyal Krupka, Aharon Bar Hillel, Ben Klein, Alon Vinnikov, Daniel Freedman, Simon Stachniak |
Abstract: We present the Discriminative Ferns Ensemble (DFE) classifier for efficient visual object recognition. The classifier architecture is designed to optimize both classification speed and accuracy when a large training set is available. Speed is obtained using simple binary features and direct indexing into a set of tables, and accuracy by using a large capacity model and careful discriminative optimization. The proposed framework is applied to the problem of hand pose recognition in depth and infra-red images, using a very large training set. Both the accuracy and the classification time obtained are considerably superior to relevant competing methods, allowing one to reach accuracy targets with run times orders of magnitude faster than the competition. We also show empirically that using DFE, we can significantly reduce classification time by increasing training sample size for a fixed target accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1089 - Large-scale visual font recognition [pdf]
Guang Chen, Jianchao Yang, Hailin Jin, Jonathan Brandt, Eli Shechtman, Aseem Agarwala, Tony Han |
Abstract: This paper addresses the large-scale visual font recognition (VFR) problem, which aims at automatic identification of the typeface, weight, and slope of the text in an image or photo without any knowledge of content. Although visual font recognition has many practical applications, it has largely been neglected by the vision community. To address the VFR problem, we construct a large-scale dataset containing 2,420 font classes, which easily exceeds the scale of most image categorization datasets in computer vision. As font recognition is inherently dynamic and open-ended, \ie, new classes and data for existing categories are constantly added to the database over time, we propose a scalable solution based on the nearest class mean classifier (NCM). The core algorithm is built on local feature embedding, local feature metric learning and max-margin template selection, which is naturally amenable to NCM and thus to such open-ended classification problems. The new algorithm can generalize to new classes and new data at little added cost. Extensive experiments demonstrate that our approach is very effective on our synthetic test images, and achieves promising results on real world test images.
|
Similar papers:
[rank all papers by similarity to this]
|
#1092 - Bregman Divergences for Infinite Dimensional Covariance Matrices [pdf]
Mehrtash Harandi, Mathieu Salzmann, Fatih Porikli |
Abstract: We introduce an approach to computing and comparing Covariance Descriptors (CovDs) in infinite-dimensional spaces. CovDs have become increasingly popular to address classification problems in computer vision. While CovDs offer some robustness to measurement variations, they also throw away part of the information contained in the original data by only retaining the second-order statistics over the measurements. Here, we propose to overcome this limitation by first mapping the original data to a high-dimensional Hilbert space, and only then compute the CovDs. We show that several Bregman divergences can be computed between the resulting CovDs in Hilbert space via the use of kernels. We then exploit these divergences for classification purposes. Our experiments demonstrate the benefits of our approach on several tasks, such as material and texture recognition, person re-identification, and action recognition from motion capture data.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: While most existing multilabel ranking methods assume the availability of a single objective label ranking for each instance in the training set, this paper deals with a more common case where subjective inconsistent rankings from multiple rankers are associated with each instance. The key idea is to learn a latent preference distribution for each instance. The proposed method mainly includes two steps. The first step is to generate a common preference distribution that is most compatible to all the personal rankings. The second step is to learn a mapping from the instances to the preference distributions. The proposed preference distribution learning (PDL) method is applied to the problem of natural scene image annotation. Experimental results show that PDL can effectively incorporate the information given by the inconsistent rankers, and perform remarkably better than the compared state-of-the-art multilabel ranking algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1109 - Deep Learning Hidden Identity Features for Face Verification [pdf]
Yi Sun, Xiaogang Wang, Xiaoou Tang |
Abstract: This paper proposes a set of effective high-level features, referred to as hidden identity features (HIFs), for face verification. The HIFs are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize thousands of face identities in the training set simultaneously, these deep ConvNets gradually form high-level features, which are more relevant to face identities, in the top layers. With this extremely challenging recognition task as supervision, we learned features that consistently correspond to identity and can be generalized well to new identities in test. Moreover, we found that classifying a large amount of training identities while retaining a small number of last hidden layer neurons is key to learning compact and discriminative features. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. The performance of our model is among the best for all of the published methods on LFW, despite using only weakly aligned faces.
|
Similar papers:
[rank all papers by similarity to this]
|
#1110 - Latent Dictionary Learning for Sparse Representation based Classification [pdf]
Meng Yang, Luc Van Gool |
Abstract: Dictionary learning (DL) for sparse coding has shown promising results in classification tasks, while how to adaptively build an relationship between dictionary atoms and class labels is still an important open question. The existing dictionary learning approaches simply fixed a dictionary atom to be class-specific or shared by all classes beforehand, but ignoring to update this relationship in DL. To address this issue, in this paper we propose a novel latent dictionary learning (LDL) method to learn a discriminative dictionary and build its relationship to class labels adaptively. Each dictionary atom is jointly learned with a latent vector, which associates this atom to the representation of different classes. More specifically, we introduce a latent representation model, in which discrimination of the learned dictionary is exploited via minimizing the within-class scatter of coding coefficients and the latent-value weighted dictionary coherence. The optimal solution is efficiently obtained by the proposed solving algorithm. Correspondingly, a latent sparse representation based classifier is also presented. Experimental results demonstrate that our algorithm outperforms many recently proposed sparse representation and dictionary learning approaches for action, gender and face recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#1111 - CID: Combined Image Denoising in Spatial and Frequency Domains Using Web Images [pdf]
Huanjing Yue, Xiaoyan Sun, Jingyu Yang, Feng Wu |
Abstract: In this paper, we propose a novel two-step scheme to filter heavy noise from images with the assistance of retrieved Web images. There are two key technical contributions in our scheme. First, for every noisy image block, we build two three dimensional (3D) data cubes by using similar blocks in retrieved Web images and similar non-local blocks within the noisy image, respectively. To better use their correlations, we propose different denoising strategies. The denoising in the 3D cube built upon the retrieved images is performed as medium filtering in the spatial domain, whereas the denoising in the other 3D cube is performed in the frequency domain. These two denoising results are then combined in the frequency domain to produce a denoising image. Second, to handle heavy noise, we further propose using the denoising image to improve image registration of the retrieved Web images, 3D cube building, and the estimation of filtering parameters in the frequency domain. Afterwards, the proposed denoising is performed on the noisy image again to generate the final denoising result. Our experimental results show that when the noise is high, the proposed scheme is better than BM3D by more than 2 dB in PSNR and the visual quality improvement is clear to see.
|
Similar papers:
[rank all papers by similarity to this]
|
#1112 - A New Perspective on Material Classification and Ink Identification [pdf]
Rakesh Shiradkar, Li Shen, George Landon, Sim Heng Ong, Ping Tan |
Abstract: The surface bi-directional reflectance distribution function (BRDF) can be used to distinguish different materials. The BRDFs of many real materials are near isotropic and can be approximated well by a 2D function. However, when the camera principal axis is coincident with the surface normal of the material sample, the captured BRDF slice is nearly 1D, which suffers from significant information loss. Thus, dramatic improvement in classification performance can be achieved by simply setting the camera at a slanted view to capture a larger portion of the BRDF domain. We further use a handheld flashlight camera to capture a 1D BRDF slice for material classification. This 1D slice captures important reflectance properties such as specular reflection and retro-reflectance. We apply these results on ink classification, which can be used in forensics and analyzing historical manuscripts. For the first time, we show that most of inks on the market can be well distinguished by their reflectance properties. Our system achieves $85\%$ overall classification accuracy over $55$ different inks with a 2D BRDF slice, and $71\%$ accuracy with a 1D BRDF slice.
|
Similar papers:
[rank all papers by similarity to this]
|
#1120 - Fast Rotation Search with Stereographic Projections for 3D Registration [pdf]
Alvaro Parra Bustos, Tat-Jun Chin, David Suter |
Abstract: Recently there has been a surge of interest to use branch-and-bound (bnb) optimisation for 3D point cloud registration. While bnb guarantees globally optimal solutions, it is usually too slow to be practical. A fundamental source of difficulty is the search for the rotation parameters in the 3D rigid transform. In this work, assuming that the translation parameters are known, we focus on constructing a fast rotation search algorithm. With respect to an inherently robust geometric matching criterion, we propose a novel bounding function for bnb that allows rapid evaluation. Underpinning our bounding function is the usage of stereographic projections to precompute and spatially index all possible point matches. This yields a robust and global algorithm that is significantly faster than previous methods. To conduct full 3D registration, the translation can be supplied by 3D feature matching, or by another optimisation framework that provides the translation. On various challenging point clouds, including those taken out of lab settings, our approach demonstrates superior efficiency.
|
Similar papers:
[rank all papers by similarity to this]
|
#1121 - Filter Pairing Neural Network for Person Re-identification [pdf]
Wei Li, Rui Zhao, Tong Xiao, Xiaogang Wang |
Abstract: Person re-identification is to match pedestrian images from disjoint camera views detected by pedestrian detectors. Challenges are presented in the form of complex variations of lightings, poses, viewpoints, blurring effects, image resolutions, camera settings, occlusions and background clutter across camera views. In addition, misalignment introduced by the pedestrian detector will affect most existing person re-identification methods that use manually cropped pedestrian images and assume perfect detection. In this paper, we propose a novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter. All the key components are jointly optimized to maximize the strength of each component when cooperating with others. In contrast to existing works that use handcrafted features, our method automatically learns features optimal for the re-identification task from data. The learned filter pairs encode photometric transforms. Its deep architecture makes it possible to model a mixture of complex photometric and geometric transforms. We build the world's largest benchmark dataset with 13,164 images of 1,360 pedestrians and will release it to the public. Unlike existing datasets, which only provide manually cropped pedestrian images, our dataset provides automatically detected bounding boxes for evaluation close to practical applications. Our neural network significantly outperforms state-of-the-art m
|
Similar papers:
[rank all papers by similarity to this]
|
#1128 - Sequential Convex Relaxation for Mutual-Information-Based\\Unsupervised Figure-Ground Segmentation [pdf]
Youngwook Kee, Mohamed Souiai, Daniel Cremers, Junmo Kim |
Abstract: We propose an optimization algorithm for mutual-information-based unsupervised figure-ground separation. The algorithm jointly estimates the color distributions of the foreground and background, and separates these based on their mutual information with geometric regularity. To this end, we revisit the mutual information and reformulate it in terms of the photometric variable and the indicator function; and propose a sequential convex optimization strategy for solving the non-convex optimization problem that arises. We minimize a sequence of convex sub-problems for the mutual-information-based non-convex energy functional and we efficiently attain high quality solutions for challenging figure-ground segmentation problems. We demonstrate the capacity of our approach in numerous experiments that show convincing fully unsupervised figure-ground separation, in terms of both segmentation quality and robustness to initialization.
|
Similar papers:
[rank all papers by similarity to this]
|
#1129 - A Convex Relaxation of Ambrosio-Tortorelli's Elliptic Functional \\for the Mumford-Shah Functional [pdf]
Youngwook Kee, Junmo Kim |
Abstract: In this paper we revisit Ambrosio-Tortorelli's nonconvex elliptic functional for approximating the Mumford-Shah functional. Then we propose a convex relaxation for it to attempt to compute both globally optimal and visually better solutions; rather than solving the nonconvex functional directly---which is the main contribution of this paper. Inspired by McCormick's seminal work on factorable nonconvex problems, we split a nonconvex product term that arises in the Ambrosio-Tortorelli functional in a way that a typical alternating gradient method guarantees a globally optimal solution without taking coupling effects completely away. Furthermore, not only do we provide a fruitful analysis of the proposed relaxation, but also demonstrate the capacity of our relaxation in numerous experiments that show convincing results compared to a naive extension of the McCormick relaxation and its quadratic variant. Indeed, we believe that the proposed relaxation would open up a possibility for convexifying a new class of functions in the context of energy minimization for computer vision.
|
Similar papers:
[rank all papers by similarity to this]
|
#1132 - Recovering Surface Details under General Unknown Illumination Using Shading and Coarse Multi-view Stereo [pdf]
DI XU, Qi Duan, Jianmin Zheng, Juyong Zhang, Jianfei Cai, Tat-Jen Cham |
Abstract: Reconstructing the shape of a 3D object from multi-view images under unknown, general illumination is a fundamental problem in computer vision and high quality reconstruction is usually challenging especially when high detail is needed. This paper presents a total variation (TV) based approach for recovering surface details using shading and multi-view stereo (MVS). Behind the approach are our two important observations: (1) the illumination over the surface of an object tends to be piecewise smooth and (2) the recovery of surface orientation is not sufficient for reconstructing geometry, which were previously overlooked. Thus we introduce TV to regularize the lighting and use visual hull to constrain partial vertices. The reconstruction is formulated as a constrained TV-minimization problem that treats the shape and lighting as unknowns simultaneously. An augmented Lagrangian method is proposed to quickly solve the TV-minimization problem. As a result, our approach is robust, stable and is able to efficiently recover high quality of surface details even starting with a coarse MVS. These advantages are demonstrated by the experiments with synthetic and real world examples.
|
Similar papers:
[rank all papers by similarity to this]
|
#1139 - Fully Automated Non-rigid Segmentation with Distance Regularized Level Set Evolution Initialized and Constrained by Deep-structured Inference [pdf]
Tuan Ngo, Gustavo Carneiro |
Abstract: We propose a new fully automated non-rigid segmentation approach based on the distance regularized level set method that is initialized and constrained by the results of a structured inference using deep belief networks. This recently proposed level-set formulation achieves reasonably accurate results in several segmentation problems, and has the advantage of eliminating periodic re-initializations during the optimization process, and as a result it avoids numerical errors. Nevertheless, when applied to challenging problems, such as the left ventricle segmentation from short axis cine magnetic ressonance (MR) images, the accuracy obtained by this distance regularized level set is lower than the state of the art. The main reasons behind this lower accuracy are the dependence on good initial guess for the level set optimization and on reliable appearance models. We address these two issues with an innovative structured inference using deep belief networks that produces reliable initial guess and appearance model. The effectiveness of our method is demonstrated on the MICCAI 2009 left ventricle segmentation challenge, where we show that our approach achieves one of the most competitive results (in terms of segmentation accuracy) in the field.
|
Similar papers:
[rank all papers by similarity to this]
|
#1147 - Non-rigid Segmentation using Sparse Low Dimensional Manifolds and Deep Belief Networks [pdf]
Jacinto Nascimento, Gustavo Carneiro |
Abstract: In this paper, we propose a new methodology for segmenting non-rigid visual objects, where the search procedure is conducted directly on a sparse low-dimensional manifold, guided by the classification results computed from a deep belief network. Our main contribution is the fact that we do not rely on the typical sub-division of segmentation tasks into rigid detection and non-rigid delineation. Instead, the non-rigid segmentation is performed directly, where points in the sparse low-dimensional can be mapped to an explicit contour representation in image space. Our proposal shows significantly smaller search and training complexities given that the dimensionality of the manifold is much smaller than the dimensionality of the search spaces for rigid detection and non-rigid delineation aforementioned, and that we no longer require a two-stage segmentation process. We focus on the problem of left ventricle endocardial segmentation from ultrasound images, and lip segmentation from frontal facial images using the extended Cohn-Kanade (CK+) database. Our experiments show that the use of sparse low dimensional manifolds reduces the search and training complexities of current segmentation approaches without a significant impact on the segmentation accuracy shown by state-of-the-art approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
#1148 - 3D-aided face recognition robust to expression and pose variations [pdf]
Baptiste Chu, Sami Romdhani, Liming Chen |
Abstract: Expression and pose variations are major challenges for reliable face recognition (FR) in 2D. In this paper, we aim to endow state of the art face recognition SDKs with robustness to facial expression variations and pose changes by using an extended 3D Morphable Model (3DMM) which isolates identity variations from those due to facial expressions. Specifically, given a probe with expression, a novel view of the face is generated where the pose is rectified and the expression neutralized. We present two methods of expression neutralization. The first one uses prior knowledge to infer the neutral expression image from an input image. The second method, specifically designed for verification, is based on the transfer of the gallery face expression to the probe. Experiments using rectified and neutralized view with a standard commercial FR SDK on two 2D face databases, namely Multi-PIE and AR, show significant performance improvement of the commercial SDK to deal with expression and pose variations and demonstrates the effectiveness of the proposed approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#1151 - Remote Heart Rate Measurement From Face Videos Under Realistic Situations [pdf]
Xiaobai Li, Jie Chen, Guoying Zhao, Matti Pietikinen |
Abstract: Heart rate is an important indicator of peoples physiological state. Recently, several papers report methods that can measure heart rate remotely from face videos. Those methods work well on stationary subjects under well controlled conditions, but their performance significantly degrades if the videos are recorded under more challenging conditions, specifically when subjects motions and illumination variations are involved. We propose a framework which utilizes face tracking and Normalized Least Mean Square adaptive filtering methods to counter their influences. We test our framework on a large difficult and public database MAHNOB-HCI and demonstrate that our method substantially outperforms all previous methods. We also use our method for long term heart rate monitoring in a game evaluation scenario and achieve promising results.
|
Similar papers:
[rank all papers by similarity to this]
|
#1157 - Discriminative Feature-to-Point Matching in Image-Based Localization [pdf]
Michael Donoser, Dieter Schmalstieg |
Abstract: The prevalent approach to image-based localization is to match interest points detected in the query image to a sparse 3D point cloud representing the known world. The obtained correspondences are then used to recover a precise camera pose. In this field state-of-the-art often ignores the availability of a set of 2D descriptors per 3D point, for example by representing each 3D point by only its centroid. In this paper we demonstrate that these sets contain useful information that can be exploited by formulating matching as a discriminative classification problem. Since memory demands and computational complexity are crucial in such a setup, we base our algorithm on the efficient and effective random fern principle. We propose an extension which projects features to fern-specific embedding spaces, which yields improved matching rates in short runtime. Experiments first show that our novel formulation provides improved matching performance in comparison to the standard nearest neighbor approach and that we outperform related methods in our localization scenario.
|
Similar papers:
[rank all papers by similarity to this]
|
#1162 - Action localization by tubelets from motion [pdf]
Mihir Jain, Jan Van Gemert, Herve Jegou, Patrick Bouthemy, Cees Snoek |
Abstract: This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art techniques, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in a context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-the-art on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences.
|
Similar papers:
[rank all papers by similarity to this]
|
#1163 - Noising versus Smoothing for Vertex Identification in Unknown Shapes [pdf]
Konstantinos Raftopoulos, Marin Ferecatu |
Abstract: A method for identifying vertices and estimating shape features of local nature e.g. curvature on the shape's boundary is presented. The boundary is seen as a real function and a study of a certain distance function reveals, almost counter-intuitively, that vertices can be defined and localized better in the presence of high frequency Fourier components (hfFc). The proposed method works on both smooth and noisy shapes, the presence of hfFc having an effect of improving on the results of the smoothed version. Experiments with noise and a comparison to the Local Area Integral Invariant descriptor (LAII) validate the method.
|
Similar papers:
[rank all papers by similarity to this]
|
#1164 - Multiple Structured-Instance Learning for Semantic Segmentation with Uncertain Training Data [pdf]
Feng-Ju Chang, Yen-Yu Lin, Kuang-Jui Hsu |
Abstract: We present an approach MSIL-CRF that incorporates multiple instance learning (MIL) into conditional random fields (CRFs). It can generalize CRFs to work on training data with uncertain labels by the principle of MIL. In this work, it is applied to saving manual efforts on annotating training data for semantic segmentation. Specifically, we consider the setting in which the training dataset for semantic segmentation is a mixture of a few object segments and an abundant set of objects' bounding boxes. Our goal is to infer the unknown object segments enclosed by the bounding boxes so that they can serve as training data for semantic segmentation. To this end, we generate multiple segment hypotheses for each bounding box with the assumption that at least one hypothesis is close to the ground truth. By treating a bounding box as a bag with its segment hypotheses as structured instances, MSIL-CRF selects the most likely segment hypotheses by leveraging the knowledge derived from both the labeled and uncertain training data. The experimental results on the Pascal VOC segmentation task demonstrate that MSIL-CRF can provide effective alternatives to manually labeled segments for semantic segmentation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1180 - Constructing Robust Affinity Graph for Spectral Clustering [pdf]
Xiatian Zhu, Chen Change Loy, Shaogang Gong |
Abstract: It is desirable for spectral clustering to have as input robust and meaningful affinity/similarity graphs in order to form clusters with desired structures that can better support human intuition. To construct such affinity graphs is non-trivial due to the ambiguity and uncertainty inherent in the raw data. In contrast to most existing clustering methods that typically employ all available features to construct affinity matrices with the Euclidean distance, which is often not an accurate representation of the underlying data structures, we propose a novel unsupervised approach to generating more robust affinity graphs via identifying and exploiting discriminative features for improving spectral clustering. Specifically, our model is capable of capturing and combining subtle similarity information distributed over discriminative feature subspaces for better revealing the latent data distribution and thereby leading to improved data clustering, especially with heterogeneous data sources. We demonstrate the efficacy of the proposed approach on challenging image and video datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1184 - Reconstructing Evolving Tree Structures in Time Lapse Sequences [pdf]
Przemysaw Gowacki, Miguel Pinheiro, Raphael Sznitman , Engin Turetken, Daniel Lebrecht, Anthony Holtmaat, Jan Kybic, Pascal Fua |
Abstract: We propose an approach to reconstructing tree structures that evolve over time in 2D images and 3D image stacks such as neuronal axons or plant branches. Instead of reconstructing structures in each image independently, we do so for all images simultaneously to take advantage of temporal-consistency constraints. We show that this problem can be formulated as a Quadratic Mixed Integer Program and solved efficiently. The outcome of our approach is a framework that provides substantial improvements in reconstructions over traditional single time-instance formulations. Furthermore, an added benefit of our approach is the ability to automatically detect places where significant changes have occurred over time, which is challenging when considering large amounts of data.
|
Similar papers:
[rank all papers by similarity to this]
|
#1185 - Beyond Pixel Labels: Image Parsing with Object Instances and Occlusion Ordering [pdf]
Joseph Tighe, Marc Niethammer, Svetlana Lazebnik |
Abstract: This work proposes a method to interpret a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. Starting with an initial pixel labeling and a set of candidate object masks for a given test image, we select a subset of objects that explain the image well and have valid overlap relationships and occlusion ordering. This is done by minimizing an integer quadratic program either using a greedy method or a standard solver. Then we alternate between using the object predictions to improve the pixel labels and using the pixel labels to improve the object predictions. The proposed system obtains promising results on two challenging subsets of the LabelMe dataset, the largest of which contains 45,676 images and 232 classes.
|
Similar papers:
[rank all papers by similarity to this]
|
#1187 - Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection [pdf]
Luis Ferraz, Xavier Binefa, Francesc Moreno-Noguer |
Abstract: We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100x compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accurate pose estimation results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5 ms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1192 - A Hierarchical Probabilistic Model for Facial Feature Detection [pdf]
Yue Wu, Ziheng Wang, Qiang Ji |
Abstract: Facial feature detection from facial images has attracted great attention in the field of computer vision. It is a nontrivial task since the appearance and shape of the face tend to change under different conditions. In this paper, we propose a hierarchical probabilistic model that could infer the true locations of facial features given the image measurements even if the face is with significant facial expression and pose. The hierarchical model implicitly captures the lower level shape variations of facial components using the mixture model. Furthermore, in the higher level, it also learns the joint relationship among facial components, the facial expression, and the pose information through automatic structure learning and parameter estimation of the probabilistic model. Experimental results on benchmark databases demonstrate the effectiveness of the proposed hierarchical probabilistic model.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Subjective Image Quality Assessment (IQA) is the most reliable way to evaluate the visual quality of digital images perceived by the end user. It is often used to construct image quality datasets and provide the groundtruth for building and evaluating objective quality measures. Subjective tests based on the Mean Opinion Score (MOS) have been widely used in previous studies, but have many known problems such as an ambiguous scale definition and dissimilar interpretations of the scale among subjects. To overcome these limitations, Paired Comparison (PC) tests have been proposed as an alternative and are expected to yield more reliable results. However, PC tests can be expensive and time consuming, since for $n$ images they require n choose 2 comparisons. We present a hybrid subjective test which combines MOS and PC tests via a unified probabilistic model and an active sampling method. The proposed method actively constructs a set of queries consisting of MOS and PC tests based on the expected information gain provided by each test and can effectively reduce the number of tests required for achieving a target accuracy. Our method can be used in conventional laboratory studies as well as crowdsourcing experiments. Experimental results show the proposed method outperforms state-of-the-art subjective IQA tests in a crowdsourced setting.
|
Similar papers:
[rank all papers by similarity to this]
|
#1195 - Informed Haar-like Features Improve Pedestrian Detection [pdf]
Shanshan Zhang, Christian Bauckhage, Armin Cremers |
Abstract: We propose a simple yet effective detector for pedestrian detection. The basic idea is to incorporate common sense and everyday knowledge into the design of simple and computationally efficient features. As pedestrians usually appear up-right in image or video data, the problem of pedestrian detection is considerably simpler than general purpose people detection. We therefore employ a statistical model of the up-right human body where the head, the upper body, and the lower body are treated as three distinct components. Our main contribution is to systematically design a pool of rectangular templates that are tailored to this shape model. As we incorporate different kinds of low-level measurements, the resulting multi-modal and multi-channel Haar-like features represent characteristic differences between parts of the human body yet are robust against variations in clothing or environmental settings. Our approach avoids exhaustive searches over all possible configurations of rectangle features and neither relies on random sampling. It thus marks a middle ground among recently published techniques and yields efficient low-dimensional yet highly discriminative features. Experimental results on the INRIA and Caltech pedestrian datasets show that our detector reaches state-of-the-art performance at low computational costs and that our features are robust against occlusions.
|
Similar papers:
[rank all papers by similarity to this]
|
#1197 - Shadow Removal from Single RGB-D Images [pdf]
Yao Xiao, Efstratios Tsougenis, Chi-keung Tang |
Abstract: We present the first automatic method to remove shadows from single RGB-D images. Using normal cues directly derived from depth, we can remove hard and soft shadows while preserving surface texture and shading. Our key assumption is: pixels with similar normals, spatial locations and chromaticity should have similar colors. A modified nonlocal matching is used to compute a shadow confidence map that localizes well hard shadow boundary, thus handling hard and soft shadows within the same framework. We compare our results produced using state-of-the-art shadow removal on single RGB images, and intrinsic image decomposition on standard RGB-D datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1200 - Sign Language Spotting using Hierarchical Sequential Patterns with Temporal Intervals [pdf]
Nicolas Pugeault, Eng-Jon Ong, Richard Bowden, Oscar Koller |
Abstract: This paper tackles the problem of spotting a set of signs occuring in videos with sequences of signs. To achieve this, we propose to model the spatio-temporal signatures of a sign using an extension of sequential patterns that contain temporal intervals called Sequential Interval Patterns (SIP). We then propose a novel multi-class classifier that organises different sequential interval patterns in a hierarchical tree structure called a Hierarchical SIP Tree (HSP-Tree). This allows one to exploit any subsequence sharing that exists between different SIPs of different classes. Multiple trees are then combined together into a forest of HSP-Trees resulting in a strong classifier that can be used to spot signs. We then show how the HSP-Forest can be used to spot sequences of signs that occur in an input video. We have evaluated the method on both concatenated sequences of isolated signs and continuous sign sequences. We also show that the proposed method is superior in robustness and accuracy to a state of the art sign recogniser when applied to spotting a sequence of signs.
|
Similar papers:
[rank all papers by similarity to this]
|
#1201 - Accurate Localization and Pose Estimation for Large 3D Models [pdf]
Linus Svrm, Olof Enqvist, Magnus Oskarsson, Fredrik Kahl |
Abstract: We consider the problem of localizing a novel image in a large 3D model. In principle, this is just an instance of camera pose estimation, but the scale introduces some challenging problems. For one, it makes the correspondence problem very difficult and it is likely that there will be a significant rate of outliers to handle. In this paper we use recent theoretical as well as technical advances to tackle these problems. Many modern cameras and phones have gravitational sensors that allow us to reduce the search space. Further, there are new techniques to efficiently and reliably deal with extreme rates of outliers. We extend these methods to camera pose estimation by using accurate approximations and fast polynomial solvers. Experimental results are given that demonstrate that it is possible to reliably estimate the camera pose despite more than 99% of outlier correspondences.
|
Similar papers:
[rank all papers by similarity to this]
|
#1206 - Efficient Nonlinear Markov Models for Human Motion [pdf]
Andreas Lehrmann, Peter Gehler, Sebastian Nowozin |
Abstract: Dynamic Bayesian networks such as Hidden Markov Models (HMMs) are successfully used as probabilistic models for human motion. The use of hidden variables makes them expressive models, but inference is only approximate and requires procedures such as particle filters or Markov chain Monte Carlo methods. In this work we propose to instead use simple Markov models that only model observed quantities. We retain a highly expressive dynamic model by using interactions that are nonlinear and non-parametric. A presentation of our approach in terms of latent variables shows logarithmic growth for the computation of exact loglikelihoods in the number of latent states. We validate our model on human motion capture data and demonstrate state-of-the-art performance on action recognition and motion completion tasks.
|
Similar papers:
[rank all papers by similarity to this]
|
#1207 - Object Partitioning using Local Convexity [pdf]
Simon Christoph Stein, Jeremie Papon, Markus Schoeler, Florentin Woergoetter |
Abstract: The problem of how to arrive at an appropriate 3D-segmentation of a scene remains difficult. While current state-of-the-art methods continue to gradually improve in benchmark performance, they also grow more and more complex, for example by incorporating chains of classifiers, which require training on large manually annotated data-sets. As an alternative to this, we present a new, efficient learning- and model-free approach for the segmentation of 3D point clouds into object parts. The algorithm begins by decomposing the scene into an adjacency-graph of surface patches based on a voxel grid. Edges in the graph are then classified as either convex or concave using a novel combination of simple criteria which operate on the local geometry of these patches. This way the graph is divided into locally convex connected subgraphs, which -- with high accuracy -- represent object parts. Additionally, we propose a novel depth dependent voxel grid to deal with the decreasing point-density at far distances in the point clouds. This improves segmentation, allowing the use of fixed parameters for vastly different scenes. The algorithm is straightforward to implement and requires no training data, while nevertheless producing results that are comparable to state-of-the-art methods which incorporate high-level concepts involving classification, learning and model fitting.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: How much data do we need to describe a location? We explore this question in the context of 3D scene reconstructions created from running structure from motion on large Internet photo collections, where reconstructions can contain many millions of 3D points. We consider several methods for computing much more compact representations of such reconstructions for the task of location recognition, with the goal of maintaining good performance with very small models. In particular, we introduce a new method for computing compact models that takes into account both image-point relationships, as well as feature distinctiveness, and show that this method produces small models that yield better recognition performance than previous model reduction techniques.
|
Similar papers:
[rank all papers by similarity to this]
|
#1219 - Efficient High-Resolution Stereo Matching using Local Plane Sweeps [pdf]
Sudipta Sinha, Daniel Scharstein, Richard Szeliski |
Abstract: We present a stereo algorithm designed for speed and efficiency that uses local slanted plane sweeps to propose disparity hypotheses for a semi-global matching algorithm. Our local plane hypotheses are derived from initial sparse feature correspondences followed by an iterative clustering step. Local plane sweeps are then performed around each slanted plane to produce out-of-plane parallax and matching-cost estimates. A final global optimization stage, implemented using semi-global matching, assigns each pixel to one of the local plane hypotheses. By only exploring a small fraction of the whole disparity space volume, our technique achieves significant speedups over previous algorithms and achieves state-of-the-art accuracy on high-resolution stereo pairs of up to 19 megapixels.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Given a single outdoor image this paper proposes a collaborative learning approach for labeling the image as either sunny or cloudy. Never adequately addressed, this two-class labeling problem is by no means trivial given the great variety of outdoor images. Our weather feature combines everyday weather cues after properly encoding them into feature vectors. These encoded cues then work collaboratively in synergy under a unified optimization framework that is aware of the presence (or absence) of a given weather cue during learning and classification. Extensive experiments and comparisons are performed to verify our method. The other contribution consists of a new weather image dataset consisting of 10K sunny and cloudy images which will also be freely available with the executable of our implementation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1226 - A Procrustean Markov Process for Non-Rigid Structure Recovery [pdf]
Minsik Lee, Chong-Ho Choi, Songhwai Oh |
Abstract: Recovering a non-rigid 3D structure from a series of 2D observations is still a difficult problem to solve accurately. Many constraints have been proposed to facilitate the recovery, and one of the most successful constraints is smoothness due to the fact that most real-world objects change continuously. However, many existing methods require to determine the degree of smoothness beforehand, which is not viable in practical situations. In this paper, we propose a new probabilistic model that incorporates the smoothness constraint without requiring any prior knowledge. Our approach regards the sequence of 3D shapes as a simple stationary Markov process with Procrustes alignment, whose parameters are learned during the fitting process. The Markov process is assumed to be stationary because deformation is finite and recurrent in general, and the 3D shapes are assumed to be Procrustes aligned in order to discriminate deformation from motion. The proposed method outperforms the state-of-the-art methods, even though the computation time is rather moderate compared to the other existing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Many binary code embedding techniques have been proposed for large-scale approximate nearest neighbor search in computer vision. Recently, product quantization that encodes the cluster index in each subspace has been shown to provide impressive accuracy for nearest neighbor search. In this paper, we explore a simple question: is it best to use all the bit budget for encoding a cluster index in each subspace? We have found that as data points are located farther away from the centers of their clusters, the error of estimated distances among those points becomes larger. To address this issue, we propose a novel encoding scheme that distributes the available bit budget to encoding both the cluster index and the quantized distance between a point and its cluster center. We also propose two different distance metrics tailored to our encoding scheme. We have tested our method against the-state-of-the-art techniques on several well-known benchmarks, and found that our method consistently improves the accuracy over other tested methods. This result is achieved mainly because our method accurately estimates distances between two data points with the new binary codes and distance metric.
|
Similar papers:
[rank all papers by similarity to this]
|
#1233 - Gyro-Based Multi-Image Deconvolution for Removing Handshake Blur [pdf]
Sung Hee Park, Marc Levoy |
Abstract: Image deblurring to remove blur caused by camera shake has been intensively studied. Nevertheless, most methods are brittle and computationally expensive. In this paper we analyze multi-image approaches, which capture and combine multiple frames in order to make deblur- ring more robust and tractable. In particular, we compare the performance of two approaches: align-and-average and multi-image deconvolution. Our deconvolution is non- blind, using a blur model obtained from real camera motion as measured by a gyroscope. We show that in most situ- ations such deconvolution outperforms align-and-average. Wealsoshow, perhapssurprisingly, thatdeconvolutiondoes not benefit from increasing exposure time beyond a certain threshold. To demonstrate the effectiveness and efficiency of our method, we apply it to still-resolution imagery of nat- ural scenes captured using a mobile camera with flexible camera control and an attached gyroscope.
|
Similar papers:
[rank all papers by similarity to this]
|
#1234 - Joint Motion Segmentation and Background Subtraction in Dynamic Scenes [pdf]
Adeel Mumtaz, Weichen Zhang, Antoni Chan |
Abstract: We propose a joint foreground-background mixture model (FBM) that simultaneously performs background subtraction and motion segmentation in complex dynamic scenes. Our FBM consist of a set of location-specific dynamic texture (DT) components, for modeling local background motion, and set of global DT components, for modeling consistent foreground motion. We derive an EM algorithm for estimating the parameters of the FBM. We also apply spatial constraints to the FBM using an Markov random field grid, and derive a corresponding variational approximation for inference. Unlike existing approaches to background subtraction, our FBM does not require a manually selected threshold or a separate training video. Unlike existing motion segmentation techniques, our FBM can segment foreground motions over complex background with mixed motions, and detect stopped objects. Since most dynamic scene datasets only contain videos with a single foreground object over a simple background, we develop a new challenging dataset with multiple foreground objects over complex dynamic backgrounds. In experiments, we show that jointly modeling the background and foreground segments with FBM yields significant improvements in accuracy on both background subtraction and motion segmentation, compared to state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1237 - Leveraging Hierarchical Parametric Network for Skeletal Joints Action Segmentation and Recognition [pdf]
Di Wu, Ling Shao |
Abstract: Over the last few years, with the popularity of the Kinect, there has been renewed interest in developing methods for human gesture and action recognition from 3D skeletal data. A number of approaches have been proposed to extract representative features from 3D skeletal data, such as, most commonly, hard wired geometric or bio-inspired shape context features. We propose a hierarchical, dynamic framework that first extracts high level skeletal joints features and then uses the learned representation for estimating emission probability to infer the action class. Gaussian mixture models are primarily used for modeling the emission distribution of hidden Markov models. We show that better action recognition using skeletal features can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features to predict probability distributions over states of hidden Markov models. The framework can be easily extended to include an ergodic state to segment and recognize actions simultaneously.
|
Similar papers:
[rank all papers by similarity to this]
|
#1238 - T-Linkage: a Continuous Relaxation of J-Linkage for Multi-Model Fitting [pdf]
Luca Magri, Andrea Fusiello |
Abstract: This paper presents an improvement of the J-linkage algorithm for fitting multiple instances of a model to noisy data corrupted by outliers. The binary preference analysis implemented by J-linkage is replaced by a continuous (soft, or fuzzy) generalization that proves to perform better than J-linkage on simulated data, and compares favorably with state of the art methods on public domain real datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1248 - Partial Symmetry in Polynomial Systems and Its Application in Computer Vision [pdf]
Yubin Kuang, Yinqiang Zheng, Kalle Astroem |
Abstract: Polynomial solving is one of key components for solving geometry problems in computer vision. Fast and stable polynomial solvers are essential for numerous applications e.g.\ minimal problems or finding for all stationary points of certain algebraic errors. Recently, full symmetry in the polynomial systems has been utilized to simplify and speed up state-of-the-art polynomial solvers based on Gr{\"o}bner basis method \cite{ask2012exploiting}. In this paper, we further explore partial symmetry (i.e.\ only a subset of unknowns are symmetric) in the polynomial systems. We develop novel numerical schemes to utilize such partial symmetry. We then demonstrate the advantage of our schemes in several computer vision problems. In both synthetic and real experiments, we show that utilizing partial symmetry allow us to obtain faster and more accurate polynomial solvers than the general solvers.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper proposes a new vectorial total variation prior (VTV) for color images. Different from existing VTVs, our VTV, named the decorrelated vectorial total variation prior (D-VTV), measures the discrete gradients of the luminance component and that of the chrominance one in a separated manner, which significantly reduces undesirable uneven color effects. Moreover, a higher-order generalization of the D-VTV, which we call the decorrelated vectorial total generalized variation prior (D-VTGV), is also developed for avoiding the staircasing effect that accompanies the use of VTVs. A noteworthy property of the D-VT(G)V is that it enables us to efficiently minimize objective functions involving it by a primal-dual splitting method. Experimental results illustrate their utility.
|
Similar papers:
[rank all papers by similarity to this]
|
#1269 - Locally Linear Hashing for Extracting Non-Linear Manifolds [pdf]
Go Irie, Zhenguo Li, Xiao-Ming Wu, Shi-Fu Chang |
Abstract: Most hashing methods aim to preserve either the variance (e.g. PCA-based hashing) or the pairwise affinity (e.g. spectral hashing) of data manifolds. However, neither property is adequate to capture their non-linear geometric structures. In this paper, we tackle this problem by exploring the locally linear structures of manifolds. We propose a new hashing method to reconstruct their locally linear structures in the binary Hamming space, which are learned by locality-sensitive sparse coding. The problem is naturally cast as a joint minimization of reconstruction error and quantization error, which is NP-hard. Nevertheless, a local optimum can be obtained efficiently via alternating optimization between optimal reconstruction and quantization. Our method distinguishes itself from others in its remarkable ability to extract nearest neighbors of the query lying on the same manifold instead of in the ambient space. We perform extensive experiments on various image benchmark datasets. Our results improve the performances of the state-of-the-art methods by 28-74% typically, and 627% in the best case for face data.
|
Similar papers:
[rank all papers by similarity to this]
|
#1272 - Gesture Recognition Portfolios for Personalization [pdf]
Angela Yao, Luc Van Gool, Pushmeet Kohli |
Abstract: Human gestures, like speech and handwriting, are often unique to the individual. Training a generic classifier which is applicable to everyone can be very difficult and as such, it has become a standard to use personalized classifiers in speech and handwriting recognition. In this paper, we address the problem of personalization in the context of gesture recognition, and propose a novel and extremely efficient way of doing personalization. Unlike traditional personalization methods which learn a single classifier that later gets adapted, our approach learns a set (portfolio) of classifiers during training, one of which is selected for each test subject based on the personalization data. We formulate classifier personalization as a selection problem and propose several algorithms to compute the set of candidate classifiers. Our experiments show that such an approach is much more efficient than adapting the classifier parameters but can still achieve comparable or better results.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We introduce a new compression scheme for highdimensional vectors that approximates the vectors using sums ofM codewords coming fromM different codebooks. We show that the proposed scheme permits efficient distance and scalar product computations between compressed and uncompressed vectors. We further suggest vector encoding and codebook learning algorithms that can minimize the coding error within the proposed scheme. In the experiments, we demonstrate that the proposed compression can be used instead of or together with product quantization. Compared to product quantization and its optimized versions, the proposed compression approach leads to lower coding approximation errors, higher accuracy of approximate nearest neighbor search in the datasets of visual descriptors, and lower image classification error, whenever the classifiers are learned on or applied to compressed vectors.
|
Similar papers:
[rank all papers by similarity to this]
|
#1282 - Covariance descriptors for 3D shape matching and retrieval [pdf]
Hedi Tabia, Hamid Laga, David Picard, Philippe-Henri Gosselin |
Abstract: Several descriptors have been proposed in the past for 3D shape analysis, yet none of them achieves best performance on all shape classes. In this paper we propose a novel method for 3D shape analysis using the covariance matrices of the descriptors rather than the descriptors themselves. Covariance matrices enable efficient fusion of different types of features and modalities. They capture the geometric and the spatial properties as well as their correlation within the same representation. Covariance matrices however lie on the manifold of Symmetric Positive Definite (SPD) tensors, a special type of Riemannian manifolds, which makes comparison and clustering of such matrices challenging. In this paper we study covariance matrices in their native space and make use of the geodesic distances on the manifold as a dissimilarity measure. We demonstrate the performance of this metric on 3D face matching and recognition tasks. We then generalize the Bag of Features paradigm, originally designed in Euclidean spaces, to the Riemannian manifold of SPD matrices. We propose a new clustering procedure that takes into account the geometry of the Riemannian manifold. We evaluate the performance of the proposed framework on 3D shape matching and retrieval applications and demonstrate its superiority compared to descriptor-based techniques.
|
Similar papers:
[rank all papers by similarity to this]
|
#1287 - Photometric Bundle Adjustment for Dense Multi-View 3D Modeling [pdf]
Amal Delaunoy, Marc Pollefeys |
Abstract: Motivated by a Bayesian vision of the 3D multi-view reconstruction from images problem, we propose a dense 3D reconstruction technique that jointly refines the shape and the camera parameters of a scene by minimizing the photometric reprojection error between a generated model and the observed images, hence considering all pixels in the original images. The minimization is performed using a gradient descent scheme coherent with the shape representation (here a triangular mesh), where we carefully derive evolution equations including the derivatives of the visibility function. This can be used at a last refinement step in 3D reconstruction pipelines and helps improving the 3D reconstruction's quality by estimating the 3D shape and camera calibration more accurately. Examples are shown for multi-view stereo where the texture is also jointly optimized and improved, but could be used for any generative approaches dealing with multi-view reconstruction settings (i.e. depth map fusion, multi-view photometric stereo).
|
Similar papers:
[rank all papers by similarity to this]
|
#1291 - Super-resolving Appearance of 3D Deformable Shapes from Multiple Videos [pdf]
Jean-Sebastien Franco, Vagia Tsiminaki, Edmond Boyer |
Abstract: We examine the problem of retrieving and super-resolving the appearance of objects observed in multiple videos under small object motions. Super-resolution has been vastly explored in the case of monocular video, where the data redundancy necessary to reconstruct the image stems from temporal accumulation. On the other hand, a handful of methods have examined texture super-resolution of a static 3D object observed from several cameras, where the data redundancy is obtained through the different viewpoints. We introduce a unified framework to leverage both possibilities for super-resolution, which uniformly deals with any source of geometric variability. To this goal we use 2D warps for all views and temporal frames, and a simple linear projection model from texture to image space. Despite its simplicity, the method is able to successfully improve the texture appearance with temporal information, as shown experimentally. Additionally, we show that our method obtains better results than state of the art 3D shape super-resolution methods existing for the static case.
|
Similar papers:
[rank all papers by similarity to this]
|
#1314 - Relative Parts: Disctinctive Parts for Learning Relative Attributes [pdf]
Yashaswi Verma, Ramachandruni Sandeep, C.V. Jawahar |
Abstract: The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) [24] provides an appealing way of comparing two images based on their visual properties (or attributes) such as ``smiling'' for face images, ``naturalness'' for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive ``significance-coefficient'' that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method not only achieves significant improvement in relative attribute prediction accuracy, it is also shown to significantly improve the performance of relative attribute feedback based interactive image search.
|
Similar papers:
[rank all papers by similarity to this]
|
#1315 - SeamSeg: Video Object Segmentation using Patch Seams [pdf]
Avinash Ramakanth, Venkatesh Babu Radhakrishnan |
Abstract: In this paper, we propose a video object segmentation algorithm by extending the formulation of seams, from image and video retargetting. In retargetting, the primary aim is to reduce the image size while preserving the salient image contents. To achieve this, the energy function used is based on edge strength. Typically, seams, which are connected paths of low energy, are utilised for retargetting. Here, we modify the formulation of seams to facilitate robust video object segmentation. The energy function associated with the proposed video seams provides temporal linking of objects across frames, while accurately modelling object motion. The proposed energy function takes into account the similarity of patches along the seam, temporal consistency of motion and spatial coherency of seams. Label propagation in the boundary regions, the most critical step in accurate object segmentation, is achieved with high fidelity, utilising the proposed video seams. To achieve accurate object segmentation without additional overheads, we curtail the error propagation from boundary regions using rough-set based modelling. The performance of proposed approach is evaluated on benchmark datasets and found to out-perform existing supervised and unsupervised state-of-the-art approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
#1319 - Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision [pdf]
Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun |
Abstract: Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and an expenditure of at least tens of thousands of dollars. Thus, coming up with solutions that can automatically do labeling given weak supervision is key to reduce this cost. In this paper we show how to exploit 3D information (i.e., stereo and/or point clouds,) to automatically generate very accurate object segmentations given annotated 3D bounding boxes. We formulate the problem as the one of inference in a binary MRF which exploits appearance models, stereo and/or noisy point clouds, a repository of 3D CAD models as well as topological constraints. We demonstrate the effectiveness of our approach in the context of autonomous driving, and show that we can segment cars with 86% intersection over union, performing as well as highly recommended MTurkers!
|
Similar papers:
[rank all papers by similarity to this]
|
#1325 - A Hierarchical Context Model for Event Recognition in Surveillance Video [pdf]
Xiaoyang Wang, Qiang Ji |
Abstract: Due to great challenges such as tremendous intra-class variations and low image resolution, context information has been playing a more and more important role for accurate and robust event recognition in surveillance videos. The context information can generally be divided into the feature level context, the semantic level context, and the prior level context. These three levels of context provide crucial bottom-up, middle level, and top down information that can benefit the recognition task itself. Unlike existing researches that generally integrate the context information at one of the three levels, we propose a hierarchical context model that simultaneously exploits contexts at all three levels and systematically incorporate them into event recognition. To tackle the learning and inference challenges brought in by the model hierarchy, we develop complete learning and inference algorithms for the proposed hierarchical context model based on variational Bayes method. Experiments on VIRAT 1.0 and 2.0 Ground Datasets demonstrate the effectiveness of the proposed hierarchical context model for improving the event recognition performance even under great challenges like large intra-class variations and low image resolution.
|
Similar papers:
[rank all papers by similarity to this]
|
#1328 - Tell Me What You See and I will Show You Where It Is [pdf]
Jia Xu, Alexander Schwing, Raquel Urtasun |
Abstract: We tackle the problem of weakly labeled semantic segmentation, where the information is given only in the form of image tags that encode which classes are present in the scene. This is an extremely difficult problem as no pixel-wise labelings are available, not even at training time. In this paper, we show that this problem can be formalized as performing learning and inference in a latent structure prediction framework. The graphical model encodes the presence and absence of a class as well as the assignments of semantic labels to super-pixels. As a consequence, we are able to leverage techniques and algorithms with good theoretical properties. We demonstrate the effectiveness of our approach in the challenging Sift Flow dataset and show superior performance to the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
#1330 - Event Detection using Multi-Level Relevance Labels and Multiple Features [pdf]
Zhongwen Xu, Ivor W. Tsang, Yi Yang, Zhigang Ma, Alexander Hauptmann |
Abstract: We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available. Related exemplars are labeled as related to the event but not exactly matched. Related exemplars share certain positive elements of the event, but have no uniform pattern due to the huge variance of relevance levels among different related exemplars. None of the existing multiple feature fusion methods can deal with the related exemplars. In this paper, we propose an algorithm which adaptively utilizes the related exemplars by cross feature learning. Ordinal labels are used to represent the multiple relevance levels of the related videos. Label candidates of related exemplars are generated by exploring the possible relevance levels of each related exemplar via a cross-feature voting strategy. Maximum margin criterion is then applied in our framework to discriminate the positive and negative exemplars, as well as the related exemplars from different relevance levels. We test our algorithm using the large scale TRECVID 2011 dataset and it gains promising performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#1331 - Fast and Robust Archetypal Analysis for Representation Learning [pdf]
Yuansi Chen, Julien Mairal, Zaid Harchaoui |
Abstract: We revisit a pioneer unsupervised learning technique called archetypal analysis, which is related to successful data analysis methods such as sparse coding and non-negative matrix factorization. Since it was proposed, archetypal analysis did not gain a lot of popularity even though it produces more interpretable models than other alternatives. Because no efficient implementation has ever been made publicly available, its application to impactful problems has indeed been severely limited. Our paper addresses this issue in order to reinstate archetypal analysis. We develop a fast optimization scheme based on an active-set strategy, and provide the first scalable open-source implementation. Then, we demonstrate the usefulness of archetypal analysis for computer vision tasks, such as codebook learning, signal classification, and large-scale image collection visualization.
|
Similar papers:
[rank all papers by similarity to this]
|
#1332 - Superpixel-grounded Deformable Part Models [pdf]
Eduard Trulls, Iasonas Kokkinos, Francesc Moreno-Noguer, Alberto Sanfeliu |
Abstract: In this work we propose a simple and fast technique of combining bottom-up segmentation, in the form of SLIC superpixels, with Deformable Part Models (DPMs). Our approach can be understood as `cleaning up' the low-level HOG features by exploiting the spatial support of SLIC superpixels; effectively we split feature variation into object-specific changes, and generic background/contextual changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine these in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough that we can repeat this process over every candidate window, during training and detection, for both the root and part filters. We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007 dataset, which outperforms the standard DPM in 13 out of 15 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach extending it to dense SIFT descriptors for large displacement optical flow.
|
Similar papers:
[rank all papers by similarity to this]
|
#1336 - Relative Pose Estimation for a Multi-Camera System with Known Vertical Direction [pdf]
Gim Hee Lee, Marc Pollefeys, Friedrich Fraundorfer |
Abstract: In this paper, we present our minimal 4-point and linear 8-point algorithms to estimate the relative pose of a multi-camera system with known vertical directions, i.e. known absolute roll and pitch angles. We solve the minimal 4-point algorithm with the hidden variable resultant method and show that it leads to an 8-degree univariate polynomial that gives up to 8 real solutions. We identify a degenerated case from the linear 8-point algorithm when it is solved with the standard Singular Value Decomposition (SVD) method and adopt a simple alternative solution which is easy to implement. We show that our proposed algorithms can be efficiently used within RANSAC for robust estimation. We evaluate the accuracy of our proposed algorithms by comparisons with various existing algorithms for the multi-camera system on simulations and show the feasibility of our proposed algorithms with results from multiple real-world datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1337 - Mirror Symmetry Histograms for Capturing Geometric Properties in Images [pdf]
Marcelo Cicconet, Davi Geiger, Michael Werman, Kristin Gunsalus |
Abstract: We propose a data structure that captures global geometric properties in images: histograms of mirror symmetry coefficients. We compute such a coefficient for every pair of pixels taking into account their respective tangents and group them in a 6-dimensional histogram. By marginalizing this symmetry histogram in various ways, we develop algorithms for a range of applications: recovery of the contour representation of an image; detection of nearly-circular cells; location of the main axis of reflection symmetry; detection of cell-division in movies of developing embryos; detection of worm-tips and indirect cell-counting via Machine Learning. Our approach generalizes a series of histogram-related methods, and the proposed algorithms perform with state-of-the-art accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1339 - Region-based Discriminative Feature Pooling for Scene Text Recognition [pdf]
Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, Robinson Piramuthu |
Abstract: We present a new feature representation method for scene text recognition problem, particularly focusing on improving scene character recognition. Many existing methods rely on histogram of oriented gradient (HOG) or part-based models, which do not span the feature space well for characters in natural scene images, especially given large variation in fonts with clutter backgrounds. In this work, we propose a discriminative feature pooling method that automatically learns the most informative sub-regions of each scene character within a multi-class classification framework, whereas each sub-region seamlessly integrates a set of low-level image features through integral images. The proposed feature representation is compact, computationally efficient, and able to effectively model distinctive spatial structures of each individual character class. Extensive experiments conducted on challenging datasets (Chars74K, ICDAR'03, ICDAR'11, SVT) show that our method significantly outperforms existing methods on scene character classification and scene text recognition tasks.
|
Similar papers:
[rank all papers by similarity to this]
|
#1341 - Turning Mobile Phones into 3D Scanners [pdf]
Kalin Kolev, Petri Tanskanen, Pablo Speciale, Marc Pollefeys |
Abstract: In this paper, we propose an efficient and accurate scheme for the integration of multiple stereo-based depth measurements. For each provided depth map a confidence-based weight is assigned to each depth estimate by evaluating local geometry orientation, underlying camera setting and photometric evidence. Subsequently, all hypotheses are fused together into a compact and consistent 3D model. Thereby, visibility conflicts are identified and resolved, and fitting measurements are averaged with regard to their confidence scores. The individual stages of the proposed approach are validated by comparing it to two alternative techniques which rely on a conceptually different fusion scheme and a different confidence inference, respectively. Pursuing live 3D reconstruction on mobile devices as a primary goal, we demonstrate that the developed method can easily be integrated into a system for monocular interactive 3D modeling by substantially improving its accuracy while adding an almost negligible overhead to its performance and retaining its interactive potential.
|
Similar papers:
[rank all papers by similarity to this]
|
#1346 - Predicting Multiple Attributes via Relative Multi-task Learning [pdf]
Lin Chen, Qiang Zhang, Baoxin Li |
Abstract: Relative attributes learning aims to learn ranking functions describing the relative strength of attributes. Most of current learning approaches learn ranking functions for each attribute independently without considering possible intrinsic relatedness among the attributes. For a problem involving multiple attributes, it is reasonable to assume that utilizing such relatedness among the attributes would benefit learning, especially when the number of labeled training pairs are very limited. In this paper, we proposed a relative multi-attribute learning framework that integrates relative attributes into a multi-task learning scheme. The formulation allows us to exploit the advantages of the state-of-the-art regularization-based multi-task learning for improved attribute learning. In particular, using joint feature learning as the case studies, we evaluated our framework with both synthetic data and two real datasets. Experimental results suggest that the proposed framework has clear performance gain in ranking accuracy and zero-shot learning accuracy over existing methods of independent relative attributes learning and multi-task learning.
|
Similar papers:
[rank all papers by similarity to this]
|
#1351 - Lacunarity Analysis on Image Patterns for Texture Classification [pdf]
Yuhui Quan, Yong Xu, Yuping Sun, Yu Luo |
Abstract: This paper introduces a statistical approach to texture description, which can achieve highly discriminative ability for classifying texture images under a wide range of transformations, including photometric changes and geometric changes. The proposed method is based on the concept of lacunarity of the image patterns. Built upon the local binary patterns that are encoded at multiple scales, lacunarity analysis is applied to capture the self-similar behavior of the local structures. The proposed texture descriptor was applied to texture classification. Our method has demonstrated excellent performance in comparison with the existing state-of-the-art approaches on four challenging benchmark datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1362 - Pseudoconvex Proximal Splitting for $L_\infty$ Problems in Multiview Geometry [pdf]
Anders Eriksson |
Abstract: In this paper we study optimization methods for minimizing large-scale pseudoconvex $L_\infty$ problems in multiview geometry. We present a novel algorithm for solving this class of problem based on proximal splitting methods. We provide a brief derivation of the proposed method along with a general convergence analysis. The resulting meta-algorithm requires very little effort in terms of implementation and instead makes use of existing advanced solvers for non-linear optimization. Preliminary experiments on a number of real image datasets indicates that the proposed method experimentally matches or outperforms current state-of-the-art solvers for this class of problems.
|
Similar papers:
[rank all papers by similarity to this]
|
#1371 - High Accuracy Monocular Localization for Autonomous Driving Using Adaptive Ground Estimation [pdf]
Shiyu Song, Manmohan Chandraker |
Abstract: Scale drift is a crucial challenge that prevents monocular autonomous driving from emulating the performance of stereo. This paper presents a real-time monocular SFM system that corrects for scale drift using a highly effective cue combination framework for ground plane estimation, yielding accuracy comparable to stereo over long driving sequences. Our ground plane estimation uses multiple cues like sparse features, dense inter-frame stereo and (when applicable) object bounding boxes. A data-driven mechanism is proposed to learn models from training data that relate observation covariances for each cue to error behavior of its underlying variables. During testing, this allows per-frame adaptation of observation covariances based on relative confidences inferred from visual data. Our framework significantly boosts not only the accuracy of monocular self-localization, but also that of applications like object localization that rely on the ground plane. Experiments on the KITTI dataset demonstrate the accuracy of our ground plane estimation, monocular SFM and object localization relative to ground truth, with detailed comparisons to prior art.
|
Similar papers:
[rank all papers by similarity to this]
|
#1381 - Multi-Cue Visual Tracking Using Robust Feature-Level Fusion Based on Joint Sparse Representation [pdf]
Xiangyuan Lan, Pong C YUEN, Andy Jinhua Ma |
Abstract: The use of multiple features for tracking has been proved as an effective approach because limitation of each feature could be compensated. Since different types of variations such as illumination, occlusion and pose may happen in a video sequence, especially long sequence videos, how to dynamically select the appropriate features is one of the key problems in this approach. To address this issue in multi-cue visual tracking, this paper proposes a new joint sparse representation model for robust feature-level fusion. The proposed method dynamically removes unstable features to be fused for tracking by using the advantages of sparse representation. As a result, robust tracking performance is obtained. Experimental results on publicly available videos show that the proposed method outperforms both existing sparse representation based and fusion-based trackers.
|
Similar papers:
[rank all papers by similarity to this]
|
#1382 - Mixing Body-Part Sequences for Human Pose Estimation [pdf]
Anoop Cherian, Julien Mairal, Karteek Alahari, Cordelia Schmid |
Abstract: In this paper, we present a method for estimating articulated human poses in videos. We cast this as an optimization problem defined on body parts with spatio-temporal links between them. Previous approaches for addressing this intractable problem have used different approximate solutions. Although such methods perform well on certain body parts, e.g. head, their performance on lower arms, i.e. elbows, wrists, remains poor. We present an alternative approximate method adapted to the pose estimation problem. Firstly, our approach takes into account temporal links with subsequent frames for the less-certain parts, namely elbows and wrists. Secondly, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences. We introduce a new dataset ``Poses in the Wild'', which is more challenging than existing ones, with sequences containing background clutter, occlusions, and severe camera motion. We experimentally compare our method with recent works on this new dataset as well as on two publicly available datasets, and show significant improvement.
|
Similar papers:
[rank all papers by similarity to this]
|
#1389 - 3D Pose from Motion for Cross-view Action Recognition via Non-linear Circulant Temporal Encoding [pdf]
Ankur Gupta, Martinez Julieta, Jim Little, Robert Woodham |
Abstract: We describe a new approach to transfer knowledge across views for action recognition by using examples from a large collection of unlabelled motion capture (mocap) data to connect different views. We achieve this by directly matching purely motion based features from videos to mocap. Our approach is able to recover 3D pose sequences without performing any body part tracking. We use these matches to generate multiple motion projections and thus add view invariance to our action recognition model. We also introduce a closed form solution for approximate non-linear Circulant Temporal Encoding (nCTE), which allows us to efficiently perform the matches in the frequency domain. We test our approach on the challenging unsupervised modality of the IXMAS dataset, and use publicly available motion capture data for matching. Without any additional annotation effort, we are able to significantly outperform the current state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
#1391 - Better Feature Tracking Through Subspace Constraints [pdf]
Bryan Poling, Gilad Lerman, Arthur Szlam |
Abstract: Feature tracking in video is a crucial task in computer vision. Usually, the tracking problem is handled one feature at a time, using a single-feature tracker like the Kanade-Lucas-Tomasi algorithm, or one of its derivatives. While this approach works quite well when dealing with high-quality video and ``strong'' features, it often falters when faced with dark and noisy video containing low-quality features. We present a framework for jointly tracking a set of features, which enables sharing information between the different features in the scene. We show that our method can be employed to track features for both rigid and nonrigid motions (possibly of few moving bodies) even when some features are occluded. Furthermore, it can be used to significantly improve tracking results in poorly-lit scenes (where there is a mix of good and bad features). Our approach does not require direct modeling of the structure or the motion of the scene, and runs in real time on a single CPU core.
|
Similar papers:
[rank all papers by similarity to this]
|
#1394 - Efficient Computation of Relative Pose for Multi-Camera Systems [pdf]
Laurent Kneip, Hongdong Li |
Abstract: We present a novel solution to compute the relative pose of a generalized camera. Existing solutions are either not general, have too high computational complexity, or require too many correspondences, which impedes an efficient or accurate usage within Ransac schemes. We factorize the problem as a low-dimensional, iterative optimization over relative rotation only, directly derived from well-known epipolar constraints. Common generalized cameras often consist of camera clusters, and give rise to omni-directional landmark observations. We prove that our iterative scheme performs well in such practically relevant situations, eventually resulting in computational efficiency similar to linear solvers, and accuracy close to bundle adjustment, while using less correspondences. Experiments on both virtual and real multi-camera systems prove superior overall performance for robust, real-time multi-camera motion estimation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1399 - Aerial Reconstructions via Probabilistic Data Fusion [pdf]
Randi Cabezas, Oren Freifeld, Guy Rosman, John Fisher III |
Abstract: We propose an integrated probabilistic model for multi-modal fusion of aerial imagery and LiDAR data. The resulting model allows for reconstruction and analysis of large 3D scenes. An advantage of the approach is that it explicitly models uncertainty, allows for missing data, and provides a consistent framework for incorporating additional measurement modalities. As compared with image-based methods, dense reconstruction of complex urban scenes is feasible with relatively fewer observations. Furthermore, the proposed model allows one to estimate absolute scale and orientation and reason about other aspects of the scene, e.g., detection moving objects. As formulated, the model lends itself to massively-parallel computations, that is, utilizing both general-purpose and domain-specific components of modern graphic hardware, we are able to do fast inference over complex and detailed scenes. We demonstrate our results on large-scale reconstruction of an urban terrain from LiDAR and visual aerial photography data.
|
Similar papers:
[rank all papers by similarity to this]
|
#1404 - Discriminative Hierarchical Modeling of Spatio-Temporally Composable Human Activities [pdf]
Ivan Lillo, Juan Carlos Niebles, Alvaro Soto |
Abstract: This paper proposes a framework for recognizing complex human activities in videos. Our method describes human activities in a hierarchical discriminative model that operates at three semantic levels. At the lower level, body poses are encoded in a representative but discriminative pose dictionary. At the intermediate level encoded poses span a space on which simple human actions are composed. At the highest level, our model captures temporal and spatial compositions of actions into complex human activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. By formulating model learning in a max-margin framework, our approach achieves powerful multi-class discrimination while providing useful annotations at the intermediate semantic level. We show how our hierarchical compositional model provides natural handling of occlusions, as well as novel compositions. To evaluate the effectiveness of our proposed framework, we introduce a new dataset of composed human activities. We provide empirical evidence that our method achieves state-of-the-art classification accuracies.
|
Similar papers:
[rank all papers by similarity to this]
|
#1405 - Unifying Spatial and Attribute Selection for Distracter-resilient Tracking [pdf]
Nan Jiang, Ying Wu |
Abstract: Visual distracters are detrimental and generally very difficult to handle in target tracking, because they generate false positive candidates for target matching. The resilience of region-based matching to the distracters depends not only on the matching metric, but also on the characteristics of the target region to be matched. The two tasks, i.e., learning the best metric and selecting the distracter-resilient target regions, actually correspond to the attribute selection and spatial selection processes in the human visual perception. This paper presents an initial attempt to unify the modeling of these two tasks for an effective solution, based on the introduction of a new quantity called Soft Visual Margin. As a function of both matching metric and spatial location, it measures the discrimination between the target and its spatial distracters, and characterizes the reliability of matching. Different from other formulations of margin, this new quantity is analytical and is insensitive to noisy data. This paper presents a novel method to jointly determine the best spatial location and the optimal metric. Based on that, a solid distracter-resilient region tracker is designed, and its effectiveness is validated and demonstrated through extensive experiments.
|
Similar papers:
[rank all papers by similarity to this]
|
#1406 - Complex Activity Recognition using Granger Constrained DBN (GCDBN) in Sports and Surveillance Video [pdf]
Eran Swears, Anthony Hoogs, Qiang Ji, Kim Boyer |
Abstract: Modeling interactions of multiple co-occurring objects in a complex activity is becoming increasingly popular in the video domain. The Dynamic Bayesian Network (DBN) has been applied to this problem in the past due to its natural ability to statistically capture complex temporal dependencies. However, standard DBN structure learning algorithms are generatively learned, require manual structure definitions, and/or are computationally complex or restrictive. We propose a novel structure learning solution that fuses the Granger Causality statistic, a direct measure of temporal dependence, with the Adaboost feature selection algorithm to automatically constrain the temporal links of a DBN in a discriminative manner. This approach enables us to completely define the DBN structure prior to parameter learning, which reduces computational complexity in addition to providing a more descriptive structure. We refer to this modeling approach as the Granger Constraints DBN (GCDBN). Our experiments show how the GCDBN outperforms two of the most relevant state-of-the-art graphical models in complex activity classification on handball video data, surveillance data, and synthetic data.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Accurate ground truth pose is essential to the training of most existing head pose estimation algorithms. However, in many cases, the ``ground truth'' pose is obtained in rather subjective ways, such as asking the human subjects to stare at different markers on the wall. In such case, it is better to use soft labels rather than explicit hard labels as the ground truth. Therefore, this paper proposes to associate a multivariate label distribution (MLD) to each image. An MLD covers a neighborhood around the original pose. Labeling the images with MLD can not only alleviate the problem of inaccurate pose labels, but also boost the training examples associated to each pose without actually increasing the total amount of training examples. Two algorithms are proposed to learn from the MLD by minimizing the weighted Jeffrey's divergence between the predicted MLD and the ground truth MLD. Experimental results show that the MLD-based methods perform significantly better than the compared state-of-the-art head pose estimation algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1413 - Total-Variation Minimization on Unstructured Volumetric Mesh: Biophysical Applications on Reconstruction of 3D Ischemic Myocardium [pdf]
Jingjia Xu, Azar Rahimi Dehaghani, Fei Gao, Linwei Wang |
Abstract: This paper describes the development and application of a new approach to total-variation (TV) minimization for reconstruction problems on geometrically-complex and unstructured volumetric mesh. The driving application of this study is the reconstruction of 3D ischemic regions in the heart from noninvasive body-surface potential data, where the use of a TV-prior can be expected to promote the reconstruction of two piecewise smooth regions of healthy and ischemic electrical properties with localized gradient in between. Compared to TV minimization on regular grids of pixels/voxels, the complex unstructured volumetric mesh of the heart poses unique challenges including the impact of mesh resolutions on the TV-prior and the difficulty of gradient calculation. In this paper, we introduce a variational TV-prior and, when combined with the iteratively re-weighted least-square concept, a new algorithm to TV minimization that is computationally efficient and robust to the discretization resolution. In a large set of simulation studies as well as two initial real-data studies, we demonstrate that the use of a TV prior outperforms L2-based penalties in the reconstruction of ischemic regions, and that the proposed TV-minimization algorithm shows higher accuracy, robustness, and computational efficiency compared to that with the commonly used discrete TV prior. Furthermore, we also compare the performance of the proposed TV minimization algorithm in combination with a L2- versus L1-based
|
Similar papers:
[rank all papers by similarity to this]
|
#1421 - RGB-D Depth Map Enhancement with Depth and Motion in Complement [pdf]
Tak-Wai Hui, King-Ngi Ngan |
Abstract: Low-cost RGB-D imaging system such as Kinect is widely utilized for dense 3D reconstruction. However, Kinect generally suffers from two major problems. The spatial resolution of the depth image is low. The depth image often contains numerous holes where no depth measurements were available. This can be due to the bad infra-red reflectance property of some objects in the scene. Since the spatial resolution of the color image is higher than that of the depth image, this paper introduces a new method to enhance the depth images from a moving Kinect using the depth cue from the induced optical flow. We not only fill holes in the raw depth image, but also recover the fine details of the scene. We address the problem of depth image enhancement by minimizing an energy functional. In order to reduce the computational complexity, we have treated the textured and homogeneous regions in the color image differently. Experimental results on real-image data are provided to show the effectiveness of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Interactive object segmentation has great practical importance in computer vision. Many interactive methods have been proposed utilizing user input in the form of mouse clicks and mouse strokes, and often requiring a lot of user intervention. In this paper, we present a system with a far simpler input method: the user needs only give the name of the desired object. With the tag provided by the user we query a text image database to gather exemplars of the object. Using object proposals and borrowing ideas from image retrieval and object detection, the object is localized in the image. An appearance model generated from the exemplars and the location prior are used in an energy minimization framework to select the object. Our method outperforms the state-of-the-art on existing datasets and on a more challenging dataset we collected.
|
Similar papers:
[rank all papers by similarity to this]
|
#1432 - Learning-by-Synthesis for Appearance-based 3D Gaze Estimation [pdf]
Yusuke Sugano, Yasuyuki Matsushita, Yoichi Sato |
Abstract: Inferring human gaze from low-resolution eye images is still a challenging task despite its practical importance in many application scenarios. This paper presents a learning-by-synthesis approach to accurate image-based gaze estimation that is person- and head pose-independent. Unlike existing appearance-based methods that assume person-specific training data, we use a large amount of cross-subject training data to train a 3D gaze estimator. We collect the largest and fully calibrated multi-view gaze dataset and perform a 3D reconstruction in order to generate dense training data of eye images. By using the synthesized dataset to learn a random regression forest, we show that our method outperforms existing methods that use low-resolution eye images.
|
Similar papers:
[rank all papers by similarity to this]
|
#1436 - Time Machine: Continuous Manifold Based Adaptation for Evolving Visual Domains [pdf]
Judy Hoffman, Trevor Darrell, Kate Saenko |
Abstract: We pose the following question: what happens when test data not only differs from training data, but differs from it in a continually evolving way? The classic domain adaptation paradigm considers the world to be separated into stationary domains with clear boundaries between them. However, in many real-world applications, examples cannot be naturally separated into discrete domains, but arise from a continuously evolving underlying process. Examples include video with gradually changing lighting and spam email with evolving spammer tactics. We formulate a novel problem of adapting to such continuous domains, and present a solution based on smoothly varying embeddings. Recent work has shown the utility of considering discrete visual domains as fixed points embedded in a manifold of lower-dimensional subspaces. Adaptation can be achieved via transforms or kernels learned between such stationary source and target subspaces. We propose a method to consider non-stationary domains, which we refer to as Continuous Manifold Adaptation (CMA). We treat each target sample as potentially being drawn from a different subspace on the domain manifold, and present a novel technique for continuous transform-based adaptation. Our approach can learn to distinguish categories using training data collected at some point in the past, and continue to update its model of the categories for some time into the future, without receiving any additional labels. Experiments on two visual datasets demonst
|
Similar papers:
[rank all papers by similarity to this]
|
#1437 - Real-time Simultaneous Pose and Shape Estimation for Articulated Objects with a Single Depth Camera [pdf]
Mao Ye, Ruigang Yang |
Abstract: In this paper we present a novel real-time algorithm for simultaneous pose and shape estimation for articulated objects, such as human beings and animals. The key of our pose estimation component is to embed the articulated deformation model with exponential-maps-based parametrization into a Gaussian Mixture Model. Benefiting from the probabilistic measurement model, our algorithm requires no explicit point correspondences as opposed to most existing methods. Consequently, our approach is less sensitive to local minimum and well handles fast and complex motions. Extensive evaluations on publicly available datasets demonstrate that our method outperforms most state-of-art pose estimation algorithms with large margin, especially in the case of challenging motions. Moreover, our novel shape adaptation algorithm based on the same probabilistic model automatically captures the shape of the subjects during the dynamic pose estimation process. Experiments show that our shape estimation method achieves comparable accuracy with state of the arts, yet requiring no extra calibration procedure.
|
Similar papers:
[rank all papers by similarity to this]
|
#1438 - Multi-modal Learning in Loosely-organized Web Images [pdf]
Kun Duan, David Crandall, Dhruv Batra |
Abstract: Photo-sharing websites have become very popular in the last few years,leading to huge collections of online images. In addition to image data, these websites collect a variety of multi-modal metadata about photos including text tags, captions, GPS coordinates, camera metadata, user profiles, etc. However, this metadata is not well constrained and is often noisy, sparse, or missing altogether. In this paper, we propose a framework to model these "loosely organized" multi-modal datasets, and show how to perform loosely-supervised learning using a novel latent Conditional Random Field framework. We also show how to learn parameters of the LCRF automatically from a small set of validation data, using Information Theoretic Metric Learning (ITML) to learn distance functions and a structural SVM formulation to learn the potential functions. We apply our framework in four datasets of images from Flickr, evaluating our approach both qualitatively and quantitatively against several baselines.
|
Similar papers:
[rank all papers by similarity to this]
|
#1440 - Visual Tracking Using Pertinent Patch Selection and Masking [pdf]
Dae-Youn Lee, Jae-Young Sim, Chang-Su Kim |
Abstract: A novel visual tracking algorithm using patch-based appearance models is proposed in this paper. We first divide the bounding box of a target object into multiple patches and then select only pertinent patches, which occur repeatedly near the center of the bounding box, to construct the foreground appearance model. We also divide the input image into non-overlapping blocks, construct a background model at each block location, and integrate these background models for tracking. Using the appearance models, we obtain an accurate foreground probability map. Finally, we estimate the optimal object position by maximizing the likelihood, which is obtained by convolving the foreground probability map with the pertinence mask. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art tracking algorithms significantly in terms of center position errors and success rates.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The objective of this paper is image reconstruction from the Bag-of-Visual-Words (BoVW), which is the de facto standard feature for image retrieval and recognition. Despite its wide use, no one has reconstructed an original image of BoVW. This task is challenging for two reasons: 1) BoVW contains quantization errors when local descriptors are assigned to visual words. 2) BoVW lacks geometry information of local descriptors when we count the occurrence of visual words by ignoring those locations. To tackle this difficult task, we use a large-scale image database to estimate the spatial arrangement of local descriptors; then this task creates a jigsaw puzzle problem with adjacency and global location costs of local descriptors. Solving this optimization problem is also challenging because it is known as an NP-Hard problem. We propose a heuristic but efficient method to optimize it. To underscore the effectiveness of our method, we apply it to BoVWs calculated from about 100 different categories, and demonstrate that our method surprisingly can reconstruct original images, although the image features lack spatial information and include quantization errors.
|
Similar papers:
[rank all papers by similarity to this]
|
#1444 - Rigid Motion Segmentation using Randomized Voting [pdf]
Heechul Jung, Jeongwoo Ju, Junmo Kim |
Abstract: In this paper, we propose a novel rigid motion segmentation algorithm called randomized voting (RV). This algorithm is based on epipolar geometry, and computes a score using the distance between the feature point and the corresponding epipolar line. This score is accumulated and utilized for final grouping. Our algorithm basically deals with two frames, so it is also applicable to the two-view motion segmentation problem. For evaluation of our algorithm, Hopkins 155 dataset, which is a representative test set for rigid motion segmentation, is adopted; it consists of two and three rigid motions. Our algorithm has provided the most accurate motion segmentation results among all of the state-of-the-art algorithms. The average error rate is 0.77%. In addition, when there is measurement noise, our algorithm is comparable with other state-of-the-art algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1445 - Salient Region Detection via High-Dimensional Color Transform [pdf]
Jiwhan Kim, Dongyoon Han, Yu-Wing Tai, Junmo Kim |
Abstract: In this paper, we introduce a novel technique to automatically detect the salient region of an image via high-dimensional color transform. Our main idea is to represent a saliency map of an image as a linear combination of high-dimensional color space where salient regions and backgrounds can be distinctively separated. This is based on an observation that salient regions often have distinctive colors compared to the background in human perception, but human perception is often complicated and highly nonlinear. By mapping a low dimensional RGB color to a feature vector in a high-dimensional color space, we show that we can linearly separate the salient regions from the background by finding an optimal linear combination of color coefficients in the high-dimensional color space. Our high dimensional color space incorporates multiple color representations including RGB, CIELab, HSV and with gamma corrections to enrich its representative power. Our experimental results on three benchmark datasets show that our technique is effective, and it is computationally efficient in comparison to previous state-of-the-art techniques.
|
Similar papers:
[rank all papers by similarity to this]
|
#1459 - Blind Image Quality Assessment using Semi-supervised Rectifier Networks [pdf]
Huixuan Tang, Neel Joshi, Ashish Kapoor |
Abstract: It is often desirable to evaluate the quality of images with a perceptually relevant measure that does not require a reference image. Recent approaches to this problem have used human provided quality scores with machine learning to learn a measure. The biggest hurdles to these efforts are: 1) the difficulty of generalizing across diverse types of distortions and 2) collecting the enormity of human scored training data that is needed to learn the measure. We present a new blind image quality measure that addresses these difficulties by learning a robust, nonlinear kernel regression function using a rectifier neural network. The method is pre-trained with unlabeled data and fine-tuned with labeled data. It generalizes across a large set of images and distortion types without the need for a large amount of labeled data. We evaluate our approach on two benchmark datasets and show that our method outperforms the current state of the art. Furthermore, we show that our semi-supervised approach is robust to using varying amounts of labeled data.
|
Similar papers:
[rank all papers by similarity to this]
|
#1469 - Quality Assessment for Comparing Image Enhancement Algorithms [pdf]
Zhengying Chen, Tingting Jiang, Yonghong Tian |
Abstract: As image enhancement algorithms are developed in recent years, how to compare the performances of different image enhancement algorithms becomes a novel task. In this paper, we propose a framework to do quality assessment for comparing image enhancement algorithms. Not like traditional image quality assessment approaches, we focus on the relative quality ranking between enhanced images rather than giving an absolute quality score for a single enhanced image. We construct a dataset which contains source images in bad visibility and their enhanced images processed by different enhancement algorithms, and then do subjective assessment in a pair-wise way to get the relative ranking of these enhanced images. A rank function is trained to fit the subjective assessment results, and can be used to predict ranks of new enhanced images which indicate the relative quality of enhancement algorithms. The experimental results show that our proposed approach statistically outperforms state-of-the-art general-purpose NR-IQA algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1474 - Tracking on the Product Manifold of Shape and Orientation for Tractography from Diffusion MRI [pdf]
YUANXIANG WANG, Hesamoddin Salehian, Guang Cheng, Baba Vemuri |
Abstract: Tractography refers to the process of tracing out the nerve fiber bundles from diffusion Magnetic Resonance Images (dMRI) data acquired either in vivo or ex-vivo. Tractography is a mature research topic within the field of diffusion MRI analysis, nevertheless, several new methods are being proposed on a regular basis thereby justifying the need, as the problem is not fully solved. Tractography is usually applied to the model (used to represent the diffusion MR signal or a derived quantity) reconstructed from the acquired data. Separating shape and orientation of these models was previously shown to approximately preserve diffusion anisotropy (a useful bio-marker) in the ubiquitous problem of interpolation. However, no further intrinsic geometric properties of this framework were exploited to date in literature. In this paper, we propose a new intrinsic recursive filter on the product manifold of shape and orientation. The recursive filter, dubbed IUKFPro, is a generalization of the unscented Kalman filter (UKF) to this product manifold. The salient contributions of this work are: (1) A new intrinsic UKF for the product manifold of shape and orientation. (2) Derivation of the Riemannian geometry of the product manifold. (3) IUKFPro is tested on synthetic and real data sets from various tractography challenge competitions. From the experimental results, it is evident that IUKFPro performs better than several competing schemes in literature with regards to the some of the err
|
Similar papers:
[rank all papers by similarity to this]
|
#1483 - Optimizing Average Precision using Weakly Supervised Data [pdf]
Aseem Behl, M. Pawan Kumar, C.V. Jawahar |
Abstract: The performance of binary classification tasks, such as action classification and object detection, is often measured in terms of the average precision (AP). Yet it is common practice in computer vision to employ the support vector machine (SVM) classifier, which optimizes a surrogate 0-1 loss. The popularity of SVM can be attributed to its empirical performance. Specifically, in fully supervised settings, SVM tends to provide similar accuracy to the AP-SVM classifier, which directly optimizes an AP-based loss. However, we hypothesize that in the significantly more challenging and practically useful setting of weakly supervised learning, it becomes crucial to optimize the right accuracy measure. In order to test this hypothesis, we propose a novel latent AP-SVM that minimizes a carefully designed upper bound on the AP-based loss function over a weakly supervised dataset. Using publicly available datasets, we demonstrate the advantage of our approach over standard loss-based binary classifiers on two challenging problems: action classification and character recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
#1488 - Pyramid-based Visual Tracking Using Sparsity Represented Mean Transform [pdf]
Zhe Zhang, Kin Hong Wong |
Abstract: In this paper, we propose a robust method for visual tracking relying on mean shift, sparse coding and spatial pyramids. Firstly, we extend the original mean shift approach to handle orientation space and scale space and name this new method as mean transform. The mean transform method estimates the motion, including the location, orientation and scale, of the interested object window simultaneously and effectively. Secondly, a pixel-wise dense patch sampling technique and a region-wise trivial template designing scheme are introduced which enable our approach to run very accurately and efficiently. Additionally, instead of using either holistic representation or local representation only, we apply spatial pyramids by combining these two representations into our approach to deal with the partial occlusion problems robustly. Observed from the experimental results, our approach outperforms state-of-the-art methods in many benchmark sequences.
|
Similar papers:
[rank all papers by similarity to this]
|
#1492 - Similarity Comparisons for Interactive Fine-Grained Categorization [pdf]
Catherine Wah, Grant Van Horn, Steven Branson, Subhransu Maji, Pietro Perona, Serge Belongie |
Abstract: Current human-in-the-loop fine-grained visual categorization systems depend on a predefined vocabulary of attributes and parts, usually determined by experts. In this work, we move away from that expert-driven and attribute-centric paradigm and present a novel interactive classification system that incorporates computer vision and perceptual similarity metrics in a unified framework. At test time, users are asked to judge relative similarity between a query image and various sets of images; these general queries do not require expert-defined terminology and are applicable to other domains and basic-level categories, enabling a flexible, efficient, and scalable system for fine-grained categorization with humans in the loop. Our system outperforms existing state-of-the-art systems for relevance feedback-based image retrieval as well as interactive classification, resulting in a reduction of up to 43% in the average number of questions needed to correctly classify an image.
|
Similar papers:
[rank all papers by similarity to this]
|
#1497 - A Cause and Effect analysis of motion trajectories for modeling actions [pdf]
Sanath Narayan, Kalpathi Ramakrishnan |
Abstract: An action is typically composed of different parts of the object moving in particular sequences. The presence of different motions (represented as a 1D histogram) has been used in the traditional bag-of-words (BoW) approach for recognizing actions. However the interactions among the motions also form a crucial part of an action. Different object-parts have varying degrees of interactions with the other parts during an action cycle. It is these interactions we want to quantify in order to bring in additional information about the actions. In this paper we propose a causality based approach for quantifying the interactions to aid action classification. Granger causality is used to compute the cause and effect relationships for pairs of motion trajectories of a video. A 2D histogram descriptor for the video is constructed using these pairwise measures. Our proposed method of obtaining pairwise measures for videos is also applicable for large datasets. We have conducted experiments on challenging action recognition databases such as HMDB51 and UCF50 and shown that our causality descriptor helps in encoding additional information regarding the actions and outperforms the state-of-the art approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding. Inspired by data-driven methods, we propose an exemplar-based approach to the task of instance segmentation, in which a set of reference image/shape masks is used to find multiple objects. We design a novel CRF framework that jointly models object appearance, shape deformation, and object occlusion. To tackle the challenging MAP inference problem, we derive an alternating procedure that interleaves object segmentation and shape/appearance adaptation. We evaluate our method on two datasets with instance labels and show promising results.
|
Similar papers:
[rank all papers by similarity to this]
|
#1518 - Generating object segmentation proposals using global and local search [pdf]
Pekka Rantalankila, Juho Kannala, Esa Rahtu |
Abstract: We present a method for generating object segmentation proposals from groups of superpixels. The goal is to propose accurate segmentations for all objects of an image. The proposed object hypotheses can be used as input to object detection systems and thereby improve efficiency by replacing exhaustive search. The segmentations are generated in a class-independent manner and therefore the computational cost of the approach is independent of the number of object classes. Our approach combines both global and local search in the space of sets of superpixels. The local search is implemented by greedily merging adjacent pairs of superpixels to build a bottom-up segmentation hierarchy. The regions from such a hierarchy directly provide a part of our region proposals. The global search provides the other part by performing a set of graph cut segmentations on a superpixel graph obtained from an intermediate level of the hierarchy. The parameters of the graph cut problems are learnt in such a manner that they provide complementary sets of regions. Experiments with Pascal VOC images show that we reach state-of-the-art with greatly reduced computational cost.
|
Similar papers:
[rank all papers by similarity to this]
|
#1530 - Learning Important Spatial Pooling Regions for Scene Classification [pdf]
DI LIN, Cewu Lu, Renjie Liao, Jiaya Jia |
Abstract: We address the false response influence problem when learning and applying discriminative parts to construct the mid-level representation in scene classification. It is often caused by the complexity of latent image structure that yields false response when convolving part filters with input images. This problem makes mid-level representation, even after pooling, not distinct enough to classify input data correctly to scene categories. Our solution is to learn important spatial pooling regions along with their appearance. Our experiments show that this new framework significantly suppresses false response and produces good results on several datasets, including MIT-Indoor, 15-Scene, and UIUC 8-Sport. When combined with global image features, we achieve state-of-the-art performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#1531 - Multipoint Filtering with Local Polynomial Approximation and Range Guidance [pdf]
Xiao Tan, Changming Sun, Tuan Pham |
Abstract: This paper presents a novel method for performing guided image filtering using multipoint local polynomial approximation (LPA) with range guidance. In our method, the LPA is extended from a pointwise model into a multipoint model for reliable filtering and better preserving image gradients which usually contain the essential information in the image to be filtered. In addition, we develop a scheme for generating a spatial adaptive support region around each point in constant time invariant to the size of the region. By using the hybrid of the local polynomial model and color/intensity based range guidance, the proposed method not only preserves edges but also does a much better job in preserving gradients than existing popular filtering methods. Our method proves to be effective in a number of applications: depth image upsampling, joint image de-noising, details enhancement, and image abstraction. Experimental results show that our method provides better results than state-of-the-art methods and it is also computationally efficient.
|
Similar papers:
[rank all papers by similarity to this]
|
#1532 - Real-time Model-based Articulated Object Pose Detection and Tracking with Variable Rigidity Constraints [pdf]
Karl Pauwels, Leonardo Rubio, Eduardo Ros |
Abstract: A novel model-based approach is introduced for real-time detecting and tracking of the pose of general articulated objects. A variety of dense motion and depth cues are integrated into a novel articulated Iterative Closest Point (ICP) approach. The proposed method can independently track the six-degrees-of-freedom pose of over a hundred of rigid parts in real-time while, at the same time, imposing articulation constraints on the relative motion of different parts. We propose a novel rigidization framework for optimally handling unobservable parts during tracking. This involves rigidly attaching the minimal amount of unseen parts to the rest of the structure in order to most effectively use the currently available knowledge. We show how this framework can be used also for detecting rather than tracking which allows for automatic system initialization or incorporating pose estimates obtained from independent object part detectors. Improved performance over alternative solutions is shown on real-world sequences.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we deal with the image deblurring problem in a completely new perspective by proposing separable kernel to represent the inherent properties of the camera and scene system. Specifically, we decompose a blur kernel into three individual descriptors (trajectory, intensity and point spread function) so that they can be optimized separately. To demonstrate the advantages, we extract one-pixel-width trajectories of blur kernels and propose a random perturbation algorithm to optimize them but still keeping their continuity. For many cases, where current deblurring approaches fall into local minimum, excellent deblurred results and correct blur kernels can be obtained by individually optimizing the kernel trajectories. Our work strongly suggests that more constraints and priors should be introduced to blur kernels in solving the deblurring problem because blur kernels have lower dimensions than images.
|
Similar papers:
[rank all papers by similarity to this]
|
#1546 - Discrete-Continuous Depth Estimation from a Single Image [pdf]
Miaomiao Liu, Mathieu Salzmann, Xuming He |
Abstract: In this paper, we tackle the problem of estimating the depth of a scene from a single image. This is a challenging task, since a single image on its own does not provide any depth cue. To address this, we exploit the availability of a pool of images for which the depth is known. More specifically, we formulate monocular depth estimation as a discrete-continuous optimization problem, where the continuous variables encode the depth of the superpixels in the input image, and the discrete ones represent relationships between neighboring superpixels. The solution to this discrete-continuous optimization problem is then obtained by performing inference in a graphical model using particle belief propagation. The unary potentials in this graphical model are computed by making use the images with known depth. We demonstrate the effectiveness of our model in both the indoor and outdoor scenarios. Our experimental evaluation shows that our depth estimates are more accurate than existing methods on standard datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1548 - A Learning-to-Rank Approach for Image Color Enhancement [pdf]
Jianzhou Yan, Stephen Lin, Sing Bing Kang, Xiaoou Tang |
Abstract: We present a machine-learned ranking approach for automatically enhancing the color of a photograph. Unlike previous techniques that train on pairs of images before and after adjustment by a human user, our method takes into account the intermediate steps taken in the enhancement process, which provide detailed information on the person's color preferences. To make use of this data, we formulate the color enhancement task as a learning-to-rank problem in which ordered pairs of images are used for training, and then various color enhancements of a novel input image can be evaluated from their corresponding rank values. From the parallels between the decision tree structures we use for ranking and the decisions made by a human during the editing process, we posit that breaking a full enhancement sequence into individual steps can facilitate training. Our experiments show that this approach compares well to existing methods for automatic color enhancement.
|
Similar papers:
[rank all papers by similarity to this]
|
#1552 - Gait Recognition under Speed Transition [pdf]
Al Mansur, Rasyid Aqmar, Yasushi Makihara, Yasushi Yagi |
Abstract: This paper describes a method of gait recognition from accelerated or decelerated gait image sequences. As a speed change occurs due to a change of pitch (the first-order derivative of a phase, namely, a gait stance) and/or stride, we model this speed change using a cylindrical manifold whose azimuth and height corresponds to the phase and the stride, respectively. Radial basis function (RBF) interpolation framework is used to learn subject specific mapping matrices for mapping from manifold to image space. Given an input speed transited gait image sequence of a test subject, we estimate the mapping matrix of the test subject as well as the phase and stride sequence using an energy minimization framework. The following three points are considered: (1) fitness of the synthesized images to the input image sequence as well as to an eigenspace constructed by exemplars of training subjects; (2) smoothness of the phase and the stride sequence; and (3) pitch and stride fitness to the pitch-stride preference model. Using the estimated mapping matrix, we synthesize a constant-speed gait image sequence, and extract a conventional period-based gait feature from it for matching. We conducted experiments using real speed transited gait image sequences with 179 subjects and demonstrated the effectiveness of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
#1553 - Towards Unified Human Parsing and Pose Estimation [pdf]
Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, Shuicheng Yan |
Abstract: We study the problem of human body configuration analysis, more specifically, human parsing and human pose estimation. These two tasks, \ie identifying the semantic regions and body joints respectively over the human body image, are intrinsically highly correlated. However, previous works generally solve these two problems separately or iteratively. In this work, we propose a unified framework for simultaneous human parsing and pose estimation based on semantic parts. By utilizing Parselets~\cite{ICCV_2013_Parselet} and Mixture of Joint-Group Templates (MJGT) as the representations for these semantic parts, we seamlessly formulate the human parsing and pose estimation problem jointly within a unified framework via a tailored And-Or graph. A novel Grid Layout Feature is then designed to effectively capture the spatial co-occurrence/occlusion information between/within the Parselets and MJGTs. Thus the mutually complementary nature of these two tasks can be harnessed to boost the performance of each other. The resultant unified model can be solved using the structure learning framework in a principled way. Comprehensive evaluations on two benchmark datasets for both human parsing and pose estimation tasks demonstrate the effectiveness of the proposed framework when compared with the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1554 - Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images [pdf]
Eleonora Vig, Michael Dorr, David Cox |
Abstract: Saliency prediction typically relies on hand-crafted (multiscale) features that are combined in different ways to form a "master" saliency map, which encodes local image conspicuity. Recent improvements to the state of the art on standard benchmarks such as MIT1003 have been achieved mostly by incrementally adding more and more hand-tuned features (such as car or face detectors) to existing models. In contrast, we here follow an entirely automatic data-driven approach that performs a large scale search for optimal features. We identify those instances of a richly-parameterized bio-inspired model family (hierarchical neuromorphic networks) that successfully predict image saliency. Because of the high dimensionality of this parameter space, we use automated hyperparameter optimization to efficiently guide the search. The optimal blend of such multilayer features combined with a simple linear classifier achieves excellent performance on several image saliency benchmarks. Models outperform the state of the art on MIT1003, on which features and classifiers are learned. Without additional training, these models generalize well to two other image saliency data sets, Toronto and NUSEF, despite their different image content. Finally, our algorithm scores best of all the 19 models evaluated to date on the MIT300 saliency challenge, which uses a hidden test set to facilitate an unbiased comparison.
|
Similar papers:
[rank all papers by similarity to this]
|
#1556 - NMF-KNN: Image Annotation using Weighted Multi-view Non-Negative Matrix Factorization [pdf]
Mahdi Kalayeh, Haroon Idrees, Mubarak Shah |
Abstract: The real world image databases such as Flickr are characterized by continuous addition of new images. The recent approaches for image annotation - the problem of assigning tags to images - have two major drawbacks. First, either models are learned using the entire training data, or to handle the issue of dataset imbalance, tag-specific discriminative models are trained. Such models become obsolete and require relearning when new images and tags are added to database. Second, the task of feature-fusion is typically dealt using ad hoc approaches. In this paper, we present a weighted extension of Multi-view Non-Negative Matrix Factorization (NMF) to address the aforementioned drawbacks. The key idea is to learn query-specific generative model on the features of nearest neighbors using the proposed NMF-KNN which imposes consensus constraint on the coefficient matrices across different features. This results in coefficient vectors across features to be consistent and, thus, naturally solves the problem of feature fusion while the weight matrices introduced in the proposed formulation alleviate the issue of dataset imbalance. Furthermore, our approach, being query-specific, is agnostic to addition of images and tags in a database. We tested our method on two datasets used for evaluation of image annotation and obtained competitive results.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We develop a binary action descriptor that is depthaware and thus achieves for the same action type good invariance under varying time, scale, viewpoint, rotation and background. It is robust to occlusion and data corruption as well. The descriptor runs very fast thanks to its binary feature. Working together with standard learning algorithm, the proposed descriptor achieves state-of-the-art or even better performance on benchmark datasets in our extensive experimental validation with impressive time performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#1565 - Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition [pdf]
Waqas Sultani, Imran Saleemi |
Abstract: This paper attempts to address the problem of recognizing human actions while training and testing on distinct datasets, when test videos are neither labeled nor available during training. In this scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. In the process of attempting the problem at hand, we explore the reasons for poor classifier performance when tested on novel datasets, and quantify the effect of scene backgrounds on action representations and recognition. We perform different types of partitioning of the gist feature space for several datasets and compute measures of background scene complexity, as well as, for the extent to which scenes are helpful in action classification. We then propose a new process to obtain a measure of confidence in each pixel of the video being a foreground region, using motion, appearance, and saliency together in a 3D MRF based framework. We also propose multiple ways to exploit the foreground confidence: to improve bag-of-words vocabulary, histogram representation of a video, and a novel histogram decomposition based representation and kernel. We have performed extensive experiments on several datasets that improve recognition accuracy, especially when training and testing across datasets, as compared to baseline methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Scribbles in scribble-based interactive segmentation such as graph-cut are usually assumed to be perfectly accurate, \ie, foreground scribble pixels will never be segmented as background in the final segmentation. However, it can be hard to draw perfectly accurate scribbles, especially on fine structures of the image or on mobile touch-screen devices. In this paper, we propose a novel ratio energy function that tolerates errors in the user input while encouraging maximum use of the user input information. More specifically, the ratio energy aims to minimize the graph-cut energy while maximizing the user input respected in the segmentation. The ratio energy function can be exactly optimized using an efficient iterated graph cut algorithm. The robustness of the proposed method is validated on the GrabCut dataset using both synthetic scribbles and manual scribbles. The experimental results show that the proposed algorithm is robust to the errors in the user input and preserves the ``anchoring'' capability of the user input.
|
Similar papers:
[rank all papers by similarity to this]
|
#1567 - Efficient Squared Curvature [pdf]
Claudia Nieuwenhuis, Eno Toeppe, Lena Gorelick, Olga Veksler, Yuri Boykov |
Abstract: Curvature has received increased attention as an important alternative to length based regularization in computer vision. In contrast to length, it preserves elongated structures and fine details. Existing approaches are either inefficient, or have low angular resolution and yield results with strong block artifacts. We derive a new model for computing squared curvature based on integral geometry. The model counts responses of straight line triple cliques. The corresponding energy decomposes into submodular and supermodular pairwise potentials. We show that this energy can be efficiently minimized even for high angular resolutions using the trust region framework. Our results confirm that we obtain accurate and visually pleasing solutions without strong artifacts at reasonable runtimes.
|
Similar papers:
[rank all papers by similarity to this]
|
#1570 - Backscatter Compensated Photometric Stereo with 3 Sources [pdf]
Chourmouzios Tsiotsios, Maria Angelopoulou, Tae-Kyun Kim, Andrew Davison |
Abstract: Photometric stereo offers the possibility of object shape reconstruction via reasoning about the amount of light reflected from oriented surfaces. However, in murky media such as sea water, the illuminating light interacts with the medium and some of it is backscattered towards the camera. Due to this additive light component, the standard Photometric Stereo equations lead to poor quality shape estimation. Previous authors have attempted to reformulate the approach but have either neglected backscatter entirely or disregarded its non-uniformity on the sensor when camera and lights are close to each other. We show that by compensating effectively for the backscatter component, a linear formulation of Photometric Stereo is allowed which recovers an accurate normal map using only 3 lights. Our backscatter compensation method for point-sources can be used for estimating the uneven backscatter directly from single images without any prior knowledge about the characteristics of the medium or the scene. We support our method comparing with previous approaches through extensive experimental results, where a variety of objects are imaged in a big water tank whose turbidity is systematically increased, and show reconstruction quality which degrades little relative to clean water results even with a very significant scattering level.
|
Similar papers:
[rank all papers by similarity to this]
|
#1588 - Discriminative Sparse Inverse Covariance Matrix: Application in Brain Functional Network Classification [pdf]
Luping Zhou, Lei Wang, Philip Ogunbona |
Abstract: Recent studies show that mental disorders change the functional organization of the brain, which could be investigated via various imaging techniques. Analyzing such changes is becoming critical as it could provide new biomarkers for diagnosing and monitoring the progression of the diseases. Functional connectivity analysis studies the covary activity of neuronal populations in different brain regions. The sparse inverse covariance estimation (SICE), also known as graphical LASSO, is one of the most important tools for functional connectivity analysis, which estimates the interregional partial correlations of the brain. Although being increasingly used for predicting mental disorders, SICE is basically a generative method that may not necessarily perform well on classifying neuroimaging data. In this paper, we propose a learning framework to effectively improve the discriminative power of SICEs by taking advantage of the samples in the opposite class. We formulate our objective as convex optimization problems for both one-class and two-class classifications. By analyzing these optimization problems, we not only solve them efficiently in their dual form, but also gain insights into this new learning framework. The proposed framework is applied to analyzing the brain metabolic covariant networks built upon FDG-PET images for the prediction of the Alzheimer's disease, and shows significant improvement of classification performance for both one-class and two-class scenarios.
|
Similar papers:
[rank all papers by similarity to this]
|
#1594 - FAST LABEL: Easy and Efficient Optimization of Joint Multi-Label and Estimation Problems [pdf]
Byung-Woo Hong, Ganesh Sundaramoorthi |
Abstract: In this paper, we derive an easy-to-implement and efficient algorithm for solving multi-label image partitioning problems (specifically for the problem addressed by Region Competition) where it is desired to jointly determine a parameter for each of the regions defined by the partition. Given an estimate of the parameters, a fast approximate solution to the multi-label sub-problem is derived by a global update using simple smoothing and thresholding steps. The method is empirically validated to be robust to fine details of the image that plague local solutions. Further, in comparison to global methods for the multi-label problem, our method is more efficient and it is easy for a non-specialist to implement. Indeed, we give sample Matlab code for the multi-label Chan-Vese problem in this paper! We perform experiments to compare the proposed method to the state-of-the-art in multi-label solutions to Region Competition and show our method achieves equal or better accuracy, with the advantage being speed and ease of implementation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1600 - Dense Semantic Image Segmentation with Objects and Attributes [pdf]
Shuai Zheng, Ming-Ming Cheng, Jonathan Warrell, Paul Sturgess, Vibhav Vineet, Carsten Rother, philip Torr |
Abstract: The concepts of objects and attributes are both important for precisely describing images, since verbal descriptions often contain both adjectives and nouns (e.g. `I see a shiny red wall'). In this paper, we formulate the problem of joint visual attribute and object class image segmentation as a dense multi-labeling problem, where each pixel in an image can be associated with both an object-class and a set of visual attributes labels. In order to learn the label correlations, we adopt a boosting based piecewise training approach with respect to the visual appearance and co-occurrence cues. We use a filtering-based mean-field approximation approach for efficient joint inference. Further, we develop a hierarchical model to incorporate region-level object and attribute information. Experiments on the aPascal, CORE and attribute augmented NYU indoor scenes datasets show that the proposed approach is able to achieve state-of-the-art results.
|
Similar papers:
[rank all papers by similarity to this]
|
#1603 - Learning Non-Linear Reconstruction Models for Image Set Classification [pdf]
Munawar Hayat, mohammed Bennamoun, Senjian An |
Abstract: We propose a deep learning framework for image set classification with application to face recognition. An Adaptive Deep Network Template (ADNT) is defined whose parameters are initialized by performing unsupervised pre-training in a layer-wise fashion using Gaussian Restricted Boltzmann Machines (GRBMs). The pre-initialized ADNT is then separately trained for images of each class and class-specific models are learnt. Based on the minimum reconstruction error from the learnt class-specific models, a majority voting strategy is used for classification. The proposed framework is extensively evaluated for the task of image set classification based face recognition on Honda/UCSD, CMU Mobo, YouTube Celebrities and a Kinect dataset. Our experimental results show that the proposed method achieves the best performance on all datasets with a 9% relative increase in the performance compared with the existing state-of-the-art for the challenging YouTube Celebrities dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
#1606 - Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras [pdf]
Yen-Yu Lin, Ju-Hsuan Hua, Nick Tang, Min-Hung Chen, Hong-Yuan Liao |
Abstract: The recent advances in RGB-D cameras have allowed us to better solve increasingly complex computer vision tasks. However, modern RGB-D cameras are still restricted by the short effective distances. The limitation may make RGB-D cameras not online accessible in practice, and degrade their applicability. We propose an alternative scenario to address this problem, and illustrate it with the application to action recognition. We use Kinect to offline collect an auxiliary, multi-modal database, in which not only the RGB videos but also the depth maps and skeleton structures of actions of interest are available. Our approach aims to enhance action recognition in RGB videos by leveraging the extra database. Specifically, it optimizes a feature transformation, by which the actions to be recognized can be concisely reconstructed by entries in the auxiliary database. In this way, the inter-database variations are adapted. More importantly, each action can be augmented with additional depth and skeleton images retrieved from the auxiliary database. The proposed approach has been evaluated on three benchmarks of action recognition. The promising results manifest that the augmented depth and skeleton features can lead to remarkable boost in recognition accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1609 - Subspace Tracking under Dynamic Dimensionality for Online Background Subtraction [pdf]
Matthew Berger, Lee Seversky |
Abstract: Long-term modeling of background motion in videos is an important and challenging problem used in numerous applications such as segmentation and event recognition. A major challenge in modeling the background from point trajectories lies in dealing with the variable length duration of trajectories, which can be due to such factors as trajectories entering and leaving the frame or occlusion from different depth layers. This work proposes an online method for background modeling of dynamic point trajectories via tracking of a linear subspace describing the background motion. To cope with variability in trajectory durations, we cast subspace tracking as an instance of subspace estimation under missing data, using a least-absolute deviations formulation to robustly estimate the background in the presence of arbitrary foreground motion. Relative to previous works, our approach is extremely fast and scales to arbitrarily long videos by processing new frames as they arrive in a sequential fashion.
|
Similar papers:
[rank all papers by similarity to this]
|
#1611 - Adaptive Object Retrieval with Kernel Reconstructive Hashing [pdf]
Haichuan Yang, Xiao Bai, Jun Zhou, Peng Ren, Jian Cheng, Zhihong Zhang |
Abstract: Hashing is very useful for fast approximate similarity search on large database. In the unsupervised settings, most hashing methods aim at preserving the similarity defined by Euclidean distance. Hash codes generated by these approaches only keep their Hamming distance corresponding to the pairwise Euclidean distance, ignoring the local distribution of each data point. This objective does not hold for k-nearest neighbors search. In this paper, we firstly propose a new adaptive similarity measure which is consistent with k-NN search, and prove that it leads to a valid kernel. Then we propose a hashing scheme which uses binary codes to preserve the kernel function. Using low-rank approximation, our hashing framework is more effective than existing methods that preserve similarity over arbitrary kernel. The proposed kernel function, hashing framework, and their combination have demonstrated significant advantages compared with several state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1643 - Learning Inhomogeneous FRAME Models for Object Patterns [pdf]
Jianwen Xie, Wenze Hu, Song Chun Zhu, Ying Nian Wu |
Abstract: The FRAME (Filters, Random field, And Maximum Entropy) model is a spatially stationary (homogeneous) Markov random field model for texture patterns. The model is a maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where the histograms are pooled spatially over all the image pixels. In this article, we investigate an inhomogeneous version of the FRAME model and apply it to modeling object patterns. The inhomogeneous FRAME is a non-stationary Markov random field model that reproduces the observed distributions or statistics of filter responses at all the individual locations, scales and orientations without spatial pooling. Our experiments show that the inhomogeneous FRAME model is capable of generating a wide variety of object patterns in natural images. We then propose a sparsified version of the inhomogeneous FRAME model where the model reproduces observed statistical properties at selected locations, scales and orientations. We propose to select these locations, scales and orientations by a shared sparse coding scheme, and we explore the connection between the sparse FRAME model and the linear additive sparse coding model. Our experiments show that it is possible to learn sparse FRAME models in unsupervised fashion and the learned models are useful for object classification.
|
Similar papers:
[rank all papers by similarity to this]
|
#1646 - Automatic Feature Learning for Robust Shadow Detection [pdf]
Salman Khan, mohammed Bennamoun, Ferdous Sohel, Roberto Togneri |
Abstract: We present a practical framework to automatically detect shadows in real world scenes from a single photograph. Previous works on shadow detection put a lot of effort in designing shadow variant and invariant hand-crafted features. In contrast, our framework automatically learns the most relevant features in a supervised manner using multiple convolutional deep neural networks (ConvNets). The 7-layer network architecture of each ConvNet consists of alternating convolution and sub-sampling layers. The proposed framework learns features at the super-pixel level and along the object boundaries. In both cases, features are extracted using a context aware window centered at interest points. The predicted posteriors based on the learned features are fed to a conditional random field model to generate smooth shadow contours. Our proposed framework consistently performed better than the state-of-the-art on all major shadow databases (collected under a variety of conditions).
|
Similar papers:
[rank all papers by similarity to this]
|
#1658 - In Search of Inliers: 3D Correspondence by Local and Global Voting [pdf]
Anders Buch, Yang Yang, Norbert Krger, Henrik Petersen |
Abstract: We present a method for finding correspondence between 3D models. From an initial set of feature correspondences, our method uses a fast voting scheme to separate the inliers from the outliers. The novelty of our method lies in the use of a combination of local and global constraints to determine if a vote should be cast. On a local scale, we use simple, low-level geometric invariants. On a global scale, we apply covariant constraints for finding compatible correspondences. We guide the sampling for collecting voters by downward dependencies on previous voting stages. All of this together results in an accurate matching procedure. We evaluate our algorithm by controlled and comparative testing on different datasets, giving superior performance compared to state of the art methods. In a final experiment, we apply our method for 3D object detection, showing potential use of our method within higher-level vision.
|
Similar papers:
[rank all papers by similarity to this]
|
#1662 - Probabilistic Labeling Cost for High-Accuracy Multi-view Reconstruction [pdf]
Ilya Kostrikov, Esther Horbert, Bastian Leibe |
Abstract: In this paper, we propose a novel labeling cost for globally optimal continuous optimization for multi-view reconstruction. Existing approaches use data terms with specific weaknesses that are vulnerable to common challenges, such as low-textured regions or specularities. Our new probabilistic method implicitly discards outliers and can be shown to become more exact the closer we get to the true object surface. Our approach achieves top results among all published methods on the Middlebury DINO SPARSE dataset and also delivers accurate results on several other datasets with widely varying challenges, for which it works in unchanged form.
|
Similar papers:
[rank all papers by similarity to this]
|
#1672 - Unified Face Analysis by Iterative Multi-Output Random Forests [pdf]
Xiaowei Zhao, Tae-Kyun Kim, Wenhan Luo |
Abstract: In this paper, we present a unified method for face image analysis, i.e., jointly estimating facial pose, expression and detecting facial landmarks in real-world facial images. The relations among the tasks are fully exploited to boost the performance of each task. To achieve this goal, we cast it as a joint probability estimation problem and propose an iterative Multi-Output Random Forests (iMORF) algorithm. Specifically, a hierarchical face analysis forest is learned to perform classification of head pose and facial expression at the top level. With the latent shape prior provided by the estimated head pose and facial expression, more accurate facial landmark detection is obtained at the bottom level. Once we get the prediction of facial landmarks, the shape-related geometric features are extracted together with the image appearance features to further improve the estimation of the head pose and facial expression. These two steps for pose/expression and landmark, are iterated until convergence, i.e., no change in estimated landmark positions. Experiments on publicly available real world face datasets demonstrate that the performance of all individual tasks is greatly improved by our iMORF algorithm, and our method outperforms state-of-the-arts.
|
Similar papers:
[rank all papers by similarity to this]
|
#1678 - SphereFlow: 6 DoF Scene Flow from RGB-D Pairs [pdf]
Michael Hornacek, Andrew Fitzgibbon, Margrit Gelautz, Carsten Rother |
Abstract: We address the problem of computing dense scene flow between a pair of consecutive RGB-D frames. We seek correspondences between the two frames with respect to patches of 3D points that we identify as the inliers of spheres. Our main contribution is to show that by reasoning in terms of such patches under 6 DoF rigid body motions in 3D, we succeed in obtaining compelling results without relying on either of two simplifying assumptions that permeate much of the earlier literature: brightness constancy or local surface planarity. As a consequence, our output is a dense field of 6 DoF 3D rigid body motions, in contrast to the 3D translations that are the norm in scene flow. Reasoning in terms of 6 DoF motions additionally allows us to introduce a 6 DoF consistency check for the flow computed in both directions, a patchwise silhouette check to help reason about alignments in occlusion areas, and an intuitive local rigidity prior to promote smoothness of the flow fields. We carry out our optimization in two steps, obtaining a first correspondence field using PatchMatch, and subsequently using $\alpha$-expansion to jointly handle occlusions and regularize the field. We show attractive flow results on challenging synthetic and real-world scenes that push the practical limits of the aforementioned assumptions.
|
Similar papers:
[rank all papers by similarity to this]
|
#1685 - Partial Occlusion Handling for Visual Tracking via Robust Part Matching [pdf]
Tianzhu Zhang, Kui Jia, Changsheng Xu, Yi Ma, Narendra Ahuja |
Abstract: Part-based visual tracking is advantageous due to its robustness against partial occlusion. However, how to effectively exploit the confidence scores of individual parts to construct a robust tracker is still a challenging problem. % In this paper, we address this problem by simultaneously matching parts in each of multiple frames, which is realized by a locality-constrained low-rank sparse learning method that establishes multi-frame part correspondences through optimization of partial permutation matrices. % The proposed part matching tracker (PMT) has a number of attractive properties. (1) It exploits the spatial-temporal locality-constrained property for robust part matching. (2) It matches local parts from multiple frames jointly by considering their low-rank and sparse structure information, which can effectively handle part appearance variations due to occlusion or noise. (3) The proposed PMT model has the inbuilt mechanism of leveraging multi-mode target templates, so that the dilemma of template updating when encountering occlusion in tracking can be better handled. This contrasts with existing methods that only do part matching between a pair of frames. % We evaluate PMT and compare with $10$ popular state-of-the-art methods on challenging benchmarks. Experimental results show that PMT consistently outperform these existing trackers.
|
Similar papers:
[rank all papers by similarity to this]
|
#1687 - Finding Matches in a Haystack: A Max-Pooling Strategy for Graph Matching in the Presence of Outliers [pdf]
Minsu Cho, Jian Sun, Jean Ponce |
Abstract: A major challenge in real-world matching problems is to tolerate the numerous outliers arising in typical visual tasks. Variations in object appearance, shape, and structure within the same object class make it hard to distinguish inliers from outliers due to clutters. In this paper, we propose a novel approach to graph matching, which is not only resilient to deformations but also remarkably tolerant to outliers. By adopting a max-pooling strategy within the graph matching framework, the proposed algorithm evaluates each candidate match using its most promising neighbors, and gradually propagates the corresponding scores to update the neighbors. As final output, it assigns a reliable score to each match together with its supporting neighbors, thus providing contextual information for further verification. We demonstrate the robustness and utility of our method with synthetic and real image experiments.
|
Similar papers:
[rank all papers by similarity to this]
|
#1690 - Histograms of Pattern Sets for Image Classification and Object Recognition [pdf]
Winn Voravuthikunchai, bruno Cremilleux, Frederic Jurie |
Abstract: This paper introduces a novel image representation capturing feature dependencies through the mining of meaningful combinations of visual features. This representation leads to a compact and discriminative encoding of images that can be used for image classification, object detection or object recognition. The method relies on (i) multiple random projections of the input space followed by local binarization of projected histograms encoded as sets of items, and (ii) the representation of images as Histograms of Pattern Sets (HoPS). The approach is validated on four publicly available datasets (Daimler Pedestrian Classification, Oxford Flowers Classification, KTH Texture Categorization, PASCAL VOC2007), allowing comparisons with many recent approaches. The proposed image representation reaches state-of-the-art performance on each of these datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We introduce a method that can register challenging specular and poorly textured 3D environments, on which previous approaches fail. We assume that a small set of reference images of the environment and a partial 3D model is already available. Like previous approaches, we register the input images by aligning them with one of the reference images using the 3D information. However, previous approaches typically rely on the pixel intensities for the alignment, which is prone to fail in presence of specularities or in absence of texture. A key component of our approach is an efficient novel local descriptor that we use to describe each image location. We show that we can rely on this descriptor in place of the intensity to significantly improve the alignment robustness at a minor increase of the computational cost, and we explain why our descriptor performs so well.
|
Similar papers:
[rank all papers by similarity to this]
|
#1700 - Using a deformation field model for localizing faces and facial points under weak supervision [pdf]
Marco Pedersoli, Tinne Tuytelaars, Luc Van Gool |
Abstract: Face detection and facial points localization are interconnected tasks. Recently it has been shown that solving these two tasks jointly with a mixture of trees of parts (MTP) leads to state-of-the-art results. However, MTP and most of the methods for facial point localization proposed so far, requires a complete annotation of the training data at facial point level. This is used to predefine the structure of the trees and to place the parts correctly. In this work we extend the mixtures from trees to more general loopy graphs. In this way we can learn in a weakly supervised manner (using only the face location and orientation) a powerful deformable detector that implicitly aligns its parts to the detected face in the image. By attaching some reference points to the correct parts of our detector we can then localize the facial points. In terms of detection our method clearly outperforms the state-of-the-art even if competing with methods that use facial point annotations during training. Additionally, without any facial point annotation at the level of individual training images, our method can localize facial points with an accuracy similar to fully supervised approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
#1701 - Seeing What You're Told: Sentence-Guided Activity Recognition In Video [pdf]
Siddharth Narayanaswamy, Andrei Barbu, Jeffrey Siskind |
Abstract: We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.
|
Similar papers:
[rank all papers by similarity to this]
|
#1709 - Decomposable Nonlocal Tensor Dictionary Learning for Multispectral Image Denoising [pdf]
Yi Peng, Deyu Meng, Zongben Xu, Biao Zhang, Chenqiang Gao, Yang Yi |
Abstract: As compared to the conventional RGB or gray-scale image, the multispectral image (MSI) helps to deliver more faithful representation for real scenes, and greatly enhances the performance of many computer vision tasks. In practice, however, an MSI is always corrupted by various noises. In this paper we propose an effective MSI denoising approach by combinatorially considering two intrinsic characteristics underlying an MSI: the nonlocal similarity over space and the global correlation across spectrum. In specific, through explicitly considering spatial self-similarity of an MSI we construct a nonlocal tensor dictionary learning model with a group-block-sparsity constraint, which helps set similar full-band patches (FBP) share the same atoms from the spatial and spectral dictionaries. Furthermore, through exploiting spectral correlation of an MSI and assuming over-redundancy of dictionaries, the constrained nonlocal MSI dictionary learning model can be decomposed into a series of unconstrained low-rank tensor approximation problems, and can be readily solved by off-the-shelf higher order statistics. Experimental results show that our method outperforms all state-of-the-art MSI denoising methods under comprehensive quantitative performance measures.
|
Similar papers:
[rank all papers by similarity to this]
|
#1711 - Class Specific 3D Object Shape Priors Using Surface Normals [pdf]
Christian Hne, Nikolay Savinov, Marc Pollefeys |
Abstract: Dense 3D reconstruction of real world objects containing textureless, reflective and specular parts is a challenging task. Using general smoothness priors such as surface area regularization can lead to defects in form of disconnected parts or unwanted indentations. We argue that this problem can be solved by exploiting the object class specific local surface orientations, e.g. a car is always close to horizontal in the roof area. Therefore, we formulate an object class specific shape prior in form of spatially varying anisotropic smoothness terms. The parameters of the shape prior are extracted from training data. We detail how our shape prior formulation directly fits into recently proposed volumetric multi-label reconstruction approaches. This allows a segmentation between the object and its supporting ground. In our experimental evaluation we show reconstructions using our trained shape prior on several challenging datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1729 - Human Shape and Pose Tracking Using Keyframes [pdf]
Chun-Hao Huang, Edmond Boyer, Slobodan Ilic |
Abstract: This paper considers human motion tracking with multi-view set-ups and investigates a robust strategy that learns online key poses to drive a shape tracking method. The interest arises with realistic dynamic scenes where occlusions or segmentation errors occur. The resulting corrupted observations present missing data and outliers that deteriorate tracking results. In order to cope with such data we propose to use key poses of the tracked person as multiple reference models. In contrast to many existing approaches that rely on a single reference model, multiple templates represent a larger variability of the human poses. They can provide therefore better initial hypotheses when tracking with ambiguous and noisy data. Our approach identifies these reference models online, during tracking, as distinctive keyframes. The most suitable one is then chosen as the reference model for the tracking initialization at each frame. In addition, taking advantage of the proximity between successive frames, an efficient outlier handling technique is proposed to prevent the model from associating to irrelevant outliers. The two strategies are successfully experimented with a surface deformation framework that estimates both the pose and the shape. Evaluations and comparisons on existing datasets also demonstrate the benefit of the approach with respect to the state of the art.
|
Similar papers:
[rank all papers by similarity to this]
|
#1737 - Speeding Up Tracking by Ignoring Features [pdf]
Lu Zhang, Hamdi Dibeklioglu, Laurens van der Maaten |
Abstract: Most modern object trackers combine a motion prior with sliding-window detection, using binary classifiers that predict the presence of the target object based on histogram features. Although the accuracy of such trackers is generally very good, they are often impractical because of their high computational requirements. To resolve this problem, the paper presents a new approach that limits the computational costs of trackers by ignoring features in image regions that --after inspecting a few features-- are unlikely to contain the target object. To this end, we derive an upper bound on the probability that a location is most likely to contain the target object, and we ignore (features in) locations for which this upper bound is small. We demonstrate the effectiveness of our new approach in experiments with model-free and model-based trackers that use linear models in combination with HOG features. The results of our experiments demonstrate that our approach allows us to reduce the average number of inspected features by up to 90% without affecting the accuracy of the tracker.
|
Similar papers:
[rank all papers by similarity to this]
|
#1738 - A Bayesian Framework For the Local Configuration of Retinal Junctions [pdf]
Touseef Qureshi, Andrew Hunter, Bashir Al-Diri |
Abstract: Retinal images contain forests of mutually intersecting and overlapping venous and arterial vascular trees. The geometry of these trees shows adaptation to vascular diseases including diabetes, stroke and hypertension. Segmentation of the retinal vascular network is complicated by inconsistent vessel contrast, fuzzy edges, variable image quality, media opacities, complex intersections and overlaps. This paper presents a Bayesian approach to resolving the configuration of vascular junctions to correctly construct the vascular trees. A probabilistic model of vascular joints (terminals, bridges and bifurcations) and their configuration in junctions is built, and Maximum A Posteriori (MAP) estimation used to select most likely configurations. The models is built using a reference set of 4208 joints extracted from the DRIVE public domain vascular segmentation dataset, and evaluated on 4361 joints from the DRIVE test set, demonstrating an accuracy of 95.2%.
|
Similar papers:
[rank all papers by similarity to this]
|
#1741 - Fast, Approximate Piecewise-Planar Modeling Based on Sparse Structure-from-Motion and Dense Superpixels [pdf]
Andras Bodis-Szomoru, Hayko Riemenschneider, Luc Van Gool |
Abstract: We present a novel approach for producing dense reconstructions from multiple images and from the underlying sparse Structure-from-Motion (SfM) data in an efficient way. State-of-the-art Multi-View Stereo (MVS) algorithms deliver dense depth maps and/or complex meshes with very high detail, and redundancy over regular surfaces. In turn, our interest lies in a light-weight method that is applicable in large-scale, primarily in the field of urban scene reconstruction from ground-based images. To overcome the problem of sparsity, we assume piecewise planarity of man-made scenes and exploit both visibility information and a fast over-segmentation of the images. The reconstruction problem is an energy formulation of a multi-view plane labelling problem, which we solve jointly over the superpixels while avoiding expensive photoconsistency computations. The resulting planar primitives, augmented by detailed superpixel boundaries are computed in about 10 s per image.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We extend the classical linear discriminant analysis (LDA) technique to linear ranking analysis (LRA), by considering the ranking order of classes centroids on the projected subspace. Under the constrain on the ranking order of the classes, two criteria are proposed: 1) minimization of the classification error with the assumption that each class is homogenous Guassian distributed; 2) maximization of the sum (average) of the $k$ minimum distances of all neighboring-class (centroid) pairs. Both criteria can be efficiently solved by the convex optimization for one-dimensional subspace. Greedy algorithm is applied to extend the results to the multi-dimensional subspace. Experimental results show that 1) LRA with both criteria achieve state-of-the-art performance on the tasks of ranking learning and zero-shot learning; and 2) the maximum margin criterion provides a discriminative subspace selection method, which can significantly remedy the class separation problem in comparing with several representative extensions of LDA.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this work we reconsider labeling problems with (virtually) continuous state spaces, which are of relevance in low level computer vision. In order to cope with such huge state spaces multi-scale methods have been proposed to approximately solve such labeling tasks. Although performing well in many cases, these methods do usually not come with any guarantees on the returned solution. A general and principled approach to solve labeling problems is based on the well-known linear programming relaxation, which appears to be prohibitive for large state spaces at the first glance. We demonstrate that a coarse-to-fine exploration strategy in the label space is able to optimize the LP relaxation for non-trivial problem instances with reasonable run-times and moderate memory requirements.
|
Similar papers:
[rank all papers by similarity to this]
|
#1756 - Structured Output Random Forests for Accurate Object Detection [pdf]
Samuel Schulter, Christian Leistner, Peter Roth, Horst Bischof |
Abstract: In this paper, we present a novel object detection approach that is capable of regressing the aspect ratio of objects, which results in accurately predicted bounding boxes having high overlap with the ground truth. In contrast to most recent works, we employ a Random Forest for learning a template based model but exploit the nature of this learning algorithm to predict arbitrary output spaces. In this way, we can simultaneously predict the object probability of a window in a sliding window approach, as well as regressing its aspect ratio with a single model. Furthermore, we also exploit the additional information of the aspect ratio during the training of the structured output Random Forest, resulting in better detection models. Our experiments demonstrate that (i) our approach gives comparable or even better results on standard detection benchmarks, (ii) the structured output prediction of the Random Forest delivers more accurate bounding boxes in terms of overlap with ground truth, especially when tightening the evaluation criterion and (iii) the detector itself becomes better by only including the structured output information during training.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We propose a purely geometric correspondence-free approach to urban geo-localization using 3D point-ray features extracted from the Digital Elevation Map of an urban environment. We derive a novel formulation for estimating the camera pose locus using 3D-to-2D correspondence of a single point and a single direction alone. We show how this allows us to compute putative correspondences between building corners in the DEM and the query image by exhaustively combining pairs of point-ray features. Then, we employ the two-point method to estimate both the camera pose and compute correspondences between buildings in the DEM and the query image. Finally, we show that the computed camera poses can be efficiently ranked by a simple skyline projection step using building edges from the DEM. Our experimental evaluation illustrates the promise of a purely geometric approach to the urban geo-localization problem.
|
Similar papers:
[rank all papers by similarity to this]
|
#1765 - Semi-supervised Spectral Clustering for Image Set Classification [pdf]
Arif Mahmood, Ajmal Mian, Robyn Owens |
Abstract: We present an image set classification algorithm based on unsupervised clustering of labeled training and unlabeled test data where labels are only used in the stopping criterion. The probability distribution of each class over the set of clusters is used to define a true set based similarity measure. To this end, we propose an iterative sparse spectral clustering algorithm. In each iteration, proximity matrix is efficiently recomputed to better represent the local subspace structure. Initial clusters capture the global data structure and finer clusters at the later stages capture the subtle class differences not visible at the global scale. Image sets are compactly represented with multiple Grassmannian manifolds which are subsequently embedded in Euclidean space with the proposed spectral clustering algorithm. We also propose an efficient eigenvector solver which not only reduces the computational cost of spectral clustering by many folds but also improves the clustering quality and final classification results. Experiments on five standard datasets and comparison with seven existing techniques show the efficacy of our algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this work we address the problem of 3D human pose estimation from a single image.It constitutes an extremely hard problem due to inherent depth ambiguities corresponding to bits of information that are not observable by typical generative models. We address this by introducing \emph{posebits}. Posebits are units of information that resolve typical ambiguities in monocular imagery. They are boolean geometric relationships between body parts designed to provide qualitative information about poses (\eg \, left-leg in front of right-leg or hands close to each other). We infer posebits bottom-up from image features using \emph{structural SVMs}. Then, pose samples consistent with the posebits are sampled and evaluated against the image in a top-down fashion. Using posebits as a mid-layer representation for inference has several other advantages: First, pose estimation becomes a much less ambiguous task conditioned on posebits. Second, annotation simplifies to answering a small set of simple yes/no questions, and 3D MoCap data can be easily clustered in semantically similar classes. This allows for fast collection of large datasets in contrast to manual annotation of 3D poses from images. There exist several new potential applications of posebits, here we show how they can be used to successfully to estimate pose from a single image and for semantic image retrieval.
|
Similar papers:
[rank all papers by similarity to this]
|
#1768 - On Projective Reconstruction In Arbitrary Dimensions [pdf]
Behrooz Nasihatkon, Richard Hartley, Jochen Trumpf |
Abstract: We study the theory of projective reconstruction for multiple projections from an arbitrary dimensional projective space into lower-dimensional spaces. This problem is important due to its applications in the analysis of dynamical scenes. The current theory, due to Hartley and Schaffalitzky, is based on the Grassmann tensor, generalizing the ideas of fundamental matrix, trifocal tensor and quadrifocal tensor in the well-studied case of 3D to 2D projections. We present a theory whose point of departure is the projective equations rather than the Grassmann tensor. This is a better fit for the analysis of approaches such as bundle adjustment and projective factorization which seek to directly solve the projective equations. In a first step, we prove that there is a unique Grassmann tensor corresponding to each set of image points, a question that remained open in the work of Hartley and Schaffalitzky. Then, we prove that projective equivalence follows from the set of projective equations, provided that the depths are all nonzero. Finally, we demonstrate possible wrong solutions to the projective factorization problem, where not all the projective depths are restricted to be nonzero.
|
Similar papers:
[rank all papers by similarity to this]
|
#1770 - An Online Learned Elementary Grouping Model for Multi-target Tracking [pdf]
Xiaojing Chen, Zhen Qin, Le An, Bir Bhanu |
Abstract: We introduce an online approach to learn possible elementary groups (groups that contain only two targets) for inferring high level context that can be used to improve multi-target tracking in a data-association based framework. Unlike most existing association-based tracking approaches that use only low level information (e.g., time, appearance, and motion) to build the affinity model and consider each target as an independent agent, we online learn social grouping behavior to provide additional information for producing more robust tracklets affinities. Social grouping behavior of pairwise targets is first learned from confident tracklets and encoded in a disjoint grouping graph. The grouping graph is further completed with the help of group tracking. The proposed method is efficient and can be easily integrated into any basic affinity model. We evaluate our approach on two public datasets, and show significant improvements compared with the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1771 - Efficient Structured Parsing of Facades Using Dynamic Programming [pdf]
Andrea Cohen, Alexander Schwing, Marc Pollefeys |
Abstract: We propose a sequential optimization technique for segmenting a rectified image of a facade into semantic categories. Our method retrieves a parsing which respects common architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the considered facade labeling problem is typically tackled as a classification task or as grammar parsing. Both approaches are not capable of fully exploiting the regularity of the problem. Therefore, our technique very significantly improves the accuracy compared to the state-of-the-art while being an order of magnitude faster. In addition, in over 90% of the test images we obtain a certificate for optimality.
|
Similar papers:
[rank all papers by similarity to this]
|
#1775 - Simultaneous Twin Kernel Learning for Structured Prediction [pdf]
Chetan Tonde, Ahmed Elgammal |
Abstract: Many problems in computer vision, including human pose estimation, image segmentation, handwritten digit reconstruction and others, can be posed as structured prediction problems. Kernel methods for structured prediction like structured support vector machines (SVMStruct), twin gaussian processes (TGP's), structured gaussian processes (GPStruct), vector valued RKHS's and many others, offer a powerful way of solving these problems. However, for all of these kernel-based approaches, poor choice of the kernel often results in reduced performance. Learning the kernel function has received significant interest, but most of the techniques are computationally expensive, restrictive in terms of the kernel they can learn, or they focus only learning kernels on inputs (one-way). In this work, we propose a novel technique for learning the kernels on both inputs and outputs, simultaneously. We call this approach Twin Kernel Learning (TKL). This technique is general in sense that, it can learn arbitrary kernels, and as a special case include 'one-way' kernel learning. We formulate this problem specifically for the case of structured prediction using Twin Gaussian Processes, where we learn the covariance functions of both inputs and outputs and compare it with the baseline results where no kernel learning is performed. We demonstrate through our experimental evaluation on several synthetic and real world datasets that we can consistently improve the performance of our algorithms with a le
|
Similar papers:
[rank all papers by similarity to this]
|
#1776 - Fast and Exact: Shape Segmentation Using ADMM and Structured Prediction [pdf]
Haithem Boussaid, Iasonas Kokkinos |
Abstract: In this work we address the multi-label shape segmentation problem in the energy minimization setting, by using a graphical model for an ensemble of shapes where each shape is represented as a cyclic graph and shape consistency is enforced by additional inter-shape connections. Our contributions are two-fold: firstly, we build on Dual Decomposition (DD) to efficiently solve the resulting optimization problems. We decompose the model's graph into a set of open, chain-structured, graphs that can be rapidly optimized using Dynamic Programming/Generalized Distance Transforms; we achieve rapid convergence by using the Alternating Direction Method of Multipliers (ADMM) and show that for graphs with spatial variables, as is the case for shape models, ADMM yields substantially faster convergence than plain DD-based methods. Secondly, we employ structured prediction to encompass loss functions that better match the medical image segmentation performance criteria: using the commonly employed mean contour distance (MCD) as a structured loss during training, we obtain a clear performance improvement. We obtain systematic improvements over the current state-of-the-art in a large X-Ray image segmentation benchmark, demonstrating the merit of exact and efficient inference with sophisticated, structured models.
|
Similar papers:
[rank all papers by similarity to this]
|
#1781 - Word Channel Based Multiscale Pedestrian Detection Without Image Resizing and Using Only One Classifier [pdf]
Arthur Costea, Sergiu Nedevschi |
Abstract: Most pedestrian detection approaches that achieve high accuracy and precision rate and can be used for real-time applications are based on histograms of gradient orientations. Multiscale detection is attained by resizing the image several times and by recomputing the image features or using multiple classifiers for different scales. In this paper we present a pedestrian detection approach that uses the same classifier for all pedestrian scales based on image features computed for a single scale. We go beyond the low level pixel-wise gradient orientation bins and use higher level visual words from a trained dictionary. Boosting is used to learn classification features from integral visual word channels. The proposed approach is evaluated on multiple datasets and achieves outstanding results on the INRIA and Caltech-USA benchmarks, outperforming current state of the art methods. By using a GPU implementation we achieve a classification rate of over 10 million bounding boxes per second and a 16 FPS rate for multiscale detection in a 640480 image.
|
Similar papers:
[rank all papers by similarity to this]
|
#1786 - Spectral Clustering with Jensen-type kernels and their multi-point extensions [pdf]
Debarghya Ghoshdastidar, Ambedkar Dukkipati, Ajay Adsul, Aparna Vijayan |
Abstract: Motivated by multi-distribution divergences, which originate in information theory, we propose a notion of `multi-point' kernels, and study their applications. We study a class of kernels based on Jensen type divergences and show that these can be extended to measure similarity among multiple points. We study tensor flattening methods and develop a multi-point (kernel) spectral clustering (MSC) method. We further emphasize on a special case of the proposed kernels, which is a multi-point extension of the linear (dot-product) kernel and show the existence of cubic time tensor flattening algorithm in this case. Finally, we illustrate the usefulness of our contributions using standard data sets and image segmentation tasks.
|
Similar papers:
[rank all papers by similarity to this]
|
#1804 - Evolutionary Quasi-random Search for Hand Articulations Tracking [pdf]
Iason Oikonomidis, Manolis Lourakis, Antonis Argyros |
Abstract: We present a new method for tracking the 3D position, global orientation and full articulation of human hands. Inspired by recent advances in model-based, hypothesize-and-test methods, the high-dimensional parameter space of hand configurations is explored with a novel evolutionary optimization technique. The proposed method capitalizes on the fact that the quasi-random samples of the Sobol sequence have low discrepancy and exhibit a more uniform coverage of the sampled space compared to random samples obtained from the uniform distribution. The method has been tested for the problems of tracking the articulation of a single hand (27D parameter space) and two hands (54D space). Extensive experiments have been carried out with synthetic and real data, in comparison with state of the art methods. The quantitative evaluation shows that the new approach achieves a speed-up of four (single hand tracking) and eight (two hands tracking) without compromising tracking accuracy. Interestingly, the proposed method is preferable compared to the state of the art either in the case of limited computational resources or in the case of more complex (i.e., higher dimensional) problems, a fact that raises considerably the applicability of the method in a number of application domains.
|
Similar papers:
[rank all papers by similarity to this]
|
#1810 - Seeing the Arrow of Time [pdf]
Lyndsey Pickup, Zheng Pan, Donglai Wei, Yichang Shih, Andrew Zisserman, Bill Freeman, Bernhard Schoelkopf |
Abstract: We explore whether we can observe Time's Arrow in a temporal sequence--is it possible to tell whether a video is running forwards or backwards? We investigate this somewhat philosophical question using computer vision and machine learning techniques. We explore three methods by which we might detect Time's Arrow in video sequences, based on distinct ways in which motion in video sequences might be asymmetric in time. We demonstrate good video forwards/backwards classification results on a selection of YouTube video clips, and on natively-captured sequences (with no temporally-dependent video compression). The motions our models have learned help discriminate forwards from backwards time.
|
Similar papers:
[rank all papers by similarity to this]
|
#1813 - Actionness Ranking with Lattice Conditional Ordinal Random Fields [pdf]
Wei Chen, Caimgin Xiong, Jason Corso |
Abstract: Action analysis in image and video has been attracting more and more attention in computer vision area. Recognizing specific actions in video clips has been the main focus. We move in a new, more general direction in this paper and ask the critical fundamental question: what is action, how is action different from motion, and in a given image or video where is the action? We study the philosophical and visual characteristics of action, which lead us to define actionness: intentional bodily movement of biological agents (people, animals). To solve the general problem, we propose the lattice conditional ordinal random field model that incorporates local evidence as well as neighboring order agreement. We implement the new model in the continuous domain and apply it to scoring actionness in both image and video datasets. Our experiments demonstrate not only that our new random field model can outperform the popular ranking SVM but also that indeed action is distinct from motion.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Arguably, Constrained Local Models (CLMs) are one of the most prominent approaches for fitting deformable models with impressive results being recently reported for both controlled lab and unconstrained settings. Fitting in most CLM methods is typically formulated as a two-step process during which local templates are first correlated with the image to yield a filter response for each landmark and then shape optimization is performed over these filter responses. We argue that such a fitting strategy may be problematic because optimization of shape and appearance is decoupled. To address this limitation, in this paper, we propose a new model/fitting strategy which results in a joint translational motion model for the model parts so that a cost function of shape and appearance is jointly minimized using Gauss-Newton optimization. We additionally show how significant computational reductions can be achieved by building a full model during training but then efficiently optimizing the proposed cost function on a sparse grid during fitting. This results in complexity that could possibly allow a close to real-time implementation. We coin the proposed formulation Gauss-Newton CLM (GN-CLM). Finally, we compare its performance against another recently proposed state-of-the-art CLM method and show that the proposed GN-CLM outperforms it by a large margin.
|
Similar papers:
[rank all papers by similarity to this]
|
#1827 - Incremental Face Alignment in the Wild [pdf]
Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, Maja Pantic |
Abstract: The development of facial databases with an abundance of annotated facial data captured under unconstrained 'in-the-wild' conditions have made discriminative facial deformable models the de facto choice for generic facial landmark localization. Even though very good performance for the facial landmark localization has been shown by many recently proposed discriminative techniques, when it comes to the applications that require excellent accuracy, such as facial behaviour analysis and facial motion capture, the semi-automatic person-specific or even tedious manual tracking is still the preferred choice. One way to construct a person-specific model automatically is through incremental updating of the generic model. This paper deals with the problem of updating a discriminative facial deformable model, a problem that has not been thoroughly studied in the literature. In particular, we study for the first time, to the best of our knowledge, the strategies to update a discriminative model that is trained by a cascade of regressors. We propose very efficient strategies to update the model and we show that is possible to automatically construct robust discriminative person and imaging condition specific models 'in-the-wild' that outperform state-of-the-art generic face alignment strategies.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: Curse of dimensionality is a practical and challenging problem in image categorization, especially in cases with a large number of classes. Multi-class classification encounters severe computational and storage problems when dealing with these large scale tasks. In this paper, we propose hierarchical feature hashing to effectively reduce dimensionality of parameter space without sacrificing classification accuracy, and at the same time exploit information in semantic taxonomy among categories. We provide detailed theoretical analysis on our proposed hashing method. Moreover, experimental results on object recognition and scene classification further demonstrate the effectiveness of hierarchical feature hashing.
|
Similar papers:
[rank all papers by similarity to this]
|
#1830 - Geometric Generative Gaze Estimation (G3E) for Remote RGB-D Cameras [pdf]
Kenneth Funes Mora, Jean-Marc Odobez |
Abstract: We propose a head pose independent gaze estimation model for distant RGB-D cameras. It relies on a geometric understanding of the 3D gaze action and generation of eye images. By introducing a semantic segmentation of the eye region within a generative process, the model (i) avoids the critical feature tracking of geometrical approaches requiring high resolution images; (ii) decouples the person dependent geometry from the ambient conditions, allowing adaptation to different conditions without retraining. Priors in the generative framework are adequate for training from few samples. In addition, the model is capable of gaze extrapolation allowing for less restrictive training schemes. Comparisons with state of the art methods validate these properties which make our method highly valuable for addressing many diverse tasks in sociology, HRI and HCI.
|
Similar papers:
[rank all papers by similarity to this]
|
#1835 - Efficient feature extraction, encoding and classification\\ for action recognition [pdf]
Vadim Kantorov, Ivan Laptev |
Abstract: Local video features provide state-of-the-art performance for action recognition. While the accuracy of action recognition has been continuously improved over the recent years, the low speed of feature extraction and subsequent recognition prevents current methods from scaling up to real-size problems. We address this issue and first develop highly efficient video features using motion information in video compression. We next explore feature encoding by Fisher vectors and demonstrate accurate action recognition using fast linear classifiers. Our method improves the speed of video feature extraction, feature encoding and action classification by two orders of magnitude at the cost of minor reduction in recognition accuracy. We validate our approach and compare it to the state of the art on three recent action recognition datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1837 - Tissue Classification via Multispectral Convolutional Sparse Coding [pdf]
Yin Zhou, Hang Chang, Kenneth Barner, Paul Spellman, Bahram Parvin |
Abstract: Image-based classification of tissue histology plays an important role in predicting clinical outcomes. However this task is very challenging due to the presence of large technical variations (e.g., fixation, staining) and biological heterogeneities (e.g., cell type, cell state). In the field of biomedical imaging, for the purposes of visualization and/or quantification, different stains are typically used for different targets of interest (e.g., cellular/subcellular events), which generates multispectrum data (images) through various types of microscopes and, as a result, provides the possibility of learning biological component-specific features by exploiting multispectral information. We propose a multispectral feature learning model that automatically learns a set of convolution filter banks from separate spectrums to efficiently discover the intrinsic tissue morphometric signatures, based on convolutional sparse coding (CSC). The learned feature representations are then aggregated through the spatial pyramid matching framework (SPM) and finally classified using a linear SVM. The proposed system has been evaluated using two large-scale tumor cohorts, collected from The Cancer Genome Atlas (TCGA). Experimental results show that the proposed model 1) outperforms systems utilizing sparse coding for unsupervised feature learning (e.g., PSDSPM [8]); 2) is competitive with systems built upon features with biological prior knowledge (e.g., SMLSPM [7]).
|
Similar papers:
[rank all papers by similarity to this]
|
#1840 - Nonparametric Part Transfer for Fine-grained Recognition [pdf]
Christoph Gring, Erik Rodner, Alexander Freytag, Joachim Denzler |
Abstract: In the following paper, we present an approach for fine-grained recognition based on a new part detection method. In particular, we propose a nonparametric label transfer technique which transfers part constellations from objects with similar global shapes. The possibility for transferring part annotations to unseen images allows for coping with a high degree of pose and view variations in scenarios where traditional detection models (such as deformable part models) fail. Our approach is especially valuable for fine-grained recognition scenarios where intraclass variations are extremely high, and precisely localized features need to be extracted. Furthermore, we show the importance of carefully designed visual extraction strategies, such as combination of complementary feature types and iterative image segmentation, and the resulting impact on the recognition performance. In experiments, our simple yet powerful approach achieves 35.9% and 57.8% accuracy on the CUB-2010 and 2011 bird datasets, which is the current best performance for these benchmarks.
|
Similar papers:
[rank all papers by similarity to this]
|
#1843 - On the quotient representation for the essential manifold [pdf]
Roberto Tron, Kostas Daniilidis |
Abstract: The essential matrix, which encodes the epipolar constraint between projected points in two views, is a corner stone of modern computer vision. Previous works have proposed different characterizations of the space of essential matrices as a Riemannian manifold. However, these works either do not consider the symmetric role played by the two views, or do not fully take into account the geometric peculiarities of the epipolar constraint. We address these limitations and give a characterization as a quotient manifold which preserves the geometrical interpretation in terms of camera poses. While our main focus in on theoretical aspects, we include experiments in pose averaging, and show that the proposed formulation produces a meaningful distance between essential matrices.
|
Similar papers:
[rank all papers by similarity to this]
|
#1848 - Topic Modeling of Multimodal Data: an Autoregressive Approach [pdf]
Yin Zheng, Yu-Jin Zhang, Hugo Larochelle |
Abstract: Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to deal with multimodal data, such as in image annotation tasks. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for text document modeling. In this work, we show how to successfully apply and extend this model to multimodal data, such as simultaneous image classification and annotation. Specifically, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model and show how to employ SupDocNADE to learn a joint representation from image visual words, annotation words and class label information. We also describe how to leverage information about the spatial position of the visual words for SupDocNADE to achieve better performance in a simple, yet effective manner. We test our model on the LabelMe and UIUC-Sports datasets and show that it compares favorably to other topic models such as the supervised variant of LDA and a Spatial Matching Pyramid (SPM) approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#1849 - What are you talking about? Text-to-Image Co-reference [pdf]
Chen Kong, Sanja Fidler, Mohit Bansal, Dahua Lin, Raquel Urtasun |
Abstract: In this paper we exploit complex sentential descriptions of RGB-D scenes in order to improve 3D object detection as well as to determine which particular object each noun/pronoun is referring to in the image. Towards this goal, we developed a structure prediction model that is able to parse both the image in terms of 3D object cuboids as well as complex sentences describing the visual content. We demonstrate the effectiveness of our approach in the challenging NYU-RGBD, which we enrich with complex descriptions, and show that our approach can improve 3D detection as well as scene classification, and is able to estimate reliably the text-image alignment problem. Furthermore, by employing the visual information, our approach is able to beat the Stanford parser in estimating co-references.
|
Similar papers:
[rank all papers by similarity to this]
|
#1851 - Curvilinear Structure Tracking by Low Rank Tensor Approximation with Model Propagation [pdf]
Erkang Cheng, Yu Pang, Ying Zhu, Haibin Ling |
Abstract: Robust tracking of deformable object like catheter or vascular structures in X-ray images is an important technique used in image guided medical interventions for effective motion compensation and dynamic multi-modality image fusion. Tracking of such anatomical structures and devices is very challenging due to large degrees of appearance changes, low visibility of X-ray images and the deformable nature of the underlying motion field as a result of complex 3D anatomical movements projected into 2D images. To address these issues, we propose a new deformable tracking method using the tensor-based algorithm with model propagation. Specifically, the deformable tracking is formulated as a multi-dimensional assignment problem which is solved by rank-1 l1 tensor approximation. The model prior is propagated in the course of deformable tracking. Both the higher order information and the model prior provide powerful discriminative cues for reducing ambiguity arising from the complex background, and consequently improve the tracking robustness. To validate the proposed approach, we applied it to catheter and vascular structures tracking and tested on X-ray fluoroscopic sequences obtained from 17 clinical cases. The results show, both quantitatively and qualitatively, that our approach achieves a mean tracking error of 1.4 pixels for vascular structure 1.3 pixels for catheter tracking.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We study the problem of cross-population age estimation. Human aging is determined by the genes and influenced by many factors. Different populations, e.g., males and females, Caucasian and Asian, may age differently. Previous research has discovered the aging difference among different populations, and reported large errors in age estimation when crossing gender and/or ethnicity. In this paper we propose novel methods for cross-population age estimation with a good performance. The proposed methods are based on projecting the different aging patterns into a common space where the aging patterns can be correlated even though they come from different populations. The projections are also discriminative between age classes due to the integration of the classical discriminant analysis technique. Further, we study the amount of data needed in the target population to learn a cross-population age estimator. Finally, we study the feasibility of multi-source cross-population age estimation. Experiments are conducted on a large database of more than 21,000 face images selected from the MORPH. Our studies are valuable to significantly reduce the burden of training data collection for age estimation on a new population, utilizing existing aging patterns even from different populations.
|
Similar papers:
[rank all papers by similarity to this]
|
#1868 - Edge-aware Gradient Domain Optimization Framework for Image Filtering by Local Propagation [pdf]
Miao Hua, Xiaohui Bie, Wencheng Wang |
Abstract: Gradient domain methods are popular for image processing, however, those methods even the edge-preserving ones cannot preserve edges well in some cases. In this paper, we present edge-aware constraints to better preserve edges for general gradient domain image filtering and theoretically analyse why those constraints are edge-aware. Our edge-aware constraints are easy to implement, fast to compute and can be seamlessly integrated into the general gradient domain optimization framework. The new gradient domain optimization framework can better preserve edges while maintaining similar image filtering effects as the original image filters. We also demonstrate the strength of our edge-aware constraints on various applications such as image smoothing, image colorization and Poisson image cloning.
|
Similar papers:
[rank all papers by similarity to this]
|
#1869 - Visual Semantic Search: Retrieving Videos via Complex Textual Queries [pdf]
Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun |
Abstract: In this paper, we tackle the problem of semantic retrieval of videos from complex queries. Towards this goal we first parse the descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed specifically for semantic search in the context of autonomous driving. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.
|
Similar papers:
[rank all papers by similarity to this]
|
#1874 - Unsupervised Trajectory Modelling using Temporal Information via Minimal Paths [pdf]
Brais Cancela, Alberto Iglesias, Marcos Ortega, Manuel Penedo |
Abstract: This paper presents a novel methodology for modelling pedestrian trajectories over a scene, based in the hypothesis that, when people try to reach a destination, they use the path that takes less time, taking into account environmental information like the type of terrain or what other people did before. Thus, a minimal path approach can be used to model human trajectory behaviour. We develop a modified Fast Marching Method that allows us to include both velocity and orientation in the Front Propagation Approach, without increasing its computational complexity. Combining all the information, we create a time surface that shows the time a target need to reach any given position in the scene. We also create different metrics in order to compare the time surface against the real behaviour. Experimental results over a public dataset prove the initial hypothesis' correctness.
|
Similar papers:
[rank all papers by similarity to this]
|
#1875 - Incremental Activity Modeling and Recognition in Streaming Videos [pdf]
MAHMUDUL HASAN, Amit Roy-Chowdhury |
Abstract: Human activity recognition in videos is a difficult but widely studied problem in computer vision due to its numerous practical applications. Most of the state-of-the-art approaches to human activity recognition need an intensive training stage and assume that all of the training examples are labeled and available beforehand. But these assumptions are unrealistic for many applications where we have to deal with streaming videos. In these continuous streaming videos, as new activities are seen, they can be leveraged upon to improve the current activity recognition model. In this work, we aim to develop an incremental activity learning framework that will be able to continuously update the activity models and learn new ones as more videos are seen. Our proposed approach leverages upon state-of-the-art machine learning tools, most notably active learning systems, and leads to the development of an online activity recognition framework for streaming videos. It does not require tedious manual labeling of every incoming examples of each activity class. We perform rigorous experiments on challenging human activity datasets, which demonstrate the robustness of our incremental activity modeling framework.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: In this paper, we address the problem of object tracking in intensity images and depth data. We propose a generic framework that can be used either for tracking $2$D templates in RGB images or for tracking $3$D objects in depth images. To overcome problems like occlusions, strong illumination changes and motion blur, that notoriously make energy minimization-based tracking methods get trapped in the local minima, we propose a learning-based method that is robust to all these problems. We use random forests to learn the relation between the parameters that defines the object's motion, and the changes it induce on the image intensities or the point cloud of the template. It follows that, when the template moves, we use the change on the image intensities or point cloud to predict the parameter of this motion.This leads to extremely fast tracking performance running at less than $2$~ms per frame and is robust to occlusions when tracking in intensity or depth images. Moreover, it demonstrates extreme robustness to strong illuminations changes for tracking using intensity images, and high robustness in tracking 3D objects from arbitrary viewpoints even in the presence of motion blur that causes missing or erroneous data in depth images. Exhaustive experimental evaluation and comparison to the related approaches strongly demonstrates the benefits of our method.
|
Similar papers:
[rank all papers by similarity to this]
|
#1895 - Reliable Multi-view Stereopsis Evaluation [pdf]
Anders Dahl, Henrik Aans, Rasmus Jensen, George Vogiatzis, Engin Tola |
Abstract: The seminal multiple view stereo benchmark evaluations from Middlebury and by Strecha et al. have played a major role in propelling the development of multi-view stereopsis methodology. Although seminal, these benchmark datasets are limited in scope with few reference scenes. Here, we try to take these works a step further by proposing a new multi-view stereo dataset, which is an order of magnitude larger in number of scenes and with a significant increase in diversity. Specifically, we propose a dataset containing 80 scenes of large variability. Each scene consists of 49 or 64 accurate camera positions and reference structured light scans, all acquired by a 6-axis industrial robot. To apply this dataset we propose an extension of the evaluation protocol from the Middlebury evaluation, reflecting the more complex geometry of some of our scenes. The proposed dataset is used to evaluate the state of the art multi-view stereo algorithms of Tola et al., Campbell et al. and Furukawa et al. Hereby we demonstrate the usability of the dataset as well as gain insight into the workings and challenges of multi-view stereopsis. Through these experiments we empirically validate some of the central hypotheses of multi-view stereopsis, as well as determining and reaffirming some of the central challenges.
|
Similar papers:
[rank all papers by similarity to this]
|
#1913 - Understanding Objects in Detail with Fine-grained Attributes [pdf]
Subhransu Maji, Iasonas Kokkinos, Stavros Tsogkas, Ross Girshick, Matthew Blaschko, Esa Rahtu, Juho Kannala, Andrea Vedaldi |
Abstract: Each of 7,413 aeroplane instances is annotated with segmentations for five part types (bottom) and their modifiers (top). The data internal variability is significant, including modern large airliners, ancient biplanes and triplanes, jet planes, propellor planes, gliders, etc. For convenience, aerpolanes are divided into ``typical'' (planes with one wing, one fuselage, and one vertical stabilizer) and ``atypical'' (planes with more diverse structure); this subdivision can be used as ``easy'' and ``hard'' subsets of the data. Several detailed modifiers are associated to parts. For example, the undercarriage wheel group modifier specifies whether an undercarriage has one wheel on one axel, two wheels on one axel and so on.
|
Similar papers:
[rank all papers by similarity to this]
|
#1914 - Efficient Hierarchical Graph-Based Segmentation of RBGD Videos [pdf]
Steven Hickson, Irfan Essa, Henrik Christensen, Stan Birchfield |
Abstract: We present an efficient and scalable algorithm for segmenting 3D RGBD point clouds by combining depth, color, and temporal information using a multistage, hierarchical graph-based approach. The algorithm processes a moving window over several point clouds to group similar regions over a graph, resulting in an initial over-segmentation. These regions are then merged to yield a dendrogram using agglomerative clustering via a minimum spanning tree algorithm. Bipartite graph matching at a given level of the hierarchical tree yields the final segmentation of the point clouds by maintaining region identities over arbitrarily long periods of time. We show that a multistage segmentation with depth then color yields better results than a linear combination of depth and color. Due to its incremental processing, our algorithm can process videos of any length and in a streaming pipeline. The algorithm's ability to produce robust, efficient segmentation is demonstrated with numerous experimental results on challenging sequences from our own as well as public RGBD data sets.
|
Similar papers:
[rank all papers by similarity to this]
|
#1919 - Iterative Multilevel MRF Leveraging Context and Voxel Information for Brain Tumour Segmentation in MRI [pdf]
Nagesh Subbanna, Doina Precup, Tal Arbel |
Abstract: In this paper, we introduce a fully automated multistage graphical probabilistic framework to segment brain tumours from multimodal Magnetic Resonance Images (MRIs) acquired from real patients. As a starting point, a Bayesian classification of the tumour is derived based on Gabor texture features, and subsequent computations are focused on areas of high tumour probabilities. An iterative, multistage Markov Random Field (MRF) framework is then devised to classify the various tumour subclasses (e.g. edema, tumour core, enhancing tumour and necrotic core). At the voxel level, an adapted MRF is devised based on both local observations, and neighbouring class and intensity features. This leads to over-segmentation and numerous false positive tumour subclass regions. A higher level MRF is then devised in order to leverage both contextual texture information and relative spatial consistency of the tumour subclass positions. Here, each node represents a possible subclass region and the graphical model takes the form of an irregular lattice. The higher level, regional information is then passed back down to the voxel-based MRF for further refinement and the two stages iterate until convergence. Experiments are performed on publicly available, patient brain tumour images from the MICCAI 2012 \cite{BRATS2012} and 2013 Brain Tumour Segmentation Challenges\cite{BRATS2013} and compared to the top performing techniques. The results demonstrate that the method achieves the top pe
|
Similar papers:
[rank all papers by similarity to this]
|
#1933 - Stable Template-Based Isometric 3D Reconstruction in All Imaging Conditions by Linear Least-Squares [pdf]
Ajad Chhatkuli, Daniel Pizarro, Adrien Bartoli, Toby Collins |
Abstract: It has been recently shown that reconstructing an isometric surface from a single input image matched to a 3D template was a well-posed problem. This however does not tell us how reconstruction algorithms will behave in practical conditions, where the amount of perspective is generally small and the projection thus behaves like weak-perspective or orthography. We here bring answers to what is theoretically recoverable in such imaging conditions, and explain why existing convex numerical solutions and analytical solutions to 3D reconstruction will become unstable. We then propose a new algorithm which works under all imaging conditions, from strong to loose perspective. We empirically show that the gain of stability is tremendous, bringing our results close to the iterative minimization of a statistically-optimal cost. Our algorithm has a low complexity, is simple and uses only one round of linear least-squares.
|
Similar papers:
[rank all papers by similarity to this]
|
#1934 - Zero-shot Event Detection using Multi-modal Fusion of Weakly Supervised Concepts [pdf]
Shuang Wu, Florian Luisier, Sravanthi Bondugula, Pradeep Natarajan |
Abstract: Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zero-shot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This task goes beyond the traditional zero-shot framework of adapting a given set of classes with training data to unseen classes. We leverage video and image collections with free-form text descriptions from widely available web sources to learn a large bank of concepts, in addition to using several off-the-shelf concept detectors, speech, and video text for representing videos. We utilize natural language processing technologies to generate event description features. The extracted features are then projected to a common high-dimensional space using text expansion, and similarity is computed in this space. We present extensive experimental results on the large TRECVID MED corpus to demonstrate our approach. Our results show that the proposed concept detection methods significantly outperform current attribute classifiers such as Classemes [31], ObjectBank [19], and SUN attributes [25]. Further, we find that fusion, both within as well as between modalities, is crucial for optimal performance.
|
Similar papers:
[rank all papers by similarity to this]
|
#1940 - Merging SVMs with Linear Discriminant Analysis: A Combined Model [pdf]
Symeon Nikitidis, Stefanos Zafeiriou, Maja Pantic |
Abstract: A key problem often encountered by many learning algorithms in computer vision dealing with high dimensional data is the so called ``curse of dimensionality'' which arises when the available training samples are less than the input feature space dimensionality. To remedy this problem, we propose a joint dimensionality reduction and classification framework by formulating an optimization problem within the maximum margin class separation task. The proposed optimization problem is solved using alternative optimization where we jointly compute the low dimensional maximum margin projections and the separating hyperplanes in the projection subspace. Moreover, in order to reduce the computational cost of the developed optimization algorithm we incorporate orthogonality constraints on the derived projection bases and show that the resulting combined model is an alternation between identifying the optimal separating hyperplanes and performing a linear discriminant analysis on the support vectors. Experiments on facial expression and object recognition validate the effectiveness of the proposed method against state-of-the-art dimensionality reduction algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: We are dealing with the face cluster recognition problem where there are multiple images per subject in both gallery and probe sets. It is never guaranteed to have a clear spatio-temporal relation among the multiple images of each subject. Considering that the image vectors of each subject, either in gallery or in probe, span a subspace; an algorithm, Dual Linear Regression Classification (DLRC), for the face cluster recognition problem is developed where the distance between two subspaces is defined as the similarity value between a gallery subject and a probe subject. DLRC attempts to find a ``virtual" face image located in the intersection of the subspaces spanning from both clusters of face images. The ``distance" between the ``virtual" face images reconstructed from both subspaces is then taken as the distance between these two subspaces. We further prove that such distance can be formulated under a single linear regression model where we indeed can find the ``distance" without reconstructing the ``virtual" face images. Extensive experimental evaluations demonstrated the effectiveness of DLRC algorithm compared to other algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
#1954 - Transitive Distance Clustering with K-Means Duality [pdf]
Zhiding Yu, Chunjing Xu, Deyu Meng, Zhuo Hui, Fanyi Xiao, Wenbo Liu |
Abstract: We propose a very intuitive and simple approximation for the conventional spectral clustering methods. It effectively alleviates the computational burden of spectral clustering - reducing the time complexity from O(n^3) to O(n^2) - while capable of gaining better performance in our experiments. Specifically, by involving a more realistic and effective distance and the "k-means duality" property, our algorithm can handle datasets with complex cluster shapes, multi-scale clusters and noise. We also show its superiority in a series of its real applications on tasks including digit clustering as well as image segmentation.
|
Similar papers:
[rank all papers by similarity to this]
|
#1958 - Learning Receptive Fields for Pooling from Tensors of Feature Response [pdf]
Can Xu, Nuno Vasconcelos |
Abstract: A new method for learning pooling receptive fields for recognition is presented. The method exploits the statistics of the 3D tensor of SIFT responses to an image. It is argued that the eigentensors of this tensor contain the information necessary for learning class-specific pooling receptive fields. It is shown that this information can be extracted by a simple PCA analysis of a specific tensor flattening. A novel algorithm is then proposed for fitting box-like receptive fields to the eigenimages extracted from a collection of images. The resulting receptive fields can be combined with any of the recently popular coding strategies for image classification. This combination is experimentally shown to improve classification accuracy for both vector quantization and Fisher vector (FV) encodings. It is then shown that the combination of the FV encoding with the proposed receptive fields has state-of-the-art performance for both object recognition and scene classification. Finally, when compared with previous attempts at learning receptive fields for pooling, the method is simpler and achieves better results.
|
Similar papers:
[rank all papers by similarity to this]
|
#1962 - Generalized Pupil-Centric Imaging and Analytical Calibration for a Non-frontal Camera [pdf]
Avinash Kumar, Narendra Ahuja |
Abstract: We consider the problem of calibrating a small field of view central perspective non-frontal camera whose lens and sensor may not lie on parallel planes due to manufacturing imperfections or intentional tilting. Generally, all lens-sensor configurations can be modeled as non-frontal with varying degrees. For modeling non-frontal sensors, approaches based on generic rotation matrix (three Euler angles) relating lens and sensor, lead to additional degrees of freedom which make linear calibration equations under-determined. This problem is altogether avoided by a different decentering distortion based approach, which models the effect of non-frontalness on image formation. This model is approximate, can handle only small tilts and cannot estimate the tilt explicitly. Thus, it cannot be used to calibrate cameras where tilt is important, \eg tilt-shift camera. Also, calibrating a rotation-based non-frontal sensor in a pupil-centric setting has been shown to be more accurate in estimating sensor tilt as compared to using a thin-lens setting. But, prior work has developed pupil-centric imaging for a single axis lens-sensor tilt while real cameras have arbitrary tilt. In this paper, we focus on non-frontal calibration based on rotation modeling and first show that only two Euler angles are sufficient to parameterize sensor tilt. Second, we generalize pupil-centric imaging for arbitrary rotated sensor. Third, we propose to use a novel pupil-centric base
|
Similar papers:
[rank all papers by similarity to this]
|
#1963 - Confidence-Rated Multiple Instance Boosting for Object Detection [pdf]
Karim Ali, Kate Saenko |
Abstract: Over the past years, Multiple Instance Learning (MIL) has proven to be an effective framework for learning with weakly labeled data. Applications of MIL to object detection, however, were limited to handling the uncertainties of manual annotations. In this paper, we propose a new MIL method for object detection that is capable of handling noisier automatically obtained annotations. Our approach consists in first obtaining confidence estimates over the label space and second incorporating these estimates within a new Boosting procedure. We demonstrate the efficiency of our procedure on two detection tasks, namely horse detection and pedestrian detection, where the training data is primarily annotated by a coarse area of interest detector and show substantial improvements over existing MIL methods. In both cases, we demonstrate that an efficient appearance model can be learned using our approach.
|
Similar papers:
[rank all papers by similarity to this]
|
#1964 - 6 Seconds of Sound and Vision: Creativity in Micro-Videos [pdf]
Miriam Redi, Michele Trevisiol, Rossano Schifanella, neil O'Hare, Alejandro Jaimes |
Abstract: The general notion of creativity, as opposed to related concepts such as beauty or interestingness, has not been studied from the perspective of automatic analysis of multimedia content. Meanwhile, short online videos shared on social media platforms, or micro-videos, have arisen as a new medium for creative expression. In this paper we study creative micro-videos in an effort to understand the features that make a video creative, and to address the problem of automatic detection of creative content. Defining creative video as videos that are novel and have aesthetic value, we conduct a crowdsourcing experiment to create a dataset of 4,000 micro-videos labelled as creative and non-creative. We propose a set of computational features that we map to the components of our definition of creativity, and conduct an analysis to determine which of these features correlate most with creative video. Finally, we evaluate a supervised approach to automatically detect creative video, with promising results, showing that it is necessary to model both aesthetic value and novelty to achieve optimal classification accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
#1972 - RAPS: Robust and Efficient Automatic Construction of Person-Specific Deformable Models [pdf]
Christos Sagonas, Stefanos Zafeiriou, Yannis Panagakis, Maja Pantic |
Abstract: Construction of Facial Deformable Models (FDMs) is a very active research field in Computer Vision, mainly due to their numerous applications, and to the very challenging nature of the problem itself: face is a highly deformable object, the appearance of which drastically changes under different poses, expressions and illuminations. Although several methodologies for constructing generic FDMs, that can be robustly fitted in static images, mainly for facial landmark localization, have recently appeared, when it comes to tasks that require very high accuracy, for example behaviour analysis or facial motion capture, persons pecific FDMs are mainly applied, requiring manual facial landmark annotation for each person and person-specific training. Recently, due to advancements on automatic subspace recovery and image congealing it was made possible to learn a person-specific model by applying image congealing methodologies to a set of images of the person. Unfortunately, these methodologies involve time consuming optimization procedures requiring eigendecompositions of high-dimensional matrices. In this paper, by using a generic texture model we show that is not only possible to reduce the computational complexity but also to increase landmark localization accuracy. Finally, we show that the proposed method is not only faster but also robust to gross non-Gaussian noise compared to the state-of-the art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper presents a framework for object recognition using topological persistence. In particular, we show that the so-called persistence diagrams built from functions defined on the objects can serve as compact and informative descriptors for images and shapes. Complementary to the bag-of-features representation, which captures the distribution of values of a given function, persistence diagrams can be used to characterize its structural properties, reflecting spatial information in an invariant way. In practice, the choice of function is simple: each dimension of the feature vector can be viewed as a function. The proposed method is general: it can work on various multimedia data, including 2d shapes, textures and triangle meshes. Extensive experiments on 3D shape retrieval, hand gesture recognition and texture classification demonstrate the performance of the proposed method in comparison with state-of-the-art methods. Additionally, our approach yields higher recognition accuracy when used in conjunction with the bag-of-features.
|
Similar papers:
[rank all papers by similarity to this]
|
#1983 - Region-based particle filter for video object segmentation [pdf]
David Varas, Ferran Marques |
Abstract: We present a video object segmentation approach that extends the particle filter to a region-based image representation. Image partition is considered part of the particle filter measurement, which enriches the available information and leads to a re-formulation of the particle filter. The prediction step uses a co-clustering between the previous image object partition and a partition of the current one, which allows us to tackle the evolution of non-rigid structures. Particles are defined as unions of regions in the current image partition and their propagation is computed through a single co-clustering. The proposed technique is assessed quantitatively on the SegTrack dataset and qualitatively on the LabelMe Video dataset, leading to satisfactory perceptual results outperforming state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#1997 - Fast and robust identification of persistent homotopy types of noisy images [pdf]
Vitaliy Kurlin |
Abstract: We present a fast algorithm to identify the topological shape (homotopy type) of a noisy dotted image in the plane. The algorithm has O(n log n) time and O(n) space in the number n of points in a given image. The only input is a point cloud. The output is the number of all non-trivial loops that persist (have a long life span) when the image is analyzed at all possible scales. We give theoretical guarantees when the algorithm correctly identifies the homotopy type by using only a noisy sample of a triangulable set.
|
Similar papers:
[rank all papers by similarity to this]
|
#1999 - Learning optimal features for salient object detection [pdf]
Song Lu, Vijay Mahadevan, Nuno Vasconcelos |
Abstract: We introduce a novel approach for salient object detection. The approach starts by partitioning an image into superpixels, and computing two types of features for each superpixel. One is the bottom-up saliency of the superpixel region, and the other is a set of "objectness" features that are informative of how likely the superpixel is to be part of an object. A graph is then formed with the superpixels as nodes, and edge weights representing a measure of similarity between two superpixels. Starting from an arbitrary initialization, the saliency information is propagated over the graph using a random walk process, whose equilibrium state yields the object saliency map. Unlike other graph based salient object detection approaches, we learn the initial salient seed locations using a large margin framework. We show that the proposed approach outperforms the state of the art on a number of salient object detection datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
#2002 - Fast and Reliable Two-View Translation Estimation [pdf]
Johan Fredriksson, Olof Enqvist, Fredrik Kahl |
Abstract: It has long been recognized that one of the fundamental difficulties in the estimation of two-view epipolar geometry is the capability of handling outliers. In this paper, we develop a fast and tractable algorithm that maximizes the number of inliers under the assumption of a purely translating camera. Compared to classical random sampling methods, our approach is guaranteed to compute the optimal solution of a cost function based on reprojection errors and it has better time complexity. The performance is in fact independent of the inlier/outlier ratio of the data. This opens up for a more reliable approach to robust ego-motion estimation. Our basic translation estimator can be embedded into a system that computes the full camera rotation. We demonstrate the applicability in several difficult settings with large amount of outliers. It turns out to be particularly tractable for small rotations and rotations around one axis (which is the case for cellular phones where the gravitation axis can be measured). Experimental results show that compared to standard \textsc{ransac} methods based on minimal solvers, our algorithm produces more accurate estimates in the presence of large outlier ratios.
|
Similar papers:
[rank all papers by similarity to this]
|
#2003 - Single Image Super-resolution using Deformable Patches [pdf]
Yu Zhu, Yanning Zhang, Alan Yuille |
Abstract: We proposed a deformable patch based method for single image super-resolution. By the concept of deformation, a patch is not regarded as a fixed vector but a flexible deformation flow. Via deformable patches, the dictionary can cover more patterns that do not appear, thus becoming more expressive. We present the energy function with slow, smooth and flexible prior for deformation model. During example-based super-resolution, we develop the deformation similarity based on the minimized energy function for basic patch matching. For robustness, we involve multiple deformed patches combination for the final reconstruction. Experiments evaluate the deformation effectiveness and super-resolution visual quality, showing that the deformable patch helps improve the representation accuracy and perform better results than the state-of-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper presents a novel method to generate a hypothesis set of class-independent object regions. It has been shown that such object regions can be used to focus computer vision techniques on the parts of an image that matter most leading to significant improvements in both object localisation and semantic segmentation in recent years. Of course, the higher quality of class-independent object regions, the better subsequent computer vision algorithms can perform. In this paper we focus on generating higher quality object hypotheses. We start from an oversegmentation for which we propose to extract a wide variety of region-features. We group regions together in a hierarchical fashion, for which we train a Random Forest which predicts at each stage of the hierarchy the best possible merge. Hence unlike other approaches, we use relatively powerful features and classifiers at an early stage of the generation of likely object regions. Finally, we identify and combine stable regions in order to capture objects which consist of dissimilar parts. We show on the PASCAL 2007 and 2012 datasets that our method yields higher quality regions than competing approaches while it is at the same time more computationally efficient.
|
Similar papers:
[rank all papers by similarity to this]
|
#2010 - Object Discovery and Segmentation via Discriminative Visual Subcategories [pdf]
Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta |
Abstract: In this paper, we propose a simple yet surprisingly powerful approach that combines the power of generative modeling for segmentation with effectiveness of discriminative models for detection to propose an algorithm that can discover objects and their segmentations from noisy Internet images. The key idea behind our approach is to learn and exploit top-down priors for joint segmentation. Unlike previous approaches which build a single prior model for each semantic class, our approach develops prior models for visually homogeneous clusters called visual subcategories. Our approach jointly discovers these visual subcategories and learns segmentation prior models for each subcategory. The strong priors learned from these visual subcategories are then combined with discriminatively trained detectors and bottom up cues to produce clean object segmentations. Our experimental results indicate state-of-the-art performance on the difficult dataset introduced by [34].
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: The basic idea of shape from shading is to infer the shape of a surface from its shading information in a single image. Since this problem is ill-posed, a number of simplifying assumptions have been often used. However they rarely hold in practice. This paper presents a simple shading-correction algorithm that transforms the image to a new image that better satisfies the assumptions typically needed by existing algorithms, thus improving the accuracy of shape recovery. The algorithm takes advantage of some local shading measures that have been driven under these assumptions. The method is successfully evaluated on real data with ground-truth 3D shapes.
|
Similar papers:
[rank all papers by similarity to this]
|
#2022 - Human Pose Estimation: New Benchmark and State of the Art Analysis [pdf]
Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele |
Abstract: Human pose estimation has made significant progress during the last years. However current benchmark datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we introduce a novel benchmark dataset that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future developments in human body models. This comprehensive dataset was collected using an established taxonomy of over 600 human activities. The collected images cover a wider range of human poses than previous datasets such as sports, recreational activities and householding, and also includes special cases such as frontal views and strongly articulated people. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, along activity labels. For each image we are providing adjacent video frames to facilitate the use of motion information. Given these rich annotations we perform a detailed analysis of leading human pose estimation approaches and provide insights for the success and failures of these methods.
|
Similar papers:
[rank all papers by similarity to this]
|
#2027 - Automatic Construction of Deformable Models In-The-Wild [pdf]
Epameinondas Antonakos, Stefanos Zafeiriou |
Abstract: Deformable objects are everywhere. Faces, cars, bicycles, chairs etc. Recently, there has been a wealth of research on training deformable models for object detection, part localization and recognition using annotated data. In order to train deformable models with good generalization ability, a large amount of carefully annotated data is required, which is a highly time consuming and costly task. We propose the first - to the best of our knowledge - method for automatic construction of deformable models using images captured in totally unconstrained conditions, recently referred to as in-the-wild. The only requirements of the method are a crude bounding box object detector and a priori knowledge of the objects shape (e.g. a point distribution model). The object detector can be as simple as the Viola-Jones algorithm (e.g. even the cheapest digital camera features a robust face detector). The 2D shape model can be created simply by deforming and projecting to the camera plane a 3D CAD model of the object. In our experiments on facial deformable models, we show that the proposed automatically built model not only performs well, but also outperforms discriminative models trained on carefully annotated data. To the best of our knowledge, this is the first time it is shown that an automatically constructed model can perform as well as methods trained directly on annotated data.
|
Similar papers:
[rank all papers by similarity to this]
|
#2039 - Scalable Object Detection using Deep Neural Networks [pdf]
Dumitru Erhan, Christian Szegedy, Alexander Toshev, Dragomir Anguelov |
Abstract: Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing \textit{any} object of interest. The model naturally handles a variable number of instances for each class and allows for cross-class generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.
|
Similar papers:
[rank all papers by similarity to this]
|
#2040 - The Photometry of Intrinsic Images [pdf]
Marc Serra, Robert Benavente, Maria Vanrell, Dimitris Samaras, Olivier Penacchio |
Abstract: Intrinsic characterization of scenes is often the best way to overcome the illumination variability artifacts that complicate most computer vision problems from 3D reconstruction to object or material recognition. This paper examines the deficiency of existing intrinsic image models to accurately account for the effects of illuminant color and sensor characteristics in the estimation of intrinsic images and presents a generic framework which incorporates insights from color constancy research to the intrinsic image decomposition problem. The proposed mathematical formulation includes information about the color of the illuminant and the effects of the camera sensors, both of which modify the observed color of the reflectance of the objects in the scene during the acquisition process. By modeling these effects, we get a "truly intrinsic" reflectance image, which we call absolute reflectance, which is invariant to changes of illuminant or camera sensors. This model allow us to represent a wide range of intrinsic image decompositions depending on the specific assumptions on the geometric properties of the scene configuration and the spectral properties of the light source and the acquisition system, thus unifying previous models in a single general framework. We demonstrate that even partial information about sensors improves significantly the estimated reflectance images, thus making our method applicable for a wide range of sensors. We validate our general intrinsic imag
|
Similar papers:
[rank all papers by similarity to this]
|
#2047 - Scalable Multitask Representation Learning for Scene Classification [pdf]
Maksim Lapin, Matthias Hein, Bernt Schiele |
Abstract: The basic idea of multitask learning is that learning tasks jointly is better than learning each task individually. In particular, if only a few training samples are available for each task, sharing a jointly trained representation with related tasks helps to improve performance. In this paper we propose a novel multitask learning method which jointly learns a low-dimensional representation and the corresponding classifiers thus profiting from inter-class relations. Our method scales with respect to the original dimension of the features and thus can be used for very high-dimensional feature representations such as the Fisher Vector. Our multitask learning approach outperforms the current state of the art on the SUN397 scene classification benchmark consistently for varying numbers of training samples.
|
Similar papers:
[rank all papers by similarity to this]
|
#2051 - Incorporating Scene Context and Object Layout into Appearance Modeling [pdf]
Hamid Izadinia, Fereshteh Sadeghi, Ali Farhadi |
Abstract: A scene category imposes tight distributions over the kind of objects that might appear in the scene, the appearance of those objects and their layout. In this paper, we propose a method to learn scene structures that can encode three main interlacing components of a scene: the scene category, the context-specific appearance of objects, and their layout. Our experimental evaluations show that our learned scene structures outperform state-of-the-art method of Deformable Part Models in detecting objects in a scene. Our scene structure provides a level of scene understanding that is amenable to deep inferences such as intelligent predictions about a covered part of an image. The scene structures can also generate features that can later be used for scene categorization. We also show promising results on scene categorization.
|
Similar papers:
[rank all papers by similarity to this]
|
#2052 - Instance-weighted Transfer Learning of Active Appearance Models [pdf]
Daniel Haase, Erik Rodner, Joachim Denzler |
Abstract: There has been a lot of work on face modeling, analysis, and landmark detection, with Active Appearance Models being one of the most successful techniques. A major drawback of these models is the large number of detailed annotated training examples needed for learning. Therefore, we present a transfer learning method that is able to learn from related training data using an instance-level transfer technique. Our method is derived using a generalization of importance sampling and in contrast to previous work we explicitly try to tackle the transfer already during learning instead of adapting the fitting process. In our studied application of face landmark detection, we efficiently transfer facial expressions from other human individuals and are thus able to learn a precise face active appearance model only from neutral faces of a single individual. Our approach is evaluated on two common face datasets and outperforms previous transfer method.
|
Similar papers:
[rank all papers by similarity to this]
|
#2064 - FastSeg: More Efficiency on Multiple Figure-Ground Segmentations [pdf]
ahmad Humayun, Fuxin Li, James Rehg |
Abstract: Recently, figure-ground segmentation algorithms that generate a pool of overlapping segment proposals have been popular. These algorithms have high recall on most objects in a scene and could be used to generate boundary-aligning proposals for subsequent object recognition engines, achieving excellent performance. What has remained unexplored is the idea of obtaining such a hypotheses pool in a computationally efficient way. By precomputing a graph which can be used for parametric min-cut over different seed enumerations, we save time spent on generating the segment pool. Besides, we have made design choices that avoid extensive computations and achieve better efficiency without losing performance. In particular, we show the segmentation performance of our algorithm is similar to the state-of-the-art on the PASCAL VOC dataset, while being an order of magnitude faster.
|
Similar papers:
[rank all papers by similarity to this]
|
#2072 - Random Laplace Feature Maps for Semigroup Kernels on Histograms [pdf]
Jiyan Yang, Vikas Sindhwani, Quanfu Fan, Haim Avron, Michael Mahoney |
Abstract: To dramatically accelerate the training and testing complexity of nonlinear kernel methods, several recent papers have proposed explicit embeddings of the input data into low-dimensional feature spaces where fast linear methods can instead be used to generate approximate solutions. Analogous to random Fourier feature maps to approximate shift-invariant kernels, such as the Gaussian kernel, on R^d, we develop a new randomized technique called random Laplace features, to approximate a family of kernel functions adapted to the semigroup structure of R_+^d. This is the space in which histograms and other non-negative data representations reside. We provide theoretical results on the uniform convergence of random Laplace features. Empirical analyses on image classification and surveillance event detection tasks demonstrates the attractiveness of using random Laplace features relative to several other feature maps proposed in the literature.
|
Similar papers:
[rank all papers by similarity to this]
|
#2080 - Laplacian Coordinates for Seeded Image Segmentation [pdf]
Wallace Casaca, Gustavo Nonato, Gabriel Taubin |
Abstract: Seed-based image segmentation methods have gained much attention lately, mainly due to their good performance in segmenting complex images with little user interaction. Such popularity leveraged the development of many new variations of seed-based image segmentation, which vary greatly regarding mathematical formulation and complexity. Most existing methods in fact rely on complex mathematical formulations that typically do not guarantee unique solution for the segmentation problem while still being prone to be trapped in local minima. In this work we present a novel framework for seed-based image segmentation that is mathematically simple, easy to implement, and guaranteed to produce a unique solution. Moreover, the formulation holds an anisotropic behavior, that is, pixels sharing similar attributes are kept closer to each other while big jumps are naturally imposed on the boundary between image regions, thus ensuring better fitting on object boundaries. We show that the proposed framework outperform state-of-the-art techniques in terms of quantitative quality metrics as well as qualitative visual results.
|
Similar papers:
[rank all papers by similarity to this]
|
#2081 - Quality Dynamic Human Body Modeling Using a Single Low-cost Depth Camera [pdf]
Qing Zhang, BO FU |
Abstract: In this paper we present a novel autonomous pipeline to build the personalized parametric model (pose-driven avatar) only using a single depth sensor. Our method first captures a few high-quality scans of the user rotating herself at multiple poses from different views. We fit each incomplete scan using template fitting techniques with a generic human template, and register all scans to every pose using global consistency constraints. After registration, these watertight models under different poses are used to train a parametric model in a fashion similar to the SCAPE method. Once the parametric model is built, it can be used as an animitable avatar or more interestingly creating dynamic 3D models from single-view depth videos. Experimental results demonstrate the effectiveness of our system to produce dynamic models.
|
Similar papers:
[rank all papers by similarity to this]
|
#2083 - A Reverse Hierarchy Model for Predicting Eye Fixations [pdf]
Tianlin Shi, Xiaolin Hu, Ming Liang |
Abstract: A number of psychological and physiological evidences suggest that visual attention works in a coarse-to-fine way, which lays a basis for the reverse hierarchy theory (RHT). This theory states that attention propagates from the top level of the visual hierarchy that processes gist and abstract information of input, to the bottom level that processes local details. Inspired by the theory, we develop a computational model for saliency detection in images. First, the original image is downsampled to different scales to constitute a fine-to-coarse pyramid. Then, saliency on each layer is obtained by image super-resolution reconstruction from the layer above, which is defined as unpredictability from this coarse to fine reconstruction. Finally, the saliency on each layer of the pyramid is converted into stochastic fixations through a probabilistic model, where attention initiates from the top layer and propagates downward the pyramid. Extensive experiments on two standard eye-tracking datasets show that the proposed method can achieve competitive results with state-of-the-art models.
|
Similar papers:
[rank all papers by similarity to this]
|
Abstract: This paper proposes a novel mean field-based Chamfer template matching method. In our method, each template is represented as a field model and matching a template with an input image is formulated as estimation of a maximum of posterior in the field model. Variational approach is then adopted to approximate the estimation. The proposed method was applied for two different variants of Chamfer template matching and evaluated through the task of object detection. Experimental results on benchmark datasets including ETHZShapeClass and INRIAHorse have shown that the proposed method could significantly improve the accuracy of template matching while did not much sacrifice the efficiency. Comparisons with other recent template matching algorithms have also shown the robustness of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
#2113 - Efficient pruning LMI conditions for Branch-and-Prune Rank and Chirallity-Constrained Estimation of the Dual Absolute Quadric [pdf]
Adlane Habed, Danda Pani Paudel, Cdric Demonceaux, David Fofi |
Abstract: We present a new globally optimal algorithm for self-calibrating a moving camera with constant parameters. Our method aims at estimating the Dual Absolute Quadric (DAQ) under the rank-3 and, optionally, camera centers chirality constraints. We employ the Branch-and-Prune paradigm and explore the space of only 5 parameters. Pruning in our method relies on solving Linear Matrix Inequality (LMI) feasibility and Generalized Eigenvalue (GEV) problems that solely depend upon the entries of the DAQ. These LMI and GEV problems are used to rule out branches in the search tree in which a quadric not satisfying the rank and chirality conditions on camera centers is guaranteed not to exist. The chirality LMI conditions are obtained by relying on the mild assumption that the camera undergoes a rotation of no more than $90^\circ$ between consecutive views. Unlike existing global methods for DAQ estimation, our algorithm can optimize a normalized objective and achieves global optimality in a competitive running-time.
|
Similar papers:
[rank all papers by similarity to this]
|